Process safety practices and formal safety management systems have been in place in the industry for many years. Process safety management has been widely credited for reductions in the number of major accidents, proactively saving hundreds of lives and improving industry performance.
OT cyber security, the cyber security of process automation systems such as used by the industry, has a lot in common with the management of process safety and we can learn very much from the experience of formal safety management, build over more than 60 years. Last week I saw a lengthy email chain on a question “if the ISA 99 workgroup (the developers of the IEC 62443 standard) should look for a closer cooperation with the ISO 27001 developers to better align the standards”. Of course such a question results in discussions addressing the importance of ISO 27001 and others emphasizing the difficulties to apply the standard in the industry.
If there is a need for ISA to cooperate with another organization for aligning its standard, than in my opinion they should have a more close cooperation with the AIChE, the ISA of the chemical engineering professionals. The reason for this is that there is a lot to learn from process safety and though IEC 62443 is supposed to be a standard for industrial control systems, there is still a lot of “IT methodology” in the standard.
In this week’s blog I like to address the link between process safety and cyber security again, and discuss some of the lessons in process safety we can (actually should in my opinion) learn from. Not for adding safety to the security triad as some suggest, this is in my opinion for multiple reasons wrong, but because process safety management and OT cyber security management have many things in common when we compare their design and management processes and process safety management is much further in its development than OT cyber security management.
I accomplished in my career several cyber security certifications such as: CISSP, CISM, CRISC, ISO 27001 LA and many more with a pure technical focus. Did all this course material ever discuss industrial control systems (ICS)? No they didn’t, still they where very valuable for developing my knowledge. As an employee working for a major vendor of ICS solutions, my task became more to adopt what was applicable, learn from IT and park those other bits of knowledge for which I saw no immediate application. As my insights further developed I started to combine bits and pieces from other disciplines more easily. In the OT cyber security world, which is relatively immature, we can learn a lot from more mature disciplines such as process safety management. Such learning generally requires us to make adaptions to address the differences.
But despite these differences we can learn from studying the accomplishments of the process safety discipline, this will certainly steepen the learning curve of OT cyber security to make it more mature. If we want to learn from process safety, where better to start than the many publications of CCPS (Center for Chemical Process Safety) and the AIChE.
Process safety starts at risk, process safety studies the problem first before they attempt to solve the problem. Process safety recognizes that all hazards and risk are not equal, consequently it focuses more resources on higher hazards and risks. Does this also apply to cyber security, in my opinion very sparingly.
More mature asset owners in the industry adopted a risk approach in their security strategy, but the majority of asset owners is still very solution driven. They decide on purchasing solutions before they inventoried the cyber security hazards and prioritized them.
What are the advantages of a risk based approach? Risk allows for optimally apportioning limited resources to improve security performance. Risk also provides a better insight in where the problems are and what options there are to solve a problem. Both OT cyber security risk and process safety risk are founded on four pillars:
- Commit to cyber security and process safety – A workforce that is convinced that their management fully supports a discipline as part of its core values will tend to follow process even when no one else is checking this.
- Understand hazards and risk – This is the foundation of a risk based approach, it provides the required insight to make justifiable decisions and apply limited resources in an effective way. This fully applies for both OT cyber security and process safety.
- Manage risk – Plants must operate and maintain the processes that pose risk, keep changes to those processes within the established risk tolerances, and prepare for / respond to / manage incidents that do occur. Nothing in here that doesn’t apply for process safety as well as OT cyber security.
- Learn from experience – If we want to improve, and because of the constantly changing threat landscape we need to improve, we need to learn. Learning we do by observing and challenging what we do. Metrics can help us here, preferably metrics that provide us with leading indicators so we see the problem coming. Also this pillar applies for both disciplines, where process safety tracks near misses OT cyber security can track parameters such as AV engines detecting malware, or firewall rules bouncing off traffic attempting to enter the control network.
Applying these four pillars for OT cyber security would in my opinion significantly improve OT cyber security over time. Based upon what I see in the field, I estimate that less than 10 percent of the asset owners adopted a risk based methodology for cyber security, while more than 50 percent adopted a risk based methodology for process safety or asset integrity.
If OT cyber security doesn’t want to learn from process safety, it will take another 15 years to reach the same level of maturity. If this happens we are bound to see many serious accidents in future, including accidents with casualties. OT cyber security is about process automation lets use the knowledge other process disciplines build over time.
The alternative for a risk based approach is a compliance based approach, well known examples are for North America the NERC CIP standard, and for Europe the French ANSSI standard for cyber security or if we look at process safety the OSHA regulations in the US. All compliance driven initiatives. A compliance driven approach can lead to:
- Decisions based upon ” If it isn’t a regulatory requirement, we are not going to do it”.
- The wrong assumption that stopping the more frequent and easier to measure incidents (like for example the mentioned malware detection) discussed in standards will also stop the low-frequency / high consequence incidents.
- Inconsistent interpretation and application of the practices described in the standard. Standards are often a compromise between conflicting opinions, resulting in soft requirements open for different interpretations.
- Overemphasized focus on the management system, forgetting about the technical aspects of the underlying problems.
- Poor communication between the technically driven staff and the business driven senior management, resulting in management not understanding the importance of the messages they receive and subsequently fail to act.
- High auditing costs, where audits focus on symptoms instead of underlying causes.
- Not moving with the flow of time. Technology is continuously changing, posing new risk to address. Risk that is not identified by a standard developed even as recent as 5 years ago.
This criticism on a compliance approach doesn’t mean I am against standards and their development. Merely I am against standards as an excuse to switch off our brain, close our eyes and ignore where we differ from the average. Risk based processes offer us the foundation to stay aware of all changes in our environment while still using standards as a checklist to make certain we don’t forget the obvious.
Like I mentioned for cyber security the majority of the asset owners would fall in the category standards-based or compliance-based. It is a step forward compared to 10 years ago, when OT cyber security was ignored by many, but it is a long way off from where asset owners are for process safety.
Where we see in process safety the number of accidents decline, we see in cyber security that both the threat event frequency and the threat capability of the threat community rise. To keep up with the growing threat, critical infrastructure should adopt a risk based strategy to keep track with the threat community. Unfortunately many governments are driving for a compliance based strategy because they can more easily audit this and doing this they are setting the requirements too low for a proper protection against the growing threat.
A risk based approach doesn’t exclude compliance with a standard, it just makes the extra step predicting the performance of the various cyber security aspects, independent of any loss events, and improving its security. As such it adds pro-activity to the defense and allows us to keep track with the threat community.
The process safety community recognized the bottlenecks of a compliance based strategy and jumped forward by introducing a risk based approach allowing them to further reduce the number of process safety accidents after several serious accidents happened in the 1980s. Accidents caused by failure of the compliance based management systems.
Because of the malicious aspects inherent to cyber security, because of the fast growing threat capabilities of the threat community and because of an increase in threat events, not jumping to a risk based strategy like the process safety community did is waiting for the first casualties to occur as a result of a cyber attack. TRISIS had the potention be the first attack causing loss of life, we were lucky it failed. But the threat actors have undoubtedly learned from their failure and work on an improved version.
I don’t include the alleged attack on a Siberian pipeline in 1982 as a cyber event as some do. If such an event would happen due to a cyber attack this would be an act of war. So for me we have been lucky so far that cyber impact was mainly a monetary value, but this can change either willingly or accidentally.
It would become a very lengthy blog if I would discuss each of the twenty elements of the risk based safety program or reliability program. But each of these elements has a strong resemblance with what would be appropriate for a cyber security program.
The element I like to jump to is the Hazard Identification and Risk Analysis (HIRA) element. HIRA is the term used to bundle all activities involved in identifying hazards and evaluating the risk induced by these hazards. In my previous blog on risk I showed a more detailed diagram for risk, splitting it in three different forms of risk. For this blog I like to focus on the the hazard part using the following simplified diagram for the same three forms of risk.
On the left side we have the consequence of the cyber security attack, some functional deviation of the automation system. This is what was what was categorized as loss of required performance and loss of ability to perform. The 3rd category, loss of confidentiality, will not lead directly to a process risk so I ignore it here. Loss of required performance caused the automation system to either execute an action that should not have been possible (not meeting design intent) or an action that does not perform as it should (not meeting operation intent). In the case of loss of ability to perform, the automation system could not execute one or more of its functions.
So perhaps the automation system’s configuration was changed in a way that the logical dimensions configured in the automation system no longer represent the physical dimensions of the equipment in the field. For example if the threat actor increases the level range of a tank this does not result into a bigger physical tank volume, so a possibility exists that the tank is overfilled without the operator noticing this in his process displays. The logical representation of the physical system (its operating window) should fit the physical dimensions of the process equipment in the plant. If this is not the case this would be the failure mode “Integrity Operating Window (IOW) deviation” in the “Loss of Required Performance” category.
Similar the threat actor might prevent the operator to stop or start a digital output, the failure mode “Loss of Controllability” in the category “Loss of Ability to Perform”. Not being able to stop or start a digital output might translate to the inability to stop or start a pump in the process system. At least stopping or starting by using the automation system. We might have implemented an electrical switch (safeguard) to do it manually if the automation system would fail.
Not being able to modify a control parameter would give rise to a whole other category of issues for the production process. Each failure mode has a different consequence for the process system equipment and the production process.
Cyber security hazards are a description of a threat actor (threat community) executing a threat event (threat action exploiting a vulnerability) resulting in a specific consequence (the functional deviation) entering a specific failure mode for the automation system function. What the consequence is for the production process and its equipment depends on the automation system function affected and the characteristics of the production system equipment and production process. This area is investigated by the process (safety) hazards. Safety is here between brackets because not every functional deviation results in a consequence for process safety, there can also be consequences for product quality or production loss not impacting process safety at all. If the affected function would be the safety instrumented system (SIS), a deviation in functionality would always affect process safety.
The HIRA for process risk would focus on how the functional deviations influence the production process and the asset integrity of its equipment. As such the HIRA has a wider scope than it would have in a typical process safety hazard analysis / hazop. For cyber security it combines what we call the computer hazop, the analysis of how failures in the automation system impact the production system and the process safety hazop.
On the other hand from a safeguard perspective of the safety hazop / PHA the scope is smaller because we can only influence the functionality of the “functional safety” functions provided by the SIS. Safety has multiple layers of protection and multiple safeguards and barriers that contain a dangerous situation. A cyber security attack can only impact the programmable electronic components (e.g. SIS) of the process safety protection functions.
This is the reason why I protest if people talk about “loss of safety” in the context of cyber security, there are in general more protection mechanism implemented, so safety is not necessarily lost. Adding safety to the triad is also incorrect in my opinion, this should be at minimum adding functional safety because that is the only safety layer that can be impacted by a cyber threat event, but functional safety is also already covered within the definition of loss of required performance.
IEC 62443’s loss of system integrity is not covering all the failure modes covered by loss of required performance. The IEC 62443-1-1 defines integrity as: “Quality of a system reflecting the logical correctness and reliability of the operating system, the logical completeness of the hardware and software implementing the protection mechanisms, and the consistency of the data structures and occurrence of the stored data.”
This definition is fully ignoring the functional aspects in an automation system, therefore it is a too limited cyber security objective for an automation system. For example where do we find in the definition that an automation action needs to be performed on the right moment in the right sequence and appropriately coordinated with other functions.
Consider for example the coordination / collaboration between a conveyor and filling mechanism or a robot. The IEC 62443 seven foundation requirements don’t cover all aspects of an automation function / industrial control system. The combined definitions used by risk based asset integrity management and risk based process safety management do cover these aspects, an example of a missed chance to learn something from an industry that has considerably more experience in its domain than the OT cyber security community has in its own field.
Can we conduct the HIRA process for cyber security in a similar way as we do for process safety? My only answer here is a firm NO!. The malicious aspects of cyber security make it impossible to work in the same way as we do for process safety. The job would just not be repeatable (so results are not consistent) and too big (so extremely time consuming). The number of possible threat events, vulnerabilities, and consequences is just too big to approach this in a workshop setting as we do for process hazard analysis (PHA) / safety hazop.
So in cyber security we work with tooling to capture all the possibilities, we categorize consequences and failure modes to assign them a trustworthy severity value meeting the risk criteria of the plant. But in the end, we end with a risk priority number just like we have in risk based process safety or risk based asset integrity to rank the hazards.
The formula for cyber security risk is more complex because we not only have to account for occurrence (threat x vulnerability) and consequence, but also for detection, and the risk reduction provided by countermeasures, safeguards and barriers. But these are normal deviations, also risk based asset integrity management and risk based process safety management differ at detail level.
The following key principles need to be addressed when we develop, evaluate, or improve any management system for risk:
- Maintain a dependable and consistent practice – So the practice should be documented, the objectives of the benefits must be in terms that demonstrate to management and employees the value of the activities;
- Identify hazards and evaluate risks – Integrate HIRA activities into the life cycle of the ICS. Both design and security operations should be driven by risk based decisions;
- Assess risks and make risk based decisions – Clearly define the analytical scope of HIRAs and assure adequate coverage. A risk system should address all the types of cyber security risk that management wants to control;
- Define the risk criteria and make risk actionable – It is crucial that all understand what a HIGH risk means, and that it is defined what the organization will do when something attains this level of risk. Risk appetite differs depending on the production process or process units within that process;
- Follow through on risk assessment results – Involve competent personnel, make a consistent risk judgement so we can follow through without too much debate if results require this;
Risk diagrams to express process risk generally have less risk levels as a risk assessment diagram for cyber security. This because it has a more direct relationship with the business / mission risk, so actions have a direct business justification. An example risk assessment diagram for process risk is shown in the following diagram:
The ALARP acronym stands for As Low As Reasonably Practicable a commonly used criterion for process related risk. Once we have the cyber security hazards and their process consequence we can assign a business impact to each hazard and create risk assessment matrices for each impact category as explained in my blog on OT cyber security risk using the impact diagram as example. or if preferred the different impact categories can be shown in a single risk assessment matrix.
So far this discussion about the parallels between risk based process safety, risk based asset integrity, and risk based OT cyber security. I noticed in responses to previous blogs, that for many this is an uncharted terrain because they might not be familiar with all three disciplines and the terminology used. Most risk methods used for cyber security have an IT origin. This results in ignoring the action part of an OT system, OT being Data + Action driven where IT is Data driven only. Another reason to more closely look at other risk based management processes applied in a plant.
There is no relationship between my opinions and references to publications in this blog and the views of my employer in whatever capacity. This blog is written based on my personal opinion and knowledge build up over 42 years of work in this industry. Approximately half of the time working in engineering these automation systems, and half of the time implementing their networks and securing them.
Author: Sinclair Koelemij
OTcybersecurity web site