This blog is about risk assessment in cyber physical systems and some of the foundational principles. I created several blogs on the topic of risk assessment before, for example “Identifying risk in cyber physical systems” and “ISA 62443-3-2 an unfettered opinion“. Specifically the one on criticizing the ISA standard caused several sharp reactions. In the meantime ISA 62443-3-2 has been adopted by the IEC and a task group focusing on cyber security and safety instrumented systems (ISA 84 work group 9) addresses the same topic. Though unfortunately the ISA 84 team seems to have the intention to copy the ISA 62443-3-2 work and with this also the structural flaws will enter the work of this group. One of these flaws is the ISA approach toward likelihood.
Therefore this blog addresses this very important aspect of a risk assessment in the cyber physical system arena. Likelihood is actually the only factor of interest, because the impact (severity of the consequences) is a factor that the process hazard analysis the asset owner makes provides. Therefore likelihood / probability is the most important and challenging variable to determine because many of the cyber security controls affect the likelihood of a cyber security hazard occurring. Much of the risk reduction is achieved by reducing the chance that a cyber attack will succeed through countermeasures.
Having said this about likelihood we must not ignore the consequences, as I have advocated in previous blogs. This because risk reduction through “cyber safeguards” (those barriers that either eliminate the possibility of a consequence occurring or reduce the severity of that consequence – see my earlier blogs) is often a more reliable risk reduction than “cyber countermeasures” (those barriers that reduce the likelihood).
But this blog is about likelihood, the bottlenecks we face in estimating the probability and selecting the correct quantitative scale to allow us to express a change in process safety risk as the result of the cyber security threat.
At the risk of losing readers before coming to the conclusions, I would still like to cover this topic in the blog and explain the various problems we face. If you are interested in risk estimation I suggest to read it, if not just forward it to stalk your enemies.
Where to start? It is always dangerous to start with a lot of concepts because for part of the audience this is known stuff, while for others it answers some questions they might have. I am not going to give much formulas, most of these have been discussed in my earlier blogs, but it is important to understand there are different forms of risk. These forms often originate from different questions we like to answer. For example we have:
- Risk priority numbers (RPN), is a risk value used for ranking / prioritization of choices. This risk is a mathematical number, the bigger the number the bigger the risk;
- Loss based risk (LBR), is a risk used for justifying decisions. This risk is used in production plants and ties the risk result to monetary loss due to production outage / equipment damage / permit to operate. Or a loss expressed in terms of human safety or environmental damage. Often risk is estimated for multiple loss categories;
- Temporal risk or actuarial risk, is a risk used for identifying the chance on a certain occurrence over time. For example insurance companies use this method of risk estimation analyzing for example the impact of obesity over a 10 year period to understand how it impacts their cost.
There are more types of risk than above three forms, but these three cover the larger part of the areas where risk is used. The only two I will touch upon in this blog, are RPN and LBR. Temporal risk, the question of how many cyber breaches caused by malware will my plant face in 5 years from now, is not raised today in our industry. Actually at this point in time (2021) we couldn’t even estimate this because we lack the statistical data required to answer the question.
We like to know what is the cyber security risk today, what is my financial, human safety, and environmental risk, can it impact my permit to operate, and how can I reduce this risk in the most effective way? What risk am I willing to accept and what should I keep a close eye on, and what risk will I avoid? RPN and LBR estimates are the methods that provide an answer here and both are extensively used in the process industry. RPN primarily in asset integrity management and LBR for business / mission risk. LBR is generally derived from the process safety risk, but business / mission risk is not the same as process safety risk.
Risk Priority Numbers are the result of what we call Failure Mode Effect and Criticality Analysis (FMECA) a method applied by for example asset integrity management. FMECA was developed by the US military over 80 years ago to change their primarily reactive maintenance processes into proactive maintenance processes. The FMECA method focuses on qualitative and quantitative risk identification for preventing failure, if I would write “for preventing cyber failure” I create the link to cyber security.
For Loss Based Risk there are various estimation methods used (for example based on annual loss expectancy), most of them very specific for a business and often not relevant or not applied in the process industry. The risk estimation method mostly used today in the process industry, is the Layers Of Protection Analysis (LOPA) method. This method is used extensively by the process safety team in a plant. The Hazard and Operability (HAZOP) study identifies the various process hazards, and LOPA adds risk estimation to this process to identify the most prominent risks.
If we know what process consequences can occur (for example a seal leak causing an exposure to a certain chemical, yes / no potential ignition, how many people would be approximately present in the immediate area, is the exposure yes / no contained within the fence of the plant, etc.) we can convert the consequence into a monetary/human safety/environmental loss value. This together with the LOPA likelihood estimate, results in a Loss Based Risk score.
How do we “communicate” risk, well generally in the form of a risk assessment matrix (RAM). A matrix that has a likelihood scale, an impact scale and shows (generally using different colors) what we see as acceptable risk, tolerable risk, and not acceptable risk zones. But other methods exist, however the RAM is the form most often used.
So much for the high-level description, now we need to delve deeper into the “flaws” I mentioned in the introduction / abstract. Now that the going gets tough, I need to explain the differences between low demand mode and high demand / continuous mode as well as the differences between IEC 61511 and IEC 62061 and some differences in how the required risk reduction is estimated, but also how the frequency of events scale ,underlying the qualitative likelihood scale, plays a role in all this. This is a bit of a chicken and egg problem, we need to understand both before understanding the total. Let me try, I start with taking a deeper dive into process safety and the safety integrity limit – the SIL.
In process safety we start with what is called the unmitigated risk, the risk inherent to the production installation. This is already quite different from cyber security risk, where unmitigated risk would be kind of a foolish point to start. Connected to the Internet and without firewalls and other protection controls, the process automation system would be penetrated in milli-seconds by all kind of continuous scanning software. Therefor in cyber security we generally start with residual risk based upon some initial countermeasures and we investigate if this risk level is acceptable / tolerable. But let’s skip a discussion on the difference between risk tolerance and risk appetite, a difference ignored by IEC 62443-3-2 though from a risk management perspective an essential difference that provides us time to manage.
However in plants the unmitigated situation is relevant, this has to do with what is called the demand rate and that we don’t analyze risk from a malicious / intentional perspective. Incidents in process safety are accidental, caused by wear and tear of materials and equipment, human error. No malicious intent, in a HAZOP sheet you will not find scenarios described where process operators intentionally open / close valves in a specific construction/sequence to create damage. And typically most hazards identified have a single or sometimes a double cause, but not much more.
Still asset owners like to see cyber security risk “communicated” in the same risk assessment framework as the conventional business / mission risk derived from process hazard analysis. Is this possible? More to the point, is the likelihood scale of a cyber security risk estimate the same as the likelihood scale of a process safety risk estimate if we take the event frequencies underlying the often qualitative scales of the risk diagrams into account?
In a single question, would the qualitative likelihood score of Expected for a professional boxer to go knock out (intentional action) during the job during his career, be the same as for a professional accountant (accidental)? Can we measure these scales along the same quantitative event frequency scales? I don’t think so. Even if we ignore for a moment the difference in career length, the probability that the accountant goes knock out (considering he is working for a reputable company) during his career is much lower. But the IEC/ISA 62443-3-2 method does seem to think we can mix these scales. But okay, this argument doesn’t necessarily convince a process safety or asset integrity expert, even though they came to the same conclusion. To explain I need to dive a little deeper into the SIL topic in the next section and discuss the influence of another factor, demand rate, on the quantitative likelihood scale.
The hazard (HAZOP) and risk assessment (LOPA) are used in process safety to define the safety requirements. The safety requirements are implemented using what are called Safety Instrumented Functions (SIF). The requirements for a SIF can be seen as two parts:
- Safety function requirements;
- Safety integrity requirements;
The SIFs are implemented to bring the production process to a safe state when demands (hazardous events) occur. A demand is for example if the pressure in a vessel rises above a specified limit, the trip point. The safety functional requirements specify the function and the performance requirements for this function. For example the SIF must close an emergency shutdown valve the “Close Flow” function, performance parameters can be the closing time and the leakage allowed in closed position. For this discussion the functional requirements are not that important, but the safety integrity requirements are.
When we discuss safety integrity requirements, we can split the topic into four categories:
- Performance measures;
- Architectural constraints;
- Systematic failures;
- Common cause failures.
For this discussion the performance measures, expressed as the Safety Integrity Level (SIL), are of importance. For the potential impact of cyber attacks (some functional deviation or lost function) the safety function requirements (above), and the common cause failures, architectural constraints, and systematic failures are of more interest. But the discussion here focuses on risk estimation and more specific the likelihood scale not so much on how and where a cyber attack can cause process safety issues. Those scenarios can be discussed in another blog.
The safety integrity performance requirements define the area I focus on, in these requirements safety integrity performance is expressed as a SIL. The IEC 61508 defines 4 safety integrity levels: SIL 1 to SIL 4. SIL 4 having the most strict requirements. To fulfill a SIL, the safety integrity function (SIF) must meet both a quantitative requirement and a set of qualitative requirements.
According to IEC 61508 the quantitative requirement is formulated based on either an average probability of failure on demand (PFDavg) for what are called low demand SIFs, and a Probability of Failure on Demand per Hour (PFH) high demand / continuous mode SIFs. Two different event frequency scales are defined, one for the low demand SIFs and one for the high demand SIFs. So depending on how often the protection mechanism is “challenged” the risk estimation method will differ.
For the low demand SIFs the SIL expresses a range of risk reduction values, and for the high demand SIFs it is a scale of maximum frequency of dangerous system failures. Two different criteria, two different methods / formulas for estimating risk. If the SIF is demanded less than once a year, low-demand mode applies. If the SIF is demanded more than once a year high demand / continuous mode is required. The LOPA technique for estimating process safety risk is based upon the low-demand mode, the assumption that the SIF is demanded less than once a year. If we would compare our cyber security countermeasure with a SIF, will our countermeasure be demanded just once a year? An antivirus countermeasure is demanded every time we write to disk, not to mention a firewall filter passing / blocking traffic. Risk estimation for cyber security controls is high demand / continuous mode.
LOPA specifies a series of initiating event frequencies (IEF) for various occurrences. These event frequencies can differ between companies, but with some exceptions they are similar. For example the IEF for a control system sensor failure can be set to once per 10 year, and the IEF for unauthorized changes to the SIF can be set to 50 times per year. These are actual examples for a chemical plant, values from the field not an example made up for this blog. The 50 seems high, but by giving it a high IEF it also requires more and stronger protection layers to reduce the risk within the asset owner’s acceptable risk criteria. So there is a certain level of subjectivity in LOPA possible, which creates differences between companies for sometimes the same risk. But most of the values are based on statistical knowledge available in the far more mature process safety discipline compared with the cyber security discipline.
In LOPA the IEF is reduced using protection layers (kind of countermeasure) with a specific PFDavg. PFDavg can be a calculated value, carried out for every safety design, or a value selected from the LOPA guidelines, or an asset owner preferred value.
For example an operator response as protection layer is worth a specific factor (maybe 0.5), so just by having an operator specified as part of the protection layer (maybe a response to an alarm) the new frequency, the Mitigated Event Frequency (MEF), for the failed control system sensor becomes 0.5 x 10 equivalent to once per 5 years. Based upon one or more independent protection layers with their PFD, the IEF is reduced to a MEF. If we create a linear scale for the event frequency and sub-divide the event scale in a number of equal qualitative intervals to create a likelihood scale, we can estimate the risk given a specific impact of the hazardous event. The linearity of the scale is important since we use the formula risk equals likelihood times impact, if the scale would be logarithmic the multiplication wouldn’t work anymore.
In reality the LOPA calculation is a bit more complicated because we also take enabling factors into account, probability of exposure, and multiple protection layers (including BPCS control action, SIF, Operator response, and physical protection layers such as brake plates and pressure relief valves). But the principle to estimate the likelihood remains a simple multiplication of factors. However LOPA applies only for the low demand cases and LOPA is using an event scale based on failures of process equipment and human errors. Not an event scale that takes into account how often a cyber asset like an operator station gets infected by malware, not taking into account intentional events, and starts with the unmitigated situation. Several differences with risk estimation for cyber security failures.
Most chemical and refinery processes follow LOPA and qualify for low-demand mode, with a maximum demand for a SIF of once a year. However in the offshore industry there is a rising demand for SIL based on high demand mode. Also in the case of machinery (IEC 62061) a high demand / continuous mode is required. Using high demand mode results in a different event scale, different likelihood formulas and LOPA no longer applies. Now let us look at cyber security and its likelihood scale. When we estimate risk in cyber physical systems we have to account for this, if we don’t a risk assessment might be a learning exercise because of the process it self but translating the results on a loss based risk scale would be fake.
Let us use the same line of thought used by process safety and translate this into the cyber security language. We can see a cyber attack as a Threat Actor applying a specific Threat Action to Exploit a Vulnerability, resulting in a Consequence. the consequence being a functional deviation of the target’s intended operation if we ignore cyber attacks breaching the confidentiality and focus on attacks attempting to impact the cyber physical system.
The threat actor + threat action + vulnerability combination has a certain event frequency, even if we don’t know yet what that frequency is, there is a certain probability of occurrence. That probability depends on many factors. Threat actors differ in capabilities and opportunity (access to target), also the exploit difficulties of a vulnerability differ and on top of all we protect these vulnerabilities using countermeasures (the protection layers in LOPA) we need to account for. We install antivirus to protect us against storing a malware infected file on the hard disk, we protect us against repeated logins that attempt brute force authentication by implementing a retry limit, we protect us against compromised login credentials by installing two factor authentication, etc. A vulnerability is not just a software deficiency, but also an unused open port on a network switch, or unprotected USB port on a server or desktop. There are hundreds of vulnerabilities in a system, each waiting to be exploited. There are also many cyber countermeasures, each with a certain effectiveness, each creating a protection layer. And with defense in depth we generally have multiple protection layers protecting our vulnerabilities. A very similar model as discussed for LOPA. This was recognized and a security method ROPA (Rings of Protection Analysis) was developed for physical security. Using the same principles to get a likelihood value based upon an event frequency.
Though ROPA was initially developed for physical security, this risk estimation method is and can also be used for cyber security. What is missing today is a standardized table for the IEF and the PFD of the various protection controls. This is not different from ROPA and LOPA because for both methods there is a guideline, but in daily practice each company defined its own list of IEF and PFDavg factors.
Another important difference with process safety is that cyber and physical security require like machinery a high demand / continuous mode which alters the risk estimation formulas (and event scales!), but the principles remain the same. So again the event frequency scale for cyber events differs from the event frequency scale used by LOPA. The demand rate for a firewall or antivirus engine is almost continuous, not limited to once a year like in the low demand mode of LOPA for process safety and derived from this business / mission risk.
The key message in above story is that the event scale used by process safety, for risk estimation is based upon failures of primarily physical equipment and accessories such as sealings and human errors. Where the event scale of cyber security is linked to the threat actor’s capabilities (Tactics, Techniques, Procedures), motivation, resources (Knowledge, money, access to equipment), opportunity (opportunities differ for an insider and outsider, some attacks are more “noisy” than others), it is linked to the effectiveness of the countermeasures (ease of bypassing, false negatives, maintenance), and the exposure of the vulnerabilities (directly exposed to the threat actor, requiring multiple steps to reach the target, detection mechanisms in place, response capabilities). These are different entities, resulting in different scales. We cannot ignore this when estimate loss based risk.
Accepting the conclusion that the event scales differ for process safety and for cyber security is accepting the impossibility to use the same risk assessment matrix for showing cyber security risk and process safety risk as is frequently requested by the asset owner. The impact will be the same, but the likelihood will be different and as such the risk appetite and tolerance criteria will differ.
So where are the flaws in IEC/ ISA 62443-3-2?
Starting with an inventory of the assets and channels in use is always a good idea. The next step, the High Level Risk Assessment (HLRA) is in my opinion the first flaw. There is no information available after the inventory to estimate likelihood. So it is suggested the method to set the likelihood to 1, basically risk assessment becomes a consequence / impact assessment at this point in time. An activity of the FMECA method. Why is a HLRA done, what is the objective of the standard? Well it is suggested that these results are later used to determine if risk is tolerable, if not a detailed risk assessment is done to identify improvements. It is also seen as a kind of early filter mechanism, focusing on the most critical impact. Additionally the HLRA should provide input for the zone and conduit analysis.
Can we using consequence severity / impact determine if the risk is tolerable (a limit suggested by ISA, the risk appetite limit is more appropriate than the risk tolerance limit in my view)?
The consequence severity is coming from the process hazard analysis (PHA) of the plant, so the HAZOP / LOPA documentation. I don’t think this information links to the likelihood of a cyber event happening. I give you a silly example to explain. Hazardous event 1, some one slaps me in the face. Hazardous event 2, a roof tile hits me on my head. Without the knowledge of likelihood (event frequency), I personally would consider the roof tile the more hazardous event. However if I would have some likelihood information telling me slapping me in the face would happen every hour and the roof tile once every 30 years I might reconsider and regret having removed hazard 1 from my analysis. FMECA is a far more valuable process than the suggested HLRA.
Does a HLRA contribute to the zone and conduit design? Zone boundaries are created as a grouping of assets with similar security characteristics following the IEC 62443 definition. Is process impact a security characteristic I can use to create this grouping? How can I relate a tube rupture in a furnace’s firebox to my security zone?
Well, then we must first ask ourselves how such a tube rupture can occur. Perhaps caused by a too high temperature, this I might link to a functional deviation caused by the control system or perhaps it might happen as the consequence that a SIF doesn’t act when it is supposed to act. So I may link the consequence to a process controller or maybe a safety controller causing a functional deviation resulting in the rupture. But this doesn’t bring me the zone, if I would further analyze how this functional deviation is caused I get into more and more detail, very far away from the HLRA. An APC application at the level 3 network segment could cause it, a modification through a breached BPCS engineering station at level 2 could cause it, an IAMS modifying the sensor configuration can cause it, and so on.
Security zone risk can be estimated in different ways, for example as zone risk following the ANSSI method. Kind of risk of the neighborhood you live in. We can also look at asset risk and set the zone risk equal to the asset with the highest risk. But all these methods are based on RPN, not on LBR.
HLRA’s just don’t provide valuable information, we typically use them for early filtering making a complex scope smaller. But that is neither required, nor appropriate for cyber physical systems. Doing a HLRA helps the risk consultant not knowing the plant or the automation systems in use, but in the process of estimating risk and designing a zone and conduit diagram it has no added value.
As next step ISA 62443-3-2 does a detailed risk assessment to identify the risk controls that reduce the risk to a tolerable level. This is another flaw because a design should be based upon an asset owner’s risk appetite, which is the boundary of the acceptable risk. The risk we are happy to accept without much action. The tolerable risk level is linked to the risk tolerance level, if we go above this we get unacceptable risk. An iterative detailed risk assessment identifying how to improve is of course correct, but we should compare with the risk appetite level not the risk tolerance level.
If the design is based on the risk tolerance level there is always panic when something goes wrong. Maybe our AV update mechanism fails, this would lead to an increase in risk, if we already are at the edge because of our security design we enter immediately in an unacceptable risk situation. Where normally our defense in depth, not relying on a single control, would give us some peace of mind to get things fixed. So the risk appetite level is the better target for risk based security design.
Finally there is this attempt to combine the process safety risk and the cyber security risk into one RAM. This is possible but requires a mapping of the cyber security risk likelihood scale on the process risk likelihood scale, because they differ. They differ not necessarily at a qualitative level, but the underlying quantitative mitigated event frequency scale differs. We can not say a Medium score on the cyber security likelihood scale is equivalent with a Medium score on the process safety likelihood scale. And it is the likelihood of the cyber attack that drives the cyber related loss based risk. We have to compare this with the process safety related loss based risk to determine the bigger risk.
This comparison requires a mapping technique to account for the underlying differences in event frequency. When we do this correctly we can spot the increase in business risk due to cyber security breaches. But only an increase, cyber security is not reducing the original process safety risk that determined the business risk before we estimated the cyber security risk.
So a long story, maybe a dull topic for someone not daily involved in risk estimation, but a topic that needs to be addressed. Risk needs a sound mathematical foundation if we want to base our decisions on the results. If we miss this foundation we create the same difference as there is between astronomy and astrology.
For me, at this point in time, until otherwise convinced by your responses, the IEC / ISA 62443-3-2 method is astrology. It can make people happy going through the process, but lacks any foundation to justify decisions based on this happiness.
There is no relationship between my opinions and references to publications in this blog and the views of my employer in whatever capacity. This blog is written based on my personal opinion and knowledge build up over 42 years of work in this industry. Approximately half of the time working in engineering these automation systems, and half of the time implementing their networks and securing them.
Author: Sinclair Koelemij
OTcybersecurity web site