In my previous blog (Why process safety risk and cyber security risk differ) I discussed the differences between process safety risk and cyber security risk and why these two risks don’t align. In this blog I like to discuss some of the cyber security risk criteria, why they differ from process safety risk criteria and how they align with cyber security design objectives. And as usual I will challenge some of IEC 62443-3-2 misconceptions.
Cyber security can be either prescriptive or risk based. An example of a prescriptive approach is the IEC 62443-3-3, the standard is a list with security requirements to follow in order to create resilience against a specific cyber threat. For this IEC 62443 uses security levels with the following definitions:
- SL 1 – Prevent the unauthorized disclosure of information via eavesdropping or casual exposure.
- SL 2 – Prevent the unauthorized disclosure of information to an entity actively searching for it using simple means with low resources, generic skills and low motivation.
- SL 3 – Prevent the unauthorized disclosure of information to an entity actively searching for it using sophisticated means with moderate resources, IACS specific skills and moderate motivation.
- SL 4 – Prevent the unauthorized disclosure of information to an entity actively searching for it using sophisticated means with extended resources, IACS specific skills and high motivation.
I underlined the criteria that create the difference between the security levels. A few examples to translate the formal text from the standard to what is happening in the field.
An example of an SL 1 level of resilience is protection against a process operator looking over the shoulder of a process engineer to retrieve the engineer’s password. Maybe you don’t consider this a cyber attack, but formally it is a cyber attack by a hopefully non-malicious insider.
An example of an SL 2 level of resilience is protection against malware, for example a process control system that became infected by WannaCry, ransomware. WannaCry didn’t target specific asset owners and did not specifically target process control systems. Any system with an exposed Microsoft EternalBlue vulnerability could become victim.
An example of an SL 3 level of resilience is protection against the attack on Ukraine’s power power grid. This was a targeted attack, using a mixture of standard tactics like phishing to get access to the system and specially crafted software to make modifications in the control system.
An example of an SL 4 level of resilience would be the protection against Stuxnet, the attack on Iran’s uranium enrichment facilities. This attack made use of zero-day vulnerabilities, required detailed knowledge of the production process, and thoroughly testing of the attack on a “copy” of the actual system.
The difference between SL 2 and SL 3 is clear, however the difference between SL 3 and SL 4 is a bit arbitrary. For example is the Triton attack an example of SL 3 or can we say it is SL 4. The Triton attack required like Stuxnet also a significant investment for the threat actors. But where Stuxnet had clear political motives, the motives for the Triton attack are less clear for me. SL 4 is often linked to resilience against attacks by nation-states, but with all these “nation state sponsored” threat actors that difference often disappears.
The IEC 62443-3-3 standard assigns the security levels to security zones and conduits of the control system for defining the technical security requirements for the assets in the zone and communication between zones. The targets for a threat actor are either an asset or a channel (the protocols flowing through the conduit). Though the standard also attempts to address risk with the IEC 62443-3-2, there is no clear path from risk to security level. There is not an “official” transformation matrix that converts a specific risk level into a security level, apart from the many issues with such a matrix. So overall IEC 62443-3-3 is a prescriptive form of cyber security. This is common for standards and government regulations, they need clear instructions.
The issue with prescriptive cyber security is that it doesn’t address the specific differences between systems. It is kind of ready-to-wear suit, does overall a good job but a tailor-made suit fits better. The tailor-made suit for cyber security is risk based security. In my opinion risk based security is a must for process automation systems supporting the critical infrastructure. Threat actors targeting critical infrastructure are highly skilled professionals, we should not put them aside as hackers for fun. Also the offensive side is a profession, either servicing a “legal” business such as the military or an “illegal” business such as cyber crime. And this cybercrime business is big business, it is for example bigger than the drugs trade.
Protection against the examples I gave for SL 3 and SL 4 requires a detailed analysis of the attack scenarios available to launch against the process installation, analyzing which barriers are in place to stop or detect these scenarios, and estimating the residual risk after having applied the barriers to see if we meet the risk criteria. For this we need residual risk, the risk after applying the barriers, and residual risk requires quantitative risk analysis to estimate the risk reduction similar to how process safety does this with LOPA as discussed in my previous blog.
Quantitative analysis is required because we want to know the contribution in risk reduction from the various barriers. Barriers being the countermeasures with which we can reduce the likelihood that the attack succeeds or with which we reduce / take away the impact severity of the consequences of a successful cyber attack. A simple example of a likelihood reducing barrier is an antivirus solution, a simple example of an impact severity reducing barrier is a back-up. These barriers are not limited to various security packages but also include design choice in the automation system and the production facility. The list of barriers I work with during my professional time has grown to over 400 and keeps growing with every detailed risk assessment. The combination of attack scenarios (the TTP of the threat actors), barriers and consequences (the deviations of the process automation functions as result of a successful cyber attack) is the base for a risk analysis.
Where a value for consequence severity can either be a subject matter expert assigned severity level for the functional deviation (resulting in a risk priority number (RPN)) or the severity scores / target factors of the worse case process scenarios that can result from the potential functional deviation. These severity scores / target factors come from the HAZOP or LOPA sheets created by the process safety engineers during their analysis of what can go wrong with the process installation and what is required to control the impact when things go wrong. When we use LOPA / HAZOP severity scores for consequence severity we call the resulting risk loss based risk (LBR), the risk a specific financial, safety, or environmental loss occurs. For security design a risk priority number is generally sufficient to identify the required barriers, however to justify the risk mitigation (investment in those barriers) to management and regulatory organizations loss based risk is often a requirement.
To carry out a detailed risk assessment the risk analyst should oversee a wide field of skills, understanding the automation system’s working in detail, understanding the process and safety engineering options, and understanding risk. Therefore risk assessments are teamwork of different subject matter experts, however the risk analyst needs to connect all input to get the result.
But risk assessments is also a very rewarding cyber security activity because communicating risk to management is far easier than explaining them why we invest in a next generation firewall instead of the lower cost packet filter. Senior management understands risk far better than cyber security technology.
So far the intro, lets look at the risk criteria now. ISA / IEC have a tendency in their approach to cyber risk to have a So much for the intro, now let’s look at the risk criteria. ISA/IEC tend to look closely at how process security approaches the subject in their approach to cyber risk and to attempt to copy it.
However as explained in my previous blog there are major differences between cyber security risk and process safety risk making such a comparison more complex than it seems. Many of these differences also have an impact on the risk criteria and the terminology we use. I like to discuss two terms in this blog, unmitigated risk and risk tolerance. Let’s start with unmitigated risk.
Both ISA 62443-3-2 and the ISA TR 84 work group use the term unmitigated risk. As an initial action they want to determine unmitigated risk and compare this with the risk criteria. Unmitigated risk is a term used by process safety for the process safety risk prior to applying the safeguards protecting against randomly occurring events that could cause a specific worse case process impact. The safeguards protect against events like a leaking seal, a failing pump, or an operator error. The event frequency of such an initiating event (e.g. failed pump) is reduced by applying safeguards, the resulting event frequency after applying the safeguard needs to meet specified limits. (See my previous blog) Safety engineers basically reduce an event frequency gap using safeguards with a reliability that actually accomplish this. The safeguard will bring the process into a safe state by itself. Multiple safeguards can exist but each safeguard will prevent the worst case consequence by itself. It is not that safeguard 1 and safeguard 2 need to accomplish the job together, each of them individually does the job. There might be a preferred sequence but not a dependency of protection layers.
Cyber security doesn’t work that way, we might define a specific initiating event frequency for the chance on a virus infection. But after installing an antivirus engine we cannot say the problem is solved and no virus infections will occur. The reliability of a process safety barrier is an entirely different factor than the effectiveness of a cyber security barrier. Both reduce risk, the process safety barrier reduces the risk to the required limit by it self with a reliability that allows us to take the credit for the risk reduction. But the cyber security barrier most likely requires additional barriers to reach an acceptable event frequency / risk reduction.
Another difference is that in cyber security the initiating event is not occurring randomly, it is an intentional action of a threat actor with a certain propensity (A term introduced by Terry Ingoldsby from Amenaza) to target the plant, select a TTP and target a specific function. The threat actor has different TTPs available to accomplish the objective, so the initiating event frequency is not so much one point on the frequency scale but a range with a low and high limit. We cannot pick on forehand the high limit of this range (representing the highest risk) because the process control system’s cyber resilience actually determines if a specific TTP can succeed, so we need to “exercise” a range of scenarios to determine the highest event frequency (highest likelihood) applicable within the range.
A third difference is the defense in depth. In process safety defense in depth is implemented by multiple independent protection layers. For example there might be a safety instrumented function configured in a safety system, but additionally there can be physical protection layers like a dike or other forms of bunding to control the impact. In cyber security we also apply defense in depth, it is considered bad practice if security depends on a single control to protect the system. However many of these cyber security controls share a common element, the infrastructure / computer. We can have an antivirus engine and we can add application whitelisting or USB protection to further reduce the chance that malware enters the system but they all share the same computer platform offering a shared point of failure.
Returning to the original remark on unmitigated risk, in process automation cyber security risk unmitigated risk doesn’t exist. The moment we install a process control system we install some inherent protection barriers. Examples are authentication / authorization, various built-in firewalls, encrypted messages, etc. So when we start a cyber security risk analysis there is no unmitigated risk, we can’t ignore the various barriers built-in the process automation systems. A better term to use is therefore inherent risk. The risk of the system as we analyze it, the inherent risk is also the residual risk but it is not unmitigated there is a range of barriers implemented when we start a risk analysis. The question is more, does the inherent risk meet the risk criteria and if not what barriers are required that result in a residual risk that does meet the criteria.
The second term I like to discuss is risk tolerance. Both IEC 62443 and the ISA TR 84 work group pose that the cyber security must meet the risk tolerance. I fully agree with this, where we differ is that I don’t see risk tolerance as the cyber security design target where the IEC 62443 standard does. In risk theory, risk tolerance is defined as the maximum loss a plant is willing to experience. To further explain my issue with the standard I first discuss the process safety side use of risk tolerance and use an F-N diagram from my previous blog.
This F-N curve shows two limits using a red and a green line. Anything above the red line is unacceptable, anything below the green line is acceptable. The area between the two lines is considered tolerable if it meets the ALARP (As Low As Reasonably Practicable) principle. For a risk to be ALARP, it must be possible to demonstrate that the cost involved in reducing the risk further would be grossly disproportionate to the benefit gained. Determining that a risk is reduced to the ALARP level involves an assessment of the risk to be avoided, an assessment of the investment (in money, time and trouble) involved in taking measures to avoid that risk, and a comparison of the two.
In process safety where the events occur randomly this is possible, in cyber security where the events occur intentionally this is far more difficult.
Can we consider the risk for the power grid in a region with political tensions the same as in the case of a region having no conflicts at all. I worked for refineries that had as a requirement to be resilient against SL 2 level of threat actors, but also refineries that told me they wanted SL 3, and there was even a case where the requirement was SL 4. So in cyber security the ALARP principle is not really followed, there is apparently another limit. The risk tolerance limit is for all the same, but there is also a limit that sets a kind of cyber security comfort level. This level we call the risk appetite, and this level is actually should become our cyber security design target level.
Risk appetite and risk tolerance shouldn’t be the same, there should be a margin between the two that allow for the possibility to respond to failures of the security mechanisms. If risk appetite and risk tolerance are the same and this would also be our design target any issue with a security control would result in an unacceptable risk.
In principal unacceptable risk means we should stop the production (as is the case for process safety), so if the limits would be the same we have created kind of a on/off mechanism. For example if we couldn’t update our antivirus engine for some reason, risk would raise above the limit and we would need to stop the plant. Maybe an extreme example and certainly not really a situation I see happening in real life. However when risk becomes not acceptable we should have an action defined for this case. There are plants that don’t have a daily update but follow a monthly or quarterly AV signature update frequency (making AV engine effectiveness very low with all the polymorph viruses), so apparently risk appetite differs.
If we want to define clear actions / consequences for each risk level we need sufficient risk levels to do this and limits that allow us to respond to failing cyber controls. So we need two limits, risk appetite and risk tolerance. The difference between risk appetite and risk tolerance is called risk capacity, having a small risk capacity means that issues can escalate quickly. A security design must also consider the risk capacity in the selection of barriers, “defense in depth” is an important security principle here because this increases risk capacity.
Above shows an example risk matrix (different criteria as the F-N diagram above) with 4 risk levels (A, TA, TNA, NA). Typically 4 or 5 risk levels are chosen and for each level the criteria and action is specified. In above example the risk tolerance level is the line that separates TNA and NA. Risk appetite might be either the line between TA and TNA or between A and TA for a very risk averse asset owner. However the risk capacity of a security design were the risk appetite is defined at the TA / TNA border is much smaller than if our target would be the A / TA border. But in both cases a risk increase due to the failure of a cyber security control with not immediately escalate into a Non Acceptable risk condition.
If we opt for risk based security, we need to have actionable risk levels and in that case a single risk tolerance level as specified and discussed in the IEC 62443-3-2 is just not possible and therefore not practiced for cyber security risk. The ALARP principle and cyber security don’t go together very well.
Maybe a bit boring topic, maybe too much text, never the less I hope the reader understands why unmitigated risk doesn’t exist in cyber security for process automation. And I certainly hope that the reader will understand that the risk tolerance limit is a bad design target.
There is no relationship between my opinions and references to publications in this blog and the views of my employer in whatever capacity. This blog is written based on my personal opinion and knowledge build up over 43 years of work in this industry. Approximately half of the time working in engineering these automation systems, and half of the time implementing their networks and securing them, and conducting cyber security risk assessments for process installations since 2012.
Author: Sinclair Koelemij
OTcybersecurity web site