This blog is about risk, more precise about a methodology to estimate risk in cyber physical systems. Additionally I discuss some of the hurdles to overcome when estimating risk. It is a method used in both small (< 2000 I/O) and large projects (> 100.000 I/O) with proven success in showing the relationship between different security design options, the cyber security hazards, and the change in residual cyber security risk.
I always thought the knowledge of risk and gambling would go hand in hand, but risk is a surprisingly “recent” discovery. While people gamble thousands of years, Blaise Pascal and Pierre de Fermat developed the risk methodology as recently as 1654. Risk is unique in the sense that it allowed mankind for the first time to make decisions based on forecasting the future using mathematics. Before the risk concept was developed, fate alone decided over the outcome. Through the concept of risk we can occasionally challenge fate.
Since the days of Pascal and De Fermat many other famous mathematicians contributed to the development of the theory around risk. But the basic principles have not changed. Risk estimation, as we use it today, was developed by Frank Knight (1921) a US economist.
Frank Knight defined some basic principles on what he called “risk identification”, I will quote these principles here and discuss them in the context of cyber security risk for cyber physical systems. All mathematical methods today estimating risk still follow these principles. There are some simple alternatives that estimate likelihood (this is generally the difficulty) of an event using some variables that influence likelihood (e.g. using parameters such as availability of information, connectivity, and management) but they never worked very accurate. I start with the simplest of all, principle 1 of the method.
PRINCIPLE 1 – Identify the trigger event
Something initiates the need for identifying risk. This can be to determine the risk of a flood, the risk of a disease, and in our case the risk of an adverse affect on a production process caused by a cyber attack. So the cyber attack on the process control and automation system is what we call the trigger event.
PRINCIPLE 2 – Identify the hazard or opportunity for uncertain gain.
This is a principal formulated in a way typical for an economist. In the world of process control and automation we focus on the hazards of a cyber attack. In OT security a hazard is generally defined as a potential source of harm to a valued asset. A source of discussion is if we define the hazard at automation system level or at process level. Ultimately we of course need the link to the production process to identify the loss event. But for an OT cyber security protection task, mitigating a malware cascading hazard is a more defined task than mitigating a too high reactor temperature hazard would be.
So for me the hazards are the potential changes in the functionality of the process control and automation functions that control the physical process. Or the absence of such a function preventing manual or automated intervention when the physical process develops process deviations. Something I call Loss of Required Performance (performance deviates from design or operations intent) or Loss of Ability to Perform (function is lost, cannot be executed or completed), using the terminology used by the asset integrity discipline.
PRINCIPLE 3 – Identify the specific harm or harms that could result from the hazard or opportunity for uncertain gain.
This is about consequence. Determining the specific harm in a risk situation must always precede an assessment of the likelihood of that harm. If we would start with analyzing the likelihood / probability, we would quickly be asking ourselves questions like “the likelihood of what?” Once the consequence is identified it is easier to identify the probability. In principal a risk analyst needs to identify a specific harm / consequence that can result from a hazard. Likewise the analyst must identify the severity or impact of the consequence. Here starts the first complexity when dealing with OT security risk. In the previous step (PRINCIPLE 2) I already discussed the reason for expressing the hazard initially at control and automation system level to have a meaningful definition I can use to define mitigation controls (Assuming that risk mitigation is the purpose of all this). So for the consequence I do the same I split the consequence of a specific attack on the control and automation system from the consequence for the physical production. When we do this we no longer have what we call a loss event. The consequence for the physical system results in a loss, like no product, or a product with bad quality, or worse perhaps equipment damage or fire and explosion, possibly injured people or casualties, etc.
The answer for this is, what is called a risk priority number. A risk priority number is based upon what we call consequence severity (just a value on a scale). Where “true” risk would be based on an impact expressed in terms of loss. A risk priority number can be used for ranking identified hazards, they cannot be used for justifying investments. For justifying investments we need to have a risk value based upon a loss. But this step can be achieved later. Initially I am interested in selecting the security controls that contribute most to reducing the risk for the control and automation system. Convincing business management to invest in these controls is a next step. To explain this, I use the following picture.
In the middle of the picture there is the functional impact, the deviation in the functionality of the control and automation system. This functional deviation results in a change (or absence off) the control and automation action. This change will be the cause of a deviation in the physical process. I discuss this part later.
PRINCIPLE 4 – Specify the sequence of events that is necessary for the hazard or opportunity for uncertain gain to result in the identified harm(s).
Before we can estimate the uncertainty, the likelihood / probability, we need to identify the specific sequence of events that is necessary for the hazard to result in the identified consequence. The likelihood of that precise sequence occurring will define the probability of the risk. I can use the word risk here because this likelihood is also the likelihood we need to use for the process risk, because it is the cyber-attack that causes the process deviation and the resulting consequence. (See above diagram)
The problem we face that there are many paths leading to from the hazard to the consequence. We need to identify each relevant pathway. On top of this as cyber security specialists we need to add various hurdles for the threat actors to block them reaching the end of the path, the consequence of the attack. This is where counterfactual risk analysis offers the solution. This new methodology helps us achieve this. The method analysis each possible path, based upon a repository filled with hundreds of potential event paths, and estimates the resulting likelihood of each path. Which is the next topic, PRINCIPLE 5.
PRINCIPLE 5 – Identify the most significant uncertainties in the preceding steps.
We can read the time when this statement was written in the sentence “identifying the most significant uncertainties”. In times before counterfactual analysis we needed to limit the number of paths to analyze. This can lead to and actually did lead to incidents because of missing an event path that was considered insignificant or just not identified (e.g. the Fukushima Daiichi nuclear incident). The more complex the problem, the more event paths exist, the easier we forget one. Today the estimation of likelihood and so risk progressed and is dealt with differently. Considering the complexity of the control and automation systems we have today combined with the abundance of tactics, technologies, and procedures available for the threat actor to attack, the number of paths to analyze is very high. Traditional methods can only cover a limited amount of variations, generally obvious attack scenarios we are familiar with before we start the risk analysis. The result of the traditional methods do not offer the level of detail required. Such a method would spot the hazard of malware cascading risk, the risk that malware propagates through the network. But it is not so important to learn how high malware cascading risk is, it is more important to know if it exists, which assets and channels cause it, and which security zones are affected. This information results from the event paths used in described method.
These questions require a risk estimation with a level of detail missed by the legacy methods. This is specifically very important for OT cyber security, because the number of event paths leading to deviation of a specific control and automation function is much larger than for example the number of event paths identified in process safety hazard analysis. An average sized refinery quickly leads to over 10.000 event paths to analyze.
Still we need “true” risk, risk linked to an actual loss. So far we have determined the likelihood for the event paths, we have grouped these paths to link them to hazards, so we have a likelihood for a hazard and we have a likelihood that a specific consequence can happen. Happily we can consolidate the information at this point, because we need to assign severity. Consequences (functional deviations) can be consolidated in what are called failure modes.
These failure modes result in the deviations in the production process. The plant has conducted a process safety hazop (process hazard analysis for US readers) to identify the event paths for the production system. The hazop identifies for a specific physical deviation (e.g. too high temperature, too high pressure, reverse flow, etc.) what the cause could be of this deviation and what the consequence for the production system is. These process event paths have a relationship with the failure modes / consequences identified by the first part of the risk analysis. A specific cause is can only result from a specific failure mode. We can link the cause to the failure mode and get what is called the extended event path (See diagram above) This provides us with part of the production process consequences. These consequences have an impact, an actual loss to get the mission risk required for justification of cyber security investment.
But the hazop information does not provide all possible event paths because there might be a new malicious combination of causes missed (causes can be combined by an attacker in a single attack to create a bigger impact) and we can attack the safeguards. For example we have the safety instrumented system that implements part of the countermeasures that can become a new source of risk.
To explain the role of a SIS, I use above picture to show that OT cyber security has a limited scope within overall process safety (And it would be even more limited if I used the word safety that defines personal safety, process safety, and functional safety). Several of the safeguards specified for the process safety hazard might not be a programmable electronic system and as such not a target for a cyber attack. But some such as the safety instrumented system, or a boiler management system are, so we need to consider them in our analysis and add new extended event paths where required. TRISIS / TRITON showed us SIS is a source of risk.
Since the TRISIS / TRITON cyber attack we need to consider SIS also as a source of new causes most likely not considered in a hazop. The TRISIS/TRITON attack showed us the possibility of modifying the program logic of the logic solver. This can range from simple actions like not closing shutdown valves prior to opening blow down valves and initiating a shutdown action to more complex unit or equipment specific scenarios. Though at operations level we distinguish between manual and automated emergency shutdown, for cyber security we cannot make this difference. Automated shutdown meaning the the shutdown action is triggered by a measured trip level and manual shutdown meaning that the shutdown is triggered by a push button, within the SIS program it is all the same. Once a threat actor is capable of modifying the logic, the difference between manual and automated shutdown disappears and even the highest level of ESD (ESD 0) can be initiated, shutting down the complete plant, potentially with tampered logic.
If we would look at what would be the ultimate loss resulting from a cyber attack, The “only” loss not caused by a cyber attack are so far fire, explosion, and loss of life. This is not because a cyber attack has not the capability to cause these losses, but we were primarily lucky that some attacks failed. Let’s hope we can keep it that way by adequately analyzing risk and mitigating the residual risk to a level that is acceptable / tolerable.
I don’t want to make the blog too long, but in future blogs I might jump back to some of these principles. There is more to explain on the number of risk levels, how to make risk actionable, etc. If you would unravel the text and add some more detail that I didn’t discuss the used risk method is relatively simple as the next diagram shows.
This model is used by the Norwegian offshore industry for emergency preparedness analysis. A less complex analysis as a cyber security analysis is but that difference is primarily in how the risk picture is established. This picture is from the 2010 version (rev 3) but not that different from the rev 2 version (2001) that is freely available on the Internet. This model is also very similar to ISO 31000 shown in the next diagram.
If you read how and where these models are used and how field proven the models are, also in the control and automation world, might explain a bit how surprised I was when I noticed IEC/ISA 62443-3-2 invented a whole new approach with various gaps. New is good when existing methods fail, but if methods exist that meet all requirements for a field proven methodology I think we should use these methods. Plants and engineers don’t benefit from reinventing the wheel. I am adding IEC to the ISA 62443 because last week IEC approved the standard.
I didn’t make this blog to continue the discussion I started in my previous blog, though actually there was no discussion no counter arguments were exchanged – neither did I change my opinion, but to show how risk can / was / is used in projects is important. Specifically because the group of experts doing formal risk assessments is extremely small. Most assessments end up in a short list of potential cyber security risk without identifying the sources of this risk in an accurate manner. In those situations it is difficult understand which countermeasures and safeguards are most effective to mitigate the risk. It also would not provide the level of detail necessary for creating a risk register for managing cyber security based on risk.
There is no relationship between my opinions and references to publications in this blog and the views of my employer in whatever capacity. This blog is written based on my personal opinion and knowledge build up over 42 years of work in this industry. Approximately half of the time working in engineering these automation systems, and half of the time implementing their networks and securing them.
Author: Sinclair Koelemij
OTcybersecurity web site