Selecting subjects for a blog is always a struggle, on one side there are thousands of subjects, on the other side there is the risk to bore the reader and scare him / her away or even upset the reader. But most of my blogs are inspired by some positive or negative response on what I experience or read, the spur of the moment. Today’s blog has been simmering for a while before I wanted to write it down. This because in my daily occupation I am very much involved in risk assessment and it is complex to write about it without potentially touching intellectual property of my employer.
What inspired me to write the blog is a great post from Sarah Fluchs and a Linked-In response on my blog on Remote Access by Dale Peterson. Apart from these triggers, I was already struggling some time with parts of the Mitre Att&ck framework and some of the developments in the IEC 62443 standard.
The concept of risk is probably one of mankind’s greatest achievements abstracting our “gut feeling” about deciding to do something by rationalizing it in a form we can use for comparison or quantification.
In our daily life risk is often related with some form of danger a negative occurrence. We more often discuss the risk of being infected by Covid-19 than the risk of winning the lottery. And even if there was a risk of winning the lottery, we would quickly associate it with the negative aspects of having too much money. But from a mathematical point of view positive or negative consequence wouldn’t make a difference.
The basic formula for risk is simple (Risk = Likelihood x Impact) but once you start using it, soon many complex questions require an answer. In cyber security we use the formula Threat x Vulnerability x Impact = Risk, Threat x Vulnerability being a measure for likelihood. Though I will not use impact but use consequence / consequence severity and explain the reasons and differences later.
Don’t expect me to explain all my thoughts on this subject in a single blog, it is a very big subject so it takes time, and parts I can’t reveal to protect the commercial rights of my employer. In risk there are very superficial concepts and very complex concepts and rules, an additional hurdle to overcome is that you always get an answer but it is not easy to verify the correctness of that answer. And most of all there are hundreds of forms of risk, so we need some structure and method.
For me risk in process automation systems has only two forms: cyber security risk and mission risk. That doesn’t take away that I can split each of them looking in more detail, for example in cyber security risk I can focus on “cascading risk” the risk that malware propagates through the system, or I can look at risk of unauthorized access, and many more. Mission risk I can break down into such elements as safety risk, reputation risk, risk of operational loss, risk of asset damage, etc. I explain these in more detail later.
You can also group risk to facilitate monitoring the risk, when we do this we build a risk register. If the risk methodology offers sufficient detail it becomes possible to determine the contribution of a specific risk control in reducing the risk. The moment you have established a grip on the risk, it becomes possible to manage the risk and justify decisions based on risk. And you can monitor how changes in the threat landscape impact your plant’s risk.
In process safety the industry does this all the time, because process safety is an in my view more mature discipline. In cyber security the majority of the asset owners has not reached that level of maturity to work with cyber security based on risk. In those cases cyber security is established through creating checklists and conformance to standards specifying lists of requirements, there is a compliance focus in the expectation that the standard will offer security by itself.
Standards and checklists are not taking site and asset specific elements into account, it is easy to either overspend or under spend on cyber security when following standards and checklists. But to start with checklists and standards provide a good base, over time organizations mature and look at cyber security in a more structural way, a way based on risk.
Okay, long intro but when I discuss a subject I don’t like to forget the historic context so let’s discuss this briefly before we become technical.
In the introduction I mentioned that there is a relationship between an organization’s maturity and using risk. This shouldn’t suggest that risk was not considered. My first participation in a risk assessment was approximately 15 years ago. I was asked to participate as a subject matter expert in the risk assessment workshops of an asset owner, my expertise was related to how the control systems of my employer function.
This was the first time I became acquainted on how an asset owner looked at risk. The formulas used, the method followed, the discussions, all made sense for me. I was convinced, we were heading for a great result.
Unfortunately, this turned out to be a dream, the risk results didn’t offer the granularity required to make decisions or to manage cyber security risk. It was almost all high risk, not enough to differentiate decisions on. Though the results failed and the company later took an approach of creating an extensive set of policies to implement security, they didn’t consider it a full failure because the process itself was valuable.
Less than a year later I was asked in the same role for another asset owner, formulas differed a bit but unfortunately the results didn’t. Beware this was all prior to Stuxnet, that changed the view on OT security considerably, the focus was still on preventing malware and keeping hackers out. The idea of a highly sophisticated attack aiming at the process installations was not considered, every installation is unique that will never work was the idea.
But advanced attacks did work we learned in 2010, and we have seen since that it became the standard. Modular malware has been developed supporting the attackers in manipulating process automation functions.
We have seen since than: power outages, damaged blast furnaces, attempts to change the program logic in a safety controller, and several other incidents.
In parallel with these occurrences the interest in risk grew again, it became important to know what could happen, what scenarios were more likely than other scenarios. There was a drive to change the reactive mode and become pro-active.
The methods used had matured and objectives had been adapted. Risk for a medical insurance company has a temporal relation, they want to know how many people will have cancer in 2030 so they can estimate the medical cost and plan their premium, driven by a quantitative approach.
Cyber security and mission risk have a different objective, the objective here is to determine which risk is highest and what can we do to reduce this risk to a level that the business can accept. So it is comparative risk, not aiming at the cost of malware infections over 5 years, but prioritizing and justifying risk reduction controls. It allowed for applying hybrid methods, partially qualitative and partially quantitative.
Globally I still see differences in adopting risk as a security driver, some regions have developed more toward using risk as a management instrument than other regions. But personally I expect that the gap will close and asset owners will embrace risk as the core driver for their cyber security management system. They will learn that communications with their senior management will be easier, because business managers are used to work with risk. Now let’s become technical.
I already mentioned in the introduction that for me cyber security risk and mission risk differ, however we can and have to translate cyber security risk into mission risk. Plants think in terms of mission risk. But we can express how mission risk is influenced by the cyber threat and adjust the risk for it.
The how I will not address in this blog, would be too lengthy and too early in this story to explain. But I do like to have a closer look at mission risk (or rather the mission risk criteria) before discussing cyber security risk in more detail.
When discussing risk it is necessary to categorize it and formulate risk criteria. I created following diagram with some imaginary criteria as an example.
The diagram shows six different impact criteria and five different impact levels. I added some names to the scales but even these vary in real life. Fully imaginary values coming from my left thumb, but reasonable values depending on the industry. Numbers differ very much by individual plant and by industry. Service continuity is expressed in the example in the form of perhaps a power provider, but could as well be expressed (different criteria, categories) to address the impact of down-stream or upstream plants of a specific cyber security hazard. One thing is for certain, every plant has one. Plants are managed by mission risk, they have established their risk criteria. But I hope you also see that there is no relationship with a cyber security hazard at this level, it is all business parameters.
You might say when the cyber attack results in a loss of view (You will see later that I don’t use this type of loss because it doesn’t offer the detail required) there will be operational loss. Can we link this to the impact scale? Very difficult.
Over 30 years ago I was involved in an incident at a refinery, the UPS of the plant failed and in its failure it also caused the main power to fail. This main power was feeding the operator stations, but not the control equipment. So suddenly the operators lost their view on the process. Incident call out was made and within two hours power was restored and operator stations reloaded. Operators had full view again, inspecting the various control loops and smiling everything was fine. So full loss of view, but no operational loss.
Okay, today’s refineries have far more critical processes than in those days, but because of that also has far more automation. Still the example shows the difficulty to link such a high level consequence (Loss of View) to an impact level. Loss of View is a consequence not an impact as MITRE classifies it, the impact would have been some operational loss or damaged equipment.
Similar story I can tell on what MITRE calls Loss of Safety, correct expression would be Loss of Functional Safety because safety is a lot more than functional safety alone. Some plants have to stop when the SIS would fail, other plants are allowed to run for a number of hours without the SIS. Many different situations exist, each of them translating differently to mission risk. So we need to make a clear difference between impact and consequence. The MITRE model doesn’t do so, the model identifies “Inhibit Response Function”, “Impair Process Control” and Impact and mix as such many terms. I think reason for this is that MITRE didn’t structure the model following a functional model of an ICS. This is where I value Sarah’s post so much, “Think in Functions, not in systems”. So let’s go to explaining the term “consequence” and explain the relationship with function.
In principle a threat actor launches a threat action against an exposed vulnerability, which results in a deviation in functionality. This deviation depends in an automation system very much on the target. Apart from this most targets conduct multiple functions, to make it even more complex several functions are created by multiple assets. And when this is the case the channels connecting these assets contribute to risk. This collaboration can even be cross-system.
An operator station or an instrument asset management system, or a compressor control system, or a flow transmitter, all have a different set of functions and it is the threat actor’s aim to use these in his attack. Ultimately there are two consequences for these functions, the function doesn’t meet its design or operations intent or the function is lost so no function at all. See also my very first blog.
When the function doesn’t meet its design or operations intent, the asset integrity discipline calls this “Loss of Required Performance”, when the function fully fails this is called “Loss of Ability to Perform”. Any functional deviation in a process automation system can be expressed in one of the two and these are high level consequences of a cyber attack.
For each asset in the ICS the detailed functionality is known, the asset owner purchases it to perform a certain function in the production process. The process engineer uses all or a part of the functions available. When these functions are missing or doing other things than expected, this results in mission impact. Which impact requires further refinement of the categories and understanding of the production process and equipment used.
We can further sub-categorize (six sub-categories are defined for each) these consequence categories allow us to link them to the production process (part of the plant’s mission).
This is also required, the MITRE categories “Impair Process Control” and “Modify Parameter” don’t make the link to mission impact. Changing the operating window of a control loop (design parameter) has a different impact than changing the control response of a control loop. These parameters are also differently protected and sometimes reside in different parts of the automation system. By changing the operating window the attacker can cause the production process to go outside its design limitations (e.g. too high pressure, too high temperature or level), where if the attacker changes a control parameter the response to changes in the process is impacted.
Depending on the process unit where this happens, mission impact differs. So a cyber attack causes a functional deviation (I ignore the confidentiality part here, see for that mentioned blog). A functional deviation is a consequence, a consequence we can give a severity score (how this is done is not part of this blog), and we can link consequence to mission impact.
Cyber security risk is also based upon the likelihood that a specific cyber security hazard occurs in combination with consequence severity. Mission risk resulting from the cyber security hazard is based upon the likelihood of the cyber security hazard and mission impact caused by the consequence.
Estimating the likelihood and the risk reduction as function of the countermeasures and safeguards is the trick here. Maybe I will discuss that in a future blog, it is by far the most interesting part in risk estimation. But in today’s world of risk analysis, using different names by different vendors such as “security process hazard analysis” or “cyber security HAZOP”, all do similar things. Create an inventory of cyber security hazards, estimate the inherent and residual risk based upon the assets installed and the channels (protocols) used for the communication between the assets, the architecture of the ICS, and the countermeasures and safeguards installed or advised. The difference is in how extensive and detailed is the cyber security hazard repository (the better you know the system (use functions and internal functions), the more extensive and detailed the repository)
Long story on the concept of risk driven cyber security, focusing on function, consequence and impact. Contrary to a checklist based approach a risk driven approach provides the “why” we secure and compares the various ways to reduce the risk, and provides a base to manage and jusity security based on risk.
Not that I think my blog on remote access security is an irresponsible suggestion that a list of controls would be sufficient to take away all risk, certainly not. But sometimes a rule or checklist is a quick way out, and risk a next phase. When teaching a child to cross the street it is far more easy to supply a rule to cross it in a straight line than discuss the temporal factor dynamic exposure creates when crossing the road over a diagonal path. Not wanting to say asset owners are children, just wanting to indicate that initially the rule works better than the method.
Off course the devil is in the detail, especially the situational awareness rule in the list requires knowledge of the cyber security hazards and indicators of compromise. But following the listed rules is already a huge step forward compared to how remote access is sometimes implemented today.
There is no relationship between my opinions and publications in this blog and the views of my employer in whatever capacity.
Author: Sinclair Koelemij