Process safety risk, cyber security risk and societal risk

Introduction

I wanted my next blog to be a blog on quantitative risk estimation using the TRITON attack scenario, published by MITRE Engenuity some months ago, as the example for an estimation. But as I started writing the blog I kind of got lost in side notes to explain the choices made.

Topics such as the difference between threat based risk and asset based risk, the regulations around societal risk, conditional likelihood, repeatable events, dependent and independent events, and threat analysis levels slipped in the text making it a blog even longer than usual for me. But it are all elements that play an important role in quantitative risk analysis and require some level of understanding to understand the estimation method.

So I decided to split the blog into multiple blogs and discuss some of these topics as separate blogs before publishing the blog on the TRITON / TRISIS risk. The first topic I like to discuss is societal risk and how this impacts process safety and most of all how it can impact cyber security risk criteria. But let’s start with process safety and its relationship with societal risk.

Process Safety risk and Societal risk

Societal risk is not an easy topic to discuss because it has a direct relationship to the law in a country. As such, there are differences between countries. Words can become very important, even ALARP can mean different things in different countries and some countries don’t even follow ALARP but work with ALARA only the difference of one word from Practical to Achievable, but a world of difference in the court room. In this blog I focus on Europe, although I encountered similar regulations in the countries of the Middle East and also saw examples in Asia and expect them to be in the US too.

Societal risk has a direct relationship with process safety and as such there is an indirect relationship with cybersecurity. Cybersecurity as one of the enablers of the functional safety task in a production installation such as a chemical factory or refinery.

The topic of quantitative risk analysis and its criteria is discussed in great detail in the CCPS (Center for Chemical Process Safety) publication “Guidelines for developing quantitative safety risk criteria”, actually a must read for OT cyber risk professionals. Though published in 2009, so before the StuxNet (2010) and TRITON/TRISIS (2017) cyber security incidents – therefore no reference to cyber security risk, the principles discussed are still relevant today.

The StuxNet attack showed us the vulnerability of production processes for cyber attacks and the TRITON/TRISIS attack showed us that the process safety function is not necessarily guaranteed during such a cyber attack. So societal risk caused by cyber attacks should be high on our agenda, however it is not.

Regulations of governments set risk criteria that apply to the impact a production process “may” have on individuals or a group of people. Regulations focus on the impact and how frequent it occurs, the cause of this impact is not relevant. This cause can be either a random failure in the production installation or an intentional failure caused by a cyber attack. The societal risk criteria of the law don’t differentiate between the two.

Since zero risk does not exist, we can only reduce the likelihood (the frequency of the occurrence) that a specific harmful event occurs. Preventing it to occur by elimination (risk avoidance) would mean abandoning the production process. We sometimes do this, for example Germany abandoned nuclear power generation, but most often we attempt to reduce the likelihood to lower the risk. The law has specified average event frequencies that set limits for the likelihood of these events, these limits are guiding the process safety design of a production facility but now also set requirements for the cyber security design if we consider loss based risk.

The relationship between societal risk and process safety risk.

Let’s first define risk, the CCPS definition of risk is:

“Risk is a measure of human injury, environmental damage, or economic loss in terms of both the incident likelihood and the magnitude of the loss”

The criteria for each are specified differently with so far a strong focus of the regulations on human injury and environmental damage. Though the recent wave of ransomware attacks has also shifted government attention toward economic loss, but there are no limits set. My focus in the blog is human injury, so process safety.

Let’s consider a simplified imaginary process safety scenario. We have a reactor that needs cooling to prevent a run-away reaction if the temperature is too high, a run-away reaction that would build-up pressure to a level that might cause an explosion or a loss of containment of a hazardous (perhaps flammable) vapor that can cause multiple fatalities. This cooling is provided by a jacket filled with water, water that is supplied and circulated by a pump.

This type of scenario is quite common. A potential cause for creating such a run-away reaction would be the failure of the pump, a pump failure would stop the circulation of the cooling water allowing the reactor to heat up with the potential run-away reaction. The plant’s process safety design has set a target that requires reducing the likelihood of such an event happening to 1 event per 10.000 year. If we know the average event frequency of the pump failure we can consider additional risk reduction controls that reduce the likelihood to the 1E-04 limit. The process for estimating such risk reduction is LOPA (Layer Of Protection Analysis), a quantitative method for process safety risk assessment. A process safety hazop is a qualitative method and is as such not capable of generating a likelihood frequency.

LOPA uses a table of event frequencies for various failures that can occur, for example the expectancy for a pump to fail is generally set to once every 10 years. We could attempt to improve the reliability of the pump by having a redundant pump, but we still might have a common factor that would impact both pumps. For example a cyber attack might be such a common factor, or maybe a power failure if both pumps would be electrical driven.

It would be difficult and costly to realize a reliability of the pump function that meets the 1E-04 criterium of our example. So instead we use a process safety function to reduce the risk by automatically bringing the reactor into a safe state after the pump failure preventing the explosion and / or loss of containment.

If we ignore any attempts to improve the pump’s reliability we need a safety instrumented function (SIF) that is able to bring the reactor into a safe state, this function needs a reliability that reduces the risk by a factor 1000 (1E-03).

A SIF is an application that is triggered by some event, perhaps the flow rate of the cooling water or the temperature in the reactor vessel, to start initiating actions (closing / opening valves) bringing the reactor in a safe state. Reducing the risk requires that all components used by the SIF have a reliability that meets the reliability criteria that result in the risk reduction with a factor 1000. This is where the safety integrity limit (SIL) comes in.

A SIF typically requires input from one or more sensors (e.g. a flow sensor), some safety instrumented system (SIS) executing the program logic, and one or more safety valves to create the safe state of the reactor. All components of the SIF together need to meet the reliability requirement to reduce the risk with the required factor. A SIL 3 SIF means we reduce risk with a factor 1000, a SIL 2 SIF would be a factor 100. The SIL is a function of the mean time between failure (MTBF) of the components and a test frequency.

Chemical plants typically require SIL 2 or SIL 3 safety systems to be able to meet the criteria. To reach the SIL level, the SIS must have a reliability level that meets SIL 3 criteria. But also the MTBF of the field equipment needs to meet the requirements. If the MTBF of a transmitter would be too low to meet the criteria we would use multiple transmitters and perhaps a 2oo3 voting mechanism to reach this reliability. Same is true for the safety valves, we may have multiple safety valves in series to increase the likelihood that at least one will act. The SIS is a very special type of controller with many diagnostic functions monitoring the various internal states of the controller and its I/O connections to make certain that it performs when demanded. For this, SIS undergoes a series of certification tests allowing it to claim a specific SIL level. So process safety is capable by using reliable functions to prevent an event from causing a high impact.

So in our example above, the process safety function reduced the risk with a factor 1000. As result we have now a likelihood of 1E-04 that a loss of containment or explosion occurs with a potential consequence of multiple fatalities due to the pump failure. This example has many simplifications, but it explains the principle of how we reduce risk in process safety.

The other question I need to answer is why did we choose the 1E-04 and not 1E-02 or 1E-06? This is where the regulations around societal risk come into the picture. Governments set these limits. So lets take a more detailed view on topics like individual risk, societal risk, and aggregate risk before linking all to cyber security risk.

Societal risk

Early examples of quantitative risk assessments (QRA) for the process industry go back to the early seventies with the Canvey island study in the UK, the Rijnmond study in the Netherlands and an LNG project study in California. Today QRA is the standard approach resulting in quantitative risk criteria for the industry and standardization of the risk estimation method. Regulation exist for both individual risk and societal risk.

Individual risk expresses the risk to a single person exposed to a hazard, where societal risk expresses the cumulative risk to groups of people affected by the incident. This makes societal risk more complex because it is a measure on a scale based on the number of people impacted. Sometimes the term aggregate risk is used, aggregate risk is the special case of societal risk for on-site personnel in buildings.

Criteria for societal risk are more strict than for individual risk, also different criteria exist for societal risk depending if on-site personnel is affected or if off-site persons are impacted. Some regulations also differentiate between off-site persons being aware of process hazards (for example personnel of another plant next to the facility where the incident occurs) and the general public with less awareness and protective equipment.

Regulations use Frequency-Number (F-N) curves to specify the criteria, in the next diagram I show the criteria for some European countries.

F-N curve showing some differences between European countries

The curve shows the boundaries for societal risk as they are specified within Europe. We can see differences between countries and we can see the new directive within the European union for new plants. Above is very familiar for process safety engineers because this are the limitations that determine their target frequency in LOPA (1E-04 in my example). Some companies can use the less restrictive F-N diagrams for on site personnel, other companies have identified scenarios that might impact the public space and need to follow more restrictive criteria.

In the design of a safety function we typically don’t take the regulatory limit as our target, we add what is called risk capacity by specifying a more restrictive target. So if the regulatory limit would be 1E-03 we might set the target to 1E-04 adding “space” between the risk tolerance limit and acceptable risk limit. Terminology can be sensitive here, specifically the word “acceptable” is a sensitive term for some country’s legal system. Alternatives like Tolerable Acceptable, Tolerable Not Acceptable have been used to sub-divide the area between Not Acceptable and Acceptable. This because a fatality is never acceptable, however since zero risk doesn’t exist we still have to assign fatal incidents with a very low likelihood of occurring an actionable risk level.

The relationship between cyber security risk and societal risk

Though I write these blogs as a private person with a personal opinion and view on cyber security risk, I can not avoid linking this view to my experience build as an employee of Honeywell. The team I work for focuses on cyber security consultancy services and security products for the industry. My first risk assessment was almost 10 years ago in 2012 and I am still involved in risk assessments today.

So almost 10 years of experience working for a major supplier of control and safety solutions, building, maintaining and securing the largest production installations ever build. Working over 40 years for this company in different roles gave me a very detailed insight on how these solutions work and how they are applied to automate production processes. This also gave me detailed insight into a lot of factual data around cyber security risk within plants. Some of these insights I can generalize and discuss in this blog.

Very high level, when we assess cyber security risk (loss based risk) the process is identifying the various system functions, identifying the hazards and their attack scenarios (actually based upon a repository of hundreds of different attack scenarios including their countermeasures and potential functional deviations for almost a 100 different OT systems of different vendors), conducting a criticality and consequence analysis and estimating residual risk based on the countermeasures implemented.

Important information for this blog are the results of the consequence analysis, these results are derived from analyzing the LOPA and HAZOP studies carried out by asset owners. A task of consequence analysis is to go over all identified process safety scenarios and determine if the causes of these scenarios can be the result of a functional deviation of a cyber attack or if the safeguard implemented by the process safety system can be disrupted or used for the attack.

Based upon the results of more than 30 consequence analysis I did it shows that on average between 40 and 60% of all identified process scenarios (identified by HAZOP and LOPA) can also be caused by a cyber attack. So in our example of a pump failure, we can also intentionally cause this by stopping the pump and preventing the SIS to act upon it.

Overall approximately 5% of these “hackable” process scenarios involve fatalities as potential consequence, an even higher percentage of scenarios can cause the highest level of economic loss. This percentage of course differs by plant, there are plants without potential fatalities and there are plants with a much higher percentage of scenarios that result into potential fatalities. Never the less these numbers show that cyber risk in the process industry has a direct relationship with societal risk because a cyber attack can cause this type of incident. An important question is, is the cumulative risk of the process safety risk and cyber security risk below the criteria for societal risk?

There are some important rules and conclusions here to consider:

  • Process scenarios that involve potential fatalities require a safety instrumented function to prevent this. So in principle there should not be any scenario possible where an isolated attack on a BPCS or a failure of a BPCS or process equipment function should result in fatalities. Where ever such scenario is detected in the analysis, it should be corrected.
  • However it is possible that an isolated attack on the safety instrumented system (like the TRITON/TRISIS) attack can cause a scenario resulting in multiple fatalities. This makes the SIS also from a cyber security point of view a very critical system. SIS in this context can be an ESD, BMS, or HIPPS function.
  • The highest impact sensitivity (IS) is scored for attacks that impact both SIS and BPCS. Impact sensitivity being a metric that “weighs” how much “pain” a cyber attack can do by attacking a specific OT function or set of OT functions (e.g. BPCS and SIS).
  • Apart from SIS and BPCS other OT functions with a significant IS score are CCS (compressor control), APC (advanced process control), LIMS (laboratory information), IAMS (instrument asset management), and ASM (analyzer management). For CCS, APC, and ASM this is not surprising. IAMS frequently creates a common point to impact both safety and control while for LIMS we see an increased integration where lab results are transmitted “digitally” to for example the APC. This can create scenarios with a high economic loss since LIMS is directly tight in with product quality.

It is essential for the security of a plant to address cyber security through risk analysis, if the threats and their impact are not known we actually start doing things without knowing if we address the highest risk. Risk analysis should be based on an accurate mapping of the functional deviations caused by a cyber attack against the process scenarios that can result from these attacks. With accurate I mean not using consequence statements such as “Loss of View” or “Loss of Control” these are far to general to be used for this mapping. OT functions have much more specific functional deviations, that can be identified if we have a detailed understanding of the workings of the OT function.

Then the most important question, does the cumulative risk of process safety and cyber security risk meet the regulatory criteria? This is the hardest question to answer because the event frequencies of process safety (based on random failure events) and the event frequencies of cyber security (based on intentional action, and based on skills and motivation) differs very much. Never the less the cumulative event frequency of the two needs to meet the same regulatory limit, as mentioned the societal risk criteria are not specified for process safety they are specified for the production process as a whole. We also know that this cumulative risk is higher than each separate risk, adding new threats doesn’t lower risk and an intentional occurrence has typically a higher event frequency than a random occurrence. If the consequence is the same, the risk will be higher.

To discuss this I need the following diagram.

Threat analysis levels

If we analyze cyber security risk for a plant we typically consider threats at OT function level. The asset, channel and application level is (should be) covered by the design teams during the threat modelling stages of the product development process. However the results at OT function level of this analysis provide an event frequency based on the likelihood of success of the attack. So as if there would be a queue of capable threat actors ready and willing to attack the plant. In a normal situation (so no cyber war) this is not so, threat actors are very selective when it comes to executing targeted attacks specifically if they are very skilled and need advanced methods to succeed.

Therefor not every plant has to fear an attack of a nation sponsored threat actor aiming to cause an attack resulting in serious process impact. Attacks have cost, the cost of a failed attack (e.g. TRITON /TRISIS) are high. To develop such an attack is a considerable investment for a threat actor. This is of course quite different for ransomware attacks with an economic objective, that type of attack is more common. So we always have to consider risk by looking at different threat actors with different motives and capabilities.

For the overall likelihood of a cyber attack two event frequencies are important: the event frequency related to the OT function level threats which depends on the type of threat actor, the TTP, the countermeasures, and the dynamic and static exposure of the OT function; and a second event frequency at management level defining how often such an attack will happen and what threat actors would be interested. The OT function level risk is basically a technical exercise and can be estimated with various risk assessment methodologies, of which methods based on FAIR are used by multiple service providers.

For analyzing management level threats, other factors play a role. Some based on historical occurrences, some driven by geo-politics. Overall a more difficult and subjective factor to assess, typically requiring a threat profiling exercise.
The combination of the two event frequencies (using conditional likelihood formulas because one event depends on the other event) results in an overall event frequency for the cyber security risk. This is the frequency that needs to meet the societal risk criteria when considering loss based risk. At OT function level we can reach reliable results, however this is more difficult at management level.

So does cyber security risk meet the societal risk criteria is a question for which there is no reliable answer.

Another difficulty is that we have different cyber security risk assessment methods, often resulting in different quantitative outcomes. Governments have solved this for process safety by standardizing the estimation of societal risk through enforcing the use of a specific method or tool. However these methods or tools do not consider cyber security risk at this point in time.

Unfortunately none of the standard organizations seems to be willing or able to address this gap. IEC 62443-3-2 is very high level, actually not addressing any of above issues, primarily an introduction on risk. And what I have seen from TR84 it is not much better because it copies IEC 62443-3-2 for a large part and also doesn’t address this legal aspect of process safety and cyber security risk.

But this topic is a gap that needs to be filled, a gap that will be very high on the agenda if a cyber attack occurs with societal impact resulting in multiple fatalities. The TRITON / TRISIS attempt shows that it was merely “luck” and not the result of any impressive cyber security that it didn’t happen.

So maybe an organization like ENISA, that is not organized around volunteers, will consider closing this gap. In order to meet European regulations the gap should be closed.

I hope that people that took the effort to read the blog till the end realize how difficult estimating loss based risk is, and conclude that it might be far easier to use an FMEAC method to estimate risk based on a risk priority number for the security design.

However IEC 62443.3.2 makes the link in its method to the hazop with its loss based impact, so we are driven into considering this complex field of regulations.

I don’t think this is necessary for a good security design, I think it is primarily the result of an attempt to show cyber security is important. What is easier in that case then to link it with big brother process safety, but for proper cyber security design an FMEAC analysis brings us the same results.

So be careful specifying the need for business related risk if some form of legal or financial justification is not required, it opens a can of worms.


There is no relationship between my opinions and references to publications in this blog and the views of my employer in whatever capacity. This blog is written based on my personal opinion and knowledge build up over 43 years of work in this industry. Approximately half of the time working in engineering these automation systems, and half of the time implementing their networks and securing them, and conducting cyber security risk assessments for process installations since 2012.


Author: Sinclair Koelemij

OTcybersecurity web site

Geef een reactie