Tag: process safety

OT security risk and loss prevention in industrial installations

Approximately 30 years ago J. Bond (Not James but John – “The Hazards of Life and All That”) introduced the three laws of loss prevention:

  • He who ignores the past is condemned to repeat it.
  • Success in preventing a loss is in anticipating the future.
  • You are not in control if a loss has to occur before you measure it.

The message for cyber security seems to be that when we measure and quantify the degree of resilience that OT security provides or an installation requires, we learn something from that. When we measure, we can compare / benchmark, predict and encapsulate the lessons learned in the past in a model looking into the future. One way of doing this is by applying quantitative risk methods.

Assessing risk is done using various methods:

  • We can use a qualitative method which will produce either a risk priority number or a value on an ordinal scale (high, low, ..) for the scenario (if scenarios are used, we discuss this later) for which the risk is estimated. The qualitative method is very common because it is easy to do, but often also results in a very subjective risk level. Never the less I know some large asset owners used such a method to assess cyber security risk for their refinery or chemical plant. Care must be taken that qualitative risk assessments don’t result in a form generic risk assessment, becoming more a risk awareness exercise than providing risk in terms of likelihood and consequence as a base for decision making.
  • Another approach is to use a quantitative method, which will produce a quantitative threat event frequency, or probability value for an attack scenario analyzed and considers how risk reduction controls reduce the risk. Quantitative methods require more effort and are traditionally only used in critical situations, for example in cases where an incident can result in fatalities. This because regulatory authorities specify quantitative limits for both worker risk and societal risk for production processes. Today we have also computational methods developed that reduce the effort of risk estimation significantly. An advantage of a quantitative method is that a quantitative result expresses risk in terms of likelihood (event frequency or probability) and consequence and links these to the risk reduction the security measures offer. Providing a method of justification.
  • A Generic risk assessment is an informal risk assessment, e.g. an asset owner and my employer wants me to consider the risk whenever I apply changes to an OT function. Generic risk is a high level process, considering what are the potential impacts of the modification, where can things go wrong, how do I recover if it goes wrong, who needs to be aware of the change and the potential impact? Most of us are very familiar with this way of accessing risk, for example when we cross a busy street we do this immediately. Also generic risk assessments are used by the industry, but they are informal and depend very much on the skills and experience of the assessor.
  • Dynamic risk assessment is typically used in situations where risk continuously develops such as is the case for firemen during a rescue operation, or a security operations center might apply this method while handling a cyber attack trying to contain the attack.

If we need to estimate risk for loss events during the conceptual or design phase of an industrial facility, we estimate risk for potential consequences for human health, environment, and finance. In those cases a quantitative method is important because it justifies the selected (or ignored) security measures. A quantitative risk assessment for loss events is an extension of the plant’s process safety risk analysis, typically a process safety HAZOP normally followed by a LOPA.

In oil and gas, refining and the chemical industry it is quite normal that the HAZOP identifies hazards that can result in fatalities or significant financial loss. Just like there are loss scenarios identified that result in serious environmental impact. For hazards concerning human safety and environmental loss the governments have set limits, limits that set a limit on how often such an event may occur per annum.

These limits are used by process safety engineers for setting their design targets for each consequence rating (Target Mitigated Event Likelihood – TMEL) on which they base the selection of the safeguards that reduce the risk in order to meet the required TMEL. Such a risk estimation process is typically carried out in the Layers Of Protection Analysis (LOPA). LOPA is a high level risk estimation process that typically follows a HAZOP. The HAZOP follows a systematical method to identify all potential safety hazards with their causes and consequences. The LOPA is the step following a HAZOP analysis performing a high level risk analysis for estimating a mitigated event likelihood frequency of the most critical hazards, for example the hazards with a high consequence rating.

LOPA is a semi-quantitative method using a set of predefined event frequencies for specific accidents (e.g. pump failure, leaking seal, ..) and predefined values for the risk reduction that safeguards offer when mitigating such an event. For example a safety instrumented function with SIL 2 safety integrity level offers a risk reduction of factor 100, a SIL 3 safeguard would offer a factor 1000, and a process operator intervention a factor 10. The SIL values are a function of the reliability and test frequency of the safeguard. Based upon these factors a mitigated event likelihood is estimated, and this likelihood must meet the TMEL value. If this is not the case additional or alternative safeguards (controls) are added to meet the TMEL value. Loss events for industrial installations are assigned a consequence rating level, and for each rating level a TMEL has been defined. For example the highest consequence rating might have a TMEL of 1E-05 assigned, the one but highest 1E-04, etc.

For the law the cause of a consequence like a fatality or environmental spill is irrelevant, the production process must offer a likelihood level that meets the legal criteria whatever the cause. For example we might need to meet a TMEL of 1E-06 for a consequence involving multiple fatalities for a greenfield installation in Europe, or a maximum TMEL of 1E-05 for existing brown field installations. This means that if a cyber attack can cause a fatality / irreversible injury, or major environmental damage, the likelihood of such an event (so its event frequency) must meet the same TMEL as is valid for the safety accident. The law doesn’t give bonus points for the one (safety) being accidental, and the other (cyber security) being intentional. Risk must be reduced to the level specified. To estimate such an event likelihood frequency as result of a cyber attack requires a quantitative process. So far the introduction.

Let’s first look at loss events and how they can arise in a production facility like a chemical plant. For this I created the following overview picture for showing the relevant automation function blocks in a production process. It is a high level overview ignoring some shared critical support components such as a HVAC that can also cause loss events if it would fail, or various management systems (network, security, documentation, ..) that can be misused to get access to the OT functions. I also ignore the network segmentations for now, network segmentation is managing the exposure of the functions so important for estimating the likelihood but not for this loss event discussion.

Overview of the automation functions

I start my explanation with discussing the blocks in the diagram.

  • Process Control Functions – These are the collection of functions that control the production process, such as the basic process control system function (BPCS), the compressor control function (CCS), power management function (PMS), Advanced Process Control functions (APC), Motor Control Center function (MCC), Alarm Management function (ALMS), Instrument Asset Management function (IAMS), and several other functions depending on the type of plant. All specialized systems that execute a specific task in the control of a production process.
  • Process Safety Functions – Examples of process safety functions are the Emergency Shutdown function (ESD), burner management function (BMS), fire and gas detection function (F&G), fire alarm panel function (FAP), and others. Where the control functions execute the production process, the safety function guards that the production process stays within safe process limits and intervenes with a partial or full shutdown if this is not the case.
  • Process diagnostic functions – Examples of process diagnostic functions are machine monitoring systems (MMS) (e.g. vibration monitoring), and process analyzers (PA) (e.g. oxygen level monitoring). These functions can feed the process safety function with information that initiates a shutdown. For example the oxygen concentration in a liquid is measured by a process analyzer, if this concentration is too high an explosion might occur if not prevented by the safety function. Process analyzers are typically clustered in management systems that control their calibration and settings. These functions are network connected, similar as is the case for analyzers also HART, Profibus, or Foundation Fieldbus are network connected and can be attacked.
  • Quality control functions – For example a Laboratory Information Management System (LIMS) keeps track of the lab results, is responsible for the sample management, instrument status, and several other tasks. Traditionally LIMS was a function that was isolated from the process automation functions, but nowadays there are also solutions that feed their data directly into the advanced process control (APC) function.
  • Process metering functions – These are critical for custody transfer applications, monitoring of production rates, and in oil and gas installations monitor also well depletion. For liquids the function is a combination of a pump and precision metering equipment, however there are also power and gas metering systems.
  • Environmental monitoring systems – These systems monitor for example the emission at a stack or discharges to surface water and groundwater. Failure can result in regulatory violations.
  • Business functions – This is a large group of functions ranging from systems that estimate energy and material balances to enterprise resource planning (ERP) systems, tank management functions, and jetty operation and mooring load monitoring are considered in this category. This block contains a mixture of OT related functions and pure information technology systems (e.g. ERP), exchanging information with the automation functions. Business function is not the same as an IEC 62443 level 4 function, it is just a group of functions directly related with the production process and potentially causing a loss event when a cyber attack occurs. They have a more logistic and administrative function, but depending on the architecture can reside at level 3, level 4 or level 5.
  • Process installation / process function – Least but not last the installation that creates the product. The pumps, generators, compressors, pipes, vessels, reactors, etc. For this discussion I consider valves and transmitters for a risk perspective to be elements of the control and safety functions.

The diagram shows numbered lines representing data and command flows between the function blocks. That doesn’t mean that this is the only critical data exchange that takes place, also within the blocks there are critical interactions between functions and their components that a risk assessment must consider. Such as for example the interaction between BPCS and APC, BPCS and CCS, BPCS and PMS, ESD and F&G. But also data exchange between the components of the main function, for example the information flow between an operator station and a controller in a BPCS, or the data flow between two safety controllers of a SIS. However that is at detailed level and mainly of importance when we analyze the technical attack scenarios. For determining the loss scenarios these details are of less relevance.

The diagram attempts to identify where we have direct loss (safety, environmental, financial) as result of the block functions failing to perform their task or performing their task in a way that does not meet the design and operational intentions of the function. Let’s look at the numbers and identify these losses:

  1. This process control function information / command flow can cause losses in all three categories (Safety, environmental, finance). In principle a malfunction of the control system shouldn’t be able to cause a fatality, the safety function should intervene in time, never the less scenarios exist that are out of control of the process safety function where an attack on the BPCS can cause a loss with a high consequence rating. These cyber attack scenarios exist for all three categories, but of course this also depends on the type and criticality of the production process and the characteristics of the process installation.
  2. Process safety functions have the task to intervene when process conditions are no longer under control by the process operator, however they can also be used to cause a loss event independent of the control function. A simple example would be to close a blocking valve (installed as a safety mechanism to isolate parts of the process) on the feed side of a pump, this would overheat and damage the pump if the pump wouldn’t be stopped. Also when we consider burner management for industrial furnaces multiple scenarios exist that can lead to serious damage or explosions if purge procedures would be manipulated. Therefore also the process safety function information/command flow can cause losses in all three categories (Safety, environmental, and finance).
  3. Process diagnostic functions can cause loss events by not providing accurate values, for example incorrect analyzer readings or incorrect vibration information. In the case of rotating equipment a vibration monitoring function can also cause a false trip of for example a compressor or generator. Also this diagnostic function data and command flow can cause losses in all three categories (Safety, environmental, finance), specifically cyber attacks on the process analyzer through their management system can have serious consequences including explosions with potentially multiple fatalities. Analyzers are in my opinion very much under-estimated as potential target, but manipulated analyzer data can result in serious consequences.
  4. The flow between the process control and process safety functions considers the exchange of information and commands between the control function and the process safety function. This would typically be alarm and status information from the process safety function and override commands coming from the process control function. Unauthorized override actions, or false override status information can lead to serious safety incidents. Just as loss SIL alarms can result in issues with the safety function not being noticed. In principle this flow can cause losses in all three categories (Safety, environmental, finance) though serious environmental damage is not likely.
  5. The flow between the diagnostic and control functions might cause the control function to operate outside normal process conditions because of for example manipulating process analyzer data. In principle this can cause both environmental and financial damage, but most likely (Assuming the process safety function is fully operational) no safety issues. So far I have never came across a scenario for this flow that scores in the highest safety or environmental category, but multiple examples exist for the highest financial category.
  6. The flow between the diagnostic and process safety functions is more critical than flow 5. This because flow 6 typically triggers the process safety function to intervene. If this information flow is manipulated or blocked, the process safety function might fail to act when demanded resulting in serious damage or explosions with losses in all three categories (Safety, environmental, finance).
  7. This information flow between the quality control function, the process installation, and the process function is exclusively critical for financial loss events. The loss event can be caused by manipulated quality information leading to off spec product either being produced or sold to customers of the asset owner.
  8. Manipulation of the information flow between the business functions and the control functions has primarily a financial impact. Though there are exceptions, for example there exist permit to work solutions (business function) that integrate with process operator status information. Manipulation of this data might potentially lead to safety accidents due to decisions based on incorrect status data.
  9. Manipulation of the information flow between metering functions and the business functions results primarily in financial damage. It normally has neither a safety or environmental impact. However the financial damage can be significant.
  10. The environmental monitoring function is an independent stand-alone function which loss is typically a financial loss event as result of not meeting environmental regulations. But minor environmental damage can be a consequence too.

Now we have looked at the loss events I like to change to the topic on how these loss events occur as result of a cyber attack on the automation functions. The diagram I created for this shows on the right side loss event categories, categories for process loss events and categories for business loss events. I don’t further discuss these in this blog, but these categories are used in the risk assessment reporting to group the risk of loss events allowing various comparison metrics showing differences between process installations. Primarily used for benchmarking, however this data gives an interesting view on the criticality of cyber attacks for different industries and the community.

Loss and threat relationship

The above graphic introduces which cyber security threats we analyze in a risk assessment and why. Threats are basically modelled as threat actors executing threat actions exploiting exposed vulnerabilities resulting in some unwanted consequence. We bundle threats in hazards, for example the hazard unauthorized access into a BPCS engineering HMI would be the grouping of all threats (attack scenarios) that can accomplish such unauthorized access. If the threat would succeed to accomplish this unauthorized access, a hazard would also have a series of consequences. These consequences are different depending on if we discuss the risk for the production process / installation or the risk for the business functions. Typically loss events for a process installation can be a safety, environmental, or financial loss event. While the impact on the business functions is typically financial loss.

Important to realize is that risk propagates, something that starts in the process installation can have serious impact for the business processes, but reverse is also true something that starts by a compromise of the business functions can lead to serious failures in the production installation. There for a risk assessment doesn’t separate IT and OT technology, it about functions and their mutual dependencies. This is why a criticality assessment (business impact) is an important instrument for identifying the risk domain for a risk analysis.

In a risk analysis we map the technical hazard consequences on the process and business loss scenarios. On the process / installation side there are basically two potential deviations: either the targeted function doesn’t perform anymore meeting design or operation intent; or the targeted function stops doing its task all together. These can result in process loss events. On the business side we have the traditional availability, confidentiality, and integrity impact. Data transfers can be lost, or their confidential content can get exposed, or the data can be manipulated all resulting in business loss events.

So the overall risk estimation process is identifying the scenarios (process and business), identify if they can be initiated by a cyber attack, if so then identify the functions that are the potential target and determine for these functions the cyber attack scenarios that lead to the desired functional deviations.

These attack scenarios are coming from a repository of attack scenarios for all the automation functions and their components in scope of the risk analysis. Doing this for multiple threat actors, multiple automation functions, and different solutions (having different architectures, using different protocols, having different dependencies between functions) for different vendors leads to a sizeable repository of cyber attacks with the potential security measures / barriers to counter these attacks.

In an automated risk assessment the technical architecture of the automation system is modelled in order to take the static and dynamic exposure of the targets (assets and protocols) into account, consider the exploitability of the targets, consider the intent, capabilities, and opportunity of the threat actors, and consider the risk reduction offered by the security measures. The result is what we call the threat event frequency, which is the same as what is called the mitigated event likelihood in the process safety context.

So far the attack scenarios considered are all single step attack scenarios against a targeted function. If the resulting mitigated event likelihood (MEL) meets the target mitigated event likelihood (TMEL) of the risk assessment criteria we are ready, the risk would meet the criteria. If not, we can add security measures to further reduce the mitigated event likelihood. In many cases this will reduce the MEL to the required level, if all functions are sufficiently resilient against a single step cyber attack than we can conclude the overall system is sufficiently resilient. However there are still cases where the TMEL is not met, even with all possible technical security measures implemented. In that case we need to extend the risk estimate by considering multi-step attack scenarios in the hope that this would reduce the overall probability of the cyber attack.

A multi-step attack scenario introduces significant challenges for a quantitative risk assessment. First we need to construct these multiple step attack scenarios, these can be learned from historical cyber attacks and otherwise using threat modelling techniques. Another complication is that in analyzing single step attack scenarios we used event frequencies, in multi-step attack scenarios this is no longer possible because we need to consider the conditional probability for the steps. The possibility of step B typically depends on the success of step A. So we need to convert the mitigated event frequencies of a specific single step into probabilities. This requires us to use a specific period, so calculate the probability for something like: “what is the probability that a threat with a mitigated event frequency of once every 10.000 years will occur in the coming 10 years”. Having a probability value we can add or multiply the probabilities in the estimation process. The chosen period for the probability conversion is not of much importance in this case because in the end we need to convert probability back into event frequency for comparison with the TMEL. Of more importance is if the conditional events are dependent or independent, this tells us to either multiply (independent) or add (dependent) probabilities which either increases or decreases likelihood.

For example if we have a process scenario that requires both to attack the control engineering function and the safety engineering function simultaneously, the result differs significantly if these functions require two independent cyber attacks or if a single cyber attack can accomplish both attack objectives. As would be the case if both engineering functions would reside in the same engineering station. This is why proper separation of the control and safety functions is also from a risk perspective very important. Mathematics and technical solutions go hand in hand.

So in a multi-step scenario we consider all steps of our attack scenario toward the ultimate target that creates the desired functional deviation impacting the production process. If these were all independent steps the conditional probability would have decreased compared with the single step estimate and as such also the likelihood if we convert the resulting conditional probability back into an event frequency (using the same period – e.g. the 10 years in my example). So far I never met a case where this wasn’t sufficient to be able to meet the TMEL. However it is essential to construct a multi-step scenario with the least amount of steps, otherwise we get an incorrect result because of too many steps between the threat actor and the target.

Never the less there is the theoretical possibility that in despite of considering all security measures available, in despite of considering the extra hurdles a multi-step attack poses, we still don’t meet the TMEL. In that case we have a few options:

One possibility is considering another risk strategy, so far we chose a risk mitigation strategy (trying to reduce risk using security measures). An alternative strategy can be a risk avoidance strategy, choosing for abandoning the concept all together or as alternative redesign the concept using another technical solution which potentially offers more or different options to mitigate the risk.

Risk strategies such as sharing risk (insurance) or spreading risk typically don’t work when it concerns non financial risk such as safety risk and environmental risk.

But as I mentioned so far I never encountered a situation where the TMEL could not be met with security measures, in the majority of the cases the compliance can be reached by analyzing single step scenarios for the targets. In some critical cases multi-step scenarios are required to estimate if risk reduction meets the criteria.

We might ask ourselves the question are we not overprotecting the assets if we attempt to solve mitigation by first establishing resilience against single step cyber attacks. This is certainly a consideration in the case where the insider attack can be ignored, but be aware that the privileged malicious insider typically can execute a single step attack scenario because of his / her presence within the system and having authorizations in the system. Offering sufficient protection against an insider attack most often requires procedures, non-technical means to control the risk. However so far there is no method developed that estimates the risk reduction of procedural controls for cyber security.

So what does a quantitative risk assessment offer? Well first of all a structural insight in a huge amount of attack scenarios, a growing amount of attack scenarios. It offers a way to justify investment in cyber security. It offers a method to show compliance with regulatory requirements for worker, societal, and environmental risk. And overall it offers consistency of the results and therefore a method for comparison.

What are the challenges, we need data for event frequencies. We need a method to estimate the risk reduction for a security measure. And we need knowledge, detailed knowledge on the infrastructure, the attack scenarios, process automation, a rare combination that only large and specialized service providers can offer.

The method does not provide actuarial risk, but it proofed in many projects to provide reliable and consistent risk. The data challenge is typically handled by decomposing the problem in smaller problems for which we can find reliable data. Experience and correction over time makes it time after time better. Actually in the same way as the semi quantitative LOPA method gained acceptance in the process safety world, the future for cyber risk analysis is in my opinion quantitative.

Strangely enough there are also many that disagree my statement, they consider (semi-)quantitative risk for cyber security as impossible. They often select as an alternative a qualitative or even a generic method, but both qualitative and generic methods are based on even less objective data. More subjective and not capable of proofing compliance to risk regulations. So the data argument is not very strong and has been concurred in many other risk disciplines. The complexity argument is correct, but that is time and education. Also LOPA was not immediately accepted, and LOPA as it is today is very structured and relatively simple to execute by subject matter experts. However we can also discuss data reliability for the LOPA method, comparing LOPA tables used by different asset owners also leads to surprises. Never the less LOPA has offered a tremendous value for a consistent and adequate process safety protection.

Process safety and cyber security differ very much but are also very similar. I believe cyber security must follow the lessons learned by process safety and adapt these for their own unique environment. This doesn’t happen very much resulting in weak standards such as for example the IEC 62443-3-2 ignoring what happens in the field.

There is no relationship between my opinions and references to publications in this blog and the views of my employer in whatever capacity. This blog is written based on my personal opinion and knowledge build up over 43 years of work in this industry. Approximately half of the time working in engineering these automation systems, and half of the time implementing their networks and securing them, and conducting cyber security risk assessments for process installations since 2012.

Author: Sinclair Koelemij

OTcybersecurity web site

Why process safety risk and cyber security risk differ


When cyber security risk for process automation systems is estimated I often see references made to process safety risk. This has several reasons:

  • For estimating risk we need likelihood and consequence, the process safety HAZOP and LOPA processes used by plants to estimate process safety risk, identify the consequence of the process scenarios they identify and analyze. These methods also classify the consequence in different categories such as for example finance, process safety, and environment.
  • People expect a cyber security risk score that is similar to the process safety risk score, a score expressed as loss based risk. The idea is that the cyber threat potentially increases the process safety risk and they like to know how much that risk is increased. Or more precisely how high is the likelihood that the process scenario could occur as result of a cyber attack.
  • The maturity of the process safety risk estimation method is much higher than the maturity of cyber security risk estimation methods in use. Not that strange if you consider that the LOPA method is about 20 years old, and the HAZOP method goes back to the end sixties. When reading publications, or even the standards on cyber security risk (e.g. IEC 62443-3-2) this lack of maturity is easily detected. Often qualitative methods are selected, however these methods have several drawbacks which I discuss later.

This blog will discuss some of these differences and immaturities. I’ve done this in previous blogs mainly by comparing what the standards say and what I’ve experienced and learned over the past 8 years as a cyber risk analysis practitioner for process automation systems doing a lot of cyber risk analysis for the chemical, and oil and gas industries. This discussion requires some theory, I will use some every day examples to explain to make it more digestible.

Let us start with a very important picture to explain process safety risk and its use, but also to show how process safety risk differs from cyber security risk.

Process safety FN curve

There are various ways to express risk, the two most used are risk matrices and FN plots / FN curve. FN curves require a quantitative risk assessment method, such as used in process safety risk analysis by for example LOPA. In an FN curve we can show the risk criteria. The boundaries for what we consider acceptable risk and what we consider unacceptable risk. I took a diagram that I found on the Internet where we have a number of process safety scenarios (shown as dots on the blue line) their likelihood of occurrence ( the vertical ax) and in this case the consequence expressed in fatalities when such a consequence can happen (horizontal ax). The diagram is taken from a Hydrogen plant, these plants belong to the most dangerous plants, this is why we see the relative high number of scenarios with a single or multiple fatalities.

Process safety needs to meet regulations / laws that are associated with their plant license. One such “rule” is that the likelihood of “in fence” fatalities must be limited to 1 every 1000 years (1.00E-3) If we look at the risk tolerance line (RED) in the diagram we see that what is considered tolerable and intolerable is exactly at the point where the line crosses the 1.00E-03 event frequency (likelihood). Another often used limit is the 1.00E-04 frequency for the limit used as acceptable risk, risk not further addressed.

How does process safety determine this likelihood for a specific process scenario? In process safety we have several structured methods for identifying hazards. One of them is the Hazard and Operability study, in short the HAZOP. In a hazop we analyze, for a specific part of the process, if specific conditions can occur. For example we can take the flow through an industrial furnace and analyze if we can have a high flow, no flow, maybe reverse flow. If such a condition is possible we look at the cause of this (the initiating event), perhaps no flow because a pump fails. If we have established the cause (the initiating event) we consider what would be the process consequence. Well possibly the furnace tubing will be damaged, the feed material would leak into the furnace and an explosion might occur. This is what is called the process consequence. This explosion has an impact on safety, one or multiple field operators might be in the neighborhood and killed by the explosion. There will also be a financial impact, and possibly an environmental impact. A hazop is a multi-month process where a team of specialists goes step by step through all units of the installation and considers all possibilities and ways how to mitigate this hazard. This results in a report with all analysis results documented and classified. Hazops are periodically reviewed in a plant to account for possible changes, this we call the validity period of the analysis.

However we don’t have yet a likelihood expressed as an event frequency such as used in the FN curve. This is where the LOPA method comes in. LOPA has tables for all typical initiating events (causes), so the event frequency for the failure of a pump has a specific value (for example 1E-01, once every 10 years). How were these tables created? Primarily based on statistical experience. These tables have been published, but can also differ between companies. It is not so that a poly propylene factory of company A uses by default the same tables as a poly propylene factory of company B. All use the same principles, but small differences can occur.

In the example we have a failing pump with an initiating frequency of once every 10 years and a process consequence that could result in a single fatality. But we also know that our target for single fatalities should be once per 1000 years or better. So we have to reduce this event frequency of 1E-01 with at least a factor 100 to get to once per 1000 years.

This is why we have protection layers, we are looking for one or more protection layers that offer us a factor one hundred extra protection. One of these protection layers can be the safety system, for example a safety controller that detects the no flow condition by measuring the flow and shuts down the furnace to a safe state using a safety valve. How much “credit” can we take for this shutdown action? This depends on the safety integrity level (SIL) of the safety instrumented function (SIF) we designed. This SIF is more than the safety controller where the logic resides, the SIF includes all components necessary to complete the shutdown function, so will include transmitters that measure the flow and safety valves that close any feed lines and bring other parts of the process into a safe condition.

We assign a SIL to the SIF. We typically (SIL 4 does exist) have 3 safety integrity levels: SIL 1, 2, and 3. According to LOPA a SIL 1 SIF gives us a reduction of a factor 10, SIL 2 will reduce the event frequency by a factor 100, and SIL 3 by a factor 1000.

How do we estimate if a SIF meets the requirements for SIL 1, 2, or 3? This requires us to estimate the average probability of failure on demand for the SIF. This estimation makes use of mean time between failure of the various components of the SIF and the test frequency of these components. For this blog I skip this part of the theory, we don’t have to go into that level of detail. High level we estimate what we call the probability of failure on demand for the protection layer (the SIF). In our example we need a SIF with a SIL 2 rating, a protection level relatively easy to create.

In the FN curve you can also see process scenarios that require more than a factor 100, for example a factor 1000 like in a SIL 3 SIF. This requires a lot more, both from the reliability of the safety controller as well as from the other components. Maybe a single transmitter is not reliable enough anymore and we need some 2oo3 (two out of three) configuration to have a reliable measurement. Never the less the principle is the same, we have some initiating event, we have one or more protection layers capable of reducing the event frequency with a specific factor. These protection layers can be a safety system (like in my example), but also some physical device (e.g. pressure relief valve), an alarm from the control system, an operator action, a periodic preventive maintenance activity, etc. LOPA gives each of these protection layers what is called a credit factor, a factor with which we can reduce the event frequency when the protection layer is present.

So far the theory of process safety risk,. One topic I avoided discussing here is the part where we estimate the probability of failure on demand (PFDavg) for a protection layer. But it has some relevance for cyber risk estimates. If we would go into more detail and discuss these formulas to estimate the effectiveness / reliability of the protection layer we see that the formulas for estimating PFDavg we depend on what is called the demand rate. The demand rate is the frequency which we expect the protection layer will needs to act.

The standard (IEC 61511) makes a difference between what is called low-demand rate and high / continuous demand rate. The LOPA process is based upon the low demand-rate formulas, the tables don’t work for high / continuous demand rate. This is an important point to notice when we start a quantitative cyber risk analysis because the demand rate of a cyber protection layer is by default a high / continuous demand rate type of protection layer. This difference impacts the event frequency scale and as such the likelihood scale. So if we were to estimate cyber risk in a similar manner as we estimated process safety risk we end up with different likelihood scales. I will discuss this later.

A few important points to note from above discussion:

  • Process safety risk is based on randomly occurring events, events based on things going wrong by accident, such as a pump failure, a leaking seal, an operator error, etc.
  • The likelihood scale of process safety risk has a “legal” meaning, plants need to meet these requirements. As such a consolidated process safety and cyber security risk score is not relevant and because of estimation differences not even possible.
  • When we estimate cyber security risk, the process safety risk is only one element. With regard to safety impact the identified safety hazards will most likely be as complete as possible, but the financial impact will not be complete because financial impact might also result from scenarios that do not impact process operations but might impact the business. The process safety hazop or LOPA does not generally address cyber security scenarios for systems that have no potential process impact, for example a historian or metering function.
  • The IEC 62443 standard tries to introduce the concept of “essential” functions and ties these functions directly to the control and safety functions. However plants and automation functions have many essential tasks not directly related to the control and safety functions, for example various logistic functions. The automation function contains all functions connected to level 0, level 1, level 2, level 3, and demilitarized zone. When we do a risk analysis these systems should be included, not just the control and safety elements. The problem that a ship cannot dock to a jetty also has significant cost to consider in a cyber risk analysis.
  • Some people suggest that cyber security provides process safety (or worse the wider safety is even suggested.) This is not true, process safety is provided by the safety systems. The various protection layers in place. Cyber security is an important condition for these functions to do their task, but not more as a condition. The Secret Service protects the president of the US against various threats, but it is the president of the US that governs the country – not the Secret Service by enabling the president to do his task.

Where does cyber security risk differ from process safety risk? Well first of all they have different likelihood scales. Process safety risk is based on random events, cyber security risk is based on intentional events.

Then there is the difference that a process safety protection layer always offers full protection when it is executed, many cyber security protection layers don’t. We can implement antivirus as a first protection layer, application white listing as a 2nd protection layer, they both would do their thing but still the attacker can slip through.

Then there is the difference that a cyber security protection layer is almost continually “challenged”, where in process safety the low demand rate is most often applied, which sets the maximum demand rate to once a year.

If we would look toward cyber security risk in the same way as LOPA does toward process safety risk, we could define various events with their initiating event frequency. For example we could suggest an event such as a malware infection to occur bi-annually. We could assign protection layers against this, for example anti-virus and assign this protection layer a probability of failure on demand (risk reduction factor), so a probability on a false negative or false-positive. If we have an initiating event (the malware infection) with a certain frequency and a protection layer (antivirus) with a specific reduction factor we can estimate a mitigated event frequency (of course taking high demand rate into account).

We can also consider multiple protection layers (e.g. antivirus and application white listing) and arrive at a frequency representing the residual risk after applying the two protection layers. Given various risk factors and parameters to enter the system specific elements and given a program that evaluates the hundreds of attack scenarios, we can arrive at a residual risk for one or hundreds of attack scenarios.

Such methods are followed today, not only by the company I work for but also by several other commercial and non-commercial entities. Is it better or worse than a qualitative risk analysis (the alternative)? I personally believe it is better because the method allows to take multiple protection layers into account. Is it actuarial type of risk, no it is not. But the subjectivity of a qualitative assessment has been removed because of the many factors determining the end result and we have risk now as residual risk based upon taking multiple countermeasures into account.

Still there is another difference between process safety and cyber security risk not accounted for. This is the threat actor in combination with his/her intentions. In process safety we don’t have a threat actor, all is accidental. But in cyber security risk we do have a threat actor and this agent is a factor that influences the initiating event frequency of an attack scenario.

The target attractiveness of facilities differ for different threat actors. A nation state threat actor with all its capabilities is not likely to attack the local chocolate cookie factory, but might show interest in an important pipeline. Different threat actors mean different attack scenarios to include but also influence the initiating event frequency it self. Where non-targeted attacks show a certain randomness of occurrence, a targeted attack doesn’t show this randomness.

We might estimate a likelihood for a certain threat actor to achieve a specific objective for the moment that the attack takes place, but this start moment is not necessarily random. Different factors influence this, so to express cyber risk on a similar event frequency scale as process safety risk is not possible. Cyber security risk is not based on the randomness of the event frequencies. If there is a political friction between Russia and Ukraine, the amount of cyber attacks occurring and skills applied is much bigger than in times without such a conflict.

Therefore cyber security risk and process safety risk cannot be compared. Though the cyber threat certainly increases the process safety risk (both initiating event frequency can be higher and the protection layer might not deliver the level of reliability expected), we can not express this rise in process safety risk level because of the differences discussed above. Process safety risk and cyber security risk are two different things and should be approached differently. Cyber security has this “Secret Service” role, and process safety this “US president” role. We can estimate the cyber security risk that this “Secret Service” role will fail and the US government role is made to do bad things, but that is an entirely different risk than that the US government role will fail. It can fail even when the “Secret Service” role is fully active and doing its job. Therefore cyber security risk has no relation with process safety risk, they are two entirely different risks. The safety protection layers provide process safety (resilience against accidental failure), the cyber security protection layers provide resilience against an intentional and malicious cyber attack.

There is no relationship between my opinions and references to publications in this blog and the views of my employer in whatever capacity. This blog is written based on my personal opinion and knowledge build up over 43 years of work in this industry. Approximately half of the time working in engineering these automation systems, and half of the time implementing their networks and securing them, and conducting cyber security risk assessments for process installations since 2012.

Author: Sinclair Koelemij

OTcybersecurity web site

A wake-up call

Process safety practices and formal safety management systems have been in place in the industry for many years. Process safety management has been widely credited for reductions in the number of major accidents, proactively saving hundreds of lives and improving industry performance.

OT cyber security, the cyber security of process automation systems such as used by the industry, has a lot in common with the management of process safety and we can learn very much from the experience of formal safety management, build over more than 60 years. Last week I saw a lengthy email chain on a question “if the ISA 99 workgroup (the developers of the IEC 62443 standard) should look for a closer cooperation with the ISO 27001 developers to better align the standards”. Of course such a question results in discussions addressing the importance of ISO 27001 and others emphasizing the difficulties to apply the standard in the industry.

If there is a need for ISA to cooperate with another organization for aligning its standard, than in my opinion they should have a more close cooperation with the AIChE, the ISA of the chemical engineering professionals. The reason for this is that there is a lot to learn from process safety and though IEC 62443 is supposed to be a standard for industrial control systems, there is still a lot of “IT methodology” in the standard.

In this week’s blog I like to address the link between process safety and cyber security again, and discuss some of the lessons in process safety we can (actually should in my opinion) learn from. Not for adding safety to the security triad as some suggest, this is in my opinion for multiple reasons wrong, but because process safety management and OT cyber security management have many things in common when we compare their design and management processes and process safety management is much further in its development than OT cyber security management.

I accomplished in my career several cyber security certifications such as: CISSP, CISM, CRISC, ISO 27001 LA and many more with a pure technical focus. Did all this course material ever discuss industrial control systems (ICS)? No they didn’t, still they where very valuable for developing my knowledge. As an employee working for a major vendor of ICS solutions, my task became more to adopt what was applicable, learn from IT and park those other bits of knowledge for which I saw no immediate application. As my insights further developed I started to combine bits and pieces from other disciplines more easily. In the OT cyber security world, which is relatively immature, we can learn a lot from more mature disciplines such as process safety management. Such learning generally requires us to make adaptions to address the differences.

But despite these differences we can learn from studying the accomplishments of the process safety discipline, this will certainly steepen the learning curve of OT cyber security to make it more mature. If we want to learn from process safety, where better to start than the many publications of CCPS (Center for Chemical Process Safety) and the AIChE.

Process safety starts at risk, process safety studies the problem first before they attempt to solve the problem. Process safety recognizes that all hazards and risk are not equal, consequently it focuses more resources on higher hazards and risks. Does this also apply to cyber security, in my opinion very sparingly.

More mature asset owners in the industry adopted a risk approach in their security strategy, but the majority of asset owners is still very solution driven. They decide on purchasing solutions before they inventoried the cyber security hazards and prioritized them.

What are the advantages of a risk based approach? Risk allows for optimally apportioning limited resources to improve security performance. Risk also provides a better insight in where the problems are and what options there are to solve a problem. Both OT cyber security risk and process safety risk are founded on four pillars:

  • Commit to cyber security and process safety – A workforce that is convinced that their management fully supports a discipline as part of its core values will tend to follow process even when no one else is checking this.
  • Understand hazards and risk – This is the foundation of a risk based approach, it provides the required insight to make justifiable decisions and apply limited resources in an effective way. This fully applies for both OT cyber security and process safety.
  • Manage risk – Plants must operate and maintain the processes that pose risk, keep changes to those processes within the established risk tolerances, and prepare for / respond to / manage incidents that do occur. Nothing in here that doesn’t apply for process safety as well as OT cyber security.
  • Learn from experience – If we want to improve, and because of the constantly changing threat landscape we need to improve, we need to learn. Learning we do by observing and challenging what we do. Metrics can help us here, preferably metrics that provide us with leading indicators so we see the problem coming. Also this pillar applies for both disciplines, where process safety tracks near misses OT cyber security can track parameters such as AV engines detecting malware, or firewall rules bouncing off traffic attempting to enter the control network.

Applying these four pillars for OT cyber security would in my opinion significantly improve OT cyber security over time. Based upon what I see in the field, I estimate that less than 10 percent of the asset owners adopted a risk based methodology for cyber security, while more than 50 percent adopted a risk based methodology for process safety or asset integrity.

If OT cyber security doesn’t want to learn from process safety, it will take another 15 years to reach the same level of maturity. If this happens we are bound to see many serious accidents in future, including accidents with casualties. OT cyber security is about process automation lets use the knowledge other process disciplines build over time.

The alternative for a risk based approach is a compliance based approach, well known examples are for North America the NERC CIP standard, and for Europe the French ANSSI standard for cyber security or if we look at process safety the OSHA regulations in the US. All compliance driven initiatives. A compliance driven approach can lead to:

  • Decisions based upon ” If it isn’t a regulatory requirement, we are not going to do it”.
  • The wrong assumption that stopping the more frequent and easier to measure incidents (like for example the mentioned malware detection) discussed in standards will also stop the low-frequency / high consequence incidents.
  • Inconsistent interpretation and application of the practices described in the standard. Standards are often a compromise between conflicting opinions, resulting in soft requirements open for different interpretations.
  • Overemphasized focus on the management system, forgetting about the technical aspects of the underlying problems.
  • Poor communication between the technically driven staff and the business driven senior management, resulting in management not understanding the importance of the messages they receive and subsequently fail to act.
  • High auditing costs, where audits focus on symptoms instead of underlying causes.
  • Not moving with the flow of time. Technology is continuously changing, posing new risk to address. Risk that is not identified by a standard developed even as recent as 5 years ago.

This criticism on a compliance approach doesn’t mean I am against standards and their development. Merely I am against standards as an excuse to switch off our brain, close our eyes and ignore where we differ from the average. Risk based processes offer us the foundation to stay aware of all changes in our environment while still using standards as a checklist to make certain we don’t forget the obvious.

The four available strategies

Like I mentioned for cyber security the majority of the asset owners would fall in the category standards-based or compliance-based. It is a step forward compared to 10 years ago, when OT cyber security was ignored by many, but it is a long way off from where asset owners are for process safety.

Where we see in process safety the number of accidents decline, we see in cyber security that both the threat event frequency and the threat capability of the threat community rise. To keep up with the growing threat, critical infrastructure should adopt a risk based strategy to keep track with the threat community. Unfortunately many governments are driving for a compliance based strategy because they can more easily audit this and doing this they are setting the requirements too low for a proper protection against the growing threat.

A risk based approach doesn’t exclude compliance with a standard, it just makes the extra step predicting the performance of the various cyber security aspects, independent of any loss events, and improving its security. As such it adds pro-activity to the defense and allows us to keep track with the threat community.

The process safety community recognized the bottlenecks of a compliance based strategy and jumped forward by introducing a risk based approach allowing them to further reduce the number of process safety accidents after several serious accidents happened in the 1980s. Accidents caused by failure of the compliance based management systems.

Because of the malicious aspects inherent to cyber security, because of the fast growing threat capabilities of the threat community and because of an increase in threat events, not jumping to a risk based strategy like the process safety community did is waiting for the first casualties to occur as a result of a cyber attack. TRISIS had the potention be the first attack causing loss of life, we were lucky it failed. But the threat actors have undoubtedly learned from their failure and work on an improved version.

I don’t include the alleged attack on a Siberian pipeline in 1982 as a cyber event as some do. If such an event would happen due to a cyber attack this would be an act of war. So for me we have been lucky so far that cyber impact was mainly a monetary value, but this can change either willingly or accidentally.

It would become a very lengthy blog if I would discuss each of the twenty elements of the risk based safety program or reliability program. But each of these elements has a strong resemblance with what would be appropriate for a cyber security program.

The element I like to jump to is the Hazard Identification and Risk Analysis (HIRA) element. HIRA is the term used to bundle all activities involved in identifying hazards and evaluating the risk induced by these hazards. In my previous blog on risk I showed a more detailed diagram for risk, splitting it in three different forms of risk. For this blog I like to focus on the the hazard part using the following simplified diagram for the same three forms of risk.

Simplified risk model

On the left side we have the consequence of the cyber security attack, some functional deviation of the automation system. This is what was what was categorized as loss of required performance and loss of ability to perform. The 3rd category, loss of confidentiality, will not lead directly to a process risk so I ignore it here. Loss of required performance caused the automation system to either execute an action that should not have been possible (not meeting design intent) or an action that does not perform as it should (not meeting operation intent). In the case of loss of ability to perform, the automation system could not execute one or more of its functions.

So perhaps the automation system’s configuration was changed in a way that the logical dimensions configured in the automation system no longer represent the physical dimensions of the equipment in the field. For example if the threat actor increases the level range of a tank this does not result into a bigger physical tank volume, so a possibility exists that the tank is overfilled without the operator noticing this in his process displays. The logical representation of the physical system (its operating window) should fit the physical dimensions of the process equipment in the plant. If this is not the case this would be the failure mode “Integrity Operating Window (IOW) deviation” in the “Loss of Required Performance” category.

Similar the threat actor might prevent the operator to stop or start a digital output, the failure mode “Loss of Controllability” in the category “Loss of Ability to Perform”. Not being able to stop or start a digital output might translate to the inability to stop or start a pump in the process system. At least stopping or starting by using the automation system. We might have implemented an electrical switch (safeguard) to do it manually if the automation system would fail.

Not being able to modify a control parameter would give rise to a whole other category of issues for the production process. Each failure mode has a different consequence for the process system equipment and the production process.

Cyber security hazards are a description of a threat actor (threat community) executing a threat event (threat action exploiting a vulnerability) resulting in a specific consequence (the functional deviation) entering a specific failure mode for the automation system function. What the consequence is for the production process and its equipment depends on the automation system function affected and the characteristics of the production system equipment and production process. This area is investigated by the process (safety) hazards. Safety is here between brackets because not every functional deviation results in a consequence for process safety, there can also be consequences for product quality or production loss not impacting process safety at all. If the affected function would be the safety instrumented system (SIS), a deviation in functionality would always affect process safety.

The HIRA for process risk would focus on how the functional deviations influence the production process and the asset integrity of its equipment. As such the HIRA has a wider scope than it would have in a typical process safety hazard analysis / hazop. For cyber security it combines what we call the computer hazop, the analysis of how failures in the automation system impact the production system and the process safety hazop.

On the other hand from a safeguard perspective of the safety hazop / PHA the scope is smaller because we can only influence the functionality of the “functional safety” functions provided by the SIS. Safety has multiple layers of protection and multiple safeguards and barriers that contain a dangerous situation. A cyber security attack can only impact the programmable electronic components (e.g. SIS) of the process safety protection functions.

This is the reason why I protest if people talk about “loss of safety” in the context of cyber security, there are in general more protection mechanism implemented, so safety is not necessarily lost. Adding safety to the triad is also incorrect in my opinion, this should be at minimum adding functional safety because that is the only safety layer that can be impacted by a cyber threat event, but functional safety is also already covered within the definition of loss of required performance.

IEC 62443’s loss of system integrity is not covering all the failure modes covered by loss of required performance. The IEC 62443-1-1 defines integrity as: “Quality of a system reflecting the logical correctness and reliability of the operating system, the logical completeness of the hardware and software implementing the protection mechanisms, and the consistency of the data structures and occurrence of the stored data.”

This definition is fully ignoring the functional aspects in an automation system, therefore it is a too limited cyber security objective for an automation system. For example where do we find in the definition that an automation action needs to be performed on the right moment in the right sequence and appropriately coordinated with other functions.

Consider for example the coordination / collaboration between a conveyor and filling mechanism or a robot. The IEC 62443 seven foundation requirements don’t cover all aspects of an automation function / industrial control system. The combined definitions used by risk based asset integrity management and risk based process safety management do cover these aspects, an example of a missed chance to learn something from an industry that has considerably more experience in its domain than the OT cyber security community has in its own field.

Can we conduct the HIRA process for cyber security in a similar way as we do for process safety? My only answer here is a firm NO!. The malicious aspects of cyber security make it impossible to work in the same way as we do for process safety. The job would just not be repeatable (so results are not consistent) and too big (so extremely time consuming). The number of possible threat events, vulnerabilities, and consequences is just too big to approach this in a workshop setting as we do for process hazard analysis (PHA) / safety hazop.

So in cyber security we work with tooling to capture all the possibilities, we categorize consequences and failure modes to assign them a trustworthy severity value meeting the risk criteria of the plant. But in the end, we end with a risk priority number just like we have in risk based process safety or risk based asset integrity to rank the hazards.

The formula for cyber security risk is more complex because we not only have to account for occurrence (threat x vulnerability) and consequence, but also for detection, and the risk reduction provided by countermeasures, safeguards and barriers. But these are normal deviations, also risk based asset integrity management and risk based process safety management differ at detail level.

The following key principles need to be addressed when we develop, evaluate, or improve any management system for risk:

  • Maintain a dependable and consistent practice – So the practice should be documented, the objectives of the benefits must be in terms that demonstrate to management and employees the value of the activities;
  • Identify hazards and evaluate risks – Integrate HIRA activities into the life cycle of the ICS. Both design and security operations should be driven by risk based decisions;
  • Assess risks and make risk based decisions – Clearly define the analytical scope of HIRAs and assure adequate coverage. A risk system should address all the types of cyber security risk that management wants to control;
  • Define the risk criteria and make risk actionable – It is crucial that all understand what a HIGH risk means, and that it is defined what the organization will do when something attains this level of risk. Risk appetite differs depending on the production process or process units within that process;
  • Follow through on risk assessment results – Involve competent personnel, make a consistent risk judgement so we can follow through without too much debate if results require this;

Risk diagrams to express process risk generally have less risk levels as a risk assessment diagram for cyber security. This because it has a more direct relationship with the business / mission risk, so actions have a direct business justification. An example risk assessment diagram for process risk is shown in the following diagram:

Risk assessment diagram for process risk example

The ALARP acronym stands for As Low As Reasonably Practicable a commonly used criterion for process related risk. Once we have the cyber security hazards and their process consequence we can assign a business impact to each hazard and create risk assessment matrices for each impact category as explained in my blog on OT cyber security risk using the impact diagram as example. or if preferred the different impact categories can be shown in a single risk assessment matrix.

Mission impact example

So far this discussion about the parallels between risk based process safety, risk based asset integrity, and risk based OT cyber security. I noticed in responses to previous blogs, that for many this is an uncharted terrain because they might not be familiar with all three disciplines and the terminology used. Most risk methods used for cyber security have an IT origin. This results in ignoring the action part of an OT system, OT being Data + Action driven where IT is Data driven only. Another reason to more closely look at other risk based management processes applied in a plant.

There is no relationship between my opinions and references to publications in this blog and the views of my employer in whatever capacity. This blog is written based on my personal opinion and knowledge build up over 42 years of work in this industry. Approximately half of the time working in engineering these automation systems, and half of the time implementing their networks and securing them.

Author: Sinclair Koelemij

OTcybersecurity web site