Categorie: Uncategorized

ICS cyber security risk criteria


Abstract

In my previous blog (Why process safety risk and cyber security risk differ) I discussed the differences between process safety risk and cyber security risk and why these two risks don’t align. In this blog I like to discuss some of the cyber security risk criteria, why they differ from process safety risk criteria and how they align with cyber security design objectives. And as usual I will challenge some of IEC 62443-3-2 misconceptions.


Cyber security can be either prescriptive or risk based. An example of a prescriptive approach is the IEC 62443-3-3, the standard is a list with security requirements to follow in order to create resilience against a specific cyber threat. For this IEC 62443 uses security levels with the following definitions:

  • SL 1 – Prevent the unauthorized disclosure of information via eavesdropping or casual exposure.
  • SL 2 – Prevent the unauthorized disclosure of information to an entity actively searching for it using simple means with low resources, generic skills and low motivation.
  • SL 3 – Prevent the unauthorized disclosure of information to an entity actively searching for it using sophisticated means with moderate resources, IACS specific skills and moderate motivation.
  • SL 4 – Prevent the unauthorized disclosure of information to an entity actively searching for it using sophisticated means with extended resources, IACS specific skills and high motivation.

I underlined the criteria that create the difference between the security levels. A few examples to translate the formal text from the standard to what is happening in the field.

An example of an SL 1 level of resilience is protection against a process operator looking over the shoulder of a process engineer to retrieve the engineer’s password. Maybe you don’t consider this a cyber attack, but formally it is a cyber attack by a hopefully non-malicious insider.

An example of an SL 2 level of resilience is protection against malware, for example a process control system that became infected by WannaCry, ransomware. WannaCry didn’t target specific asset owners and did not specifically target process control systems. Any system with an exposed Microsoft EternalBlue vulnerability could become victim.

An example of an SL 3 level of resilience is protection against the attack on Ukraine’s power power grid. This was a targeted attack, using a mixture of standard tactics like phishing to get access to the system and specially crafted software to make modifications in the control system.

An example of an SL 4 level of resilience would be the protection against Stuxnet, the attack on Iran’s uranium enrichment facilities. This attack made use of zero-day vulnerabilities, required detailed knowledge of the production process, and thoroughly testing of the attack on a “copy” of the actual system.

The difference between SL 2 and SL 3 is clear, however the difference between SL 3 and SL 4 is a bit arbitrary. For example is the Triton attack an example of SL 3 or can we say it is SL 4. The Triton attack required like Stuxnet also a significant investment for the threat actors. But where Stuxnet had clear political motives, the motives for the Triton attack are less clear for me. SL 4 is often linked to resilience against attacks by nation-states, but with all these “nation state sponsored” threat actors that difference often disappears.

The IEC 62443-3-3 standard assigns the security levels to security zones and conduits of the control system for defining the technical security requirements for the assets in the zone and communication between zones. The targets for a threat actor are either an asset or a channel (the protocols flowing through the conduit). Though the standard also attempts to address risk with the IEC 62443-3-2, there is no clear path from risk to security level. There is not an “official” transformation matrix that converts a specific risk level into a security level, apart from the many issues with such a matrix. So overall IEC 62443-3-3 is a prescriptive form of cyber security. This is common for standards and government regulations, they need clear instructions.

The issue with prescriptive cyber security is that it doesn’t address the specific differences between systems. It is kind of ready-to-wear suit, does overall a good job but a tailor-made suit fits better. The tailor-made suit for cyber security is risk based security. In my opinion risk based security is a must for process automation systems supporting the critical infrastructure. Threat actors targeting critical infrastructure are highly skilled professionals, we should not put them aside as hackers for fun. Also the offensive side is a profession, either servicing a “legal” business such as the military or an “illegal” business such as cyber crime. And this cybercrime business is big business, it is for example bigger than the drugs trade.

Protection against the examples I gave for SL 3 and SL 4 requires a detailed analysis of the attack scenarios available to launch against the process installation, analyzing which barriers are in place to stop or detect these scenarios, and estimating the residual risk after having applied the barriers to see if we meet the risk criteria. For this we need residual risk, the risk after applying the barriers, and residual risk requires quantitative risk analysis to estimate the risk reduction similar to how process safety does this with LOPA as discussed in my previous blog.

Quantitative analysis is required because we want to know the contribution in risk reduction from the various barriers. Barriers being the countermeasures with which we can reduce the likelihood that the attack succeeds or with which we reduce / take away the impact severity of the consequences of a successful cyber attack. A simple example of a likelihood reducing barrier is an antivirus solution, a simple example of an impact severity reducing barrier is a back-up. These barriers are not limited to various security packages but also include design choice in the automation system and the production facility. The list of barriers I work with during my professional time has grown to over 400 and keeps growing with every detailed risk assessment. The combination of attack scenarios (the TTP of the threat actors), barriers and consequences (the deviations of the process automation functions as result of a successful cyber attack) is the base for a risk analysis.

Where a value for consequence severity can either be a subject matter expert assigned severity level for the functional deviation (resulting in a risk priority number (RPN)) or the severity scores / target factors of the worse case process scenarios that can result from the potential functional deviation. These severity scores / target factors come from the HAZOP or LOPA sheets created by the process safety engineers during their analysis of what can go wrong with the process installation and what is required to control the impact when things go wrong. When we use LOPA / HAZOP severity scores for consequence severity we call the resulting risk loss based risk (LBR), the risk a specific financial, safety, or environmental loss occurs. For security design a risk priority number is generally sufficient to identify the required barriers, however to justify the risk mitigation (investment in those barriers) to management and regulatory organizations loss based risk is often a requirement.

To carry out a detailed risk assessment the risk analyst should oversee a wide field of skills, understanding the automation system’s working in detail, understanding the process and safety engineering options, and understanding risk. Therefore risk assessments are teamwork of different subject matter experts, however the risk analyst needs to connect all input to get the result.

But risk assessments is also a very rewarding cyber security activity because communicating risk to management is far easier than explaining them why we invest in a next generation firewall instead of the lower cost packet filter. Senior management understands risk far better than cyber security technology.


So far the intro, lets look at the risk criteria now. ISA / IEC have a tendency in their approach to cyber risk to have a So much for the intro, now let’s look at the risk criteria. ISA/IEC tend to look closely at how process security approaches the subject in their approach to cyber risk and to attempt to copy it.

However as explained in my previous blog there are major differences between cyber security risk and process safety risk making such a comparison more complex than it seems. Many of these differences also have an impact on the risk criteria and the terminology we use. I like to discuss two terms in this blog, unmitigated risk and risk tolerance. Let’s start with unmitigated risk.

Both ISA 62443-3-2 and the ISA TR 84 work group use the term unmitigated risk. As an initial action they want to determine unmitigated risk and compare this with the risk criteria. Unmitigated risk is a term used by process safety for the process safety risk prior to applying the safeguards protecting against randomly occurring events that could cause a specific worse case process impact. The safeguards protect against events like a leaking seal, a failing pump, or an operator error. The event frequency of such an initiating event (e.g. failed pump) is reduced by applying safeguards, the resulting event frequency after applying the safeguard needs to meet specified limits. (See my previous blog) Safety engineers basically reduce an event frequency gap using safeguards with a reliability that actually accomplish this. The safeguard will bring the process into a safe state by itself. Multiple safeguards can exist but each safeguard will prevent the worst case consequence by itself. It is not that safeguard 1 and safeguard 2 need to accomplish the job together, each of them individually does the job. There might be a preferred sequence but not a dependency of protection layers.

Cyber security doesn’t work that way, we might define a specific initiating event frequency for the chance on a virus infection. But after installing an antivirus engine we cannot say the problem is solved and no virus infections will occur. The reliability of a process safety barrier is an entirely different factor than the effectiveness of a cyber security barrier. Both reduce risk, the process safety barrier reduces the risk to the required limit by it self with a reliability that allows us to take the credit for the risk reduction. But the cyber security barrier most likely requires additional barriers to reach an acceptable event frequency / risk reduction.

Another difference is that in cyber security the initiating event is not occurring randomly, it is an intentional action of a threat actor with a certain propensity (A term introduced by Terry Ingoldsby from Amenaza) to target the plant, select a TTP and target a specific function. The threat actor has different TTPs available to accomplish the objective, so the initiating event frequency is not so much one point on the frequency scale but a range with a low and high limit. We cannot pick on forehand the high limit of this range (representing the highest risk) because the process control system’s cyber resilience actually determines if a specific TTP can succeed, so we need to “exercise” a range of scenarios to determine the highest event frequency (highest likelihood) applicable within the range.

A third difference is the defense in depth. In process safety defense in depth is implemented by multiple independent protection layers. For example there might be a safety instrumented function configured in a safety system, but additionally there can be physical protection layers like a dike or other forms of bunding to control the impact. In cyber security we also apply defense in depth, it is considered bad practice if security depends on a single control to protect the system. However many of these cyber security controls share a common element, the infrastructure / computer. We can have an antivirus engine and we can add application whitelisting or USB protection to further reduce the chance that malware enters the system but they all share the same computer platform offering a shared point of failure.

Returning to the original remark on unmitigated risk, in process automation cyber security risk unmitigated risk doesn’t exist. The moment we install a process control system we install some inherent protection barriers. Examples are authentication / authorization, various built-in firewalls, encrypted messages, etc. So when we start a cyber security risk analysis there is no unmitigated risk, we can’t ignore the various barriers built-in the process automation systems. A better term to use is therefore inherent risk. The risk of the system as we analyze it, the inherent risk is also the residual risk but it is not unmitigated there is a range of barriers implemented when we start a risk analysis. The question is more, does the inherent risk meet the risk criteria and if not what barriers are required that result in a residual risk that does meet the criteria.


The second term I like to discuss is risk tolerance. Both IEC 62443 and the ISA TR 84 work group pose that the cyber security must meet the risk tolerance. I fully agree with this, where we differ is that I don’t see risk tolerance as the cyber security design target where the IEC 62443 standard does. In risk theory, risk tolerance is defined as the maximum loss a plant is willing to experience. To further explain my issue with the standard I first discuss the process safety side use of risk tolerance and use an F-N diagram from my previous blog.

Example F-N diagram

This F-N curve shows two limits using a red and a green line. Anything above the red line is unacceptable, anything below the green line is acceptable. The area between the two lines is considered tolerable if it meets the ALARP (As Low As Reasonably Practicable) principle. For a risk to be ALARP, it must be possible to demonstrate that the cost involved in reducing the risk further would be grossly disproportionate to the benefit gained. Determining that a risk is reduced to the ALARP level involves an assessment of the risk to be avoided, an assessment of the investment (in money, time and trouble) involved in taking measures to avoid that risk, and a comparison of the two.

In process safety where the events occur randomly this is possible, in cyber security where the events occur intentionally this is far more difficult.

Can we consider the risk for the power grid in a region with political tensions the same as in the case of a region having no conflicts at all. I worked for refineries that had as a requirement to be resilient against SL 2 level of threat actors, but also refineries that told me they wanted SL 3, and there was even a case where the requirement was SL 4. So in cyber security the ALARP principle is not really followed, there is apparently another limit. The risk tolerance limit is for all the same, but there is also a limit that sets a kind of cyber security comfort level. This level we call the risk appetite, and this level is actually should become our cyber security design target level.

Risk appetite and risk tolerance shouldn’t be the same, there should be a margin between the two that allow for the possibility to respond to failures of the security mechanisms. If risk appetite and risk tolerance are the same and this would also be our design target any issue with a security control would result in an unacceptable risk.

In principal unacceptable risk means we should stop the production (as is the case for process safety), so if the limits would be the same we have created kind of a on/off mechanism. For example if we couldn’t update our antivirus engine for some reason, risk would raise above the limit and we would need to stop the plant. Maybe an extreme example and certainly not really a situation I see happening in real life. However when risk becomes not acceptable we should have an action defined for this case. There are plants that don’t have a daily update but follow a monthly or quarterly AV signature update frequency (making AV engine effectiveness very low with all the polymorph viruses), so apparently risk appetite differs.

If we want to define clear actions / consequences for each risk level we need sufficient risk levels to do this and limits that allow us to respond to failing cyber controls. So we need two limits, risk appetite and risk tolerance. The difference between risk appetite and risk tolerance is called risk capacity, having a small risk capacity means that issues can escalate quickly. A security design must also consider the risk capacity in the selection of barriers, “defense in depth” is an important security principle here because this increases risk capacity.

Example risk matrix

Above shows an example risk matrix (different criteria as the F-N diagram above) with 4 risk levels (A, TA, TNA, NA). Typically 4 or 5 risk levels are chosen and for each level the criteria and action is specified. In above example the risk tolerance level is the line that separates TNA and NA. Risk appetite might be either the line between TA and TNA or between A and TA for a very risk averse asset owner. However the risk capacity of a security design were the risk appetite is defined at the TA / TNA border is much smaller than if our target would be the A / TA border. But in both cases a risk increase due to the failure of a cyber security control with not immediately escalate into a Non Acceptable risk condition.

If we opt for risk based security, we need to have actionable risk levels and in that case a single risk tolerance level as specified and discussed in the IEC 62443-3-2 is just not possible and therefore not practiced for cyber security risk. The ALARP principle and cyber security don’t go together very well.


Maybe a bit boring topic, maybe too much text, never the less I hope the reader understands why unmitigated risk doesn’t exist in cyber security for process automation. And I certainly hope that the reader will understand that the risk tolerance limit is a bad design target.


There is no relationship between my opinions and references to publications in this blog and the views of my employer in whatever capacity. This blog is written based on my personal opinion and knowledge build up over 43 years of work in this industry. Approximately half of the time working in engineering these automation systems, and half of the time implementing their networks and securing them, and conducting cyber security risk assessments for process installations since 2012.


Author: Sinclair Koelemij

OTcybersecurity web site

Cyber risk assessment is an exact business

Abstract


This blog is about risk assessment in cyber physical systems and some of the foundational principles. I created several blogs on the topic of risk assessment before, for example “Identifying risk in cyber physical systems” and “ISA 62443-3-2 an unfettered opinion“. Specifically the one on criticizing the ISA standard caused several sharp reactions. In the meantime ISA 62443-3-2 has been adopted by the IEC and a task group focusing on cyber security and safety instrumented systems (ISA 84 work group 9) addresses the same topic. Though unfortunately the ISA 84 team seems to have the intention to copy the ISA 62443-3-2 work and with this also the structural flaws will enter the work of this group. One of these flaws is the ISA approach toward likelihood.

Therefore this blog addresses this very important aspect of a risk assessment in the cyber physical system arena. Likelihood is actually the only factor of interest, because the impact (severity of the consequences) is a factor that the process hazard analysis the asset owner makes provides. Therefore likelihood / probability is the most important and challenging variable to determine because many of the cyber security controls affect the likelihood of a cyber security hazard occurring. Much of the risk reduction is achieved by reducing the chance that a cyber attack will succeed through countermeasures.

Having said this about likelihood we must not ignore the consequences, as I have advocated in previous blogs. This because risk reduction through “cyber safeguards” (those barriers that either eliminate the possibility of a consequence occurring or reduce the severity of that consequence – see my earlier blogs) is often a more reliable risk reduction than “cyber countermeasures” (those barriers that reduce the likelihood).

But this blog is about likelihood, the bottlenecks we face in estimating the probability and selecting the correct quantitative scale to allow us to express a change in process safety risk as the result of the cyber security threat.

At the risk of losing readers before coming to the conclusions, I would still like to cover this topic in the blog and explain the various problems we face. If you are interested in risk estimation I suggest to read it, if not just forward it to stalk your enemies.



Where to start? It is always dangerous to start with a lot of concepts because for part of the audience this is known stuff, while for others it answers some questions they might have. I am not going to give much formulas, most of these have been discussed in my earlier blogs, but it is important to understand there are different forms of risk. These forms often originate from different questions we like to answer. For example we have:

  • Risk priority numbers (RPN), is a risk value used for ranking / prioritization of choices. This risk is a mathematical number, the bigger the number the bigger the risk;
  • Loss based risk (LBR), is a risk used for justifying decisions. This risk is used in production plants and ties the risk result to monetary loss due to production outage / equipment damage / permit to operate. Or a loss expressed in terms of human safety or environmental damage. Often risk is estimated for multiple loss categories;
  • Temporal risk or actuarial risk, is a risk used for identifying the chance on a certain occurrence over time. For example insurance companies use this method of risk estimation analyzing for example the impact of obesity over a 10 year period to understand how it impacts their cost.

There are more types of risk than above three forms, but these three cover the larger part of the areas where risk is used. The only two I will touch upon in this blog, are RPN and LBR. Temporal risk, the question of how many cyber breaches caused by malware will my plant face in 5 years from now, is not raised today in our industry. Actually at this point in time (2021) we couldn’t even estimate this because we lack the statistical data required to answer the question.

We like to know what is the cyber security risk today, what is my financial, human safety, and environmental risk, can it impact my permit to operate, and how can I reduce this risk in the most effective way? What risk am I willing to accept and what should I keep a close eye on, and what risk will I avoid? RPN and LBR estimates are the methods that provide an answer here and both are extensively used in the process industry. RPN primarily in asset integrity management and LBR for business / mission risk. LBR is generally derived from the process safety risk, but business / mission risk is not the same as process safety risk.


Risk Priority Numbers are the result of what we call Failure Mode Effect and Criticality Analysis (FMECA) a method applied by for example asset integrity management. FMECA was developed by the US military over 80 years ago to change their primarily reactive maintenance processes into proactive maintenance processes. The FMECA method focuses on qualitative and quantitative risk identification for preventing failure, if I would write “for preventing cyber failure” I create the link to cyber security.

For Loss Based Risk there are various estimation methods used (for example based on annual loss expectancy), most of them very specific for a business and often not relevant or not applied in the process industry. The risk estimation method mostly used today in the process industry, is the Layers Of Protection Analysis (LOPA) method. This method is used extensively by the process safety team in a plant. The Hazard and Operability (HAZOP) study identifies the various process hazards, and LOPA adds risk estimation to this process to identify the most prominent risks.

If we know what process consequences can occur (for example a seal leak causing an exposure to a certain chemical, yes / no potential ignition, how many people would be approximately present in the immediate area, is the exposure yes / no contained within the fence of the plant, etc.) we can convert the consequence into a monetary/human safety/environmental loss value. This together with the LOPA likelihood estimate, results in a Loss Based Risk score.

How do we “communicate” risk, well generally in the form of a risk assessment matrix (RAM). A matrix that has a likelihood scale, an impact scale and shows (generally using different colors) what we see as acceptable risk, tolerable risk, and not acceptable risk zones. But other methods exist, however the RAM is the form most often used.


A picture of a typical risk assessment matrix with 5 different risk levels and some criteria for risk appetite and risk tolerance. And a picture to have at least one picture in the blog.

So much for the high-level description, now we need to delve deeper into the “flaws” I mentioned in the introduction / abstract. Now that the going gets tough, I need to explain the differences between low demand mode and high demand / continuous mode as well as the differences between IEC 61511 and IEC 62061 and some differences in how the required risk reduction is estimated, but also how the frequency of events scale ,underlying the qualitative likelihood scale, plays a role in all this. This is a bit of a chicken and egg problem, we need to understand both before understanding the total. Let me try, I start with taking a deeper dive into process safety and the safety integrity limit – the SIL.


In process safety we start with what is called the unmitigated risk, the risk inherent to the production installation. This is already quite different from cyber security risk, where unmitigated risk would be kind of a foolish point to start. Connected to the Internet and without firewalls and other protection controls, the process automation system would be penetrated in milli-seconds by all kind of continuous scanning software. Therefor in cyber security we generally start with residual risk based upon some initial countermeasures and we investigate if this risk level is acceptable / tolerable. But let’s skip a discussion on the difference between risk tolerance and risk appetite, a difference ignored by IEC 62443-3-2 though from a risk management perspective an essential difference that provides us time to manage.

However in plants the unmitigated situation is relevant, this has to do with what is called the demand rate and that we don’t analyze risk from a malicious / intentional perspective. Incidents in process safety are accidental, caused by wear and tear of materials and equipment, human error. No malicious intent, in a HAZOP sheet you will not find scenarios described where process operators intentionally open / close valves in a specific construction/sequence to create damage. And typically most hazards identified have a single or sometimes a double cause, but not much more.

Still asset owners like to see cyber security risk “communicated” in the same risk assessment framework as the conventional business / mission risk derived from process hazard analysis. Is this possible? More to the point, is the likelihood scale of a cyber security risk estimate the same as the likelihood scale of a process safety risk estimate if we take the event frequencies underlying the often qualitative scales of the risk diagrams into account?

In a single question, would the qualitative likelihood score of Expected for a professional boxer to go knock out (intentional action) during the job during his career, be the same as for a professional accountant (accidental)? Can we measure these scales along the same quantitative event frequency scales? I don’t think so. Even if we ignore for a moment the difference in career length, the probability that the accountant goes knock out (considering he is working for a reputable company) during his career is much lower. But the IEC/ISA 62443-3-2 method does seem to think we can mix these scales. But okay, this argument doesn’t necessarily convince a process safety or asset integrity expert, even though they came to the same conclusion. To explain I need to dive a little deeper into the SIL topic in the next section and discuss the influence of another factor, demand rate, on the quantitative likelihood scale.


The hazard (HAZOP) and risk assessment (LOPA) are used in process safety to define the safety requirements. The safety requirements are implemented using what are called Safety Instrumented Functions (SIF). The requirements for a SIF can be seen as two parts:

  • Safety function requirements;
  • Safety integrity requirements;

The SIFs are implemented to bring the production process to a safe state when demands (hazardous events) occur. A demand is for example if the pressure in a vessel rises above a specified limit, the trip point. The safety functional requirements specify the function and the performance requirements for this function. For example the SIF must close an emergency shutdown valve the “Close Flow” function, performance parameters can be the closing time and the leakage allowed in closed position. For this discussion the functional requirements are not that important, but the safety integrity requirements are.


When we discuss safety integrity requirements, we can split the topic into four categories:

  • Performance measures;
  • Architectural constraints;
  • Systematic failures;
  • Common cause failures.

For this discussion the performance measures, expressed as the Safety Integrity Level (SIL), are of importance. For the potential impact of cyber attacks (some functional deviation or lost function) the safety function requirements (above), and the common cause failures, architectural constraints, and systematic failures are of more interest. But the discussion here focuses on risk estimation and more specific the likelihood scale not so much on how and where a cyber attack can cause process safety issues. Those scenarios can be discussed in another blog.

The safety integrity performance requirements define the area I focus on, in these requirements safety integrity performance is expressed as a SIL. The IEC 61508 defines 4 safety integrity levels: SIL 1 to SIL 4. SIL 4 having the most strict requirements. To fulfill a SIL, the safety integrity function (SIF) must meet both a quantitative requirement and a set of qualitative requirements.

According to IEC 61508 the quantitative requirement is formulated based on either an average probability of failure on demand (PFDavg) for what are called low demand SIFs, and a Probability of Failure on Demand per Hour (PFH) high demand / continuous mode SIFs. Two different event frequency scales are defined, one for the low demand SIFs and one for the high demand SIFs. So depending on how often the protection mechanism is “challenged” the risk estimation method will differ.

For the low demand SIFs the SIL expresses a range of risk reduction values, and for the high demand SIFs it is a scale of maximum frequency of dangerous system failures. Two different criteria, two different methods / formulas for estimating risk. If the SIF is demanded less than once a year, low-demand mode applies. If the SIF is demanded more than once a year high demand / continuous mode is required. The LOPA technique for estimating process safety risk is based upon the low-demand mode, the assumption that the SIF is demanded less than once a year. If we would compare our cyber security countermeasure with a SIF, will our countermeasure be demanded just once a year? An antivirus countermeasure is demanded every time we write to disk, not to mention a firewall filter passing / blocking traffic. Risk estimation for cyber security controls is high demand / continuous mode.

LOPA specifies a series of initiating event frequencies (IEF) for various occurrences. These event frequencies can differ between companies, but with some exceptions they are similar. For example the IEF for a control system sensor failure can be set to once per 10 year, and the IEF for unauthorized changes to the SIF can be set to 50 times per year. These are actual examples for a chemical plant, values from the field not an example made up for this blog. The 50 seems high, but by giving it a high IEF it also requires more and stronger protection layers to reduce the risk within the asset owner’s acceptable risk criteria. So there is a certain level of subjectivity in LOPA possible, which creates differences between companies for sometimes the same risk. But most of the values are based on statistical knowledge available in the far more mature process safety discipline compared with the cyber security discipline.

In LOPA the IEF is reduced using protection layers (kind of countermeasure) with a specific PFDavg. PFDavg can be a calculated value, carried out for every safety design, or a value selected from the LOPA guidelines, or an asset owner preferred value.

For example an operator response as protection layer is worth a specific factor (maybe 0.5), so just by having an operator specified as part of the protection layer (maybe a response to an alarm) the new frequency, the Mitigated Event Frequency (MEF), for the failed control system sensor becomes 0.5 x 10 equivalent to once per 5 years. Based upon one or more independent protection layers with their PFD, the IEF is reduced to a MEF. If we create a linear scale for the event frequency and sub-divide the event scale in a number of equal qualitative intervals to create a likelihood scale, we can estimate the risk given a specific impact of the hazardous event. The linearity of the scale is important since we use the formula risk equals likelihood times impact, if the scale would be logarithmic the multiplication wouldn’t work anymore.

In reality the LOPA calculation is a bit more complicated because we also take enabling factors into account, probability of exposure, and multiple protection layers (including BPCS control action, SIF, Operator response, and physical protection layers such as brake plates and pressure relief valves). But the principle to estimate the likelihood remains a simple multiplication of factors. However LOPA applies only for the low demand cases and LOPA is using an event scale based on failures of process equipment and human errors. Not an event scale that takes into account how often a cyber asset like an operator station gets infected by malware, not taking into account intentional events, and starts with the unmitigated situation. Several differences with risk estimation for cyber security failures.


Most chemical and refinery processes follow LOPA and qualify for low-demand mode, with a maximum demand for a SIF of once a year. However in the offshore industry there is a rising demand for SIL based on high demand mode. Also in the case of machinery (IEC 62061) a high demand / continuous mode is required. Using high demand mode results in a different event scale, different likelihood formulas and LOPA no longer applies. Now let us look at cyber security and its likelihood scale. When we estimate risk in cyber physical systems we have to account for this, if we don’t a risk assessment might be a learning exercise because of the process it self but translating the results on a loss based risk scale would be fake.


Let us use the same line of thought used by process safety and translate this into the cyber security language. We can see a cyber attack as a Threat Actor applying a specific Threat Action to Exploit a Vulnerability, resulting in a Consequence. the consequence being a functional deviation of the target’s intended operation if we ignore cyber attacks breaching the confidentiality and focus on attacks attempting to impact the cyber physical system.

The threat actor + threat action + vulnerability combination has a certain event frequency, even if we don’t know yet what that frequency is, there is a certain probability of occurrence. That probability depends on many factors. Threat actors differ in capabilities and opportunity (access to target), also the exploit difficulties of a vulnerability differ and on top of all we protect these vulnerabilities using countermeasures (the protection layers in LOPA) we need to account for. We install antivirus to protect us against storing a malware infected file on the hard disk, we protect us against repeated logins that attempt brute force authentication by implementing a retry limit, we protect us against compromised login credentials by installing two factor authentication, etc. A vulnerability is not just a software deficiency, but also an unused open port on a network switch, or unprotected USB port on a server or desktop. There are hundreds of vulnerabilities in a system, each waiting to be exploited. There are also many cyber countermeasures, each with a certain effectiveness, each creating a protection layer. And with defense in depth we generally have multiple protection layers protecting our vulnerabilities. A very similar model as discussed for LOPA. This was recognized and a security method ROPA (Rings of Protection Analysis) was developed for physical security. Using the same principles to get a likelihood value based upon an event frequency.

Though ROPA was initially developed for physical security, this risk estimation method is and can also be used for cyber security. What is missing today is a standardized table for the IEF and the PFD of the various protection controls. This is not different from ROPA and LOPA because for both methods there is a guideline, but in daily practice each company defined its own list of IEF and PFDavg factors.

Another important difference with process safety is that cyber and physical security require like machinery a high demand / continuous mode which alters the risk estimation formulas (and event scales!), but the principles remain the same. So again the event frequency scale for cyber events differs from the event frequency scale used by LOPA. The demand rate for a firewall or antivirus engine is almost continuous, not limited to once a year like in the low demand mode of LOPA for process safety and derived from this business / mission risk.


The key message in above story is that the event scale used by process safety, for risk estimation is based upon failures of primarily physical equipment and accessories such as sealings and human errors. Where the event scale of cyber security is linked to the threat actor’s capabilities (Tactics, Techniques, Procedures), motivation, resources (Knowledge, money, access to equipment), opportunity (opportunities differ for an insider and outsider, some attacks are more “noisy” than others), it is linked to the effectiveness of the countermeasures (ease of bypassing, false negatives, maintenance), and the exposure of the vulnerabilities (directly exposed to the threat actor, requiring multiple steps to reach the target, detection mechanisms in place, response capabilities). These are different entities, resulting in different scales. We cannot ignore this when estimate loss based risk.

Accepting the conclusion that the event scales differ for process safety and for cyber security is accepting the impossibility to use the same risk assessment matrix for showing cyber security risk and process safety risk as is frequently requested by the asset owner. The impact will be the same, but the likelihood will be different and as such the risk appetite and tolerance criteria will differ.


So where are the flaws in IEC/ ISA 62443-3-2?

Starting with an inventory of the assets and channels in use is always a good idea. The next step, the High Level Risk Assessment (HLRA) is in my opinion the first flaw. There is no information available after the inventory to estimate likelihood. So it is suggested the method to set the likelihood to 1, basically risk assessment becomes a consequence / impact assessment at this point in time. An activity of the FMECA method. Why is a HLRA done, what is the objective of the standard? Well it is suggested that these results are later used to determine if risk is tolerable, if not a detailed risk assessment is done to identify improvements. It is also seen as a kind of early filter mechanism, focusing on the most critical impact. Additionally the HLRA should provide input for the zone and conduit analysis.

Can we using consequence severity / impact determine if the risk is tolerable (a limit suggested by ISA, the risk appetite limit is more appropriate than the risk tolerance limit in my view)?

The consequence severity is coming from the process hazard analysis (PHA) of the plant, so the HAZOP / LOPA documentation. I don’t think this information links to the likelihood of a cyber event happening. I give you a silly example to explain. Hazardous event 1, some one slaps me in the face. Hazardous event 2, a roof tile hits me on my head. Without the knowledge of likelihood (event frequency), I personally would consider the roof tile the more hazardous event. However if I would have some likelihood information telling me slapping me in the face would happen every hour and the roof tile once every 30 years I might reconsider and regret having removed hazard 1 from my analysis. FMECA is a far more valuable process than the suggested HLRA.

Does a HLRA contribute to the zone and conduit design? Zone boundaries are created as a grouping of assets with similar security characteristics following the IEC 62443 definition. Is process impact a security characteristic I can use to create this grouping? How can I relate a tube rupture in a furnace’s firebox to my security zone?

Well, then we must first ask ourselves how such a tube rupture can occur. Perhaps caused by a too high temperature, this I might link to a functional deviation caused by the control system or perhaps it might happen as the consequence that a SIF doesn’t act when it is supposed to act. So I may link the consequence to a process controller or maybe a safety controller causing a functional deviation resulting in the rupture. But this doesn’t bring me the zone, if I would further analyze how this functional deviation is caused I get into more and more detail, very far away from the HLRA. An APC application at the level 3 network segment could cause it, a modification through a breached BPCS engineering station at level 2 could cause it, an IAMS modifying the sensor configuration can cause it, and so on.

Security zone risk can be estimated in different ways, for example as zone risk following the ANSSI method. Kind of risk of the neighborhood you live in. We can also look at asset risk and set the zone risk equal to the asset with the highest risk. But all these methods are based on RPN, not on LBR.

HLRA’s just don’t provide valuable information, we typically use them for early filtering making a complex scope smaller. But that is neither required, nor appropriate for cyber physical systems. Doing a HLRA helps the risk consultant not knowing the plant or the automation systems in use, but in the process of estimating risk and designing a zone and conduit diagram it has no added value.

As next step ISA 62443-3-2 does a detailed risk assessment to identify the risk controls that reduce the risk to a tolerable level. This is another flaw because a design should be based upon an asset owner’s risk appetite, which is the boundary of the acceptable risk. The risk we are happy to accept without much action. The tolerable risk level is linked to the risk tolerance level, if we go above this we get unacceptable risk. An iterative detailed risk assessment identifying how to improve is of course correct, but we should compare with the risk appetite level not the risk tolerance level.

If the design is based on the risk tolerance level there is always panic when something goes wrong. Maybe our AV update mechanism fails, this would lead to an increase in risk, if we already are at the edge because of our security design we enter immediately in an unacceptable risk situation. Where normally our defense in depth, not relying on a single control, would give us some peace of mind to get things fixed. So the risk appetite level is the better target for risk based security design.

Finally there is this attempt to combine the process safety risk and the cyber security risk into one RAM. This is possible but requires a mapping of the cyber security risk likelihood scale on the process risk likelihood scale, because they differ. They differ not necessarily at a qualitative level, but the underlying quantitative mitigated event frequency scale differs. We can not say a Medium score on the cyber security likelihood scale is equivalent with a Medium score on the process safety likelihood scale. And it is the likelihood of the cyber attack that drives the cyber related loss based risk. We have to compare this with the process safety related loss based risk to determine the bigger risk.

This comparison requires a mapping technique to account for the underlying differences in event frequency. When we do this correctly we can spot the increase in business risk due to cyber security breaches. But only an increase, cyber security is not reducing the original process safety risk that determined the business risk before we estimated the cyber security risk.

So a long story, maybe a dull topic for someone not daily involved in risk estimation, but a topic that needs to be addressed. Risk needs a sound mathematical foundation if we want to base our decisions on the results. If we miss this foundation we create the same difference as there is between astronomy and astrology.

For me, at this point in time, until otherwise convinced by your responses, the IEC / ISA 62443-3-2 method is astrology. It can make people happy going through the process, but lacks any foundation to justify decisions based on this happiness.


There is no relationship between my opinions and references to publications in this blog and the views of my employer in whatever capacity. This blog is written based on my personal opinion and knowledge build up over 42 years of work in this industry. Approximately half of the time working in engineering these automation systems, and half of the time implementing their networks and securing them.


Author: Sinclair Koelemij

OTcybersecurity web site

The role of detection controls and a SOC

Abstract


This blog is about cyber security detection controls for cyber physical systems, process automation systems, process control systems or whatever we want to call them. The blog discusses their value in securing these systems and some of the requirements to consider when implementing them. As usual, my blogs approach the subject from a combination of the knowledge area of ​​cyber security and the knowledge area of ​​automation technology. More specific for this topic is the combination of Time Based Security and Asset Integrity Management.



It’s been a while since I started a blog, I postponed my retirement for another 2 years, so just too much to do to spend my weekends creating blogs. But sometimes the level of discomfort rises above my high alarm threshold for discomfort, forcing me to analyze the subject to better understand the cause of this discomfort. After seven months of no blogging, there are plenty of topics to discuss. For this blog I want to analyze the requirements for a cybersecurity detection function for cyber-physical systems. Do they differ from the requirements we set for information systems, and if so, what are these differences? Let’s start with the historical context.


Traditional cybersecurity concepts were based on a combination of fortress mentality and risk aversion. This resulted in the building of static defenses as described in the TCSEC and the Orange Book of the 1980s. But almost 40 years later, this no longer works in a world where the information resides increasingly at the very edge of the automation systems. Approximately 20 years ago a new security model was introduced as an answer to the increasing number of issues the traditional models were facing.

This new model was called Time Based Security, it is a dynamic view of cyber security based on protection, detection and response. Today we still use elements of this model, for example the NIST cybersecurity framework includes the TMS model by defining security as Identify, Protect, Detect, Respond and Recover. For the topic of the blog, I focus specifically on the three elements of the TMS model Protect, Detect and React. The idea behind this model is that the formula P> D + R defines a secure system. This formula expresses that the amount of time it takes for a threat actor to break through protection (P) must be greater than the time it takes to (D) detect the intrusion and the time it takes to respond ( R) on this intrusion.

Time Based Security model

In the days of the Orange Book we were building our fortresses, but also discovered that sooner or later someone found a path across, under or through the wall and we wouldn’t know about it. The TMS model kind of suggests that if we have a very fast detection and response mechanism, we can use a less strong protection mechanism. This is a very convenient idea in a world where systems become increasingly open.

This does not mean that we can throw away our protection strategies, it just means that in addition to protection, we can also implement a good detection and response strategy to compensate for a possible lack of protection. We need a balance, sometimes protection seems the logical answer, in other cases a focus on detection and response makes sense. If we compare this to physical security, it makes sense for a bank to protect a safe from the use of a torch burning through the safe’s wall. Delaying the attack increases the time we have to detect and respond. On the other hand, a counter that faces the public and is therefore relatively unprotected may benefit more from a panic button (detection) to initiate a response.

It is important to realize that detection and response are “sister processes” and that they are in series. If one fails, they both fail. Now with the concept of TMS in mind, let’s move on to a cyber attack on a physical cyber system.


To describe a cyber attack, I use the cyber kill chain defined by Lockheed Martin in combination with a so-called P-F curve. P-F curves are used in the Asset Integrity Management discipline to show equipment health over time to identify the interval between potential failure and functional failure. Since I need a relationship between time and the progress of a cyber attack, I borrowed this technique and created the following diagram.

P-F curve describing the cyber attack over time

A p-f curve has two axes, the vertical axis represents the cyber resilience of the cyber-physical system and the horizontal axis represents the progress of the attack over time. The curve shows several important moments in the time of the attack. The P0 point is where it all starts. In order for an attack to happen, we need a particular vulnerability, even more than this, the vulnerability must be exposed – the threat actor must be able to exploit it.

The threat actor who wants to attack the plant gathers information and looks for ways to penetrate the defenses. He / she can find multiple vulnerabilities, but eventually one will be selected and the threat actor will start developing an exploit. This can be malware that collects more information, but also when enough information has been collected, the code can be developed to attack the cyber-physical system (CPS).

To do this, we need a delivery method, sometimes along with a propagator module that penetrates deeper into the system to deliver the payload (the code that is doing the actual attack on the CPS) to one of the servers in the inner network segments of the system. This could be a USB stick containing malware, but it could also be a phishing email or a website infecting a user’s laptop and perhaps installing a keylogger that waits for the user to connect to the CPS from a remote location. It is not necessarily the factory it was initially aimed at, it could also be a supplier organization that builds the system or provides support services. Sensitive knowledge of the plant is stored in many places, different organizations build, maintain and operate the plant together and as such have a shared responsibility for the security of the plant. There are many scenarios here, for my example I propose that the attacker has developed malicious code with the ability to perform a scripted attack on the Basic Process Control System (BPCS). The BPCS is the automation function controlling the production process, if our target would be a chemical plant it will usually be a Distributed Control System (DCS).

The first time security is faced with the attack is on P1, where the chosen delivery method successfully penetrates the system to deliver its payload. Here we have protection mechanisms such as USB protection checks, antivirus protection, firewalls and maybe two-factor authentication against unauthorized logins. But we should also have detection controls, we can warn the moment a USB device is inserted, we can monitor who is logging into the system, we can even inspect what traffic is going through the firewall and see if it is using a suspicious amount of bandwidth , which indicates a download of a file. We can monitor many indicators to check if something suspicious is happening. All of this falls into a category defined as “early detection”, detection at a time when we can still stop the attack by an appropriate response before the attack hits the BPCS.

Nevertheless, our threat actor is skilled enough to slip through this first layer of defense and the malicious code delivers the payload to a server or desktop station of the BPCS. At this point (P2), the malicious code is installed in the BPCS and the threat actor is ready for its final attack (P3). We assume that the threat actor uses a script that describes the steps to attack the BPCS. Perhaps the script is writing digital outputs that shut down a compressor or initiate an override command to bypass a specific safety instrument function (SIF) in the safety system. Many scenarios are possible at this level in the system once the threat actor has gathered information about how the automation features are implemented. Which tag names represent which function, which parameters to change, which authorizations are required, which parameters are displayed, etc.

At this point, the threat actor passed several access controls that protect the inner security zones, passed controls that prevented malicious code from installing on the BPCS servers / desktops, and whitelisted the application controls that prevented execution of code. For detection at this level, we can use different audit events that signal access to BPCS system, we can also use anomaly detection controls that signal abnormal traffic. An important point here is that the attacker has already penetrated very deep into the system, at such a level that the reaction time to stop the attack is almost impossible without an automated response.

The problem with what we call “late detection” is that we need an automated response to be on time when the actor has an immediate malicious intent. Only when the threat actor postpones the attack until a later moment can we find the time to respond “manually”. However, an automated response requires a lot of knowledge of what to do in different situations and requires a high level of confidence in the detection system. A detection system that delivers an abundance of false positives would become a bigger threat to the production system than a threat actor.


If we look at how quickly attacks on cyber-physical systems progress, we can take Ukraine’s attack on the electricity grid as an example. The estimated time per attack here was estimated to be less than 25 minutes. In these 25 minutes – the speed of onset – the threat actors gained access to the SCADA system: opened the circuit breakers; corrupted the firmware of a protocol converter preventing the panel operators from closing the circuit breakers remotely; and wiped the SCADA hard drives. So in our theoretical TMS model, we would need a D + R time of less than 25 minutes to be effective, even if only partially effective, depending on the moment of detection. This is very short.

Can a security operations center fill this role? What would this require? Let us first look at the detection function to find an answer.


First of all we need a reliable detection mechanism, a detection mechanism that causes false positives will soon slow down any reaction. What is required for a reliable detection mechanism:

  • In-depth knowledge of the various events that can happen, what is an expected event and what is a suspicious event? This can differ depending on the plant’s modes such as: normal operation, abnormal operation, or emergency. Also system behavior is quite different during a plant stop. All of these conditions can be a source of suspicious events;
  • In-depth knowledge of the various protocols in use, for system’s such as DCS including a wide set of proprietary protocols not publicly documented. Also these have shown to be a considerable hurdle for anomaly detection systems;
  • In-depth knowledge of process operations, what activities are normal and which not? Should we detect the modification of the travel rate of a valve, is the change of this alarm setting correct, is disabling this process alarm a suspicious event or a normal action to address a transmitter failure? This level of detection is not even attempted today, anomaly detection systems might detect the program download of a controller or PLC, but the process operator would also know this immediately without a warning from the SOC. In all cases close collaboration with process operations is essential, this would slow down the reaction.

How if the security operations center (SOC) is outsourced, what would be different from an “in-company” SOC?

  • On the in depth knowledge of the various events that can happen, the outsourced SOC will have a small disadvantage because as an external organization the SOC will not always be aware of the the different plant states. But overall it will perform similarly as an in-company SOC.
  • On the in-depth knowledge of the various protocols, the outsourced SOC can have a disadvantage because it will have a bigger learning curve with multiple similar customers. Never the less knowledge of the proprietary protocols would remain a hurdle.
  • The in-depth knowledge of process operations is in my opinion fully out of reach for an outsourced SOC. As such control over this would remain primarily a protection task. Production processes are too specific to formulate criteria for normal and suspicious activity.

How about if the threat actor would also target the detection function? Disabling the detection function during the attack will certainly prevent the reaction. In the case of an outsourced SOC we would expect:

  • A dedicated connection between SOC and plant. If the connection would be over public connections such as the Internet a denial of service attack would become feasible to disable the detection function;
  • A local – in the plant – detection system would still provide information on what is going on if the connection to the SOC fails. The expertise of a SOC might be missing but the detection function would still be available;
  • The detection system would be preferably “out of band” because this would make a simultaneously attack on protection and detection systems more difficult.

How about the reaction function what do we need to have a reaction that can ‘outperform” the threat actor?

Reaction is a cyber physical system is not easy because we have two parts, the programmable electronic automation functions that are attacked and the physical system controlled by the automation system. The physical system has completely independent ‘dynamics”, we can’t just stop an automation function and expect the physical process to be stable in all cases. Cooling might be required, or Even during a plant stop we often need certain functions to be active. Even after a black shutdown – ESD 0 / ESD 0.1 – some functions are still required. There are rotating reactors that need to keep rotating to prevent a runaway reaction, a cement kiln needs to rotate otherwise it just breaks under its wait, cooling might be required to prevent a thermal runaway, and we need power for many functions.

Stopping a cyber attack by disconnecting from external networks is also not always a good idea. Some malware “detonates” when it looses its connection with a command and control server and might wipe the hard disk to hide its details. Detection therefore requires a good identification on what is happening and the reaction on this needs to be aligned with the attack but also with the production process. A wrong reaction can easily increase the damage.

While a SOC can often actively contribute to the response to a breach of the security of an information system, this is in my opinion not possible for a cyber-physical system. In a cyber-physical system, the state of the production process is essential, but also which automation functions remain active. In principle, a safety system – if present – can isolate process parts, make the installation pressure-less / de-energized, but even then some functions are required. Process operations therefore play a key role in the decision-making process, it is generally not acceptable to initiate an ESD 0 when a cyber attack occurs. So several preparatory actions are needed to organize a reaction, such as:

  • We need a disconnection process describing under what circumstances disconnection is allowed and what alternative processes need to startup when disconnection is activated. Without having this defined and approved before the situation occurs it is difficult to quickly respond;
  • We need to have playbooks that document what to do in what situation depending on which automation function is affected. Different scenarios exist we need to be ready and not improvise at the moment the attack occurs, some guidance is required;
  • We need to understand how to contain the attack, is it possible to split a redundant automation system in two halves, a not affected part and an affected part;
  • We need to understand if we can contain the attack to parts of the plant if multiple “trains” exist in order to limit the overall impact;
  • We need to test / train processes to make sure they work when needed.

All this would be documented in the incident handling process where containment and recovery zones, and response strategies are defined to organize an adequate and rapid response. But all this must be in accordance with the state of the production process, possible process reconstitution and the defined / selected recovery strategy.

Process operations plays a key role in the decision making during a cyber attack that impacts the physical system. As such a SOC is primarily advisor and spectator because a knowledge gap. Site knowledge is key here, detailed knowledge of the systems and their automation functions is essential for the reaction process.

Of course we should also think about recovery, but the first focus must be on detection and response, because here we can still limit the impact of the cyber attack if our response is timely and adequate. If we fail at this stage, the problems could be bigger than necessary. An OT SOC differs very much from an IT SOC and has a limited role because of the very specific requirements and unique differences per production process.

So in the early detection zone, a SOC can have a lot of value. In the late detection zone, in the case where the threat actor would act immediately, I think the SOC has a limited role and the site team should take up the incident handling process. But in all cases, with or without a SOC, detection and reaction is a key part of cyber security. It is important to realize that detection and reaction are “sister processes”, even the best detection system is rather useless if there is no proper reaction process that supports it.


There is no relationship between my opinions and references to publications in this blog and the views of my employer in whatever capacity. This blog is written based on my personal opinion and knowledge build up over 42 years of work in this industry. Approximately half of the time working in engineering these automation systems, and half of the time implementing their networks and securing them.


Author: Sinclair Koelemij

OTcybersecurity web site

The Purdue Reference Model outdated or up-to-date?


Is the Purdue Reference Model (PRM) outmoded? If I listen to the Industrial Internet of Things (IIoT) community, I would almost believe so. IIoT requires connectivity to the raw data of the field devices and our present architectures don’t offer this in an easy and secure way. So let’s look in a bit more detail to the PRM and the IIoT requirements, to see if they really conflict or can coexist side by side?


I start the discussion with the Purdue Reference Model, developed in the late 80’s. Developed in a time that big data was anything above 256 Mb and the Internet was still in its early years, a network primarily used by and for university students. PRM’s main objective was creating a hierarchical model for manufacturing automation, what was called computer integrated manufacturing (CIM) in those days. If we see the model as it was initially published we note a few things:

  • The model has nothing to do with a network, or security. There are no firewalls, there is no DMZ.
  • There is an operator console at level 1. Maybe a surprise for people who started to work in automation in the last 20 years, but in those days quite normal.
  • Level 4 has been split in a level 4A and a level 4B, to segment functions that directly interact with the automation layers and functions requiring no direct interface.
  • There is no level 0, the Process layer is what it is a bunch of pipes, pumps, vessels and some clever stuff.

Almost a decade later we had ANSI / ISA-95 that took the work of the Purdue university and extended it with some new ideas and created an international standard that became very influential for ICS design. It was the ANSI / ISA-95 standard that named the Process level, level 0. But in those days level 0 was still the physical production equipment. The ISA-95 standard says following on level 2, level 1, and level 0 : ” Level 0 indicates the process, usually the manufacturing or production process. Level 1 indicates manual sensing, sensors, and actuators used to monitor and manipulate the process. Level 2 indicates the control activities, either manual or automated, that keeps the process stable or under control.” So level 0 is the physical production equipment, and level 1 includes the sensors and actuators. It was many years later that people started using level 0 for the field equipment and their networks, all new developments with the rise of Foundation Fieldbus, Profibus, the HART protocol, but never part of the standard. It was probably the ISA 99 standard that introduced a new layer between level 4 and level 3, the DMZ. It was the vendor community that started giving it a level name, level 3.5. But level 3.5 has never been a functional level, it was an interface between two trust zones for adding security. Though often the way it was implemented contributed little to security, but it was a nice feeling to say we have a demilitarized zone between the corporate network and the control network. So far the history of the Purdue Reference Model and ISA-95 and their contribution to the levels. Now lets have a look at how a typical industrial control system (ICS) looks without immediately using names for levels.


From a functional viewpoint we can split a traditional ICS architecture in 5 main parts:

  • Production management system, which task it is to control the automation functions. Typical domain of the DCS and SCADA systems. But also compressor control (CCS), power management systems (PMS) and the motor control center (MCC) reside here. basically everything that takes care of the direct control of the production process. And of course all these systems interact with each other using a wide range of protocols, many of them with security short comings;
  • Operations management system, which task it is to optimize the production process, but also to monitor and diagnose the process equipment (for example machine monitoring functions such as vibration monitoring), and various management functions such as asset management, accounting and reconciliation functions to monitor the mass balance of process units, and emission monitoring systems to meet environmental regulations;
  • Information systems is the third category, these systems collect raw data and transform it into information to be used for historical trends or to feed other systems with information. The objective here is to transform the raw data into information and ultimately information into knowledge. The data samples stored are normally one minute snapshots, hourly averages, shift averages, daily averages, etc. An other example of an information system is custody metering system (CMS) for measuring product transfer for custody purposes.
  • The last domain of the ICS is the Application domain, for example for advanced process control (APC) applications, traditionally performing advisory functions. But overtime the complexity of running an production processes grew, response to changes in the process became more important so advisory functions were taking over the task of the process operator immediately changing the setpoints using setpoint control or controlling the valve with direct digital control functions. There are plants today that if APC would fail, the production is no longer profitable or can’t reach the required product quality levels.
  • Finally there is the Business domain, generally not part of the ICS. in the business domain we mange for example the feed stock and products to produce. It is the decision domain.
Functional model of a chemical plant or refinery

The production management systems are shown in the light blue color, the operations management, information management, and application systems in the light purple color. The model seems to comply with the models of the ANSI / ISA-95 standard. However this model is far from perfect, because an operations management function as vibration monitoring, or corrosion monitoring also have sensors that connect to the physical system. And an asset management system requires information from the field equipment. If we consider a metering system, also part of the information function, it is actually a combination of a physical system and sensors.


And then of course we have all the issues around vendor support obligations. Asset owners expect systems with a very high level of reliability, vendors have to offer this level of reliability in a challenging real-time environment by testing and validating their systems and changes rigorously. Than there is nothing worse than suddenly having to integrate with the unknown, a 3rd party solution. As a result we see systems that have a level 2 function connected to level 3, in principle exposing a critical system more and making the functionality rely on more components so reducing reliability.


So many conflicts in the model, still it has been used for over 20 years, and it still helps us to guide us in securing systems today. If we take the same model and add some basic access security controls we get the following:

At high level this represents the architecture in use today, in smaller systems the firewall between level 2 and 3 might miss, some asset owners might prefer a DMZ with two firewalls (from different vendors to enforce diversity), the level 4A hosts those functions that interface with the control network and generally different policies are enforced for these systems than required at level 4B. For example asset owners might enforce the use of desktop clients, might limit Internet access for these clients, might restrict email to internal email only, etc. All to reduce the chance on compromising the ICS security.


Despite that the reference model is not supporting all of the new functions we have in plants today – for example I didn’t discuss wireless, virtualization, and system integration topics at level 2 – the reference model still helps structure the ICS. Is it impossible to add the IIoT function to this model? Let’s have a look at what IIoT requires?


In an IIoT environment we have a three level architecture:

  • The edge tier – The edge tier is for our discussion the most important, this is where the interface is created with the sensors and actuators. Sometimes existing sensors / actuators, but also new sensors and actuators.
  • The platform tier – The platform tier processes and transforms the data and controls the direction of the data flow. So from that point also of importance for the security of the ICS.
  • The enterprise tier – This tier hosts the application and business logic that needs to provide the value. It is external to the ICS, so as long as we have a secure end to end connection between the platform tier and the enterprise tier ICS security needs to trust that the systems and data in this tier are properly secured.

The question is can we reconcile these two architectures? The answer seems to be what is called the gateway-mediated edge. This is a device that aggregates and controls the data flows from all the sensors and actuators providing a single point of connection to the enterprise tier. This type of device also performs the necessary translation for communicating with the different protocols used in ICS. The benefits of such a gateway device is the scalability of the solution together with controlling the entry and exit of the data flows from a single controllable point. In this model some of the data processing is kept local, close to the edge, allowing controlled interfaces with different data sources such as Foundation Field Bus, Profibus, and HART enabled field equipment. To implement such an interface device doesn’t require to change the Purdue reference model in my opinion, it can make use of the same functional architecture. Additionally new developments such as the APL (Advanced Physical Layer) technology, the technology that will remove the I/O layer as we know it today, will fully align with this concept.


So I am far from convinced that we need a new model, also today the model doesn’t reflect all the data flows required by today’s functions. The very clear boundaries we used to have, have disappeared with the introduction of new technologies such as virtualization and wireless sensor networks. Centralized process computer systems of the 70s, have disappeared becoming decentralized over the last 30 years and are now moving back into centralized data centers where hardware and software have lose ties. Today these data centers are primarily onsite, but over time confidence will grow and larger chunks of the system will move to the cloud.

Despite all these technology changes, the hierarchical structure of the functions hasn’t change that much. It is the physical location of the functions that changes, undoubtedly demanding a different approach to cyber security. It are the cyber security standards of today that are dated on the day they are released. The PRM was never about security or networking, it is a hierarchical functional model for the relationship between functions which is as relevant today as it was 30 years ago.

We have a world of functional layers, and between these layers data is exchanged, so we have boundaries to protect. As long as we have bi-directional control over the data flows between the functional layers and keep control over where the data resides, we protect the overall function. If there is something we need to change it are the security standards, not the Purdue reference model. But we need to be careful with the security of these gateways as the recent OSI PI security risk has learned us.


There is no relationship between my opinions and publications in this blog and the views of my employer in whatever capacity.


Sinclair Koelemij

TRISIS revisited


For this blog I like to go back in time, to 2017 the year of the TRISIS attack against a chemical plant in Saudi Arabia. I don’t want to discuss the method of the attack (this has been done excellently in several reports) but want to focus on the potential consequences of the attack because I noticed that the actual threat is underestimated by many.

The subject of this blog was triggered after reading Joe Weiss’ blog on the US presidential executive order and noticing that some claims were made in the blog that are incorrect in my opinion. After checking what the Dragos report on TRISIS wrote on the subject, and noticing a similar underestimation of the dangers I decided to write this blog. Let’s start with summing up some statements made in Joe’s blog and the Dragos report that I like to challenge.


I start with quoting the part of Joe’s blog that starts with the sentence: “However, there has been little written about the DCS that would also have to be compromised. Compromising the process sensors feeding the SIS and BPCS could have led to the same goal without the complexity of compromising both the BPCS and SIS controller logic and the operator displays.” The color high lights are mine to emphasize the part I like to discuss.


The sentence seems to suggest (“also have to be compromised”) that the attacker would ultimately also have to attack the BPCS to be effective in an attempt to cause physical damage to the plant. For just tripping the plant by activating a shutdown action the attacker would not need to invest in the complexity of the TRISIS attack. Once gaining access to the control system at the level the attackers did, tens of easier to realize attack scenarios were available if only a shutdown was intended. The assumption that the attacker needs the BPCS and SIS together to cause physical damage is not correct, the SIS can cause physical damage to the plant all by it self. I will explain this later with a for safety engineers well known example of an emergency shutdown of a compressor.


Next I like to quote some conclusions in the (excellent) Dragos report on TRISIS. It starts at page 18 with:

Could This Attack Lead to Loss of Life?

Yes. BUT, not easily nor likely directly. Just because a safety system’s security is compromised does not mean it’s safety function is. A system can still fail-safe, and it has performed its function. However, TRISIS has the capability to change the logic on the final control element and thus could reasonably be leveraged to change set points that would be required for keeping the process in a safe condition. TRISIS would likely not directly lead to an unsafe condition but through its modifying of a system could deny the intended safety functionality when it is needed. Dragos has no intelligence to support any such event occurred in the victim environment to compromise safety when it was needed.


The conclusion that the attack could not likely lead to the loss of life, is in my opinion not a correct conclusion and shows the same underestimation as made by Joe. As far as I am aware the part of the modified logic has never been published (hopefully someone did analyze) so the scenario I am going to sketch is just guessing a potential objective. It is what is called a cyber security hazard, it could have occurred under the right conditions for many production systems including the one in Saudi Arabia. So let’s start with explaining how shutdown mechanisms in combination with safety instrumented systems (SIS) work, and why some of the cyber security hazards related to SIS can actually lead to significant physical damage and potential loss of life.


A SIS has different functions like I explained in my earlier blogs. A little bit simplified summary, there is a preventative protection layer the Emergency Shutdown System (ESD) and there is a mitigative layer, e.g. the Fire & Gas system detecting fire or gas release and activating actions to extinguish fires and to alert for toxic gases. For our discussion I focus on the ESD function, but interesting scenarios also exist for F&G.

The purpose of the ESD system is to monitor process safety parameters and initiate a shutdown of the process system and/or the utilities if these parameters deviate from normal conditions. A shutdown function is a sequence of actions, opening valves, closing valves, stopping pumps and compressors, routing gases to the flare, etc. These actions need to be done in a certain sequence and within a certain time window, if someone has access to this logic and modifies the logic this can have very serious consequences. I almost would say, it always has very serious consequences because the plant contains a huge amount of energy (pressure, temperatures, rotational speed, product flow) that needs to be brought to a safe (de-energized) state in a very short amount of time, releasing incredible powers. If an attacker is capable of tampering with this shutdown process serious accidents will occur.


Let’s discuss this scenario in more detail in the context of a centrifugal compressor, most plants have multiple so always an interesting target for the “OT certified” threat actor. Centrifugal compressors increase the kinetic energy of for example a gas into a pressure so a gas flow through pipelines is created either to transfer a product through the various stages of the production process or perhaps to create energy for opening / closing pneumatic driven valves.

Transient operations, for example the start-up and shutdown of process equipment, always have dangers that need to be addressed. An emergency shutdown because there occurred in the plant a condition that demanded the SIS to transfer the plant to a safe state, is such a transient operation. But in this case unplanned and in principle fully automated, no process operator to guard the process and correct where needed. The human factor is not considered a very reliable factor in functional safety and is often just too slow. SIS on the other hand is reliable, the redundancy and the continuous diagnostic checks all warrant a very low failure on demand probability for SIL 2 and SIL 3 installations. They are designed to perform when needed, no excuses allowed. But this is only so if the program logic is not tampered with, the sequence of actions must be performed as designed and is systematically tested after each change.

Compressors are controlled by compressor control systems (CCS), one of the many sub-systems in an ICS. The main task of a compressor control system is anti surge control. The surge phenomenon in a centrifugal compressor is a complete breakdown and reversal of the flow through the compressor. A surge causes physical damage to the compressor and pipelines because of the huge forces released if a surge occurs. Depending on the gas this can also lead to explosions and loss of containment.

An anti surge controller of the CCS continuously calculates the surge limit (which is dynamic) and controls the compressor operation to stay away from this point of danger. This all works fine during normal operation, however when an emergency shutdown occurs the basic anti surge control provided by the CCS has shown to be insufficient to prevent a surge. In order to improve the response and prevent a surge, the process engineer has two design options called a hot bypass method or a cold bypass method recycling the gas to allow for a more gradual shutdown. The hot bypass is mostly used because of its closeness to the compressor which results into a more direct response. Such a hot bypass method requires to open some valves to feed the gas back to the compressor, this action is implemented as a task of the ESD function. The quantity of gas that can be recycled has a limit, so it is not just opening the bypass valve to 100% but opening it with the right amount. Errors in this process or a too slow reaction would easily result into a surge, damaging the compressor, potentially rupturing connected pipes, causing loss of containment, perhaps resulting in fire and explosions, and potentially resulting in casualties and a long production stop with high repair cost.


All of this is under control of the logic solver application part of the SIS. If the TRISIS attacker’s would have succeeded into loading altered application logic, they would have been capable of causing physical damage to the production installation, damage that could have caused loss of life.

So my conclusion differs a bit, an attack on a SIS can lead to physical damage when the logic is altered, which can result in loss of life. A few changes in the logic and the initiation of the shutdown action would have been enough to accomplish this.


This is just one example of a cyber security hazard in a plant, multiple examples exist showing that the SIS by itself can cause serious incidents. But this blog is not supposed to be a training for certified OT cyber terrorist so I keep it with this for safety engineers well known example.

Proper cyber security requires proper cyber security hazard identification and hazard risk analysis. This has too little focus and is sometimes executed at a level of detail insufficient to identify the real risks in a plant.

I don’t want to criticize the work of others, but do want to emphasize that OT security is a specialism not a variation on IT security. ISA published a book “Security PHA Review” written by Edwar Marsazal and Jim MgClone which addresses the subject of secure safety systems in a for me far too simplified manner by basically focusing on an analysis / review of the process safety hazop sheet to identify cyber related hazards.

The process safety hazop doesn’t contain the level of detail required for a proper analysis, neither does the process safety hazop process assume malicious intent. One valve may fail, but multiple valves at the same time in a specific sequence is very unlikely and not considered. While these options are fully open to the threat actor with a plan.

Proper risk analysis starts with identifying the systems and sub-systems of an ICS, than identifying cyber security hazards in these systems, identifying which functional deviations can result from these hazards, and than translate how these functional deviations can impact the production system. That is much more than a review of a process hazop sheet on “hackable” safeguards and causes. That type of security risk analysis requires OT security specialists with a detailed knowledge on how these systems work, what their functional tasks are, and the imagination and dose of badness to manipulate these systems in a way that is beneficial for an attacker .

Sinclair Koelemij

Interfaced or integrated?

Introduction

Like the Stuxnet attack in 2010 made the industry aware that targeted cyber-attacks on physical systems are a serious possibility, the TRISIS attack in 2017 targeting a safety system of a chemical plant also caused a considerable rise in awareness of the cyber-threat. Many asset owners started to reevaluate their company policies on how to connect BPCS and SIS systems.

That BPCS and SIS need to exchange information is not so much a discussion, all understand that this is required for most installations. But how to do this has become a debate within and between the asset owner and vendor organizations. The discussions have focused on the theme do we allow an integrated architecture with its advantages, or do we fallback to an interfaced architecture.

I call it a fallback because an interfaced connection between a Basic Process Control System (BPCS) and Safety Instrumented System (SIS) used to be the standard up to approximately 15-20 years ago.

Four main architectures are in use today for exchanging information between BPCS and SIS:

  • Isolated / air-gapped – The two systems are not connected and limit their exchange of data through analog hard-wiring between the I/O cards of the safety controller and the I/O cards of the process controller;
  • Interfaced – The process controller and the safety controller are connected through a serial interface (could also be an Ethernet interface, but most common is serial);
  • Integrated – The BPCS and SIS are connected over a network connection, today the most common architecture;
  • Common – Here the BPCS and SIS are no longer independent from each other and either run on a shared platform.

There are variations on above architecture, but above four categories summarize the main differences good enough for this discussion. Although security risk differs, because of differences in exposure,  we cannot say that one architecture is better than another as business requirements and their associated business / mission risk differs. Take for instance a batch plant that produces multiple different products on the same production equipment, such a plant might require changes to its SIS configuration for every new product cycle, where a continuous process of a gas treatment plant or a refinery only incidentally require changes to the safety logic. These constraints do influence decisions on which architecture best meets business requirements. For the chemical, oil & gas, and refining industries the choice is mainly between interfaced or integrated the topic of the blog.

I start the discussion by introducing some basic requirements for the interaction between BPCS and SIS using the IEC 61511 as reference.

Requirements

Following clauses of the IEC 61511 standard are describing the requirements for this separation.

Section 11.2.2:

“Where the SIS is to implement both SIFs and non-SIFs then all the hardware, embedded software and application program that can negatively affect any SIF under normal and fault conditions shall be treated as part of the SIS and comply with the requirements for the highest SIL of the SIFs it can impact.”

Section 11.2.3:

“Where the SIS is to implement SIF of different SIL, then the shared or common hardware and embedded software and application program shall conform to the highest SIL.”

A Safety Integrity Function (SIF) is the function that takes all necessary actions to bring the production system to a safe state when predefined process conditions are exceeded. A SIF can either be manually activated (a button or key on the console of the process operator) or acting when a predefined process limit (trip point) is exceeded, such as for example a high pressure limit. SIL stands for Safety Integrity Limit, it is a measure for process safety risk reduction, a level of performance for the SIF. The higher the SIL level the more criteria need to be met to reduce the probability on failure that the SIF is not able to act when demanded by the process conditions. In general a BPCS can’t meet the higher SIL (SIL 2 and above) and a SIS doesn’t provide all the control functions a BPCS has, so we end up with two systems. One system provides the control functions and one system enforces that the production process stays within a safe state, section 11.2.4 of the standard describes this. For a common architecture this is different because a common architecture has a shared hardware / software platform, therefore such an architecture is generally used for processes with no SIL requirements. Next clauses discuss the separation of the BPCS and SIS functions.

Section 11.2.4:

“If it is intended not to qualify the Basic Process Control System (BPCS) to this standard, then the basic process control system shall be designed to be separate and independent to the extent that the functional integrity of the safety instrumented system is not compromised.”

NOTE 1 Operating information may be exchanged but should not compromise the functional safety of the SIS.

NOTE 2 Devices of the SIS may also be used for functions of the basic process control system if it can be shown that a failure of the basic process control system does not compromise the safety instrumented functions of the safety instrumented system.

So BPCS and SIS should be separate and independent but it is allowed to exchange operating information (process values, status information, alarms) as long as a failure of the BPCS doesn’t compromise the safety instrumented functions. This requirement allows for instance the use process data from the SIS, reducing cost in the number of sensors installed. It also allows valves that combine a control and a safety function. And it allows for initiating safety override commands from the BPCS HMI to temporarily override a safety instrumented function to carry out maintenance activities. But functions that under normal operation fully meet the IEC 61511 requirements, can be maliciously manipulated when taking cybersecurity into account. An important requirement is that BPCS and SIS should be independent, the following sections discuss this:

Section 11.2.9:

The design of the SIS shall take into consideration all aspects of independence and dependence between the SIS and BPCS, and the SIS and other protection layers.

Section 11.2.10:

A device to perform part of a safety instrumented function shall not be used for basic process control purposes, where a failure of that device results in a failure of the basic process control function which causes a demand on the safety instrumented function, unless an analysis has been carried out to confirm that the overall risk is acceptable.

Below diagram shows the various protection layers that need to be independent. In today’s BPCS the supervisory function and control function are generally combined into a single system, as such form a single BPCS protection layer.

Figure 1 – Protection layers (Source: Center for Chemical Process Safety – CCPS)

The yellow parts in the diagram form the functional safety part implemented using what the standard calls Electrical/Electronic/Programmable Electronic Safety- related system (E/E/PES), or to be more specific systems such as: BPCS ( for the control and supervisory functions), SIS ESD ( for the preventative emergency shutdown function), SIS F&G (for the mitigative fire and gas detection function). Apart from these functions fire and alarm panels are used that interface with SIS, MIMIC panels also require interfaces, and public address general alarm systems (PAGA) require an interface. Though PAGA’s generally are hardwired connected, fire and alarm panels often use Modbus TCP to connect. These aren’t the only SIS that exist in a plant, also boiler management systems (BMS) and high integrity pressure protection systems (HIIPS) are preventative safety functions implemented in SIS.

Interfaced and integrated architectures

So far the discussion on the components and functions of the architecture let’s get into more detail to investigate these architectures. There are several reports published describing BPCS ó SIS  architectures. For example the LOGIIC organization published a report in 2018 and ISA published a report in 2017 on the subject. The ISA-TR84.00.09-2017 report ignores architectures with a dedicated isolated network for SIF related traffic. Because the major SIS vendors support and promote this architecture and in my experience it is also the most commonly used architecture my examples will follow architectures defined in the LOGIIC report on SIS. LOGIIC defines following architectures as example of integrated and interfaced structures. (Small modifications are mine)

Figure 2 – Interfaced and integrated architecture principles

In the interfaced architecture there are two networks isolated from each other. The BPCS network generally has some path to the corporate network, while the SIS network remains fully isolated. The exchange of data between the two systems makes use of Modbus RTU where the BPCS is the master and the SIS is the slave. The example shows two serial connections for example using the RS-232C protocol, but a more common alternative is a multi-drop connection using the RS-485 protocol, where a single serial interface on the BPCS side connects with multiple logic solvers (safety controllers).

An integrated architecture has a common network connecting both the BPCS and SIS equipment. Variations exist where separation between a BPCS network segment and SIS network segment is created with a firewall between the two systems (A). Or alternatively a dual-homed OPC server is sometimes used when the communication is making use of the OPC protocol (B). See the diagrams below.

Figure 3 – Some frequently used integrated architectures

Because one of the main targets is the SIS engineering function, architecture C and D were developed. In this case the SIS engineering station can be connected to the SIF network or alternatively an isolated 3rd network segment is created. Different vendors support different architectures, but the four examples above seem to cover most of the installations. Depending on the communication protocol used for the communication between BPCS and SIS there might be a need for a publishing service. This is a function mapping the SIS internal addresses into a format the BPCS understands and uses for its communication.

In the C and D architecture an additional security function can be an integrated firewall in the logic solver restricting access to a configured set of network nodes (servers, stations, controllers) for a defined set of protocols. Above architectures are just some typicals relevant for comparing interfaced and integrated architectures. Actual installations can be more complex because of additional business requirements such as extending the control and SIF networks to a remote location and the use of instrument asset management systems (IAMS) for centrally managing field equipment.

The architectures above do not show a separation between level 1 and level 2 for the process controller and the logic solver. All controllers / logic solvers are shown as connected to level 2, this is not necessarily correct. Some vendors implement a level 1 / level 2 separation using a micro firewall function for each controller with a preset series of policies that limit access to a sub set of approved protocols and allow for throttling traffic to protect against high volumes of traffic. Other vendors use a dedicated level 1 network segment where they can enforce network traffic limitations, and sometimes it is a mix depending on the technology used. There are also vendor implementations with a dedicated firewall between the level 2 and level 1 network segment. Since I want to discuss the differences independent of a specific vendor implementation I ignore these differences and consider them all connected as shown in the diagrams.

Let’s now compare the pros and cons of these architectures.

A comparison between interfaced and serial architectures

For the discussion I assume that the attacker’s objective is to create a functional deviation in the Safety Integrity Function (SIF) to either deviate from the operation intent / design intent (What I call Loss of Required Performance, for example modified logic or trip point) or making the SIS fail when it needs to act. Just tripping the plant can be done in many ways, the more dangerous threat is causing physical damage to the production system.

If we want to compare the security of an interfaced and integrated architecture the first thing is to look at the possibilities to manipulate the BPCS <=> SIS traffic. For the interfaced architecture this traffic passes the serial connection, for the integrated architecture the pink high lighted part of the network passes this traffic.

Figure 4 – BPCS <=> SIS traffic

One potential attack scenario is that the attacker attempts to create a sink hole by intercepting all traffic and dropping it. In an integrated architecture relative simple malware can do this using ARP poisoning as a mechanism to create a Man-In-The-Middle attack (MITM). The malware hijacks the IP address of one or both of the communicating BPCS/SIS nodes and just drops the message. This is possible because level 1 and level 2 traffic are normally in the same Ethernet broadcast domain, so actual messages are exchanged using Ethernet MAC addresses. A malware infection on any of the station and server equipment within the broadcast domain can perform this simple attack. Possible consequences of such a simple attack would be that SIF alarms are lost, SIS status messages are lost, SIS process data is lost, and messages to enforce an override will not arrive at the logic solver.

The noticeability of such an attack is relatively high, but this differs depending on the type of traffic and additional ingenuity added by the attacker. The loss of a SIF alarm would not be detected that quickly, but the loss process data would most likely be immediately detected. So selectively removing traffic can make the attack less noticeable. Also the protocol used is of importance for the success of this attack, a Modbus master (BPCS) would quickly detect if the slave (SIS) would not respond and generate a series of retries and ultimately raises an alarm. Similar detection mechanisms exist when using vendor proprietary protocols. But in those situations where the SIS can initiate traffic independent of the BPCS, than a SIF alarm might get lost if there is not a confirmation from the BPCS is required causing that the process operator might not receive the new alarm.

More ingenious attacks can make use of the MITM mechanism to capture and replay BPCS ó SIS traffic. Replaying process data messages would essentially freeze the operators information, replaying override messages might override or undo an override command. And of course an attacker can overload an operator this way with false alarms and status change messages causing the potential for missing an important alarm. The danger of these two attacks is that the attacker doesn’t need to know that much specific system information to be successful.

Enhancing the attack even more by intercepting and modifying traffic or injecting new messages is another option. In that scenario the attacker can attempt to make “intelligent” malicious modifications. There are various ways to reduce the likelihood of a MITM attack, but the risk remains in the absence of secure communications that authenticate, encrypt, confirms reception of a message, and includes a timestamp to validate if the message was send within a specific time window. If we analyze the protocols used today for communication between BPCS and SIS most don’t offer these functions.

Above MITM scenarios are only possible in an integrated architecture, in an interfaced architecture the communication channel is isolated from the control network by the process controller. An attacker would need physical access to the cabling to perform a similar attack.

Apart from attacking the traffic it is possible to directly attack the logic solver if it would have an exposed software vulnerability. In the integrated architecture the logic solver is always exposed to this threat and only testing, limiting network access to the logic solver, and using software validation techniques can offer protection against this threat.

In an interfaced architecture, the logic solver is no longer exposed to an external connected network but this does not offer a guarantee that such an attack can’t happen. The SIS engineering station might get infected with malware and execute an attack. Also in the Stuxnet attack the target system was on an isolated network, this didn’t stop the attackers to penetrate. But it certainly requires more skills and resources to design such an attack for an interfaced architecture.

In an integrated architecture the SIS engineering station might be exposed. In architecture A above the exposure depends very much on the firewall configuration, but there might be connections required for the antivirus engine, security patches, and in some cases for the publishing function. In architectures C and D the exposure of the SIS engineering station is equivalent to the exposure in an interfaced architecture. However it can still be infected by malware when files are transferred, for example to support this publishing function, or through some supply chain attack. But in general architecture D, where the SIS engineering station is isolated from the SIF traffic reduces the exposure for a number of possible cyber-attacks. Integrated architectures C and D are considered more secure, but how do they compare with an interfaced architecture?

Two main cyber security hazards seem to make a difference:

a)     Manipulation of BPCS <=> SIS traffic;

b)     Exploitation of a logic solver resident software vulnerability;

To mitigate / reduce the risk of scenario a) the industry needs secure communications between the two systems. Protocols used today such as Modbus TCP, Classic OPC, and most of the proprietary protocols don’t offer this.

To mitigate / reduce the risk of scenario b) we would need mechanisms that prevent or detect the insertion of malicious code. These have partially been implemented, but not completely and not by all vendors as we can learn from the TRISIS incident.

So overall we can say that the exposure in an interfaced environment is smaller than the exposure in an integrated environment. Since exposure directly correlates with likelihood we can say that the risk in an interfaced architecture is smaller than the risk in an integrated architecture. But security is not all about risk reduction, in the end the cost of security something must be weight against the benefits the business has for selecting an architecture.

I think there is also a price to pay when the industry moves back to interfaced architectures.

Possible disadvantages of a change to interfaced solutions can be:

·       Reduced communication capabilities, e.g. the bandwidth of a serial connection is lower than the bandwidth of an Ethernet connection;

·       The support for Sequence Of Event (SOE) functionality over serial connections would be hindered by not having a single clock source;

·       The cost of the architecture is higher because the need for additional components and being constrained by cable length restrictions;

·       Sharing instrument readings between SIS and BPCS is limited because of the relatively low speed serial connectivity, this can increase the number of field instruments required;

·       If Instrument Asset Management Systems (IAMS) is used these systems can no longer benefit from the HART pass-through technology and need to use the less secure HAT multiplexers;

·       The cost for engineering and maintaining a Modbus RTU implementation is higher because of the more complex address mapping;

·       Future developments w.r.t. the Industrial Internet Of Things (IIoT) would become more difficult for safety instrumentation related data. For example the new developments around Advanced Physical Layer (APL) technology are hard to combine with an interfaced architecture.

So generally serial connections were seen as a thing of the past and some vendors stopped supporting them on their newer equipment. However the TRISIS attack has put the interfaced architecture very prominent on the table again, specifically in the absence of secure communications between BPCS and SIS.

The cost of moving to an interfaced architecture must be weight against the lost benefits for the business. How high is the residual risk of an integrated architecture compared to the residual risk of an interfaced architecture. Is moving from integrated to interfaced justified by the difference and what are the alternatives?

Author: Sinclair Koelemij

Date: May 9, 2020

Cyber security and process safety, how do they converge?

Introduction

There has been a lot of discussion on the relationship between cybersecurity and process safety in Industrial Control Systems (ICS). Articles have been published on the topic of safety / cybersecurity convergence, and on the topic to add safety to the cybersecurity triad for ICS. The cybersecurity community practicing securing ICS seems to be divided over this subject. This blog approaches the discussion from the process automation practice in combination with the asset integrity management practice as practiced in the chemical, refining, and oil and gas industry. The principles discussed are common for most automated processes, independent where applied. Let’s start the discussion by defining process / functional safety, asset integrity and cybersecurity before we discuss their relationships and dependencies.

Process safety is the discipline that aims at preventing loss of containment, the prevention of unintentional releases of chemicals, energy, or other potentially dangerous materials that can have a serious effect to the production process, people and the environment. The industry has developed several techniques to analyze process safety for a specific production process. Common methods to analyze the process safety risk are Process Hazard Analysis (PHA), Hazard and Operability study (HAZOP), and Layers Of Protection Analysis (LOPA). The area of interest for cybersecurity is what is called Functional Safety, the part that is implemented using programmable electronic devices. Functional safety implements the automated safety integrity functions (SIF), alarms, permissives and interlocks protecting the production process.

Figure 1 – Protection layers (Source: Center for Chemical Process Safety (CCPS))

Above diagram shows that several automation system components play a role in functional safety, there are two layers implemented by the basic process control system (BPCS), and two layers implemented by the safety instrumented system (SIS): the preventative layer implemented by the emergency shutdown function (ESD) and the mitigative layer implemented by the fire and gas system (F&G). The supervisory and basic control layers of the BPCS are generally implemented in the same system, therefore not considered independent and often shown as a single layer. Interlocks and permissives are implemented in the BPCS, where the safety integrity functions (SIF) are implemented in the SIS (ESD and F&G). Other functional safety functions exist such as the High Integrity Pressure Protection System (HIPPS) and Boiler Management System (BMS). For this discussion it is not important to make a distinction between these process safety functions. Important is too understand that the ESD safety function is responsible to act and return the production system to a safe state when the production process entered a hazardous state, where the F&G safety function acts on detection of smoke and gas and will activate mitigation systems depending on the nature, location, and severity of the detected hazard. This includes such actions as initiating warning alarms for personnel, releasing extinguishants, cutting off the process flow, isolating fuel sources, and venting equipment. The BPCS, ESD and F&G form independent layers, so their functions should not rely on each other but they don’t exist in isolation extending their ability to prevent or mitigate an incident by engaging with other systems.

Asset integrity is defined as the ability of an asset to perform its required function effectively and efficiently whilst protecting health, safety and the environment and the means of ensuring that the people, systems, processes, and resources that deliver integrity are in place, in use and will perform when required over the whole life-cycle of the asset.

Asset integrity includes elements of process safety. In this context, an asset is a process or facility that is involved in the use, storage, manufacturing, handling or transport of chemicals, but also the equipment comprising such a process or facility. Examples of process control assets include pumps, furnaces, tanks, vessels, piping systems, buildings, but also includes the BPCS and SIS among other process automation functions. As soon as a production process is started up the assets are subject to many different damage and degradation mechanisms depending on the type of asset. For electronic programmable components this can be hardware failures, software failures, but today also maliciously caused failures by cyber-attacks. From an asset integrity perspective there are two types of failure modes:

  • Loss of required performance;
  • Los of ability to perform as required;

The required performance of an asset is the successful functioning (of course while in service) achieving its operational / design intent as part of a larger system or process. For example in the context of a BPCS the control function adheres to the defined operating window, such as sensor ranges, data sampling rates, valve travel rates, etc. The BPCS presents the information correctly to the process operator, measured values accurately represent the actual temperatures, levels, flows, and pressures as present in the physical system. In the context of a SIS it means that the set trip points are correct, that the measured values are correct, and that the application logic acts as intended when triggered.

Loss of ability is not achieving that required performance. An example for a BPCS is loss of view or loss of control, the ability fails where in loss of required performance the ability is present but doesn’t execute correctly. Loss of ability is very much related to availability and loss of required performance to integrity. Never the less I prefer to use the terminology used by the asset integrity discipline because it more clearly defines what is missing.

The simplest definition of cybersecurity is that it is the practice of protecting computers, servers, mobile devices, electronic devices, networks, and data from malicious attacks. Typical malicious cyber-attacks are gaining unauthorized access into systems, and distributing malicious software.

In the context of ICS we often talk about Operational Technology (OT), which is defined as the hardware and software dedicated to detecting and / or causing changes in a production processes through monitoring and/or control of physical devices such as sensors, valves, pumps, etc. This is a somewhat limited definition because process automation systems contain other functions such as for example dedicated safety, diagnostic and analyzing functions.

The term OT was introduced by the IT community to differentiate between cybersecurity disciplines that protect these OT systems and those that protect the IT (corporate network) systems. There are various differences between IT and OT systems that justified to create this new category, though there is also significant overlap frequently confusing the discussion. In this document the OT systems are considered the process automation systems, the various embedded devices such as process and safety controllers, network components, computer servers running applications and stations for operators and engineers to interface with these process automation functions.

The IT security discipline defined a triad based on confidentiality, integrity, availability (CIA) as a model highlighting the three core security objectives in an information system. For OT systems ISA extended the number of security objectives by introducing 7 foundational requirement categories (Identification and authentication control, Use control, System integrity, Data confidentiality, Restricted data flow, Timely response to events, Resource availability) to group the requirements.

Now we have defined the three main disciplines I like to discuss the topic if we need to extend the triad with a fourth element safety.

Is safety a cybersecurity objective?

Based upon the above introduction to the three disciplines and taking asset integrity as the leading discipline for plant maintenance organizations, we can define three cybersecurity objectives for process automation systems:

  • Protection against loss of Required Performance;
  • Protection against Loss of Ability;
  • And protection against Loss of Confidentiality.

If we have established these three objectives we also established functional safety. Not necessarily process safety because this also depends non-functional safety elements. But these non-functional safety elements are not vulnerable to cyber attacks other than potentially revealing these elements through loss of confidentiality. Based upon all information on the TRISIS attack against a Triconex safety system, I believe all elements in this attack have been covered. The loss of confidentiality can be linked to the 2014 attack that is suggested to be the source of the disclosure of the application logic. The aim of the attack was most likely causing a loss of required performance by modifying the application logic that is part of the SIF. The loss of ability, the crash of the safety controller, was not the objective but an “unfortunate” consequence of an error in the malicious software.

Functional safety is established by the BPCS and SIS functionality, the combination of interlocks, permissives, and the safety integrity functions contribute to overall process safety. Cybersecurity contributes indirectly to functional safety by maintaining above three objectives. Loss of required performance and loss of ability would have a direct consequence to the process safety, loss of confidentiality might lead over time to the exposure of a vulnerability or contribute to the design of the attack. Required performance is overlapping what a process engineer would call operability, operability also includes safety so also from this angle nothing is missing in the three objectives.

Based upon above reasoning I don’t see a justification for adding safety to the triad, the issue is more that the availability and integrity objectives of the triad should be aligned with the terminology used by the asset integrity discipline to include safety. Which would make OT cybersecurity different from IT cybersecurity.

Author: Sinclair Koelemij

Date: April 2020