Criticizing a standard is no easy matter, once a group moves with a common idea the individual is supposed to stay in line. But the Dutch are direct, some might call it blunt, others rude. And I am Dutch and have a strong feeling the IEC 62443 standard is promising more than it can deliver, so want to address this in today’s blog. Not to question the value of the standard, neither to criticize the people contributing to it often in their private time, just to clarify where the standard has great value and where flaws are appearing.
This is a story about risk, and since the many wonders of OT security risk have my special attention, it is time to clarify my position on the standard in more detail.
The Dutch have a phrase “De knuppel in het hoenderhok gooien”, literally translated it means “Throwing the bat into the chicken shed”. The proper English translation seems to be to “Throw the cat among the pigeons”. Both expressions don’t align very well with my nature, agility and the purpose of this blog. So I was looking for a better phrase and decided to use the Russian phrase “пустить козла в огород” (let a goat into the garden). It seems to be a friendly action from the perspective of both the goat and the threat actor, so let me be the guy that lets the goat into the garden.
As always I like to learn, better understand the rational behind choices made, I don’t write this blog to confront those waiting in their trenches to fight for an unshakable belief in the blessings of a particular standard. Standards have strong points and weaker points.
I am in favor of standards, not too many standards if they merely overlap others, and I see their value when we make our initial steps to guide us in protecting our ICS targets.
I also see their value for vendors when they need to show asset owners that their products have been evaluated for meeting an established set of security requirements, to show that their development teams create new products with security processes embedded in their product development process, and to show that we operate our services with a focus on security. The ISA 99 team has really achieved a fantastic result and contributed to a more secure ICS. ISASecure is an excellent example of this.
But I also see that standards are sometimes misused by people to hide behind, for example if a standard doesn’t address a specific topic. “If the standard doesn’t address it, it isn’t important we can ignore it”. I see that standards are sometimes referred to in procurement documents to demand a certain level of protection. However standards are not written as procurement documents, they allow for many shades of gray and above all their development and refresh cycle struggles with the speed of change of technology in an continuously changing ICS threat landscape. Standards are good, but also have their pitfalls to be aware off. So I remain critical when I apply them.
Europe is a patchwork of standards without many major differences between those standards. Standards seem to bring out the dog in us, by developing new standards without substantially differentiating from existing. Standards sometimes seem to function as the next tree on a dogs walk through life. It is apparently a pleasant activity to develop new standards, though they often developed more into trust zones to make them look different creating hurdles in a global community.
The IEC 62443 / ISA 99 was the first standard that guided us (I know ISO 27001 has older roots, but not aimed at ICS). The standard had a significant influence in our thinking, also on my thinking, vocabulary and became the foundation for many standards since. Thanks to the standard we could make big steps forward.
But I feel the standard also makes promises it can’t meet in my opinion. Specifically the promise that it also addresses the threat posed by advanced threat actors with all the resources they need and operating in multi-discipline teams. The security level 3 and security level 4 type of threat actors of IEC 62443. This aspect I like to address in the blog.
IEC 62443-3-2 was recently released. The key concept of IEC 62443 turns around the definition of security zones and conduits between zones that pass the traffic, the conceptual pipeline that passes the channels (protocols) flowing between assets in the zones.
The standard wants us to determine “zone risk” and uses this risk to assign target security levels to a zone. A similar approach is also used by ANSSI the French standards institute, their standard also includes the risk model on how to estimate the risk.
Such an approach is the logical consequence from a drive to create a list of security requirements and provide a mechanism to select which of the many security requirements are an absolute must to stop a specific threat actor. “Give me a list on what to do” is a very human reaction when we are in a hurry and are happy with what our neighbors do to address a problem. However such an approach does not always take into account that the neighbor might be wrestling with a different problem.
ANSSI doesn’t link their risk result to a threat actor, IEC 62443 does link it to a threat actor in its SL1, SL2, SL 3, and SL4 definitions. ANSSI introduces the threat actor into its risk formula, taking a more capable threat actor into account raises the risk. The risk value determines the security class for the zone and the standard specifies which requirements must be met by the zone.
Within IEC 62443-3-2, the target level is the result of a high level risk assessment and a conversion from risk to a target security level. The IEC 62443-3-3 specifies the security requirements for this target level. Small differences between the two standards not important for my argument, though from a risk analysis perspective the ANSSI method is the theoretical better approach. It doesn’t need the transformation from risk to target level.
Where do I see a flaw in both ANSSI as well as IEC 62443 when addressing cyber security for critical infrastructure. This has to do with the concept of zone risk. Zone risk is in the IEC 62443 a kind of neighborhood risk, the assets live in a good or worse neighborhood. If you happen to live in the bad neighborhood you have to double your door locks, add iron bars before you windows and to also account for situational awareness and incident response you have to buy a big dog.
However zone risk / neighborhood risk doesn’t take the individual asset or threat into account. The protection required for the vault of a bank differs in the neighborhood from the protection required by the grocery that also wants to invite some customers in to do some business. You might say, the bank shouldn’t be in the bad neighborhood but that doesn’t work in an ICS where automation functions often overlap multiple zones.
There are a lot of intrinsic trust relations between components in an ICS that can be misused if we don’t take them into account. That would make security zones either so big that we could account anymore for differences in other characteristics (for example an 24×7 attended environment or an unattended environment) or make the zones so small that we get into trouble protecting the zone perimeter. The ideal solution would be what cyber security experts call the zero-trust-model, each asset is its own zone. This would be an excellent solution, but I don’t see it happen the gap with today’s ICS solutions is just too big, and also here there remain differences that require larger security zones.
Zone risk automatically leads to control based risk, a model where we compare a list with “what-is-good-for-you” controls with a list of implemented controls. The gap between the two lists can be seen as exposing non-protected vulnerabilities that can be exploited. The likelihood that this happens and the consequence severity would result into risk.
Essential for a control based risk model is that we ignore the asset and also don’t consider all threats, just those threats that result from the gap in controls. The concept works when you make your initial steps in life and learn to walk, but when attempting the steeple chase you are flat on your face very soon.
Zone risk is of course ideal for governance because we can make “universal rules”, it is frequently our first reaction to control risk.
The Covid-19 approach in many countries shows the difficulties this model faces. The world has been divided in different zones, sometimes countries, sometimes regions within a country. Governance within that zone has set rules for these regions, and thanks to this we managed to control the attack. So what is the problem?
Now we are on our way back to the “normal” society, while the threat hasn’t disappeared and is still present, we see the cracks in the rule sets. We have to maintain 1.5 meter distance (though rules differ by zone, we have 1 meter zones, and 2 meter zones). We allow planes to fly with all seats occupied, at the same time we deny this for buses and trains. We have restrictions in restaurants, theaters, sport schools, etc.
I am not challenging the many different rules, but just want to indicate that the “asset” perspective differs from the zone perspective and there are times that the zone perspective is very effective to address the initial attack but has issues when we want to continue with our normal lives. Governance today has a difficult job to justify all the differences, while trying to hold on their successful zone concept.
The same I see for OT security, to address our initial security worries the zone approach adopted by IEC 62443 worked fine. The defense team of the asset owners and vendors made good progress and raised the bar for the threat actor team. But the threat actor team also matured and extended their portfolio of devious attack scenarios, and now cracks become apparent in our defense. To repair these cracks I believe we should step up from control based risk to other methods of risk assessment.
The next step on the ladder is asset based risk. In asset based risk we take the asset and conduct threat modeling to identify which threats exist and how we best address them. This results in a list of controls to address the risk. We can compare this list with the controls that are actually implemented and the gap between the two will bring us on the path of estimating risk. Assets in this model are not necessarily physical system components, the better approach is to define assets as ICS functions. But the core of the method is we take the asset/function, analyse what can go wrong and define the required controls to prevent this from happening.
The big problem for governance is that this approach doesn’t lead to a confined set of controls or security requirements. New insights caused by changes in the function or environment of the function may lead to new requirements. But it is an approach that tends to follow technical change. For example the first version of IEC 62443-3-3 ignored the new technologies such as wireless sensors, virtual systems, and IIoT. This technology didn’t exist at the time, but today’s official version is still the version from September 2011. It took approximately 5 years to develop IEC 62443-3-3 and I believe it is an ISA policy to refresh a standard each 5 year. This is a long time in the security world. Ten years ago we had the Stuxnet attack, a lot has happened since than.
In threat based risk we approach it from the threat side, what kind of threats are feasible, what would be the best target for the threat actor team to execute the threat. If we have listed our threats, we can can analyze which controls would be necessary to prevent and detect them if attempted. These controls we can compare with the controls we actually applied in the ICS, and can construct risk from this gap to rank the threats in order of importance.
Following this approach, we truly follow the advice of the Chinese general Sun Tzu when he wrote “If you know the enemy and you know yourself, you don’t have to fear the result of a hundred battles. (Threat based risk) If you know yourself, but not the enemy, you will also be defeated for every victory achieved (Asset based risk). If you don’t know the enemy or yourself, you will succumb in every battle. (Control based risk)” Control based risk neither takes the asset into account (knowing yourself), nor the threat (knowing your enemy).
I don’t know if I have to call the previous part of the blog a lengthy intro, a historical context, or part of my continued story on OT cyber security risk. In the remainder of the blog I like to explain the concept of risk for operational technology, the core of every process automation system, meant to be the core of this blog.
When we talk about risk we make it sound as if only one type of risk exists, we sometimes mix up criticality and risk, and we use terms as high level risk where initial risk better represents what actually is attempted.
Let me try to be more specific starting with a picture I made for this blog. (Don’t worry the next diagrams are bigger).
For me there are three types of risk related to ICS:
- Cyber security risk
- Process risk
- Mission (or business) risk
Ultimately for an asset owner and its management, there is only one risk they are interested in, this is what I call mission risk. It is the risk related to the impact expressed in monetary value, in safety consequence, environmental damage, company image, regulations, and service delivery. Risk that directly threatens the companies existence. See the diagram I shared showing some of the impact categories in my earlier blog on OT cyber security risk.
If we can translate the impact of a potential cyber security incident (a cyber security hazard) into a credible mission risk value we catch the attention of any plant manager. Their world is for a large part risk driven. Management struggles with requests such as “I need a better firewall” (better means generally more expensive) without understanding what this better does in terms of mission risk.
Above model (I will enlarge the diagram) shows risk as a two stage / two step process. Let me explain starting at stage 1.
Part of what follows I already discussed in my ode to multiple honor and glory for Consequence, but I repeat it here partially for clarity and to add some more detail readers requested for in private messages.
The formula for cyber security risk is simple,
Cyber security risk = Threats x Vulnerability x Consequence.
I use the word consequence here, because I reserve the word impact for mission risk.
There is another subtle difference I like to make, I consider IT cyber security risk as a risk where consequence is primarily influenced by data. We are worried that data confidentiality is lost, data integrity is affected, or data is no longer available. The well known CIA triad.
Security in the context of automation adds another factor, that is the actions of the automated function. The sequence of actions certainly becomes of importance, time starts playing a role, it is becoming a game of action and interaction. Another factor, a very dominant factor enters the game. Therefore consequence needs to be redefined, to do this we can group consequence into three “new” categories:
- Loss of required performance (LRP) – Defined as “The functions, do not meet operational / design intent while in service”. Examples are program logic has changed, ranges were modified, valve travel rates were modified, calibrations are off, etc.
- Loss of Ability (LoA) – Defined as “The function stopped providing its intended operation” Examples are loss of view, loss of control, loss of ability to communicate, loss of functional safety, etc.
- Loss of confidentiality (LoC) – Defined as “Information or data in the system was exposed that should have been kept confidential.” Examples are loss of intellectual property, loss of access credential confidentiality, loss of privacy data, loss of production data.
The “new” is quoted because the first two categories are well known categories in asset integrity management used on a daily base in plants. The data part (LoC) is simple, very much alike the confidentiality as defined in IT risk estimates. But for getting a more discriminating level of risk we need to split it up in additional sub-categories, sub-categories I call failure modes. For confidentiality there are four failure modes:
- Loss of intellectual property – Examples of this can be a specific recipe, a particular way of automating the production process that needs to remain secret;
- Loss of production data confidentiality – This might be data that can reveal production cost / efficiency, the availability to deliver a service;
- Loss of access credential confidentiality – This is data that would allow a threat actor to raise his/her access privileges for a specific function;
- Loss of privacy data confidentiality – This type of data is not commonly found in ICS, but there are exceptions where privacy data and with this the GDPR regulations are very important.
Failure modes differ for each function, there are functions for which none of above failure modes play a role and there are functions for which all play a role. Now let’s go to the next two categories Loss of Required Performance and Loss of Ability. These two are very specific for process automation systems like ICS. Loss of required performance has 6 failure modes assigned, in random sequence:
- Integrity Operating Window (IOW) deviations – This applies to functional deviations that allow that the automation functions comes outside its design limitations which could cause immediate physical damage. Examples are modifying a level measurement range, potentially allowing to overfill a tank, or modifying a temperature range potentially damaging the coils in an industrial furnace. IOW is very important in a plant, so many standards have been addressing them;
- Control Performance (CP) deviations – This failure mode is directly linked to the control dynamics, so where in the previous category the attacker modified engineering data, in this failure mode the attacker modifies operator data. For example raising a flow, stopping a compressor, shutting down a steel mill furnace. There are many situations in a plant where this can lead to dangerous situations, examples are when filling a tank with a non-conductive fluids, there are conditions that a static electricity buildup can create a spark. If there would be an explosive vapor in this tank this can lead to major accidents with multiple casualties. Safety incidents have been reported where this happened accidentally. One way to prevent this is to restrict the flow used to fill the tank. If an attacker would have access to this flow control he might maliciously maximize it to cause such a spark and explosion;
- Functional safety configuration / application deviations (FS) – If safety logic or configuration was affected by an attack many things can happen, in most cases bad things because the SIS primary task is to prevent bad things from happening. Loss of functional safety and tampering with the configuration / application is considered a single category. This because there is no such thing as a bit of safety, safety is an all or nothing failure mode;
- Monitoring and diagnostic function deviations (MD) – ICS contains many monitoring and diagnostic functions, for example vibration monitoring functions, condition protection functions, corrosion monitoring, analyzers, gas chromatography, emission monitoring systems, visual monitoring functions such as cameras monitoring the flare, and many more. Impacting these functions can be just as bad, as an example if an attacker were to modify settings in an emission monitoring solution the measured values might be reported as higher than they actually are. If this would become publicly known this would cause a stream of negative publicity and impact the company public image. Modifying vibration monitoring settings might prevent the detection of mechanical wear leading to a larger physical damage than would be the case if detected in time;
- Monitoring data or alarm integrity deviation (MDAI) – This type of failure mode causes that data is no longer accurately represented, or alarms don’t occur when they should (Either too early, too late, maybe too many overflowing a process operator). The ICS function does in all cases its job as designed, however it is fed with wrong data and as a consequence might act in the wrong manner or too early or too late;
- Loss of deterministic / real-time behavior behavior (RB) – This type of failure mode is very specific for real-time OT components. As I explained in my blog on real-time systems tasks are required to finish in time. if a task is not ready when its time slot ended it is bad luck, next time the task starts it starts at the beginning creating the possibility of starvation, where specific tasks never complete.
All above six failure modes have in common that the ICS functions, only it doesn’t function as intended. Either the design has been tampered with or its operational response doesn’t do the job as intended.
The next main category of consequences is Loss of Ability to perform, in this category the function is no longer available. But also here we have multiple failure modes (6) for grouping consequences.
- Loss of controllability and / or intervention (LoCI) – When this occurs we lost the ability to move a system what is called its configuration space. We can’t adjust the flow anymore, pressure controls are lost, this might occur because the actuators fail, or because an operator has no longer access to the functions that allow him to do so.
- Loss of observability (LoO) – This is the situation when we no longer can observe the actual physical state of the production process. This can be because we no longer measure this, but can also occur when we freeze the measured data by for example continuously replaying the same measurement message over and over again.
- Loss of alarm or alerting function (LoAA) – This is the situation where there are no alarms or alerts warning the process operators of anomalies that they should respond to. Not necessarily restricted to BPCS alarms, it can also be alarms on a fire and alarm panel or a PAGA that doesn’t function.
- Loss of coordination and / or collaboration (LoCC) – In an automation system, many functions collaborate or coordinate functions together. Coordination / collaboration is much more than communication, the activities need to be “synchronized” need to be aware of what the other is doing or things might go wrong with a potential high impact.
- Loss of data communication (LoDC) – When we can’t communicate we can’t exchange commands or data. This doesn’t necessarily mean we can’t do anything, many functions can act isolated, sometimes there are hardwired connections to act.
- Loss of reporting functions (historical data / trending) – This failure mode prevents us to report on ICS performance or observe slowly changing trends (creep) by blocking trending capabilities. Loss of reporting can be serious under certain circumstances, specifically for FDA and regulatory compliance.
This were the sixteen failure modes / sub-categories we can use to group consequences, functional deviations caused by a cyber attack. Not every function has all failure modes, and the failure modes have different severity scores for each function. But in general we can group hundreds of functional deviations of an ICS with its many functions (BPCS, SIS, CCS, IAMS, CDP, DAHS, PMS, MCC, MMS, ….) into these 16 failure modes and assign a severity value to each which we use for estimating cyber security risk for the various cyber security hazards.
I will not discuss all blocks of the diagram, that would be a book and no blog, but the criticality block is of importance. Why is criticality (the importance of the function) important?
For OT security criticality is a twin, one is criticality as the importance for the plant’s mission and the other is the importance for the threat actor. They are different things, for example a function such as Instrument Asset Management System (IAMS), is for the plant’s mission not so important, but for the threat actor’s mission it might be a very interesting object because it provides access to all field equipment.
The same differences we see with various technical and security management functions that if used by a threat actor provide him / her with many benefits. If we want to compare cyber security risk from different functions we can not just base this on consequence severity, we need to take function criticality into account. Another important aspect of criticality is that it is not a constant, criticality can increase over time. This provides us with information for our recovery time objectives, recovery point objectives, and maximum tolerable downtime. All critical security design parameters!
The zone and conduit model plays an important role in the Threat x Vulnerability, the likelihood, side of the cyber security risk equation. A very important factor here is exposure, which has two parts: dynamic exposure and static exposure. Connectivity is one of the three parameters for “measuring” static exposure.
The final block of the stage 1 side I like to discuss briefly, perhaps in a later blog in more detail, are the Functional Indicators of Compromise (FIoC).
In cyber security we are used to talk about IoC, general in the form that the threat actor makes some modifications in the registry or transfers some executable into a specific location. The research companies evaluating the cyber attacks, document these IoC in their reports.
But when we discuss functional deviations such as in OT cyber security as result of a cyber attack we have Indicators of Compromise here too, FIoC. This because often data is stored in multiple locations in the system. For example when we configure a controller we set various parameters for the field equipment connected. This data can be downloaded to the field equipment or separately managed by an IAMS. But it is not necessarily so that if changed in one place the other places are tracking it and automatically update.
This offers us one way of detecting inconsistencies and alert upon them. For multiple reasons an important function, but not always done. Similarly many settings in an ICS are always within a certain range, if they vary it are small variations. A cyber attack generally initiates bigger changes we can identify.
Not for all, but for many consequences / failure modes there is a functional indicator of compromise (FIoC) available to warn that something unusual occurred and we should check it.
Let’s slowly move to the end of the blog, I already surpassed my reading time long ago, let’s look at stage 2 the path from cyber security risk to mission risk.
Stage 1 provided us two results, a likelihood score and a set of functional deviations, that we need to bring over to stage 2. The identified functional deviations can cause deviations in the production process, perhaps a too high flow in my tank example, perhaps the possibility to close one or more valves simultaneously.
In order to evaluate their consequence for process risk we need to inspect which functional deviations have unforeseen potential process deviations. Process safety analysis thoroughly goes over the various failure modes that can occur in a plant and analyses how to prevent these to cause a personal safety incident or damage to the equipment.
But what process safety analysis does not do is looking for foul play. A PHA sheet might conclude that the failure of a specific valve is no issue because there is an open connection available, but what if that open connection would be blocked at the same time the threat actor manipulates initiates the failure of the valve. Two or more actions can occur simultaneously in an attack.
That is the activity in the second stage of the risk analysis, needed to arrive at mission risk. First we need to identify the process consequence that are of importance before we can translate this into business / mission impact. The likelihood of such an incident is the same as the likelihood of the cyber security incident that caused all the trouble.
Though the method of analyzing the process safety sheet and the checklist to process is probably just as interesting as many of the stage 1 processes a blog needs to end. So perhaps a later blog, or a book when I am retired.
What did I try to say in this blog? Well first of all I like to repeat, standards are important and the IEC 62443 is the most influential of all and contributes very much to the security of ICS. Doesn’t take away they are not above criticism, it is after all a men made product.
The structure of standards such as IEC 62443 or the French ANSSI standard leads to control based risk because of their zone based strategy. Nothing wrong with this, but in my opinion not sufficient to protect more critical systems that are attacked by more capable threat actors. Like the SL 3 and SL 4 type of threat actors. There I see gaps in the method.
I explained we have three forms of risk, cyber security risk, process risk, and mission risk. They differ because they need different inputs. We cannot just jump from cyber security risk to mission risk forgetting about how the functional deviations in the ICS caused by the cyber attack impact the production process, we need to analyse this step.
I have shown and discussed a two stage method to come from a cyber security hazard to a process failure hazard to mission risk. I didn’t explain the method on how I identified all these cyber security hazards, nor the method of risk analysis analysis to estimate likelihood. This can still be either threat based or asset based, maybe another blog to explain the details.
But at minimum the asset specific elements and the threats are accounted for in the method described, this is missing in the zone risk approach. Functional deviations (consequences) play no role in zone risk estimates.
Does this mean the IEC 62443 isn’t good enough, no absolutely not. All I say in this blog it is not complete enough, security cannot be achieved with just a list of requirements. This plays no role when the objective is to keep the cyber criminals (SL 2) out, it plays a role when attacks become targeted and ICS specific. It is my opinion that for SL 3 and SL 4 we need a detailed asset or threat based risk analysis that potentially adds new requirements (and in my hands-on experience it does) to the requirements identified by the zone risk analysis.
So what would I like to see different?
- IEC 62443 starts with an inventory, this seems for me the right first step.
- Then IEC 62443 starts with a high level risk assessment, this I believe is wrong. Apart from the name (should have been initial risk assessment) I think there is no information available to determine likelihood, so it becomes primarily an impact driven assessment. Than the proper choice in my opinion would have been to conduct a criticality assessment, for criticality I don’t need likelihood. Apart from this, criticality in a plant has a time aspect because of upstream / down stream dependencies. This provides us with recovery time objectives (for the function), recovery point objectives, and maximum tolerable downtime when multiple functions are impacted. This part is missing in the IEC 62443 standard, while an important design parameter. Another reason is of course that when we do asset based or threat based risk and start comparing risk between multiple functions we need to take the function’s criticality into account.
- Then IEC 62443 creates a zone and conduit diagram, no objections here. To determine risk we need static exposure, we need to know which assets go where. So good step.
- Then IEC 62443 does a detailed risk assessment, also the right moment to do so. Hopefully also in IEC 62443 will see the importance of asset or threat based risk assessment. The standard doesn’t discuss this, it might as well be a control based risk assessment. But because there is no criticality assessment I concluded it must be a control based assessment not looking at risk at function level.
I hope I didn’t step on too many toes, only wanted to give a goat a great day. It is just my opinion as private person working as a consultant during business hours. Who on a daily base, during working hours, needs to secure systems, many of them part of national critical infrastructure.
Above is what I am missing, a criticality assessment, and a risk assessment that meets the SL3 / SL4 threat.
There is no relationship between my opinions and publications in this blog and the views of my employer in whatever capacity. this blog is written based on my personal opinion and knowledge build up over 42 years of work in this industry. Approximately half of the time working in engineering these automation systems, and half of the time implementing their networks and securing them.
Author: Sinclair Koelemij
OTcybersecurity web site