Dare for More, featuring the ICS kill-chain and a steel mill

When selecting a topic for a blog I generally pick one that entails some controversy. This time I select a topic that is generally accepted and attempt to challenge it to create the controversy. I believe very much that OT security is about learning, and that learning requires to continuously challenging the new knowledge we acquire. So this blog is not about trying to know better, or trying to refute a well founded theorem, but simply by applying some knowledge and ratio to investigate the feasibility of other options. In this case, is a one stage ICS attack feasible?

Five years ago Michael Assante and Robert Lee wrote an excellent paper called “The Industrial Control System Cyber Kill Chain“. I think the readers of this blog should carefully study this document, the paper provides a very good structure on how threat actors approach attacks on industrial control systems (ICS). The core message of the paper is that attacks on ICS, with the objective to damage production equipment, are two stage attacks.

The authors describe a first stage where the threat actor team collects information by penetrating the ICS and establishing a persistent connection, this is called in the paper the “Management and Enablement Phase“. This phase is followed by the “Sustainment, Entrenchment, Development, and Execution” phase, the phase where the threat actor captures additional information and prepares for the attack on the industrial control system (ICS).

Both phases are part of the first attack stage. The information collected in stage 1 is used to prepare the second stage in the kill chain. The second stage conducts the attack on the process installation to cause a specific failure mode (“the damage of the production system”). Typically there is considerable time between the first stage and the second stage, time used for the development of the second stage.

I assume that the rational of the authors for suggesting two stages is that each plant is unique, and as such each ICS configuration is unique, so to damage the process installation we need to know its unique configuration in detail before we can strike.

Seems like fair reasoning and the history of ICS attacks supports the theorem of two stage attacks.

  • Stuxnet required the threat actor team to collect information on: which variable speed drives were used to cause the necessary speed variations causing the excessive wear in the centrifuge bearings. The threat actor team also needed information on which process tags were configured in the Siemens control system to hide these speed variations from the process operator, and the threat actors required a copy of the PLC program code so they could modify it to execute the speed variations. The form of Stuxnet’s stage 1 differs probably very much from the ICS kill chain stage 1 scenario in general (Probably partially a non electronic old fashioned espionage activity), but this isn’t important. There is a stage 1 collecting information, and a stage 2 that “finishes” the job. So high level it matches the ICS kill chain pattern described in the paper.
  • The two attacks on the Ukraine power grid in 2015 and 2016 follow a similar pattern. It is very clear from the efficiency of the attack, developed tools and the very specific actions carried out, that the threat actor team had detailed knowledge on what equipment was installed, and which process tags were related to the breaker switches. So also in this case the threat actor team executed a stage 1 collecting information, than based on this information the threat actor team plans the stage 2 attack, and in the stage 2 attack the actual objective is realized.
  • The third example TRISIS / TRITON, follows the same pattern. Apparently the objective of this attack was to modify the safety integrity function in a Tristation safety controller. Details here are incomplete when reading / listening to the various sources. An interesting story, though perhaps at times a bit too dramatic for those of us that regularly spend their days at chemical plants, is for example the Darknet diaries on the TRISIS attack. Featuring among others Robert Lee and my former colleague Marina Krotofil. Was the attack directed against a Fire & Gas Tristation, or was the emergency shutdown function the target? This is not very clear in the story, because of the very different set of consequences for the production process an important detail if it comes to understanding the objective of the attack. Never the less also in the TRISIS case we saw a clear two stage approach, again an attack in two stages. First attack collecting the process information and Tristation configuration and logic, and a second attack attempting to install the modified logic and initiate (or wait for) something to happen. In my TRISIS revisited blog I explained a possible scenario to cause considerable damage and showed for this to happen that we don’t need to have a simultaneous attack on the control system.

What am I challenging in this blog? For me the million dollar question is, can an attacker cause physical equipment damage in a one stage attack, or are ICS attacks by default two stage attacks? Is a one stage attack, where all knowledge required is readily available in the public space and known by a subject matter experts, possible? And if it is a possibility, what are the characteristics that make such a scenario possible?

Personally I believe it is possible and I might even have seen an example of such an attack that I will discuss in more detail. I believe that an attacker with some basic hands-on practice working with a controls system, can do it it in a single stage.

I am not looking for scenarios that cause a loss of production, like for example a ransomware attack would cause. This is too “easy”, I am looking for targets that when attacked in the first stage would suffer considerable physical damage.

I am looking into these scenarios always for better understanding how such an attack would work and what we can learn from it to prevent such a scenario or make at least make it more difficult. Therefore I like to follow the advice of the Chinese general Sun Tzu, to learn to know the opponent, as well as our own shortcomings.

Sun Tzu wrote over 2500 years ago, the following important lesson which is also true in OT security:

  • “If you know the enemy and you know yourself, you don’t have to fear the result of a hundred battles.”
  • “If you know yourself, but not the enemy, you will also be defeated for every victory achieved.”
  • “If you don’t know the enemy or yourself, you will succumb in every battle.”

So lets investigate some scenarios that can lead to physical damage. In order to damage process equipment we have to focus on consequences with failure modes belonging to the categories: Loss of Required Performance and Loss of Ability (See my earlier blogs for an explanation) This since only failure modes in those categories can cause physical damage in a one stage attack.

Loss of confidentiality is basically the outcome of the first stage in a two stage attack, access credentials intellectual property (e.g. design information), or information in general all might assist in preparing stage 2. But I am looking for a single stage attack so need to rely on knowledge available in the public domain without the need to collect it from the ICS, gain access into the ICS and cause the damage on the first entry.

From a functional aspect, the type of function to attack for causing direct physical damage would be: BPCS (Basic Process Control System), SIS (Safety Instrumented System), CCS (Compressor Control System), MCC (Motor Control Center), BMS (Boiler Management System), PMS (Power Management System), APC (Advanced Process Control), CPS (Condition Protection System), and IAMS (Instrument Asset Management System).

So plenty of choice for the threat actor team depending on the objective. It is not difficult for a threat actor to find out which systems are used at a specific site and which vendor provided these functions. This because if we know the product made, we know which functions are mandatory to have. If we know the the business a bit we can find many detailed success stories on the Internet, revealing a lot of interesting detail for the threat actor team. Additionally threat actors can have access to a similar system for testing their attack strategies, such as for example the TRISIS threat actor team had access to a Triconix safety system for its attack.

To actually make the damage happen we also need to consider the failure modes of the process equipment. Some process equipment would be more vulnerable than others. Interesting candidates would be:

  • Extruders such as used in the thermoplastic industry, an extruder melts raw plastic, adds additives to it and passes it through a die to get the final product. This is a sensitive process where too high heath would change the product structure or an uneven flow might cause product quality issues. But the worst condition is no flow, if no flow happens this would result in that the melted plastic would solidify within the extruder. If this would happen, considerable time is required to recover from this. But this is a failure mode that happens once in a while also under normal circumstances, so an annoyance but process operations knows how to handle this so hardly an interesting target for a cyber attack attempting to cause significant damage.
  • How about a kiln in a cement process? If a kiln doesn’t rotate, the construction would collapse under its own weight. So stopping a kiln could be a potential target, however it is not that complex to create a safeguard creating a manual override to bypass the automation system to keep the kiln rotating. So also not an interesting target to pursuit for this blog.
  • How about a chemical process where runaway reactions are a potential failure mode? A runaway reaction is a thermally unstable reaction that accelerates with rising temperature and can ultimately lead to explosions and fires. A relatively recent example of such an accident is the Shell Moerdijk accident in 2014. Causing a runaway reaction could certainly be a scenario because these can be induced relatively easy. The threat actor team wouldn’t need that much specific information to carry out the attack, they could perhaps stop the agitator in the reactor vessel, or stop a cooling process to trigger the runaway reaction. If the threat actor gained access to the control system, this action is not that complex to do if familiar with control systems. So for me a potential single stage attack target.
  • Another example that came to mind was the attack on the German steel mill in 2014. According to the BSI report the attacker(s) gained access into the office network through phishing emails, once the threat actor had access into the office network they worked their way into the control systems. How this happened is not documented, but could have been through some key logger software. Once the threat actor had access to the control system they caused what was called “an uncontrolled shutdown” by shutting down multiple control system components. This looks like a one stage attack, lets have a closer look.

LESSON 1: Never allow access from the office domain into the control domain without implementing TWO-FACTOR AUTHENTICATION. Not even for your own personnel. Not implementing two-factor implementation is denying the risk of phishing attacks and ignoring defense in depth. When doing so we are taking a considerable risk. See also my blog on remote access, RULE 3.

For the steel mill attack there are no details known on what happened other than that the threat actor according to the BSI report shuts down multiple control system components and causes massive damage. Just speculating from now on what might have happened but a possible scenario could have been shutting down shutting down the cooling system and shutting down BPCS controllers so the operator can’t respond that quickly. The cooling system is one of the most critical parts of a blast furnace. Because the BSI report mentions both “Anlagen” and “Steurungskomponenten” it might have been above suggestion, first shutting down the motors of the water pumps and than shutting down the controllers to prevent a quick restart of the pumps.

The picture on the left (Courtesy of HSE) shows a blast furnace. A blast furnace is charged at the top with iron ore, coke, sinter, and limestone. The charge materials gradually work their way down, reacting chemically and thermally on their path down. This process takes about 6-8 hours depending on the size of the furnace. The furnace is heated by injecting hot air (approx. 1200 C) mixed with pulverized coal, and sometimes methane gas through nozzles called Tuyeres in the context of a blast furnace.

This results in a furnace temperature of over 2000 C. To protect the blast furnace shell against these high temperatures it is critical to cool the refractory lining (See picture). This cooling is done by pumping water into the lining. Cooling systems are typically fully redundant having doubled all equipment.

There would be multiple electrical pump systems and turbine (steam) driven pump systems. If one system fails, the other would take immediately over. One being steam driven, the other being electrical driven. So all looks fine from a cooling system failure perspective. The problem is that when we consider OT cyber security we also have to consider malicious failures / forced shutdowns. In those cases redundancy doesn’t work multiple motors can be switched of, neither is it so that if a process valve fails to a closed or open position there might not be another valve doing the same or in opposite direction. Frequently we see in process safety HAZOP analysis that double failures are not taken into account, this is the field of the cyber security HAZOP translating the various failure modes of the ICS functions in how they can be used to cause process failures.

The question in above scenario is could the attacker shutdown both pumps, and would there be a manual restart of the cooling system possible. If this would fail, the lining would overheat very quickly. In the case of a Corus furnace in Port Talbot, the cooling system was designed to run simultaneously two pumps, each producing 45.000 litres per minute. So a flow equivalent to 90.000 1 liter water bottles per minute. If we would need a crowd of people to consume this amount of water per minute, we get a very sizable crowd. Just to give you an idea how much water this is. If the cool water flow was stopped, the heat in the lining would rise very quickly and as a result the refractory lining would be damaged through the huge thermal stress created.

For safety purposes there is generally also an emergency water tower which acts when the water pressure drops. This system supplies water through gravitational force, it might not have worked, or maybe insufficient capacity to prevent damage. Or perhaps the attacker also knew a way to stop this safety control.

Do we need a two-stage attack for above scenario?

I believe a threat actor with sufficient hands-on experience with a specific BPCS, can relatively quickly locate these critical pumps and could shut them down. With the same level of knowledge he can subsequently shutdown the controller if his authorizations would allow this. Typically engineer level access would be required, but I also encountered systems were process operators could shutdown a controller. How blast furnaces work and what the critical functions are, even which companies specialize in supplying these functions is all in the public domain. So perhaps a single stage attack is possible.

If above is a credible scenario on what might have happened in the case of the German steel mill, what can we learn from it and do to increase the resilience against this attack?

  • First, we should not allow any remote access (access from the corporate network is also remote access) without two-factor authentication if the user would have “write” possibilities in the ICS. The fact that an attacker could gain access into the corporate network and from there could get access into the control network and causing an uncontrolled shutdown is in my opinion a significant (and basic) security flaw.
  • Second, we should consider implementing dual control for critical functions in the process that have a steady state. For example a compressor providing instrument air, or in this case a cool water pump and its redundant partners that should be on as the normal state. Dual control would demand that two users approve the shutdown action before the control system executes the command, this would generally be users with a different authorization level. Such a control would further raise the bar for the threat actor.

LESSON 2: For critical automation functions that can cause significant damage when initiated, consider to either isolate them from the control system (hardwired connections) or implement DUAL CONTROL to not only depend on a single layer of protection – access control. Defense in depth should never be ignored for critical functions, WHEN CRITICAL DON’T RELY ON A SINGLE CONTROL, APPLY DEFENSE IN DEPTH!

Above two additional security controls would make the sketched scenario much more difficult, almost impossible to succeed. OT security is more than installing AV, patching and a firewall. To reduce risk we need to assess security from all points of view: the automation function and its vulnerabilities, our protection of these vulnerabilities, our situational awareness of what is happening, and what threats there are – which attack scenarios could happen.

IEC 62443 specifies the “dual control” requirement as an SL 4 requirement, I personally consider this as a flaw in the standard and believe it should already be a mandatory requirement for SL 3 because unauthorized access to critical process equipment is well within the capabilities of the “sponsored attacker” type of threat actor.

Having dual control as a requirement doesn’t mean that all control actions should have this implemented, but there are critical functions in a plant that are seldom switched on or off where adding dual control adds security by adding DEFENSE IN DEPTH. When critical, don’t rely on a single control.

Also today’s blog is a story on the need for approaching OT security of ICS in a different manner as we do for IT systems where network security and hardening are the core activities for prevention. Network security and hardening of the computer systems are also essential first steps for an OT system but we should never forget the actual automation functions of the OT systems.

This is why OT security teams should approach OT security differently and construct bow-tie diagrams for cyber security hazards, relating cause (threat actor initiating threat action exploiting a vulnerability) to consequence (the functional deviation) and identifying security countermeasures, safeguards, and barriers. Only following this path we can analyze possible process failures as result of a functional deviation and can design OT systems in a secure way. However for this method to provide the required information we need a far higher level of detail / discrimination than terminology used in the past offered. Terminology such as loss of view and loss of control don’t lead to identifying countermeasures, safeguards and barriers. To make a step toward how a cyber attack can impact a production system we need much more granularity, therefor the 16 failure modes discussed in a prior blog.

Each ICS main function (BPCS, SIS, CCS, PMS, ….) has a fixed set of functions and potential deviations, it is the core task of the cyber security HAZOP to identify what risk they bring. Not to identify these deviations, these are fixed because each main function has a set of automation tasks it carries out. It are the deviations, their likelihood and severity for the production installation that are of importance, because than we know what benefits controls bring and which controls are of less importance. But we need to consider all options before we can select.

LESSON 3 – My final lesson of the day, first analyze what the problem is prior to trying to solve it.

Many times I see asset owners surfing the waves of the latest security hype. Every security solution has its benefits and “exciting” features, but this doesn’t mean it is the right solution for your specific problem or addressing your biggest security gap.

To know what are the right solutions you need to have a clear view on the cyber security hazards in your system, and then priority and select those that contribute most. There are often many features already available, start using them.

This leads to reusing methodologies we have learned to apply successfully for almost 60 years now in process safety, we should adapt them where necessary and reuse them for OT security.

There is no relationship between my opinions and references to publications in this blog and the views of my employer in whatever capacity. This blog is written based on my personal opinion and knowledge build up over 42 years of work in this industry. Approximately half of the time working in engineering these automation systems, and half of the time implementing their networks and securing them.

Author: Sinclair Koelemij

OTcybersecurity web site

Letting a goat into the garden

Criticizing a standard is no easy matter, once a group moves with a common idea the individual is supposed to stay in line. But the Dutch are direct, some might call it blunt, others rude. And I am Dutch and have a strong feeling the IEC 62443 standard is promising more than it can deliver, so want to address this in today’s blog. Not to question the value of the standard, neither to criticize the people contributing to it often in their private time, just to clarify where the standard has great value and where flaws are appearing.

This is a story about risk, and since the many wonders of OT security risk have my special attention, it is time to clarify my position on the standard in more detail.

The Dutch have a phrase “De knuppel in het hoenderhok gooien”, literally translated it means “Throwing the bat into the chicken shed”. The proper English translation seems to be to “Throw the cat among the pigeons”. Both expressions don’t align very well with my nature, agility and the purpose of this blog. So I was looking for a better phrase and decided to use the Russian phrase “пустить козла в огород” (let a goat into the garden). It seems to be a friendly action from the perspective of both the goat and the threat actor, so let me be the guy that lets the goat into the garden.

As always I like to learn, better understand the rational behind choices made, I don’t write this blog to confront those waiting in their trenches to fight for an unshakable belief in the blessings of a particular standard. Standards have strong points and weaker points.

I am in favor of standards, not too many standards if they merely overlap others, and I see their value when we make our initial steps to guide us in protecting our ICS targets.

I also see their value for vendors when they need to show asset owners that their products have been evaluated for meeting an established set of security requirements, to show that their development teams create new products with security processes embedded in their product development process, and to show that we operate our services with a focus on security. The ISA 99 team has really achieved a fantastic result and contributed to a more secure ICS. ISASecure is an excellent example of this.

But I also see that standards are sometimes misused by people to hide behind, for example if a standard doesn’t address a specific topic. “If the standard doesn’t address it, it isn’t important we can ignore it”. I see that standards are sometimes referred to in procurement documents to demand a certain level of protection. However standards are not written as procurement documents, they allow for many shades of gray and above all their development and refresh cycle struggles with the speed of change of technology in an continuously changing ICS threat landscape. Standards are good, but also have their pitfalls to be aware off. So I remain critical when I apply them.

Europe is a patchwork of standards without many major differences between those standards. Standards seem to bring out the dog in us, by developing new standards without substantially differentiating from existing. Standards sometimes seem to function as the next tree on a dogs walk through life. It is apparently a pleasant activity to develop new standards, though they often developed more into trust zones to make them look different creating hurdles in a global community.

The IEC 62443 / ISA 99 was the first standard that guided us (I know ISO 27001 has older roots, but not aimed at ICS). The standard had a significant influence in our thinking, also on my thinking, vocabulary and became the foundation for many standards since. Thanks to the standard we could make big steps forward.

But I feel the standard also makes promises it can’t meet in my opinion. Specifically the promise that it also addresses the threat posed by advanced threat actors with all the resources they need and operating in multi-discipline teams. The security level 3 and security level 4 type of threat actors of IEC 62443. This aspect I like to address in the blog.

IEC 62443-3-2 was recently released. The key concept of IEC 62443 turns around the definition of security zones and conduits between zones that pass the traffic, the conceptual pipeline that passes the channels (protocols) flowing between assets in the zones.

The standard wants us to determine “zone risk” and uses this risk to assign target security levels to a zone. A similar approach is also used by ANSSI the French standards institute, their standard also includes the risk model on how to estimate the risk.

Such an approach is the logical consequence from a drive to create a list of security requirements and provide a mechanism to select which of the many security requirements are an absolute must to stop a specific threat actor. “Give me a list on what to do” is a very human reaction when we are in a hurry and are happy with what our neighbors do to address a problem. However such an approach does not always take into account that the neighbor might be wrestling with a different problem.

ANSSI doesn’t link their risk result to a threat actor, IEC 62443 does link it to a threat actor in its SL1, SL2, SL 3, and SL4 definitions. ANSSI introduces the threat actor into its risk formula, taking a more capable threat actor into account raises the risk. The risk value determines the security class for the zone and the standard specifies which requirements must be met by the zone.

Within IEC 62443-3-2, the target level is the result of a high level risk assessment and a conversion from risk to a target security level. The IEC 62443-3-3 specifies the security requirements for this target level. Small differences between the two standards not important for my argument, though from a risk analysis perspective the ANSSI method is the theoretical better approach. It doesn’t need the transformation from risk to target level.

Where do I see a flaw in both ANSSI as well as IEC 62443 when addressing cyber security for critical infrastructure. This has to do with the concept of zone risk. Zone risk is in the IEC 62443 a kind of neighborhood risk, the assets live in a good or worse neighborhood. If you happen to live in the bad neighborhood you have to double your door locks, add iron bars before you windows and to also account for situational awareness and incident response you have to buy a big dog.

However zone risk / neighborhood risk doesn’t take the individual asset or threat into account. The protection required for the vault of a bank differs in the neighborhood from the protection required by the grocery that also wants to invite some customers in to do some business. You might say, the bank shouldn’t be in the bad neighborhood but that doesn’t work in an ICS where automation functions often overlap multiple zones.

There are a lot of intrinsic trust relations between components in an ICS that can be misused if we don’t take them into account. That would make security zones either so big that we could account anymore for differences in other characteristics (for example an 24×7 attended environment or an unattended environment) or make the zones so small that we get into trouble protecting the zone perimeter. The ideal solution would be what cyber security experts call the zero-trust-model, each asset is its own zone. This would be an excellent solution, but I don’t see it happen the gap with today’s ICS solutions is just too big, and also here there remain differences that require larger security zones.

Zone risk automatically leads to control based risk, a model where we compare a list with “what-is-good-for-you” controls with a list of implemented controls. The gap between the two lists can be seen as exposing non-protected vulnerabilities that can be exploited. The likelihood that this happens and the consequence severity would result into risk.

Essential for a control based risk model is that we ignore the asset and also don’t consider all threats, just those threats that result from the gap in controls. The concept works when you make your initial steps in life and learn to walk, but when attempting the steeple chase you are flat on your face very soon.

Zone risk is of course ideal for governance because we can make “universal rules”, it is frequently our first reaction to control risk.

The Covid-19 approach in many countries shows the difficulties this model faces. The world has been divided in different zones, sometimes countries, sometimes regions within a country. Governance within that zone has set rules for these regions, and thanks to this we managed to control the attack. So what is the problem?

Now we are on our way back to the “normal” society, while the threat hasn’t disappeared and is still present, we see the cracks in the rule sets. We have to maintain 1.5 meter distance (though rules differ by zone, we have 1 meter zones, and 2 meter zones). We allow planes to fly with all seats occupied, at the same time we deny this for buses and trains. We have restrictions in restaurants, theaters, sport schools, etc.

I am not challenging the many different rules, but just want to indicate that the “asset” perspective differs from the zone perspective and there are times that the zone perspective is very effective to address the initial attack but has issues when we want to continue with our normal lives. Governance today has a difficult job to justify all the differences, while trying to hold on their successful zone concept.

The same I see for OT security, to address our initial security worries the zone approach adopted by IEC 62443 worked fine. The defense team of the asset owners and vendors made good progress and raised the bar for the threat actor team. But the threat actor team also matured and extended their portfolio of devious attack scenarios, and now cracks become apparent in our defense. To repair these cracks I believe we should step up from control based risk to other methods of risk assessment.

The next step on the ladder is asset based risk. In asset based risk we take the asset and conduct threat modeling to identify which threats exist and how we best address them. This results in a list of controls to address the risk. We can compare this list with the controls that are actually implemented and the gap between the two will bring us on the path of estimating risk. Assets in this model are not necessarily physical system components, the better approach is to define assets as ICS functions. But the core of the method is we take the asset/function, analyse what can go wrong and define the required controls to prevent this from happening.

The big problem for governance is that this approach doesn’t lead to a confined set of controls or security requirements. New insights caused by changes in the function or environment of the function may lead to new requirements. But it is an approach that tends to follow technical change. For example the first version of IEC 62443-3-3 ignored the new technologies such as wireless sensors, virtual systems, and IIoT. This technology didn’t exist at the time, but today’s official version is still the version from September 2011. It took approximately 5 years to develop IEC 62443-3-3 and I believe it is an ISA policy to refresh a standard each 5 year. This is a long time in the security world. Ten years ago we had the Stuxnet attack, a lot has happened since than.

In threat based risk we approach it from the threat side, what kind of threats are feasible, what would be the best target for the threat actor team to execute the threat. If we have listed our threats, we can can analyze which controls would be necessary to prevent and detect them if attempted. These controls we can compare with the controls we actually applied in the ICS, and can construct risk from this gap to rank the threats in order of importance.

Following this approach, we truly follow the advice of the Chinese general Sun Tzu when he wrote “If you know the enemy and you know yourself, you don’t have to fear the result of a hundred battles. (Threat based risk) If you know yourself, but not the enemy, you will also be defeated for every victory achieved (Asset based risk). If you don’t know the enemy or yourself, you will succumb in every battle. (Control based risk)” Control based risk neither takes the asset into account (knowing yourself), nor the threat (knowing your enemy).

I don’t know if I have to call the previous part of the blog a lengthy intro, a historical context, or part of my continued story on OT cyber security risk. In the remainder of the blog I like to explain the concept of risk for operational technology, the core of every process automation system, meant to be the core of this blog.

When we talk about risk we make it sound as if only one type of risk exists, we sometimes mix up criticality and risk, and we use terms as high level risk where initial risk better represents what actually is attempted.

Let me try to be more specific starting with a picture I made for this blog. (Don’t worry the next diagrams are bigger).

The risk model

For me there are three types of risk related to ICS:

  • Cyber security risk
  • Process risk
  • Mission (or business) risk

Ultimately for an asset owner and its management, there is only one risk they are interested in, this is what I call mission risk. It is the risk related to the impact expressed in monetary value, in safety consequence, environmental damage, company image, regulations, and service delivery. Risk that directly threatens the companies existence. See the diagram I shared showing some of the impact categories in my earlier blog on OT cyber security risk.

If we can translate the impact of a potential cyber security incident (a cyber security hazard) into a credible mission risk value we catch the attention of any plant manager. Their world is for a large part risk driven. Management struggles with requests such as “I need a better firewall” (better means generally more expensive) without understanding what this better does in terms of mission risk.

Above model (I will enlarge the diagram) shows risk as a two stage / two step process. Let me explain starting at stage 1.

Cyber security risk

Part of what follows I already discussed in my ode to multiple honor and glory for Consequence, but I repeat it here partially for clarity and to add some more detail readers requested for in private messages.

The formula for cyber security risk is simple,

Cyber security risk = Threats x Vulnerability x Consequence.

I use the word consequence here, because I reserve the word impact for mission risk.

There is another subtle difference I like to make, I consider IT cyber security risk as a risk where consequence is primarily influenced by data. We are worried that data confidentiality is lost, data integrity is affected, or data is no longer available. The well known CIA triad.

Security in the context of automation adds another factor, that is the actions of the automated function. The sequence of actions certainly becomes of importance, time starts playing a role, it is becoming a game of action and interaction. Another factor, a very dominant factor enters the game. Therefore consequence needs to be redefined, to do this we can group consequence into three “new” categories:

  • Loss of required performance (LRP) – Defined as “The functions, do not meet operational / design intent while in service”. Examples are program logic has changed, ranges were modified, valve travel rates were modified, calibrations are off, etc.
  • Loss of Ability (LoA) – Defined as “The function stopped providing its intended operation” Examples are loss of view, loss of control, loss of ability to communicate, loss of functional safety, etc.
  • Loss of confidentiality (LoC) – Defined as “Information or data in the system was exposed that should have been kept confidential.” Examples are loss of intellectual property, loss of access credential confidentiality, loss of privacy data, loss of production data.

The “new” is quoted because the first two categories are well known categories in asset integrity management used on a daily base in plants. The data part (LoC) is simple, very much alike the confidentiality as defined in IT risk estimates. But for getting a more discriminating level of risk we need to split it up in additional sub-categories, sub-categories I call failure modes. For confidentiality there are four failure modes:

  • Loss of intellectual property – Examples of this can be a specific recipe, a particular way of automating the production process that needs to remain secret;
  • Loss of production data confidentiality – This might be data that can reveal production cost / efficiency, the availability to deliver a service;
  • Loss of access credential confidentiality – This is data that would allow a threat actor to raise his/her access privileges for a specific function;
  • Loss of privacy data confidentiality – This type of data is not commonly found in ICS, but there are exceptions where privacy data and with this the GDPR regulations are very important.

Failure modes differ for each function, there are functions for which none of above failure modes play a role and there are functions for which all play a role. Now let’s go to the next two categories Loss of Required Performance and Loss of Ability. These two are very specific for process automation systems like ICS. Loss of required performance has 6 failure modes assigned, in random sequence:

  • Integrity Operating Window (IOW) deviations – This applies to functional deviations that allow that the automation functions comes outside its design limitations which could cause immediate physical damage. Examples are modifying a level measurement range, potentially allowing to overfill a tank, or modifying a temperature range potentially damaging the coils in an industrial furnace. IOW is very important in a plant, so many standards have been addressing them;
  • Control Performance (CP) deviations – This failure mode is directly linked to the control dynamics, so where in the previous category the attacker modified engineering data, in this failure mode the attacker modifies operator data. For example raising a flow, stopping a compressor, shutting down a steel mill furnace. There are many situations in a plant where this can lead to dangerous situations, examples are when filling a tank with a non-conductive fluids, there are conditions that a static electricity buildup can create a spark. If there would be an explosive vapor in this tank this can lead to major accidents with multiple casualties. Safety incidents have been reported where this happened accidentally. One way to prevent this is to restrict the flow used to fill the tank. If an attacker would have access to this flow control he might maliciously maximize it to cause such a spark and explosion;
  • Functional safety configuration / application deviations (FS) – If safety logic or configuration was affected by an attack many things can happen, in most cases bad things because the SIS primary task is to prevent bad things from happening. Loss of functional safety and tampering with the configuration / application is considered a single category. This because there is no such thing as a bit of safety, safety is an all or nothing failure mode;
  • Monitoring and diagnostic function deviations (MD) – ICS contains many monitoring and diagnostic functions, for example vibration monitoring functions, condition protection functions, corrosion monitoring, analyzers, gas chromatography, emission monitoring systems, visual monitoring functions such as cameras monitoring the flare, and many more. Impacting these functions can be just as bad, as an example if an attacker were to modify settings in an emission monitoring solution the measured values might be reported as higher than they actually are. If this would become publicly known this would cause a stream of negative publicity and impact the company public image. Modifying vibration monitoring settings might prevent the detection of mechanical wear leading to a larger physical damage than would be the case if detected in time;
  • Monitoring data or alarm integrity deviation (MDAI) – This type of failure mode causes that data is no longer accurately represented, or alarms don’t occur when they should (Either too early, too late, maybe too many overflowing a process operator). The ICS function does in all cases its job as designed, however it is fed with wrong data and as a consequence might act in the wrong manner or too early or too late;
  • Loss of deterministic / real-time behavior behavior (RB) – This type of failure mode is very specific for real-time OT components. As I explained in my blog on real-time systems tasks are required to finish in time. if a task is not ready when its time slot ended it is bad luck, next time the task starts it starts at the beginning creating the possibility of starvation, where specific tasks never complete.

All above six failure modes have in common that the ICS functions, only it doesn’t function as intended. Either the design has been tampered with or its operational response doesn’t do the job as intended.

The next main category of consequences is Loss of Ability to perform, in this category the function is no longer available. But also here we have multiple failure modes (6) for grouping consequences.

  • Loss of controllability and / or intervention (LoCI) – When this occurs we lost the ability to move a system what is called its configuration space. We can’t adjust the flow anymore, pressure controls are lost, this might occur because the actuators fail, or because an operator has no longer access to the functions that allow him to do so.
  • Loss of observability (LoO) – This is the situation when we no longer can observe the actual physical state of the production process. This can be because we no longer measure this, but can also occur when we freeze the measured data by for example continuously replaying the same measurement message over and over again.
  • Loss of alarm or alerting function (LoAA) – This is the situation where there are no alarms or alerts warning the process operators of anomalies that they should respond to. Not necessarily restricted to BPCS alarms, it can also be alarms on a fire and alarm panel or a PAGA that doesn’t function.
  • Loss of coordination and / or collaboration (LoCC) – In an automation system, many functions collaborate or coordinate functions together. Coordination / collaboration is much more than communication, the activities need to be “synchronized” need to be aware of what the other is doing or things might go wrong with a potential high impact.
  • Loss of data communication (LoDC) – When we can’t communicate we can’t exchange commands or data. This doesn’t necessarily mean we can’t do anything, many functions can act isolated, sometimes there are hardwired connections to act.
  • Loss of reporting functions (historical data / trending) – This failure mode prevents us to report on ICS performance or observe slowly changing trends (creep) by blocking trending capabilities. Loss of reporting can be serious under certain circumstances, specifically for FDA and regulatory compliance.

This were the sixteen failure modes / sub-categories we can use to group consequences, functional deviations caused by a cyber attack. Not every function has all failure modes, and the failure modes have different severity scores for each function. But in general we can group hundreds of functional deviations of an ICS with its many functions (BPCS, SIS, CCS, IAMS, CDP, DAHS, PMS, MCC, MMS, ….) into these 16 failure modes and assign a severity value to each which we use for estimating cyber security risk for the various cyber security hazards.

I will not discuss all blocks of the diagram, that would be a book and no blog, but the criticality block is of importance. Why is criticality (the importance of the function) important?

For OT security criticality is a twin, one is criticality as the importance for the plant’s mission and the other is the importance for the threat actor. They are different things, for example a function such as Instrument Asset Management System (IAMS), is for the plant’s mission not so important, but for the threat actor’s mission it might be a very interesting object because it provides access to all field equipment.

The same differences we see with various technical and security management functions that if used by a threat actor provide him / her with many benefits. If we want to compare cyber security risk from different functions we can not just base this on consequence severity, we need to take function criticality into account. Another important aspect of criticality is that it is not a constant, criticality can increase over time. This provides us with information for our recovery time objectives, recovery point objectives, and maximum tolerable downtime. All critical security design parameters!

The zone and conduit model plays an important role in the Threat x Vulnerability, the likelihood, side of the cyber security risk equation. A very important factor here is exposure, which has two parts: dynamic exposure and static exposure. Connectivity is one of the three parameters for “measuring” static exposure.

The final block of the stage 1 side I like to discuss briefly, perhaps in a later blog in more detail, are the Functional Indicators of Compromise (FIoC).

In cyber security we are used to talk about IoC, general in the form that the threat actor makes some modifications in the registry or transfers some executable into a specific location. The research companies evaluating the cyber attacks, document these IoC in their reports.

But when we discuss functional deviations such as in OT cyber security as result of a cyber attack we have Indicators of Compromise here too, FIoC. This because often data is stored in multiple locations in the system. For example when we configure a controller we set various parameters for the field equipment connected. This data can be downloaded to the field equipment or separately managed by an IAMS. But it is not necessarily so that if changed in one place the other places are tracking it and automatically update.

This offers us one way of detecting inconsistencies and alert upon them. For multiple reasons an important function, but not always done. Similarly many settings in an ICS are always within a certain range, if they vary it are small variations. A cyber attack generally initiates bigger changes we can identify.

Not for all, but for many consequences / failure modes there is a functional indicator of compromise (FIoC) available to warn that something unusual occurred and we should check it.

Process risk and mission risk

Let’s slowly move to the end of the blog, I already surpassed my reading time long ago, let’s look at stage 2 the path from cyber security risk to mission risk.

Stage 1 provided us two results, a likelihood score and a set of functional deviations, that we need to bring over to stage 2. The identified functional deviations can cause deviations in the production process, perhaps a too high flow in my tank example, perhaps the possibility to close one or more valves simultaneously.

In order to evaluate their consequence for process risk we need to inspect which functional deviations have unforeseen potential process deviations. Process safety analysis thoroughly goes over the various failure modes that can occur in a plant and analyses how to prevent these to cause a personal safety incident or damage to the equipment.

But what process safety analysis does not do is looking for foul play. A PHA sheet might conclude that the failure of a specific valve is no issue because there is an open connection available, but what if that open connection would be blocked at the same time the threat actor manipulates initiates the failure of the valve. Two or more actions can occur simultaneously in an attack.

That is the activity in the second stage of the risk analysis, needed to arrive at mission risk. First we need to identify the process consequence that are of importance before we can translate this into business / mission impact. The likelihood of such an incident is the same as the likelihood of the cyber security incident that caused all the trouble.

Though the method of analyzing the process safety sheet and the checklist to process is probably just as interesting as many of the stage 1 processes a blog needs to end. So perhaps a later blog, or a book when I am retired.

What did I try to say in this blog? Well first of all I like to repeat, standards are important and the IEC 62443 is the most influential of all and contributes very much to the security of ICS. Doesn’t take away they are not above criticism, it is after all a men made product.

The structure of standards such as IEC 62443 or the French ANSSI standard leads to control based risk because of their zone based strategy. Nothing wrong with this, but in my opinion not sufficient to protect more critical systems that are attacked by more capable threat actors. Like the SL 3 and SL 4 type of threat actors. There I see gaps in the method.

I explained we have three forms of risk, cyber security risk, process risk, and mission risk. They differ because they need different inputs. We cannot just jump from cyber security risk to mission risk forgetting about how the functional deviations in the ICS caused by the cyber attack impact the production process, we need to analyse this step.

I have shown and discussed a two stage method to come from a cyber security hazard to a process failure hazard to mission risk. I didn’t explain the method on how I identified all these cyber security hazards, nor the method of risk analysis analysis to estimate likelihood. This can still be either threat based or asset based, maybe another blog to explain the details.

But at minimum the asset specific elements and the threats are accounted for in the method described, this is missing in the zone risk approach. Functional deviations (consequences) play no role in zone risk estimates.

Does this mean the IEC 62443 isn’t good enough, no absolutely not. All I say in this blog it is not complete enough, security cannot be achieved with just a list of requirements. This plays no role when the objective is to keep the cyber criminals (SL 2) out, it plays a role when attacks become targeted and ICS specific. It is my opinion that for SL 3 and SL 4 we need a detailed asset or threat based risk analysis that potentially adds new requirements (and in my hands-on experience it does) to the requirements identified by the zone risk analysis.

So what would I like to see different?

  • IEC 62443 starts with an inventory, this seems for me the right first step.
  • Then IEC 62443 starts with a high level risk assessment, this I believe is wrong. Apart from the name (should have been initial risk assessment) I think there is no information available to determine likelihood, so it becomes primarily an impact driven assessment. Than the proper choice in my opinion would have been to conduct a criticality assessment, for criticality I don’t need likelihood. Apart from this, criticality in a plant has a time aspect because of upstream / down stream dependencies. This provides us with recovery time objectives (for the function), recovery point objectives, and maximum tolerable downtime when multiple functions are impacted. This part is missing in the IEC 62443 standard, while an important design parameter. Another reason is of course that when we do asset based or threat based risk and start comparing risk between multiple functions we need to take the function’s criticality into account.
  • Then IEC 62443 creates a zone and conduit diagram, no objections here. To determine risk we need static exposure, we need to know which assets go where. So good step.
  • Then IEC 62443 does a detailed risk assessment, also the right moment to do so. Hopefully also in IEC 62443 will see the importance of asset or threat based risk assessment. The standard doesn’t discuss this, it might as well be a control based risk assessment. But because there is no criticality assessment I concluded it must be a control based assessment not looking at risk at function level.

I hope I didn’t step on too many toes, only wanted to give a goat a great day. It is just my opinion as private person working as a consultant during business hours. Who on a daily base, during working hours, needs to secure systems, many of them part of national critical infrastructure.

Above is what I am missing, a criticality assessment, and a risk assessment that meets the SL3 / SL4 threat.

There is no relationship between my opinions and publications in this blog and the views of my employer in whatever capacity. this blog is written based on my personal opinion and knowledge build up over 42 years of work in this industry. Approximately half of the time working in engineering these automation systems, and half of the time implementing their networks and securing them.

Author: Sinclair Koelemij

OTcybersecurity web site

The Classic ICS perimeter

Whenever I think about an ICS perimeter, I see the picture of a well head with its many access controls and monitoring functions. In this mid-week blog, I have chosen an easy digestible subject. No risk and certainly no transformers this time, but something I call the “Classic ICS perimeter”.

What is classic about this perimeter? The classic element is that I don’t discuss all those solutions that contribute to the “de-perimeterization” – if this is a proper English word – of the Industrial Control Systems (ICS) when we implement wireless sensor networks, virtualization, or IIoT solutions.

There are many different ICS perimeter architectures, some architectures just exist to split management responsibilities and some architectures add or attempt to add additional security. I will discuss in this blog that DCS and SCADA are really two different systems, with different architectures, but also different security requirements.

When I started to make the drawings for this blog, I quickly ended up with 16 different architectures, some bad some good, but all exist but my memory might have failed too. The focus in the blog is on the perimeter between the office domain(OD) and the process automation network (PAN). I will briefly detail the PAN into its level 1, level 2, and level 3 segments, see also my blog on the Purdue Reference Model. Internal perimeters between level 2 and level 3 are not discussed in this blog because of the differences that exist between different vendor solutions for level 1 and level 2 or interfacing with legacy systems.

Different vendors often differ in level 2 / level 1 architecture, different implementation rules to follow to meet vendor requirements and different network equipment. To cover these differences would almost be a second blog. So this time a bit more a focus on IT technology than my normal focus on OT cyber security. More the data driven IT security world than the data plus automation action driven OT cyber security world.

Maybe the first question is, why do we have a perimeter? Can’t we just air-gap the ICS?

We generally have a perimeter because data needs to be exchanged between the planning processes in the office domain and the operations management / product management functions in the PAN (see also my PRM blog) and sometimes engineers want remote access into the network (see the remote access blog). When the tanker leaves Kuwait, the composition data of the crude is available and the asset owner will start its planning process. Which refinery can best process the crude, what is the sulfur level in the crude, and many more. Ultimately when the crude arrives, and is stored into tanks in the terminal, and forwarded to the refinery to produce the end product, this data is required to set some of the parameters in the automation system. Additionally production data is required by the management and marketing departments, custody metering systems produce records on how much crude has been imported, environmental monitoring systems collect information from the stacks and water surface to report that no environmental regulations are violated.

Only for very critical systems, such as for example nuclear power, I have seen fully isolated systems. Not only are the control systems isolated, but also the functional safety systems remain isolated. Though also in this world more functions become digital, and more functions are interfaced with the environment.

More common is the use of data diodes as perimeter device in cases where one way traffic (from PAN to OD) suffices. And also in this world we see compromises by allowing periodic reversal of the data flow direction to update antivirus systems and security patches. But by far, most systems have a perimeter based on a firewall connection between the OD and the PAN, the topic of this blog.

I start the discussion with three simple architecture examples.

Some basic perimeters, frequently used

Architecture 1, a direct connection between the OD and the PAN.

If the connection is exclusively an outbound connection, this can be a secure solution for less critical sites. Though any form of defense in depth is missing here, if the firewall gets compromised the attacker gets full access to the PAN. A firewall / IPS combination would be preferred. Still some asset owners allow this architecture to pass outbound history information in the direction of the office domain.

Architecture 2, adding a demilitarized zone (DMZ).

A DMZ is added to allow the inspection of the data before the data continuous on its path to functions in the PAN. But if we do this we need to actually inspect this data, just forwarding it using the same protocols adds a little security (hiding the internal network addresses of the PAN) but only if we use a different range and not just continue with private IP address ranges like the 10.10 or 192.168 range.

Alternatively the PAN can make data available for the office domain users by offering an interface to this data in the DMZ. For example a web service providing access to process data. But we better make sure it is a read only function, not allowing write actions to the control functions.

The theoretically ideal DMZ only allows inbound traffic. For example a function in the PAN sends data to a DMZ function, and a user or server in the OD collects this data from the DMZ. Or in the reverse direction. Unfortunately not all solutions offer this possibility, in those cases inbound data from the OD needs to continue toward a PAN function. In this situation we should make certain that we use different protocols for the traffic coming in the DMZ and the traffic going from the DMZ to the PAN function. (Disjoint protocols)

The reason for the dis-joint protocol rule is to prevent that a vulnerable service can be used by a network worm to jump from the OD into the PAN. Typical protocols where this can happen are RDP (Microsoft terminal server), SMB (e.g. file shares or print servers), RPC (e.g. RPC DCOM used by classic OPC), and https (used for data exchange).

If the use of disjoint protocols is not available, an IPS function in the firewall becomes very important. The IPS function can intercept the network worm attempting to propagate through the firewall by inspecting the traffic for exploit patterns.

Another important consideration in architectures like this is how to handle the time synchronization of the functions in the DMZ. The time server protocol (e.g. NTP) can be used for amplification attacks in an attempt to create a denial of service of a specific function. An amplification attack happens when a small message triggers a large response message, if we combine this with spoofing the source address of the sender we can use this for attacking a function within the PAN and potentially overloading it. To protect against this, some firewalls offer a local time server function. In this case the firewall synchronizes with the time server in the PAN and offers a separate service in the DMZ for time synchronization. So there is no inbound (DMZ to PAN) NTP traffic required, preventing the denial of service amplification attack from the DMZ toward a PAN function.

Architecture 3, adding an additional firewall.

Adding an additional firewall prevents that if the outer firewall is compromised, the attacker has direct access into the PAN. With two firewalls, breaching the first firewall gives access to the DMZ, but a second firewall is there to stop / delay the attacker. This moment needs to be used to detect the first breach by monitoring the traffic and functions in the DMZ for irregular behavior.

This delay / stop works best if the second firewall is of a different vendor / model. If both would be from the same vendor, using the same operating software, the exploit causing the breach on the first firewall would most likely also work for the second. Having two different vendors delays the attacker more and increases the chance on detecting the attacker trying to breach the second firewall. DMZs create a very strong defense if properly implemented. If this is not possible we should look for compensating controls, but never forget that defense in depth is a key security principle. It is in general not wise to rely on just one security control to stop the attacker. And please don’t think that the PRM segmentation is defense in depth enough, there are very important operations management functions at level 3 that are authorized to perform critical actions in the production management systems at level 2 and level 1. Know your ICS functions, understand what they do and how they can fail. It is an essential element in OT cyber security it is not just controlling network traffic.

A variation on architecture 2 and 3 is shown in the next diagram. Here we see for architecture 4 and architecture 5 two firewalls in series (Orange and red). This architecture is generally chosen if there are two organizations responsible for the perimeter. For example the IT department and the plant maintenance department, each protecting their realm. Though the access rules for the inbound traffic are the same for both firewalls in architecture 4 and 5, this architecture can offer a little bit more resilience than architecture 2 / 3 because of the diversity added if we use two different firewalls.

Architecture 6, adds a second internal boundary between level 3 functions (operations management) and level 2 / level 1 functions (production management).

For the architectures 1 to 5 this might have been implemented with a router between level 3 and level 2 in combination with access control lists. Firewalls can offer more functionality, especially Next Generation Firewalls (NGFW – strange marketing invention that seems to hold for all new firewall generations to come) offer the possibility to limit access based on user and specific applications or proxies allowing for a more granular control over the access into the real-time environment of the production management functions.

Sometimes plants require Internet access for perhaps specialized health monitoring systems of the turbine or generator, or maybe remote access to a support organization. Preferably this is done by creating a secure path through the office domain toward the Internet, but it happens that the IT department doesn’t allow for this or there is no office domain to connect to. In those cases asset owners are looking for solutions using an Internet provider, or a 4G solution.

Architectures with local Internet connectivity

Architecture 7 shows the less secure option to do this if the DMZ also hosts other functions. In that case architecture 8, with a separate DMZ for the Internet connection, is preferred because the remote connectivity is kept separate from any function in DMZ 1. This allows for more restricted access filters on the path toward the PAN and reduces the risk for the functions in DMZ 1. The potential risk with architecture 7 is that the firewall that connects to the Internet is breached and gets access to the functions in the DMZ, potentially breaching these and as a next step gaining access to the PAN. We should never immediately expose the firewall with the OD perimeter to the Internet, also here two different firewalls improve security preferably only allowing end to end protected and encrypted connections.

The final 4 DCS architectures I discuss briefly are more as example for an alternative approach.

Some alternative DCS perimeters

Architecture 9, is very similar to architecture 6 without the DMZ. The MES layer (Manufacturing Execution Systems) hosts the operation management systems. This type architecture is often seen in organizations where the operation management systems are managed by the IT department.

This type of architecture also occurs when there are different disciplines “owning” the system responsibility, for example a team for the mechanical maintenance “owning” a vibration monitoring function, another team “owning” a power management function and the motor control center functions, and maybe a 3rd group “owning” the laboratory management functions.

In this case there are multiple systems, each with its own connection to the corporate network. In general splitting up the responsibility for security often creates inconsistencies in architecture and security operations and as such a higher residual risk for the investment made. Sometimes putting all your eggs into one basket is better, when we focus our attention on this single basket.

Architecture 10 is the same as architecture 9 but now with a DMZ allowing additional inspections. Architecture 11 is an architecture frequently used in larger sites with multiple plants connected to a common level 3 network. There is a single access domain that controls all data exchange with the office domain, hosts various management functions such as back-up, management of the antivirus systems, security patch management, and remote access.

There are some common operations management functions at L3 and each plant has its own production management system connected through a firewall.

Architecture 12 is similar but the firewall is replaced by a router filtering access to the L2/L1 equipment. In smaller system this can offer similar security, but like discussed a firewall offers more functions to control access.

An important pit fall in many of these architectures, is the communication using classic OPC. Due to the publications of Eric Byres allmost 15 years ago, and the development of the OPC firewall there is a focus on the classic OPC protocol not being firewall friendly. This because of the wide dynamic port range required by RPC DCOM for the return traffic. Often the more important security issue is the read / write authorizations of the OPC server.

Several OPC server solutions enable read / write authorizations server wide, which results in also exposing control tags and their parameters not required for implementing the automation function required.

For example sometimes a vibration management system with write access to the control function to signal a shutdown of the compressor because of excessive vibrations, also permits this system to approach other process tags and their parameters. Filtering the traffic between the two systems in that case doesn’t provide much extra security if we have no control over the content of the traffic.

Implementation of OPC security gateway functionality offers more protection in those cases. Limiting which process tags can be browsed by the OPC client, which process tag / parameter can be written to and which can be read from.

Other improvements are related to OPC UA where solutions exist that support reverse connect, so the firewall doesn’t require inbound traffic if communication is required that crosses the perimeter.

So far high level some common ICS architectures when DCS is used for the BPCS (Basic Primary Control System) function, the variation in SCADA architectures is smaller, I discuss the 4 most common ones.

SCADA architecture

The first SCADAs were developed to centralize the control of multiple remote substations. In those days we had local control (generally with PLCs and RTUs. IEDs came in later times) in the substation and needed a central supervisory function to oversee the substations.

A SCADA architecture generally has a firewall connecting it with the corporate network and a firewall connecting it with the remote substations. This can be a few substations, but also hundreds of locations in the case of pipelines where block valves are controlled to segment the pipe in case of damaged pipelines. Architecture 13 is an example of such an architecture. The key characteristic here is that the substations have no firewalls. Architecture 13 might be applied in the case if the WAN is a private network dedicated for the task to connect to the substations / remote locations.

Substation architecture varies very much depending on the industry. A substation in the power grid has a different architecture, than a compressor substation for a pipeline, or a block valve station segmenting the pipeline, or a clustering of offshore platforms.

Architecture 14 is an architecture where we have a fall back control center. If one control center fails, the fall back center can take over the primary control center. Primary is perhaps a wrong word here, because some asset owners periodically swap between the two centers to verify its operation.

The two control centers require synchronization, this is done by the direct connection between the two. It depends very much on the distance between the two centers how synchronization takes place. Fall back control centers exist on different continents many thousands of miles distance.

Not shown in the diagram but often implemented is a redundant WAN. If the primary connection fails the secondary connection takes over. Sometimes a G4 network is used, sometimes an alternative WAN provider.

Also here diversity is an important control, implementing a fall back WAN using a different technology can offer more protection – a lower risk.

SCADA architecture

Architecture 15 is similar to architecture 13, with the difference of the firewall at the substations, this when the WAN connections are not private connections. The complexity here is the substation firewalls in combination with the redundancy of the WAN connections. Architecture 16 adds the fall back control center.

Blogs have to end, though there is much to explain. In some situations we need to connect both the BPCS function and the SIS function to a remote location. This creates new opportunities for an attacker if not correctly implemented.

A good OT cyber security engineer needs to think bad, consider which options an attacker has, what the attack objective can be. To understand this it is important to understand the functions because it are these functions and their configuration that can be misused in the plans of the threat actor. Think in functions, consider impact on data and impact on the automation actions. Network security is an important element, but with just looking at network security we will not secure an industrial control system.

There is no relationship between my opinions and publications in this blog and the views of my employer in whatever capacity. this blog is written based on my personal opinion and knowledge build up over 42 years of work in this industry. Approximately half of the time working in engineering these automation systems, and half of the time implementing their networks and securing them.

Author: Sinclair Koelemij

OTcybersecurity web site

Power transformers and Aurora

Almost 2 weeks ago, I was inspired by the Wall Street Journal web site and a blog from Joe Weiss to make a blog on a power transformer that was seized by the US government while on its way to its destination. “Are Power transformers Hackable”, the conclusion was yes they can be tampered with but it is very unlikely they are in the WAPA case.

Since than there were other discussions / posts, interviews, all in my opinion very speculative and in some cases even wrong. However there was also an excellent analysis report from Sans that provided a good analysis on the topic.

The report shows step by step that there is little ( more accurate there is no) evidence provided by the blogs and news media that China would have tampered with the power transformer in an attempt to attack the US power grid at a later moment in time.

That was also my conclusion, but the report provides much a more thorough and convincing series of arguments to reach this conclusion.

Something I hadn’t noticed at the time, and was added by Joe Weiss in one of his blogs, is the link to the Aurora attack. I have always been very skeptical on the Aurora attack scenario, not that I think it is not feasible, but because I think it will be stopped in time and is hard to properly execute. But the proposition of an Aurora attack in combination with tampering the power transformer is an unthinkable scenario for me.

Let’s step back to my previous blog, but add some more technical detail using the following sketch I made:

Drawing 1 – A hypothetical power transformer (with on load tap changer) control diagram

The diagram shows in the left bottom corner the power transformer with an on load tap changer to compensate for the fluctuations in the net.
Fluctuations that are quite normal now the contribution of renewable power generation sources such as solar and wind energy is increasing. To compensate for these fluctuations an automated tap changer continuously adapts the number of windings in the transformer. So it is not unlikely that the WAPA transformer would have an automated On Load Tap Changer (OLTC).

During an energized condition, which is always the case in an automated situation, this requires a make before break mechanism to step from one tap position to the next. This makes it a very special switching mechanism with various mechanical restrictions with regard to changing position, one of them is that a step is limited to a step toward an adjacent position. Because there is always a moment that both adjacent taps are connected it is necessary to restrict the current, so we need to add impedance to limit the current.

This transition impedance is created in the form of a resistor or reactor and consist of one or more units that are bridging adjacent taps for the purpose of transferring load from one tap to the other without interruption or an appreciable change in the load current. At the same time, they are limiting the circulating current for the period when both taps are used.

See the figure on the right for a schematic view of an OLTC. There is a low voltage and a high voltage side on the transformer. The on-load-tap-changer is at the high voltage side, the side with the most windings. In the diagram the tap changer is drawn with 18 steps, which is a typical value though I have seen designs that allowed as much as 36 steps. A step for this type tap changer might be 1.25% of the nominal voltage. Because we have three phases, we need to have three tap changers. The steps for each of the three phases need to be made simultaneously. Because of the make-before-break principle, the tap can only move one step at the time. It needs to pass all positions on its way to its final destination. When automated on-load-tap-changers are used the operating position needs to be shown in the control room.

For this each position has a transducer (not shown in the drawing) that provides a signal to indicate where the tap is connected. There is a motor drive unit moving the tap upward or downward, but always step by step. So the maximum change per step is approximate 1.25%. If a step change is required, is determined by the voltage regulator (See drawing 1), the voltage regulator measures the voltage (using an instrument transformer) and compares this with an operator set reference level. Based on the difference between reference level and measured level the tap is moved upward (decreasing the number of windings) or downward (increasing the number of windings).

To prevent that there will be jitter, caused by moving the tap upward and immediately downward, the engineer sets a delay time between the steps. A typical value is between 15 – 30 seconds.

Also if there is an operator that wants to manually jump from tap position 5 to tap position 10 (Manual Command mode), the software in the voltage regulator still controls this step by step by issuing consecutive raise/lower tap commands and on the motor drive side this is mechanically enforced by the construction. On the voltage regulator unit itself, the operator can only press a raise and lower button also limited to a single step.

The commands can be given from the local voltage regulator unit or from the HMI station in the substation or remotely from the SCADA in a central control center. But important for my conclusions later on, is to remember it is step by step … tap position by tap position … no athletics allowed or possible.

Now lets discuss the Aurora principle. The Aurora attack scenario is based on creating a repetitive variable load on the generator. For example disconnecting the generator from the grid and quickly connecting it again. The disconnection would cause the generator to suddenly increase its rotation speed because there is no load anymore, connecting the generator again to the grid would cause a sudden decrease in rotation speed. Taking the enormous mass of a generator into account, these speed changes result in a high mechanical stress on the shaft and bearings which would ultimately result into damage requiring to replace the generator. Which takes a long time because also generators are build specifically for a plant.

When we go from full load to no load and back this is a huge swing in mechanical forces released on the generator. This also because of the weight of the generator, you can’t stop it that easy. Additionally behind the generator we have the turbine creates the rotation by the steam from the boilers. So it is a relative easy mechanical calculation to understand that damage will occur when this happens. But this is when the load swings from full load to no load.

Using a tap changer to create an Aurora type of scenario doesn’t work. First of all even if we could have a maximum jump from the lowest tap to the highest tap (which is not the case because the tap would normally be somewhere around the mid-position, and it is mechanically impossible) it is a swing of 20-25% in load.

The load variations from a single step are approximately 1.25%, and the next step is only made after the time delay, in the Aurora scenario a reverse step of 1.25%. This is by no means sufficient to cause the kind of mechanical stress and damage that occurs after a swing from full load to zero load.

Additionally the turbine generator combination is protected by a condition monitoring function that detects various types of vibrations including mechanical vibrations of the shaft and bearings.

Since the transformer caused load swing is so small that it will not cause immediate mechanical issues, the condition monitoring equipment will either alert or shutdown the turbine generator combination when detecting anomalous vibrations. The choice between alert or shutdown is a process engineering choice. But in both cases the impact is quite different from the Aurora effect caused by manipulating breakers.

Repetitive tap changes are not good for generators, therefore the time delay was introduced to prevent any additional wear from happening. The small load changes will cause vibrations in the generator but these vibrations are detected by the condition monitoring system and this function will prevent damage if the vibrations are above the limit.

Than the argument can be, but the voltage regulator could have been tampered with. True, but voltage regulators are not coming from the same company that supplied the transformer. Same argument for the motor drive unit. And you don’t have to seize a transformer to check the voltage regulator, anyone can hand carry it.

And of course as the SANS report noted, the placement of the transformer needs to be right behind the turbine generator section. WAPA is a power distribution company, not a power generation company so not a likely situation too.

I read another scenario on Internet, a scenario based on the use of the sensors for the attack, therefore I added them to diagram 1. All the sensors in the picture check the condition of the transformer. If they would not accurately represent the values, this might lead to undetected transformer issues and a transformer failure over time. But it would be failure at a random point in time, inconvenient and costly but those things happen also if all sensors function. But manipulating them as part of a cyber attack to cause damage, I don’t think this is possible. At most the sensors could create a false positive or false negative alarm. So I don’t see a feasible attack scenario here that works for conducting an orchestrated attack on the power grid.

In general if we want to attack substation functions, there are a few options in diagram 1. The attack can come over the network, the SCADA is frequently connected to the corporate network or even to the Internet. Famous example is the Ukraine attack on the power distribution. We can try penetrating the WAN, these are not always private connections so there are opportunities here. So far never seen an example of this. And we can attack the time source, time is a very important element in the control of a power system. Though time manipulation this has been demonstrated, I haven’t seen it used against a power installation. But all these scenarios are not related to the delivery of a Chinese transformer and a reason for intercepting it.

So based on these technical arguments I don’t think the transformer can be manipulated in a way that causes an Aurora effect. Too many barriers are preventing this scenario. Nor do I think that tampering the sensors would actively enable an attacker to cause damage in a way where the attacker determines the moment of the attack. The various build-in (and independent) safety mechanism would also isolate the transformer before physical damage occurs.

For me it is reasonable hat the US government initiates this type of inspections, if a supply chain attack from a foreign entity is considered a serious possibility, than white listing vendors and inspecting the deliverable is a requirement for the process. If this is a great development for global commerce, I doubt very much. But I am an OT security professional, no economist, so this is just an opinion.

Enough written for a Saturday evening, I think this ends my blogs on power transformers.

There is no relationship between my opinions and publications in this blog and the views of my employer in whatever capacity.

Sinclair Koelemij

Consequence with capital C

In my previous blog I quickly touched upon the subject Consequence and the difference between Consequence and impact and its relationship with functionality. This blog focuses specifically on what Consequence is for OT cyber security risk.

I mentioned that a cyber attack in an OT system consist in its core elements out of:

A threat actor carrying out a threat action by exploiting a vulnerability resulting in a Consequence that causes an impact in the production process.

In reality this is of course a simplification, the threat actor normally has to carry out multiple threat actions to make progress in achieving the objective of the attack, this is explained by the cyber kill chain concept of Lockheed Martin. And of course the defense team will normally not behave like a sitting duck and starts to implement countermeasures to shield the potential vulnerabilities in the automation system and would build barriers and safeguards to limit the severity of specific Consequences. Or where possible remove the potential Consequence all together. Apart from these preventive controls the defense team would establish situational awareness in an attempt to catch the threat actor team as early as possible in the kill chain.

Sometimes it looks like there are simple recipes for both the threat actors and defenders of an industrial control system (ICS):

  • The threat actor team uses the existing functionality for its attack (something we call hacking), the defense team will counter this by reducing functionality available to the attacker (hardening);
  • The threat actor team adds new functionality to the existing system (we call this malware), the defense team will white list all executable code (application white listing);
  • The threat actor team adds new functionality to the existing system using new components (such as in the supply chain attack), the defense team requests for white listing the supply chain (e.g. the widely discussed US presidential order);
  • Because a wise defense team doesn’t trust the perfectness of its countermeasures and respects the threat actor team’s ingenuity and persistence, the defense team will add some situational awareness to the defense by sniffing around in event logs and traffic streams attempting to capture the threat actors red-handed.

Unfortunately there is no such thing as perfectness in cyber security or something like zero security risk. New vulnerabilities pop-up every day, countermeasures aren’t perfect and can even become an instrument for the attack (E.g. the Flame attack in 2012, creating a man-in-the-middle between the Windows Updates System distributing security patches and the systems of a bank in the Gulf Region). New tactics, techniques and procedures (TTP) are continuously developed, and new attack scenarios designed, the threat landscape is continuously changing.

To become more pro-active the OT cyber security defense teams adopted risk as the driver, cyber security risk being equivalent to the formula Cyber Security Risk = Threat x Vulnerability x Consequence.

Threats and vulnerabilities are addressed by implementing the traditional countermeasures. A world of continuous change and research for vulnerabilities, the world of network security.

In this world Consequence is often the forgotten variable to manage, an element far less dynamic than the world of threats and vulnerabilities. But offering a more effective and reliable risk reduction than the Threat x Vulnerability part of the formula. So this time the blog is written to honor this essential element for OT cyber security risk, our heroine of the day – Consequence.

Frequent readers of my blog know that I write long intros and tend to follow them up sketching the historic context of my subject. Same recipe in this blog.

When the industry was confronted with the cyber threat, after more and more asset owners changed over from automation systems based on proprietary technology toward systems build on open technology, a gap in skills became apparent.

The service departments of the vendor organizations and the maintenance departments of the asset owners were not prepared for dealing with the new threat of cyber attacks. In the world of proprietary systems they were protected by the lack of knowledge of the outside world of the internal workings of these systems, the connectivity between systems was less, and the number of functions available was smaller. And perhaps also because in those days it was far more clear who the adversaries were than it is today.

As a consequence asset owners and vendors were looking for IT schooled engineers to expand their capabilities and IT service providers attempted to extend their business scope. IT engineers had grown their expertise since the mid-nineties, they benefited from a learning curve over time to improve their techniques in preventing hacking and preventing infection with malicious code.

So initially the focus in OT security was on network security, a focus on the hardening (attack surface reduction / increasing resilience) of the open platforms, and network equipment. However essential security settings in the functionality of the ICS, the applications running on these platforms, were not addressed or didn’t exist yet.

Most of the time the engineers were not even aware these settings existed. These security settings were set during the installation, sometimes as default settings, and were never considered again if there were no system changes. Additionally the security engineers with a pure IT background had never configured these systems, they were not familiar with the functionality in these systems. If the Microsoft and network platforms were secure the system was secure.

OT cyber security became organized around network security, I (my personal opinion, certainly not my employer’s 🙂 – for those mixing up the personal opinion of a technical security guy and the vision of the employer he works for. ) compare it with the hypothalamus of the brain, the basic functions sex and a steady hart rhythm are implemented, but the wonders of the cerebral cortex are not yet activated. The security risk aspects of the automation system and process equipment were ignored, too difficult and not important because so far never required. This changed after the first attack on cyber physical systems, the Stuxnet attack in 2010.

Network security focuses on the Threat x Vulnerability part of the risk formula, the Consequence part is always linked to the functional aspects of the system. Functional aspects are within the domain of the risk world.

Consequence requires us to look at an ICS as a set of functions, where the cyber attack causes a malicious functional deviation (the Consequence) which ultimately results into impact on the physical part of the system. See my previous blog for an overview picture of some impact criteria.

What makes OT security different from IT security, is this functional part. An IT security guy doesn’t look that much at functionality, IT security is primarily data driven. An OT security guy might have some focus on data (some data is actual confidential) but his main worry is the automation function. The sequence and timing of production activities. OT cyber security needs to consider the Consequence of this functional deviation, they need to think about barriers / safeguards, or what ever name is used to reduce the overall Consequence severity in order to lower risk.

Therefore Sarah’s statement in last week’s post “Don’t look at the system but look at the function” is so relevant when looking at risk. For analyzing OT cyber security risk we need to activate the cerebral cortex part of our OT security brain. So far the historical context of hero in this blog. Let’s discuss Consequence en the functions of ICS in more technical detail.

For me, from an OT security perspective, there are three loss possibilities:

  • Loss of required performance (LRP) – Defined as “The functions, do not meet operational / design intent while in service”. Examples are program logic has changed, ranges were modified, valve travel rates were modified, calibrations are off, etc.
  • Loss of Ability (LoA) – Defined as “The function stopped providing its intended operation” Examples are loss of view, loss of control, loss of ability to communicate, loss of functional safety, etc.
  • Loss of confidentiality (LoC) – Defined as “Information or data in the system was exposed that should have been kept confidential.” Examples are loss of intellectual property, loss of access credential confidentiality, loss of privacy data, loss of production data.

Each of these three categories have sub categories to further refine and categorize Consequence (16 in total). Consequence is linked to functions, but functions in an OT environment are generally not provided by a single component or even by a single system.

If we consider the process safety function discussed in the TRISIS blog, we see that at minimum this function depends on a field sensor function/asset, an actuator function/asset, a logic solver function/asset together called the safety integrity function (SIF). Additionally there is the HMI engineering function/asset, sometimes a separate HMI operations function / asset, and the channels between these assets.

Still it is a single system with different functions such as: safety integrity function (SIF), alerting function (SIF alerts), a data exchange function, an engineering function, maintenance functions, operation functions and some more functions at detailed level such as diagnostics and sequence of event reporting.

Each of these functions could be targeted by the threat actor as part of the plan, some are more likely than others. All these functions are very well documented, for risk we need to conduct what is called Consequence analysis determining failure modes. A proper understanding of which functional deviations are of interest for a threat actor includes not only an ability to recognize a possible Consequence but also distinguish how functions can fail. These failure modes are basically the 16 sub-categories of the three main categories LRP, LoA, LoC.

The threat actor team will consider what functional deviations are required for the attack and how to achieve this, the defense team should consider what is possible and how to prevent it. If the defense team of the Natanz uranium enrichment plant had conducted Consequence analysis, they would have recognized the potential for causing mechanical stress on the bearings of the centrifuge as the result of functional deviations caused by the variable frequency drive of the centrifuges. Using vibration monitoring would have recognized the change in vibration pattern caused by the small repetitive changes in rotation speed and would almost certainly have caused an earlier alert that something was wrong than the time it took now. The damage (Consequence severity / impact) would have been smaller.

If we jump back to the TRISIS attack we can say that the threat actor’s planned functional deviation (Consequence) could have been the Loss of Ability to execute the safety integrity function, so making the automation function vulnerable if a non-safe situation would occur.

Another scenario described in the TRISIS blog, is the Loss of Required Performance, where the logic function is altered in a way that could cause a flow surge when the process shutdown logic for the compressor would be triggered after having made the change to the logic.

A third Consequence discussed was the Loss of Confidentiality that occurred when the program logic was captured in preparation of the attack.

Every ICS has many functions, I extended the diagram in the PRM blog with some typical functions implemented in a new greenfield plant.

Some typical system functions in a modern plant

Quite a collection of automation functions, and this is just for perhaps a refinery or chemical plant. I leave it to the reader to find the acronyms on the Internet, there you can also value the many solutions and differences there are. The functions for the power grid, for a pipeline, for a pulp and paper mill, oil field, or offshore platform differ significantly. Today’s automation world is a little bit more diverse than SCADA or DCS. I hope the reader realizes that ICS has many functions, different packages, different vendors, many interactions and data exchange. And this is just the automation side of it, for security risk analysis we also need to include the security and system management functions, also these induce risk. Some institutes call this ICS SCADA, I need to stay friendly but really a bad term. The BPCS (Basic Process Control System to give one acronym away for free) can be a DCS or a SCADA, but an ICS is a whole lot more today than DCS or SCADA in the last decade functionality exploded and I guess the promise of IIoT will even grow the number of functions.

Each of the functions in the diagram has a fixed set of for the threat actor interesting deviations (Consequences), these are combined with the threat actions that can cause the deviation (the cyber security hazard), these can be combined with the countermeasures and safeguards / barriers that protect the vulnerabilities and limit the Consequence severity to be able to estimate cyber security risk. resulting in a repository of bow-ties for the various functions.

A cyber security hazop (risk assessment) is not the process of a workshop where a risk analyst threat models the ICS together with the asset owner, that would become a very superficial risk analysis with little value considering the many cyber security hazards there are. Costly in time and scarce in result, a cyber security hazop is the confrontation between a repository of cyber security hazards and the installed ICS functionality, countermeasures, safeguards and barriers.

Consequence plays a key role, because Consequence severity is asset owner specific and depends on the criticality analysis of the systems / functions. Consequence severity can never be higher than the criticality of the system that creates the function. All of this is a process, called Consequence analysis, a process with a series of rules and structured checks that allow estimating cyber security risk (I am not discussing the likelihood part in this blog) and link Consequence to mission impact.

Therefore Consequence with capital C.

Maybe a bit much for a mid-week blog, but I have seen far too many risk assessments where Consequence was not even considered, or expressed in for mitigation meaningless terms as Loss of View and Loss of Control.

Since in automation Consequence is such an essential element we can’t ignore it.

Understanding the hazards, is attempting to understand the devious mind of the threat actor team. Only then we can hope to become pro-active.

There is no relationship between my opinions and publications in this blog and the views of my employer in whatever capacity.

Sinclair Koelemij

OT cyber security risk

Selecting subjects for a blog is always a struggle, on one side there are thousands of subjects, on the other side there is the risk to bore the reader and scare him / her away or even upset the reader. But most of my blogs are inspired by some positive or negative response on what I experience or read, the spur of the moment. Today’s blog has been simmering for a while before I wanted to write it down. This because in my daily occupation I am very much involved in risk assessment and it is complex to write about it without potentially touching intellectual property of my employer.

What inspired me to write the blog is a great post from Sarah Fluchs and a Linked-In response on my blog on Remote Access by Dale Peterson. Apart from these triggers, I was already struggling some time with parts of the Mitre Att&ck framework and some of the developments in the IEC 62443 standard.

The concept of risk is probably one of mankind’s greatest achievements abstracting our “gut feeling” about deciding to do something by rationalizing it in a form we can use for comparison or quantification.

In our daily life risk is often related with some form of danger a negative occurrence. We more often discuss the risk of being infected by Covid-19 than the risk of winning the lottery. And even if there was a risk of winning the lottery, we would quickly associate it with the negative aspects of having too much money. But from a mathematical point of view positive or negative consequence wouldn’t make a difference.

The basic formula for risk is simple (Risk = Likelihood x Impact) but once you start using it, soon many complex questions require an answer. In cyber security we use the formula Threat x Vulnerability x Impact = Risk, Threat x Vulnerability being a measure for likelihood. Though I will not use impact but use consequence / consequence severity and explain the reasons and differences later.

Don’t expect me to explain all my thoughts on this subject in a single blog, it is a very big subject so it takes time, and parts I can’t reveal to protect the commercial rights of my employer. In risk there are very superficial concepts and very complex concepts and rules, an additional hurdle to overcome is that you always get an answer but it is not easy to verify the correctness of that answer. And most of all there are hundreds of forms of risk, so we need some structure and method.

For me risk in process automation systems has only two forms: cyber security risk and mission risk. That doesn’t take away that I can split each of them looking in more detail, for example in cyber security risk I can focus on “cascading risk” the risk that malware propagates through the system, or I can look at risk of unauthorized access, and many more. Mission risk I can break down into such elements as safety risk, reputation risk, risk of operational loss, risk of asset damage, etc. I explain these in more detail later.

You can also group risk to facilitate monitoring the risk, when we do this we build a risk register. If the risk methodology offers sufficient detail it becomes possible to determine the contribution of a specific risk control in reducing the risk. The moment you have established a grip on the risk, it becomes possible to manage the risk and justify decisions based on risk. And you can monitor how changes in the threat landscape impact your plant’s risk.

In process safety the industry does this all the time, because process safety is an in my view more mature discipline. In cyber security the majority of the asset owners has not reached that level of maturity to work with cyber security based on risk. In those cases cyber security is established through creating checklists and conformance to standards specifying lists of requirements, there is a compliance focus in the expectation that the standard will offer security by itself.

Standards and checklists are not taking site and asset specific elements into account, it is easy to either overspend or under spend on cyber security when following standards and checklists. But to start with checklists and standards provide a good base, over time organizations mature and look at cyber security in a more structural way, a way based on risk.

Okay, long intro but when I discuss a subject I don’t like to forget the historic context so let’s discuss this briefly before we become technical.

In the introduction I mentioned that there is a relationship between an organization’s maturity and using risk. This shouldn’t suggest that risk was not considered. My first participation in a risk assessment was approximately 15 years ago. I was asked to participate as a subject matter expert in the risk assessment workshops of an asset owner, my expertise was related to how the control systems of my employer function.

This was the first time I became acquainted on how an asset owner looked at risk. The formulas used, the method followed, the discussions, all made sense for me. I was convinced, we were heading for a great result.

Unfortunately, this turned out to be a dream, the risk results didn’t offer the granularity required to make decisions or to manage cyber security risk. It was almost all high risk, not enough to differentiate decisions on. Though the results failed and the company later took an approach of creating an extensive set of policies to implement security, they didn’t consider it a full failure because the process itself was valuable.

Less than a year later I was asked in the same role for another asset owner, formulas differed a bit but unfortunately the results didn’t. Beware this was all prior to Stuxnet, that changed the view on OT security considerably, the focus was still on preventing malware and keeping hackers out. The idea of a highly sophisticated attack aiming at the process installations was not considered, every installation is unique that will never work was the idea.

But advanced attacks did work we learned in 2010, and we have seen since that it became the standard. Modular malware has been developed supporting the attackers in manipulating process automation functions.

We have seen since than: power outages, damaged blast furnaces, attempts to change the program logic in a safety controller, and several other incidents.

In parallel with these occurrences the interest in risk grew again, it became important to know what could happen, what scenarios were more likely than other scenarios. There was a drive to change the reactive mode and become pro-active.

The methods used had matured and objectives had been adapted. Risk for a medical insurance company has a temporal relation, they want to know how many people will have cancer in 2030 so they can estimate the medical cost and plan their premium, driven by a quantitative approach.

Cyber security and mission risk have a different objective, the objective here is to determine which risk is highest and what can we do to reduce this risk to a level that the business can accept. So it is comparative risk, not aiming at the cost of malware infections over 5 years, but prioritizing and justifying risk reduction controls. It allowed for applying hybrid methods, partially qualitative and partially quantitative.

Globally I still see differences in adopting risk as a security driver, some regions have developed more toward using risk as a management instrument than other regions. But personally I expect that the gap will close and asset owners will embrace risk as the core driver for their cyber security management system. They will learn that communications with their senior management will be easier, because business managers are used to work with risk. Now let’s become technical.

I already mentioned in the introduction that for me cyber security risk and mission risk differ, however we can and have to translate cyber security risk into mission risk. Plants think in terms of mission risk. But we can express how mission risk is influenced by the cyber threat and adjust the risk for it.

The how I will not address in this blog, would be too lengthy and too early in this story to explain. But I do like to have a closer look at mission risk (or rather the mission risk criteria) before discussing cyber security risk in more detail.

When discussing risk it is necessary to categorize it and formulate risk criteria. I created following diagram with some imaginary criteria as an example.

Example of mission risk criteria

The diagram shows six different impact criteria and five different impact levels. I added some names to the scales but even these vary in real life. Fully imaginary values coming from my left thumb, but reasonable values depending on the industry. Numbers differ very much by individual plant and by industry. Service continuity is expressed in the example in the form of perhaps a power provider, but could as well be expressed (different criteria, categories) to address the impact of down-stream or upstream plants of a specific cyber security hazard. One thing is for certain, every plant has one. Plants are managed by mission risk, they have established their risk criteria. But I hope you also see that there is no relationship with a cyber security hazard at this level, it is all business parameters.

You might say when the cyber attack results in a loss of view (You will see later that I don’t use this type of loss because it doesn’t offer the detail required) there will be operational loss. Can we link this to the impact scale? Very difficult.

Over 30 years ago I was involved in an incident at a refinery, the UPS of the plant failed and in its failure it also caused the main power to fail. This main power was feeding the operator stations, but not the control equipment. So suddenly the operators lost their view on the process. Incident call out was made and within two hours power was restored and operator stations reloaded. Operators had full view again, inspecting the various control loops and smiling everything was fine. So full loss of view, but no operational loss.

Okay, today’s refineries have far more critical processes than in those days, but because of that also has far more automation. Still the example shows the difficulty to link such a high level consequence (Loss of View) to an impact level. Loss of View is a consequence not an impact as MITRE classifies it, the impact would have been some operational loss or damaged equipment.

Similar story I can tell on what MITRE calls Loss of Safety, correct expression would be Loss of Functional Safety because safety is a lot more than functional safety alone. Some plants have to stop when the SIS would fail, other plants are allowed to run for a number of hours without the SIS. Many different situations exist, each of them translating differently to mission risk. So we need to make a clear difference between impact and consequence. The MITRE model doesn’t do so, the model identifies “Inhibit Response Function”, “Impair Process Control” and Impact and mix as such many terms. I think reason for this is that MITRE didn’t structure the model following a functional model of an ICS. This is where I value Sarah’s post so much, “Think in Functions, not in systems”. So let’s go to explaining the term “consequence” and explain the relationship with function.

In principle a threat actor launches a threat action against an exposed vulnerability, which results in a deviation in functionality. This deviation depends in an automation system very much on the target. Apart from this most targets conduct multiple functions, to make it even more complex several functions are created by multiple assets. And when this is the case the channels connecting these assets contribute to risk. This collaboration can even be cross-system.

An operator station or an instrument asset management system, or a compressor control system, or a flow transmitter, all have a different set of functions and it is the threat actor’s aim to use these in his attack. Ultimately there are two consequences for these functions, the function doesn’t meet its design or operations intent or the function is lost so no function at all. See also my very first blog.

When the function doesn’t meet its design or operations intent, the asset integrity discipline calls this “Loss of Required Performance”, when the function fully fails this is called “Loss of Ability to Perform”. Any functional deviation in a process automation system can be expressed in one of the two and these are high level consequences of a cyber attack.

For each asset in the ICS the detailed functionality is known, the asset owner purchases it to perform a certain function in the production process. The process engineer uses all or a part of the functions available. When these functions are missing or doing other things than expected, this results in mission impact. Which impact requires further refinement of the categories and understanding of the production process and equipment used.

We can further sub-categorize (six sub-categories are defined for each) these consequence categories allow us to link them to the production process (part of the plant’s mission).

This is also required, the MITRE categories “Impair Process Control” and “Modify Parameter” don’t make the link to mission impact. Changing the operating window of a control loop (design parameter) has a different impact than changing the control response of a control loop. These parameters are also differently protected and sometimes reside in different parts of the automation system. By changing the operating window the attacker can cause the production process to go outside its design limitations (e.g. too high pressure, too high temperature or level), where if the attacker changes a control parameter the response to changes in the process is impacted.

Depending on the process unit where this happens, mission impact differs. So a cyber attack causes a functional deviation (I ignore the confidentiality part here, see for that mentioned blog). A functional deviation is a consequence, a consequence we can give a severity score (how this is done is not part of this blog), and we can link consequence to mission impact.

Cyber security risk is also based upon the likelihood that a specific cyber security hazard occurs in combination with consequence severity. Mission risk resulting from the cyber security hazard is based upon the likelihood of the cyber security hazard and mission impact caused by the consequence.

Estimating the likelihood and the risk reduction as function of the countermeasures and safeguards is the trick here. Maybe I will discuss that in a future blog, it is by far the most interesting part in risk estimation. But in today’s world of risk analysis, using different names by different vendors such as “security process hazard analysis” or “cyber security HAZOP”, all do similar things. Create an inventory of cyber security hazards, estimate the inherent and residual risk based upon the assets installed and the channels (protocols) used for the communication between the assets, the architecture of the ICS, and the countermeasures and safeguards installed or advised. The difference is in how extensive and detailed is the cyber security hazard repository (the better you know the system (use functions and internal functions), the more extensive and detailed the repository)

Long story on the concept of risk driven cyber security, focusing on function, consequence and impact. Contrary to a checklist based approach a risk driven approach provides the “why” we secure and compares the various ways to reduce the risk, and provides a base to manage and jusity security based on risk.

Not that I think my blog on remote access security is an irresponsible suggestion that a list of controls would be sufficient to take away all risk, certainly not. But sometimes a rule or checklist is a quick way out, and risk a next phase. When teaching a child to cross the street it is far more easy to supply a rule to cross it in a straight line than discuss the temporal factor dynamic exposure creates when crossing the road over a diagonal path. Not wanting to say asset owners are children, just wanting to indicate that initially the rule works better than the method.

Off course the devil is in the detail, especially the situational awareness rule in the list requires knowledge of the cyber security hazards and indicators of compromise. But following the listed rules is already a huge step forward compared to how remote access is sometimes implemented today.

There is no relationship between my opinions and publications in this blog and the views of my employer in whatever capacity.

Author: Sinclair Koelemij

Remote access

In times of Covid-19 the interest in remote access solutions has grown. Remote access has always been a hot topic for industrial control systems, some asset owners didn’t want any remote access, others made use of their corporate VPN solutions to create access to the control network, and some made use of the remote access facilities provided by their service providers. In this blog I will discuss a set of security controls to consider when implementing remote access.

There are multiple forms of remote access:

  • Remote access from an external organization, for example a service provider. This requires interactive access, often with privileged access to critical ICS functions;
  • Remote access from an internal organization, for example process engineers, maintenance engineers and IT personnel. Also often requires privileged access;
  • Remote operations, this provides the capability to operate the production process from a remote location. Contrary to remote access for support this type of connection requires 24×7 availability and should not hinder the process operator to carry out his / her task;
  • Remote monitoring, for example health monitoring of turbines and generators, well head monitoring and similar diagnostic and monitoring functions;
  • Remote monitoring of the technical infrastructure for example for network performance, or remote connectivity to a security operation center (SOC);
  • Remote updates, for example daily updates for the anti-virus signatures, updates for the IPS vaccine, or distribution of security patches.

The rules I discuss in this blog are for remote interactive access for engineering and support tasks, a guy or girl accessing the control system from a remote location for doing some work.

In this discussion I consider a remote connection to be a connection with a network with a different level of “trust”. I put “trust” between quotes because I don’t want to enter in all kind of formal discussions on what trust is in security, perhaps even if we should allow trust in our life as security professional.

RULE 1 – (Cascading risk) Enforce disjoint protocols when passing a DMZ.

The diagram shows a very common architecture, but unfortunately also one with very high risk because of allowing inbound access into the DMZ toward the terminal server and allowing outbound access from the terminal server to the ICS function using the same protocol.

Recently we have had several RDP vulnerabilities that were “wormable”, meaning a network worm can make use of the vulnerability to propagate through the network. So allowing a network worm that infects the corporate network to reach the terminal server and from there infect the control network. Which is high risk in times of ransomware spreading as a network worm.

This is a very bad practice and should be avoided! Not enforcing disjoint protocols increases what is called cascading risk, the risk that a security breach propagates through the network.

RULE 2 – (Risk of hacking) Prevent inbound connections. Architectures where the request is initiated from the “not-trusted” side, like in above terminal server example, require an inbound firewall port to be open to facilitate the traffic. For example in above diagram the TCP port for the RDP protocol.

Solutions using a polling mechanism, where a function on the trusted side polls a function on the not-trusted side for access requests, offers a smaller attack surface because the response channel makes use of the firewall’s stateful behavior, where the port is only temporarily open for just this specific session.

Inbound connections expose the service that is used, if there would be a vulnerability in this service this might be exploited to gain access. Prevent at all times such connections coming from the Internet. Direct Internet connectivity requires a very good protection, an inbound connection for remote access offers a high risk for compromise.

So also a very bad practice, unfortunately a practice I came across to several times because some vendor support organizations use such connectivity.

RULE 3 – (Risk of exposed login credentials) Enforce two-factor authentication. The risk that the access credentials are revealed through phishing attacks capturing the access credentials is relatively big. Two factor authentication adds to this the requirements that apart from knowing the credentials (login / password) the threat actor also needs to possess access to a physical token generator for login.

This raises the bar for a threat actor. Personally I have the most trust in a physical token like a key fob that generates a code. Alternatives are tokens installed in the PC, either as software or as a USB device.

RULE 4 – (Risk of unauthorized access) Enforce an explicit approval mechanism where someone on the trusted side of the connection explicitly needs to “enable” remote access. Typically after a request over the phone either the process operator / supervisor or a maintenance engineer needs to approve access before the connectivity can be established.

Multiple solutions exist, some solutions have this feature build-in, sometimes an Ethernet switch is used, and there are even firewalls where a digital input signal can alter the access rule.

Sometimes implicit approval seems difficult to prevent, for example access to sub-sea installations, access to installations in remote areas, or access to unmanned installations. But also for these situations implementing explicit approval is often possible with some clever engineering.

RULE 5 – (Risk of prohibited traffic) Don’t allow for end to end tunneled connections between a server in the control network and a server at the not trusted side (either corporate network or a service provider network)

Encrypted tunnels prevent the firewall to inspect the traffic, so bypass more detailed checks on the traffic. So best practice is to break the tunnel have it inspected by the firewall and reestablish the tunnel to the node on the trusted side of the network. Where to break the tunnel is often a discussion, my preference is to break it in the DMZ. Tunneled traffic might contain clear text communication, so we need to be careful where to expose this traffic if we open the tunnel.

RULE 6 – (Risk of unauthorized activity) Enforce for connectivity with external users, such as service providers, a supervision function where someone on the trusted side can see what the remote user does and intervene when required.

Systems exist that no only supervise the activity visual, but also log all activity allowing it to be replayed later in time.

RULE 7 – (Risk of unauthorized activity) Make certain there is an easy method to disconnect the connection. A “supervisor” on the trusted side (inside) of the connection must be able to disconnect the remote user. But prevent that this can be done accidentally, because if the remote user does some critical activity, his control over the activity shouldn’t be suddenly lost.

RULE 8 – (Risk of unauthorized activity) Restrict remote access to automation system nodes as much as possible. Remote access to a safety engineering function might not be a good idea, so prevent this where possible. Where in my view this should be prevented with a technical control, an administrative policy is fine but generally not considered by an attacker.

RULE 9 – (Risk of unauthorized access) Restrict remote access for a limited time. A remote access session should be granted for a controlled length of time. The time duration of the session needs to match the time requirement of the task, never the less there should be an explicit end time that is reasonable.

RULE 10 – (Risk of exposure confidential data) Enforce the use of secure protocols for remote access, login credentials should not pass the network in clear text at any time. So for example don’t use protocols such as Telnet for accessing network devices, use the SSH protocol that encrypts the traffic instead.

In principle all traffic that passes networks outside the trusted zones should be encrypted and end to end authenticated. Using certificates is a good option, but it better be a certificate specifically for your plant and not a globally used certificate.

In times of state sponsored attackers, the opponent might have the same remote access solution installed and inspected in detail.

RULE 11 – (Risk of exposure confidential data) Don’t reveal login credentials of automation system functions to external personnel from service providers. Employees of service providers generally have to support tens or hundreds of installations. They can’t memorize all these different access credentials, so quickly mechanism are used to store these. Varying from paper to Excel spreadsheets and password managers, prevent that a compromise of this information compromises your system. Be aware that changing passwords of services is not always an easy task, so control this cyber security hazard.

A better approach might be to manage the access credentials for power users, including external personnel, using password managers that support login as a proxy function. In these systems the user only needs to know his personnel login credentials and the proxy function will use the actual credentials in the background. This has several security advantages:

  • Site specific access credentials are not revealed, if access is no longer required, disabling / removing access to the proxy function is sufficient to block access without ever having compromised the system’s access credentials.
  • Enforcing access through such a proxy function blocks the possibility of hopping between servers, because the user is not aware of the actual password. (This does require to enforce certain access restrictions for users in general.)

Also consider separating the management of login credentials for external users from the management of login credentials for internal users (users with a login on the trusted side). You might want to differentiate between what a user is allowed to do remotely and what he can do when on site. Best to enforce this with technical controls.

RULE 12 – (Risk of unauthorized activity) Enforce least privilege for remote activities. Where possible only provide view-only capabilities. This often requires a collaborative session where a remote engineer guides a local engineer to execute the actions, however it reduces the possibilities for an unauthorized connection to be used for malicious activity.

RULE 13 – (Risk of unauthorized activity) Manage access into the control network from the secure side, the trusted side. Managing access control and authorizations from the outside of the network is like putting your door key outside under the doormat. Even if the task is done from remote, the management systems should be on the inside.

RULE 14 – (Risk of unauthorized activity) Detection. Efi Kaufman addressed in a response a very valid point.

We need to build our situational awareness. The whole idea of creating a DMZ zone is to have one zone where we can do some detailed checks before we let the traffic in. In RULE 5, I already mentioned to break the tunnel open so we can inspect, but there is of course lots more to inspect. If we don’t apply detection mechanisms we have an exclusive focus on prevention assuming everyone behaves fine when we open the door.

This is unfortunately not true, so a detection mechanism is required to check if nothing suspicious happens. Exclusively focusing on prevention is a common trap, and I fell in it!

Robin, a former pen tester pointed out that it is important to monitor the antivirus function, as pentester he was able to compromise a system, because av triggering on the payload was not monitored, giving him all the time to investigate and modify the payload until it worked.

RULE 15 – (Risk of hacking, cascading risk) Patching, patching, patching. There is little excuse not to patch remote access systems, or systems in the DMZ in general. Exposure demands that we patch.

Beware that these systems can have connectivity with the critical systems, their users might be logged in using powerful privileges, privileges that could be misused by the attacker. Therefore patching is also very important.

RULE 16 – (Risk of malware infection, cascading risk) Keep your AV up to date. While we start to do some security operation tasks, better to make sure our AV signatures are up to date.

RULE 17 – (Risk unauthorized activities) Robin addressed in a response the need for enforcing that the remote user logs for each request the purpose of the remote access request. This facilitates identification if processes are followed, and people are not abusing privileges or logging in daily for 8 hours/day for a month instead of coming to site.

Seventeen simple rules to consider when implementing remote access that popped up in my mind while typing this blog.

If I missed important controls please let me know than I will add them.

Use these rules when considering how to establish remote connectivity for support type of tasks. Risk appetite differs, so engineers might only want to select some rules and accept a bit higher risk.

But Covid-19 should not lead to an increased risk of cyber incidents by implementing solutions that increase exposure on the automation systems in an irresponsible manner.

There is no relationship between my opinions and publications in this blog and the views of my employer in whatever capacity.

Author: Sinclair Koelemij

Thanks to CISO CLUB a Russian version

Are Power Transformers hackable?

There is a revived attention for supply chain attacks after the seize of a Chinese transformer in the port of Houston. While on its way to a US power transmission company – Western Area Power Administration (WAPA) – the 226 ton transformer was rerouted to Sandia National Laboratories in Albuquerque New Mexico for inspection on possible malicious implants.

The sudden inspection happens more or less at the same time that the US government issued a presidential directive aiming for white listing vendors allowed to supply solutions for the US power grid, and excluding others to do so. So my curiosity is raised and additionally triggered by the Wall Street Journal claim that transformers do not contain software-based control systems and are passive devices. Is this really true in 2020? So the question is, are power transformers “hackable” or must we see the inspection exclusively as a step in increasing trade restrictions.

Before looking into potential cyber security hazards related to the transformer, let’s first look at some history of supply chain “attacks” relevant for industrial control systems (ICS). I focus here on supply chain attacks using hardware products because in the area of software products, Trojan horses are quite common.

Many supply chain attacks in the industry are based on having purchased counterfeit products. Frequently resulting in dangerous situations, but generally driven by economic motives and not so much by a malicious intent to damage the production installation. Some examples of counterfeits are:

  • Transmitters – We have seen counterfeit transmitters that didn’t qualify for the intrinsic safety transmitter zone qualifications specified by the genuine product sheet. And as such creating a dangerous potential for explosions in a plant when these products would be installed in zone 1 and zone 0 areas with a potential for the presence of explosive gases.
  • Valves – We have seen counterfeit valves, where mechanical specifications didn’t meet the spec sheet of the genuine product. This might lead to the rupture of the valve resulting in a loss of containment with potential dangerous consequences.
  • Network equipment – On the electronic front we have seen counterfeit Cisco network equipment that could be used to create a potential backdoor in the network.

However it seems that the “attack” here is more an exploit of the asset owner’s vulnerability for low prices (even if they sound ridiculously low), in combination with highly motivated companies trying to earn some fast money, than an intentional and targeted attack on the asset integrity of an installation.

That companies selling these products are often found in Asia, with China as the absolute leader according to reports, is probably caused by a different view / attitude toward patents, standards and intellectual property in a fast growing region and additionally China’s economic size. Not necessarily a plot against an arrogant Western world enemy.

The most spectacular example of such an incident is where counterfeit Cisco equipment ended up in the military networks of the US. But as far as I know, it was also in this case never shown that the equipment’s functionality was maliciously altered. Main problem was a higher failure rate caused by low manufacturing standards, potentially impacting the networks reliability. Never the less also here a security incident because of the potential for malicious functionality.

Also proven malicious incidents have occurred, for instance in China, where computer equipment was sold with already pre-installed malware. Malware not detectable by antivirus engines. So the option to attack industrial control systems through the supply chain certainly exist, but as far as I am aware never succeeded.

But there is always the potential that functionality is maliciously altered, so we need to see above incidents as security breaches and consider them to be a serious cyber security hazard we need to address. Additionally power transformers are quite different from the hardware discussed above, so a supply chain attack on US power grid using power transformers is a different analysis. If it would happen and was detected it would mean end of business for the supplier, so stakes are high and chances that it happens are low. Let’s look now at the case of the power transformer.

For many people, a transformer might not look like an intelligent device. But in today’s world everything in the OT world becomes smart (not excluding the possibility we ourselves might be the exception), so we also have smart power transformers today. Partially surfing on the waves of the smart hype, but also adding new functions that can be targeted.

Of course I have no information on the specifications of the WAPA transformer, but it is a new transformer so probably making use of today’s technology. Since seizing a transformer is not a small thing, transformers used in the power transmission world are designed to carry 345 kilo volts or more and can weigh as much as 410 tons (820.000 lb in the non-metric world), there must be a good reason to do so.

One of the reasons is of course that it is very critical and expensive equipment (can be $5.000.000+) and is built specifically for the asset owner. If it would fail and be damaged, replacement would take a long time. So this equipment must not only be secure, but also be very reliable. So worth an inspection from different viewpoints.

What would be the possibilities for an attacker to use such a huge lump of metal for executing a devious attack on a power grid. Is it really possible, are there scenarios to do so?

Since there are many different types of power transformers, I need to make a choice and decided to focus on what are called conservator transformers, these transformers have some special features and require some active control to operate. Looking at OT security from a risk perspective, I am more interested in if a feasible attack scenario exists – are there exposed vulnerabilities to attack, what would be the threat action – then in a specific technical vulnerability in the equipment or software that make it happen today. To get a picture of what a modern power transformer looks like, the following demo you can play with (demo).

Look for instance at the Settings tab and select the tap position table from where we can control or at minimum monitor the onload tap changes (OLTC). Tap changers select variable turn ratios to either increase or decrease the turn ratio to regulate the output voltage for variations on the input side. Another interesting selection you find when selecting the Home icon, leading you directly to the Buchholz safety relay. Also look at the interface protocol Goose, I would say it all looks very smart.

I hope everyone realizes from this little web demo, that what is frequently called a big lump of dumb metal might actually be very smart and containing a lot more than a view sensors to measure temperature and level as the Wall Street Journal suggests. Like I said I don’t know WAPA’s specification, so maybe they really ordered a big lump of dumb metal but typically when buying new equipment companies look ahead and adopt the new technologies available.

Let’s look in a bit more detail to the components of the conservator power transformer, being a safety function the Buchholz relay is always a good point to start if we want to break something. The relay is trying to prevent something bad from happening, what is this and how does this relay counter this, can we interfere?

A power transformer is filled with insulating oil to insulate and serve as a coolant between the windings. The Buchholz relay connects between the overhead conservator (a tank with insulating oil) and the main oil tank of the transformer body. If a transformer fails, or is overloaded this causes extra heat, heated insulating oil forms gas and the trapped gas presses the insulating oil level further down (driving it into the conservator tank passing the Buchholz relay function) so reducing the insulation between the windings. The lower level could cause an arc, speeding up the process and causing more gas pressure, pressing the insulating oil even more away and exposing the windings.

It is the Buchholz relay’s task to detect this and operate a circuit breaker to isolate the transformer before the fault causes additional damage. If the relay wouldn’t do its task quick enough the transformer windings might be damaged causing a long outage for repair. In principal Buchholz relays, as I know them, are mechanical devices working with float switches to initiate an alarm and the action. So I assume there is not much to tamper with from a cyber point of view.

How about the tap changer? This looks more promising, specifically an on load tap changer (OLTC). There are various interesting scenarios here, can we make step changes that impact the grid? When two or more power transformers work in parallel, can we create out-of-step situations between the different phases by causing differences in operation time?

An essential requirement for all methods of tap changing under load is that circuit continuity must be maintained throughout the tap stepping operation. So we need a make-before-break principle of operation, which causes at least momentary, that a connection is made simultaneously to two adjacent taps on the transformer. This results in a circulating current between these two taps. To limit this current, an impedance in the form of either resistance or inductive reactance is required. If not limited the circulating current would be a short-circuit between taps. Thus time also plays a role. Voltage change between taps is a design characteristic of the transformer, this is normally small approximately 1.25% of the nominal voltage. So if we want to do something bad, we need to make a bigger step than expected. The range seems to be somewhere between +2% and -16% in 18 steps, so quite a jump is possible if we can increase the step size.

To make it a bit more complex, a transformer can be designed with two tap changers one for in phase changes and one for out of phase changes, this also might provide us with some options to cause trouble.

So plenty of ingredients seem to be available, we need to do things in a certain sequence, we need to do it within a certain time, and we need to limit the change to prevent voltage disturbances. Step changers use a motor drive, and motor drives are controlled by motor drive units, so it looks like we have an OT function. Again a bit larger attack surface than a few sensors and a lump of metal would provide us. And then of course we saw Goose in the demo, a protocol with issues, and we have the IEDs that control all this and provide protection, a wider scope to investigate and secure but not part of the power transformer.

Is this all going to happen? I don’t think so, the Chinese company making the transformers is a business, and a very big one. If they would be caught tampering with the power transformers than that is bad for business. Can they intentionally leave some vulnerabilities in the system, theoretically yes but since multiple parties (the delivery contains also non-Chinese parts) are involved it is not likely to happen. But I have seen enough food for a more detailed analysis and inspection to find it very acceptable that also power transformers are assessed for their OT security posture when used in critical infrastructure.

So on the question are power transformers hackable, my vote would be yes. On the question will Sandia find any malicious tampering, my vote would be no. Good to run an inspection but bad to create so much fuss around it.

There is no relationship between my opinions and publications in this blog and the views of my employer in whatever capacity.

Sinclair Koelemij

The Purdue Reference Model outdated or up-to-date?

Is the Purdue Reference Model (PRM) outmoded? If I listen to the Industrial Internet of Things (IIoT) community, I would almost believe so. IIoT requires connectivity to the raw data of the field devices and our present architectures don’t offer this in an easy and secure way. So let’s look in a bit more detail to the PRM and the IIoT requirements, to see if they really conflict or can coexist side by side?

I start the discussion with the Purdue Reference Model, developed in the late 80’s. Developed in a time that big data was anything above 256 Mb and the Internet was still in its early years, a network primarily used by and for university students. PRM’s main objective was creating a hierarchical model for manufacturing automation, what was called computer integrated manufacturing (CIM) in those days. If we see the model as it was initially published we note a few things:

  • The model has nothing to do with a network, or security. There are no firewalls, there is no DMZ.
  • There is an operator console at level 1. Maybe a surprise for people who started to work in automation in the last 20 years, but in those days quite normal.
  • Level 4 has been split in a level 4A and a level 4B, to segment functions that directly interact with the automation layers and functions requiring no direct interface.
  • There is no level 0, the Process layer is what it is a bunch of pipes, pumps, vessels and some clever stuff.

Almost a decade later we had ANSI / ISA-95 that took the work of the Purdue university and extended it with some new ideas and created an international standard that became very influential for ICS design. It was the ANSI / ISA-95 standard that named the Process level, level 0. But in those days level 0 was still the physical production equipment. The ISA-95 standard says following on level 2, level 1, and level 0 : ” Level 0 indicates the process, usually the manufacturing or production process. Level 1 indicates manual sensing, sensors, and actuators used to monitor and manipulate the process. Level 2 indicates the control activities, either manual or automated, that keeps the process stable or under control.” So level 0 is the physical production equipment, and level 1 includes the sensors and actuators. It was many years later that people started using level 0 for the field equipment and their networks, all new developments with the rise of Foundation Fieldbus, Profibus, the HART protocol, but never part of the standard. It was probably the ISA 99 standard that introduced a new layer between level 4 and level 3, the DMZ. It was the vendor community that started giving it a level name, level 3.5. But level 3.5 has never been a functional level, it was an interface between two trust zones for adding security. Though often the way it was implemented contributed little to security, but it was a nice feeling to say we have a demilitarized zone between the corporate network and the control network. So far the history of the Purdue Reference Model and ISA-95 and their contribution to the levels. Now lets have a look at how a typical industrial control system (ICS) looks without immediately using names for levels.

From a functional viewpoint we can split a traditional ICS architecture in 5 main parts:

  • Production management system, which task it is to control the automation functions. Typical domain of the DCS and SCADA systems. But also compressor control (CCS), power management systems (PMS) and the motor control center (MCC) reside here. basically everything that takes care of the direct control of the production process. And of course all these systems interact with each other using a wide range of protocols, many of them with security short comings;
  • Operations management system, which task it is to optimize the production process, but also to monitor and diagnose the process equipment (for example machine monitoring functions such as vibration monitoring), and various management functions such as asset management, accounting and reconciliation functions to monitor the mass balance of process units, and emission monitoring systems to meet environmental regulations;
  • Information systems is the third category, these systems collect raw data and transform it into information to be used for historical trends or to feed other systems with information. The objective here is to transform the raw data into information and ultimately information into knowledge. The data samples stored are normally one minute snapshots, hourly averages, shift averages, daily averages, etc. An other example of an information system is custody metering system (CMS) for measuring product transfer for custody purposes.
  • The last domain of the ICS is the Application domain, for example for advanced process control (APC) applications, traditionally performing advisory functions. But overtime the complexity of running an production processes grew, response to changes in the process became more important so advisory functions were taking over the task of the process operator immediately changing the setpoints using setpoint control or controlling the valve with direct digital control functions. There are plants today that if APC would fail, the production is no longer profitable or can’t reach the required product quality levels.
  • Finally there is the Business domain, generally not part of the ICS. in the business domain we mange for example the feed stock and products to produce. It is the decision domain.
Functional model of a chemical plant or refinery

The production management systems are shown in the light blue color, the operations management, information management, and application systems in the light purple color. The model seems to comply with the models of the ANSI / ISA-95 standard. However this model is far from perfect, because an operations management function as vibration monitoring, or corrosion monitoring also have sensors that connect to the physical system. And an asset management system requires information from the field equipment. If we consider a metering system, also part of the information function, it is actually a combination of a physical system and sensors.

And then of course we have all the issues around vendor support obligations. Asset owners expect systems with a very high level of reliability, vendors have to offer this level of reliability in a challenging real-time environment by testing and validating their systems and changes rigorously. Than there is nothing worse than suddenly having to integrate with the unknown, a 3rd party solution. As a result we see systems that have a level 2 function connected to level 3, in principle exposing a critical system more and making the functionality rely on more components so reducing reliability.

So many conflicts in the model, still it has been used for over 20 years, and it still helps us to guide us in securing systems today. If we take the same model and add some basic access security controls we get the following:

At high level this represents the architecture in use today, in smaller systems the firewall between level 2 and 3 might miss, some asset owners might prefer a DMZ with two firewalls (from different vendors to enforce diversity), the level 4A hosts those functions that interface with the control network and generally different policies are enforced for these systems than required at level 4B. For example asset owners might enforce the use of desktop clients, might limit Internet access for these clients, might restrict email to internal email only, etc. All to reduce the chance on compromising the ICS security.

Despite that the reference model is not supporting all of the new functions we have in plants today – for example I didn’t discuss wireless, virtualization, and system integration topics at level 2 – the reference model still helps structure the ICS. Is it impossible to add the IIoT function to this model? Let’s have a look at what IIoT requires?

In an IIoT environment we have a three level architecture:

  • The edge tier – The edge tier is for our discussion the most important, this is where the interface is created with the sensors and actuators. Sometimes existing sensors / actuators, but also new sensors and actuators.
  • The platform tier – The platform tier processes and transforms the data and controls the direction of the data flow. So from that point also of importance for the security of the ICS.
  • The enterprise tier – This tier hosts the application and business logic that needs to provide the value. It is external to the ICS, so as long as we have a secure end to end connection between the platform tier and the enterprise tier ICS security needs to trust that the systems and data in this tier are properly secured.

The question is can we reconcile these two architectures? The answer seems to be what is called the gateway-mediated edge. This is a device that aggregates and controls the data flows from all the sensors and actuators providing a single point of connection to the enterprise tier. This type of device also performs the necessary translation for communicating with the different protocols used in ICS. The benefits of such a gateway device is the scalability of the solution together with controlling the entry and exit of the data flows from a single controllable point. In this model some of the data processing is kept local, close to the edge, allowing controlled interfaces with different data sources such as Foundation Field Bus, Profibus, and HART enabled field equipment. To implement such an interface device doesn’t require to change the Purdue reference model in my opinion, it can make use of the same functional architecture. Additionally new developments such as the APL (Advanced Physical Layer) technology, the technology that will remove the I/O layer as we know it today, will fully align with this concept.

So I am far from convinced that we need a new model, also today the model doesn’t reflect all the data flows required by today’s functions. The very clear boundaries we used to have, have disappeared with the introduction of new technologies such as virtualization and wireless sensor networks. Centralized process computer systems of the 70s, have disappeared becoming decentralized over the last 30 years and are now moving back into centralized data centers where hardware and software have lose ties. Today these data centers are primarily onsite, but over time confidence will grow and larger chunks of the system will move to the cloud.

Despite all these technology changes, the hierarchical structure of the functions hasn’t change that much. It is the physical location of the functions that changes, undoubtedly demanding a different approach to cyber security. It are the cyber security standards of today that are dated on the day they are released. The PRM was never about security or networking, it is a hierarchical functional model for the relationship between functions which is as relevant today as it was 30 years ago.

We have a world of functional layers, and between these layers data is exchanged, so we have boundaries to protect. As long as we have bi-directional control over the data flows between the functional layers and keep control over where the data resides, we protect the overall function. If there is something we need to change it are the security standards, not the Purdue reference model. But we need to be careful with the security of these gateways as the recent OSI PI security risk has learned us.

There is no relationship between my opinions and publications in this blog and the views of my employer in whatever capacity.

Sinclair Koelemij

TRISIS revisited

For this blog I like to go back in time, to 2017 the year of the TRISIS attack against a chemical plant in Saudi Arabia. I don’t want to discuss the method of the attack (this has been done excellently in several reports) but want to focus on the potential consequences of the attack because I noticed that the actual threat is underestimated by many.

The subject of this blog was triggered after reading Joe Weiss’ blog on the US presidential executive order and noticing that some claims were made in the blog that are incorrect in my opinion. After checking what the Dragos report on TRISIS wrote on the subject, and noticing a similar underestimation of the dangers I decided to write this blog. Let’s start with summing up some statements made in Joe’s blog and the Dragos report that I like to challenge.

I start with quoting the part of Joe’s blog that starts with the sentence: “However, there has been little written about the DCS that would also have to be compromised. Compromising the process sensors feeding the SIS and BPCS could have led to the same goal without the complexity of compromising both the BPCS and SIS controller logic and the operator displays.” The color high lights are mine to emphasize the part I like to discuss.

The sentence seems to suggest (“also have to be compromised”) that the attacker would ultimately also have to attack the BPCS to be effective in an attempt to cause physical damage to the plant. For just tripping the plant by activating a shutdown action the attacker would not need to invest in the complexity of the TRISIS attack. Once gaining access to the control system at the level the attackers did, tens of easier to realize attack scenarios were available if only a shutdown was intended. The assumption that the attacker needs the BPCS and SIS together to cause physical damage is not correct, the SIS can cause physical damage to the plant all by it self. I will explain this later with a for safety engineers well known example of an emergency shutdown of a compressor.

Next I like to quote some conclusions in the (excellent) Dragos report on TRISIS. It starts at page 18 with:

Could This Attack Lead to Loss of Life?

Yes. BUT, not easily nor likely directly. Just because a safety system’s security is compromised does not mean it’s safety function is. A system can still fail-safe, and it has performed its function. However, TRISIS has the capability to change the logic on the final control element and thus could reasonably be leveraged to change set points that would be required for keeping the process in a safe condition. TRISIS would likely not directly lead to an unsafe condition but through its modifying of a system could deny the intended safety functionality when it is needed. Dragos has no intelligence to support any such event occurred in the victim environment to compromise safety when it was needed.

The conclusion that the attack could not likely lead to the loss of life, is in my opinion not a correct conclusion and shows the same underestimation as made by Joe. As far as I am aware the part of the modified logic has never been published (hopefully someone did analyze) so the scenario I am going to sketch is just guessing a potential objective. It is what is called a cyber security hazard, it could have occurred under the right conditions for many production systems including the one in Saudi Arabia. So let’s start with explaining how shutdown mechanisms in combination with safety instrumented systems (SIS) work, and why some of the cyber security hazards related to SIS can actually lead to significant physical damage and potential loss of life.

A SIS has different functions like I explained in my earlier blogs. A little bit simplified summary, there is a preventative protection layer the Emergency Shutdown System (ESD) and there is a mitigative layer, e.g. the Fire & Gas system detecting fire or gas release and activating actions to extinguish fires and to alert for toxic gases. For our discussion I focus on the ESD function, but interesting scenarios also exist for F&G.

The purpose of the ESD system is to monitor process safety parameters and initiate a shutdown of the process system and/or the utilities if these parameters deviate from normal conditions. A shutdown function is a sequence of actions, opening valves, closing valves, stopping pumps and compressors, routing gases to the flare, etc. These actions need to be done in a certain sequence and within a certain time window, if someone has access to this logic and modifies the logic this can have very serious consequences. I almost would say, it always has very serious consequences because the plant contains a huge amount of energy (pressure, temperatures, rotational speed, product flow) that needs to be brought to a safe (de-energized) state in a very short amount of time, releasing incredible powers. If an attacker is capable of tampering with this shutdown process serious accidents will occur.

Let’s discuss this scenario in more detail in the context of a centrifugal compressor, most plants have multiple so always an interesting target for the “OT certified” threat actor. Centrifugal compressors increase the kinetic energy of for example a gas into a pressure so a gas flow through pipelines is created either to transfer a product through the various stages of the production process or perhaps to create energy for opening / closing pneumatic driven valves.

Transient operations, for example the start-up and shutdown of process equipment, always have dangers that need to be addressed. An emergency shutdown because there occurred in the plant a condition that demanded the SIS to transfer the plant to a safe state, is such a transient operation. But in this case unplanned and in principle fully automated, no process operator to guard the process and correct where needed. The human factor is not considered a very reliable factor in functional safety and is often just too slow. SIS on the other hand is reliable, the redundancy and the continuous diagnostic checks all warrant a very low failure on demand probability for SIL 2 and SIL 3 installations. They are designed to perform when needed, no excuses allowed. But this is only so if the program logic is not tampered with, the sequence of actions must be performed as designed and is systematically tested after each change.

Compressors are controlled by compressor control systems (CCS), one of the many sub-systems in an ICS. The main task of a compressor control system is anti surge control. The surge phenomenon in a centrifugal compressor is a complete breakdown and reversal of the flow through the compressor. A surge causes physical damage to the compressor and pipelines because of the huge forces released if a surge occurs. Depending on the gas this can also lead to explosions and loss of containment.

An anti surge controller of the CCS continuously calculates the surge limit (which is dynamic) and controls the compressor operation to stay away from this point of danger. This all works fine during normal operation, however when an emergency shutdown occurs the basic anti surge control provided by the CCS has shown to be insufficient to prevent a surge. In order to improve the response and prevent a surge, the process engineer has two design options called a hot bypass method or a cold bypass method recycling the gas to allow for a more gradual shutdown. The hot bypass is mostly used because of its closeness to the compressor which results into a more direct response. Such a hot bypass method requires to open some valves to feed the gas back to the compressor, this action is implemented as a task of the ESD function. The quantity of gas that can be recycled has a limit, so it is not just opening the bypass valve to 100% but opening it with the right amount. Errors in this process or a too slow reaction would easily result into a surge, damaging the compressor, potentially rupturing connected pipes, causing loss of containment, perhaps resulting in fire and explosions, and potentially resulting in casualties and a long production stop with high repair cost.

All of this is under control of the logic solver application part of the SIS. If the TRISIS attacker’s would have succeeded into loading altered application logic, they would have been capable of causing physical damage to the production installation, damage that could have caused loss of life.

So my conclusion differs a bit, an attack on a SIS can lead to physical damage when the logic is altered, which can result in loss of life. A few changes in the logic and the initiation of the shutdown action would have been enough to accomplish this.

This is just one example of a cyber security hazard in a plant, multiple examples exist showing that the SIS by itself can cause serious incidents. But this blog is not supposed to be a training for certified OT cyber terrorist so I keep it with this for safety engineers well known example.

Proper cyber security requires proper cyber security hazard identification and hazard risk analysis. This has too little focus and is sometimes executed at a level of detail insufficient to identify the real risks in a plant.

I don’t want to criticize the work of others, but do want to emphasize that OT security is a specialism not a variation on IT security. ISA published a book “Security PHA Review” written by Edwar Marsazal and Jim MgClone which addresses the subject of secure safety systems in a for me far too simplified manner by basically focusing on an analysis / review of the process safety hazop sheet to identify cyber related hazards.

The process safety hazop doesn’t contain the level of detail required for a proper analysis, neither does the process safety hazop process assume malicious intent. One valve may fail, but multiple valves at the same time in a specific sequence is very unlikely and not considered. While these options are fully open to the threat actor with a plan.

Proper risk analysis starts with identifying the systems and sub-systems of an ICS, than identifying cyber security hazards in these systems, identifying which functional deviations can result from these hazards, and than translate how these functional deviations can impact the production system. That is much more than a review of a process hazop sheet on “hackable” safeguards and causes. That type of security risk analysis requires OT security specialists with a detailed knowledge on how these systems work, what their functional tasks are, and the imagination and dose of badness to manipulate these systems in a way that is beneficial for an attacker .

Sinclair Koelemij