When selecting a topic for a blog I generally pick one that entails some controversy. This time I select a topic that is generally accepted and attempt to challenge it to create the controversy. I believe very much that OT security is about learning, and that learning requires to continuously challenging the new knowledge we acquire. So this blog is not about trying to know better, or trying to refute a well founded theorem, but simply by applying some knowledge and ratio to investigate the feasibility of other options. In this case, is a one stage ICS attack feasible?
Five years ago Michael Assante and Robert Lee wrote an excellent paper called “The Industrial Control System Cyber Kill Chain“. I think the readers of this blog should carefully study this document, the paper provides a very good structure on how threat actors approach attacks on industrial control systems (ICS). The core message of the paper is that attacks on ICS, with the objective to damage production equipment, are two stage attacks.
The authors describe a first stage where the threat actor team collects information by penetrating the ICS and establishing a persistent connection, this is called in the paper the “Management and Enablement Phase“. This phase is followed by the “Sustainment, Entrenchment, Development, and Execution” phase, the phase where the threat actor captures additional information and prepares for the attack on the industrial control system (ICS).
Both phases are part of the first attack stage. The information collected in stage 1 is used to prepare the second stage in the kill chain. The second stage conducts the attack on the process installation to cause a specific failure mode (“the damage of the production system”). Typically there is considerable time between the first stage and the second stage, time used for the development of the second stage.
I assume that the rational of the authors for suggesting two stages is that each plant is unique, and as such each ICS configuration is unique, so to damage the process installation we need to know its unique configuration in detail before we can strike.
Seems like fair reasoning and the history of ICS attacks supports the theorem of two stage attacks.
- Stuxnet required the threat actor team to collect information on: which variable speed drives were used to cause the necessary speed variations causing the excessive wear in the centrifuge bearings. The threat actor team also needed information on which process tags were configured in the Siemens control system to hide these speed variations from the process operator, and the threat actors required a copy of the PLC program code so they could modify it to execute the speed variations. The form of Stuxnet’s stage 1 differs probably very much from the ICS kill chain stage 1 scenario in general (Probably partially a non electronic old fashioned espionage activity), but this isn’t important. There is a stage 1 collecting information, and a stage 2 that “finishes” the job. So high level it matches the ICS kill chain pattern described in the paper.
- The two attacks on the Ukraine power grid in 2015 and 2016 follow a similar pattern. It is very clear from the efficiency of the attack, developed tools and the very specific actions carried out, that the threat actor team had detailed knowledge on what equipment was installed, and which process tags were related to the breaker switches. So also in this case the threat actor team executed a stage 1 collecting information, than based on this information the threat actor team plans the stage 2 attack, and in the stage 2 attack the actual objective is realized.
- The third example TRISIS / TRITON, follows the same pattern. Apparently the objective of this attack was to modify the safety integrity function in a Tristation safety controller. Details here are incomplete when reading / listening to the various sources. An interesting story, though perhaps at times a bit too dramatic for those of us that regularly spend their days at chemical plants, is for example the Darknet diaries on the TRISIS attack. Featuring among others Robert Lee and my former colleague Marina Krotofil. Was the attack directed against a Fire & Gas Tristation, or was the emergency shutdown function the target? This is not very clear in the story, because of the very different set of consequences for the production process an important detail if it comes to understanding the objective of the attack. Never the less also in the TRISIS case we saw a clear two stage approach, again an attack in two stages. First attack collecting the process information and Tristation configuration and logic, and a second attack attempting to install the modified logic and initiate (or wait for) something to happen. In my TRISIS revisited blog I explained a possible scenario to cause considerable damage and showed for this to happen that we don’t need to have a simultaneous attack on the control system.
What am I challenging in this blog? For me the million dollar question is, can an attacker cause physical equipment damage in a one stage attack, or are ICS attacks by default two stage attacks? Is a one stage attack, where all knowledge required is readily available in the public space and known by a subject matter experts, possible? And if it is a possibility, what are the characteristics that make such a scenario possible?
Personally I believe it is possible and I might even have seen an example of such an attack that I will discuss in more detail. I believe that an attacker with some basic hands-on practice working with a controls system, can do it it in a single stage.
I am not looking for scenarios that cause a loss of production, like for example a ransomware attack would cause. This is too “easy”, I am looking for targets that when attacked in the first stage would suffer considerable physical damage.
I am looking into these scenarios always for better understanding how such an attack would work and what we can learn from it to prevent such a scenario or make at least make it more difficult. Therefore I like to follow the advice of the Chinese general Sun Tzu, to learn to know the opponent, as well as our own shortcomings.
Sun Tzu wrote over 2500 years ago, the following important lesson which is also true in OT security:
- “If you know the enemy and you know yourself, you don’t have to fear the result of a hundred battles.”
- “If you know yourself, but not the enemy, you will also be defeated for every victory achieved.”
- “If you don’t know the enemy or yourself, you will succumb in every battle.”
So lets investigate some scenarios that can lead to physical damage. In order to damage process equipment we have to focus on consequences with failure modes belonging to the categories: Loss of Required Performance and Loss of Ability (See my earlier blogs for an explanation) This since only failure modes in those categories can cause physical damage in a one stage attack.
Loss of confidentiality is basically the outcome of the first stage in a two stage attack, access credentials intellectual property (e.g. design information), or information in general all might assist in preparing stage 2. But I am looking for a single stage attack so need to rely on knowledge available in the public domain without the need to collect it from the ICS, gain access into the ICS and cause the damage on the first entry.
From a functional aspect, the type of function to attack for causing direct physical damage would be: BPCS (Basic Process Control System), SIS (Safety Instrumented System), CCS (Compressor Control System), MCC (Motor Control Center), BMS (Boiler Management System), PMS (Power Management System), APC (Advanced Process Control), CPS (Condition Protection System), and IAMS (Instrument Asset Management System).
So plenty of choice for the threat actor team depending on the objective. It is not difficult for a threat actor to find out which systems are used at a specific site and which vendor provided these functions. This because if we know the product made, we know which functions are mandatory to have. If we know the the business a bit we can find many detailed success stories on the Internet, revealing a lot of interesting detail for the threat actor team. Additionally threat actors can have access to a similar system for testing their attack strategies, such as for example the TRISIS threat actor team had access to a Triconix safety system for its attack.
To actually make the damage happen we also need to consider the failure modes of the process equipment. Some process equipment would be more vulnerable than others. Interesting candidates would be:
- Extruders such as used in the thermoplastic industry, an extruder melts raw plastic, adds additives to it and passes it through a die to get the final product. This is a sensitive process where too high heath would change the product structure or an uneven flow might cause product quality issues. But the worst condition is no flow, if no flow happens this would result in that the melted plastic would solidify within the extruder. If this would happen, considerable time is required to recover from this. But this is a failure mode that happens once in a while also under normal circumstances, so an annoyance but process operations knows how to handle this so hardly an interesting target for a cyber attack attempting to cause significant damage.
- How about a kiln in a cement process? If a kiln doesn’t rotate, the construction would collapse under its own weight. So stopping a kiln could be a potential target, however it is not that complex to create a safeguard creating a manual override to bypass the automation system to keep the kiln rotating. So also not an interesting target to pursuit for this blog.
- How about a chemical process where runaway reactions are a potential failure mode? A runaway reaction is a thermally unstable reaction that accelerates with rising temperature and can ultimately lead to explosions and fires. A relatively recent example of such an accident is the Shell Moerdijk accident in 2014. Causing a runaway reaction could certainly be a scenario because these can be induced relatively easy. The threat actor team wouldn’t need that much specific information to carry out the attack, they could perhaps stop the agitator in the reactor vessel, or stop a cooling process to trigger the runaway reaction. If the threat actor gained access to the control system, this action is not that complex to do if familiar with control systems. So for me a potential single stage attack target.
- Another example that came to mind was the attack on the German steel mill in 2014. According to the BSI report the attacker(s) gained access into the office network through phishing emails, once the threat actor had access into the office network they worked their way into the control systems. How this happened is not documented, but could have been through some key logger software. Once the threat actor had access to the control system they caused what was called “an uncontrolled shutdown” by shutting down multiple control system components. This looks like a one stage attack, lets have a closer look.
LESSON 1: Never allow access from the office domain into the control domain without implementing TWO-FACTOR AUTHENTICATION. Not even for your own personnel. Not implementing two-factor implementation is denying the risk of phishing attacks and ignoring defense in depth. When doing so we are taking a considerable risk. See also my blog on remote access, RULE 3.
For the steel mill attack there are no details known on what happened other than that the threat actor according to the BSI report shuts down multiple control system components and causes massive damage. Just speculating from now on what might have happened but a possible scenario could have been shutting down shutting down the cooling system and shutting down BPCS controllers so the operator can’t respond that quickly. The cooling system is one of the most critical parts of a blast furnace. Because the BSI report mentions both “Anlagen” and “Steurungskomponenten” it might have been above suggestion, first shutting down the motors of the water pumps and than shutting down the controllers to prevent a quick restart of the pumps.
The picture on the left (Courtesy of HSE) shows a blast furnace. A blast furnace is charged at the top with iron ore, coke, sinter, and limestone. The charge materials gradually work their way down, reacting chemically and thermally on their path down. This process takes about 6-8 hours depending on the size of the furnace. The furnace is heated by injecting hot air (approx. 1200 C) mixed with pulverized coal, and sometimes methane gas through nozzles called Tuyeres in the context of a blast furnace.
This results in a furnace temperature of over 2000 C. To protect the blast furnace shell against these high temperatures it is critical to cool the refractory lining (See picture). This cooling is done by pumping water into the lining. Cooling systems are typically fully redundant having doubled all equipment.
There would be multiple electrical pump systems and turbine (steam) driven pump systems. If one system fails, the other would take immediately over. One being steam driven, the other being electrical driven. So all looks fine from a cooling system failure perspective. The problem is that when we consider OT cyber security we also have to consider malicious failures / forced shutdowns. In those cases redundancy doesn’t work multiple motors can be switched of, neither is it so that if a process valve fails to a closed or open position there might not be another valve doing the same or in opposite direction. Frequently we see in process safety HAZOP analysis that double failures are not taken into account, this is the field of the cyber security HAZOP translating the various failure modes of the ICS functions in how they can be used to cause process failures.
The question in above scenario is could the attacker shutdown both pumps, and would there be a manual restart of the cooling system possible. If this would fail, the lining would overheat very quickly. In the case of a Corus furnace in Port Talbot, the cooling system was designed to run simultaneously two pumps, each producing 45.000 litres per minute. So a flow equivalent to 90.000 1 liter water bottles per minute. If we would need a crowd of people to consume this amount of water per minute, we get a very sizable crowd. Just to give you an idea how much water this is. If the cool water flow was stopped, the heat in the lining would rise very quickly and as a result the refractory lining would be damaged through the huge thermal stress created.
For safety purposes there is generally also an emergency water tower which acts when the water pressure drops. This system supplies water through gravitational force, it might not have worked, or maybe insufficient capacity to prevent damage. Or perhaps the attacker also knew a way to stop this safety control.
Do we need a two-stage attack for above scenario?
I believe a threat actor with sufficient hands-on experience with a specific BPCS, can relatively quickly locate these critical pumps and could shut them down. With the same level of knowledge he can subsequently shutdown the controller if his authorizations would allow this. Typically engineer level access would be required, but I also encountered systems were process operators could shutdown a controller. How blast furnaces work and what the critical functions are, even which companies specialize in supplying these functions is all in the public domain. So perhaps a single stage attack is possible.
If above is a credible scenario on what might have happened in the case of the German steel mill, what can we learn from it and do to increase the resilience against this attack?
- First, we should not allow any remote access (access from the corporate network is also remote access) without two-factor authentication if the user would have “write” possibilities in the ICS. The fact that an attacker could gain access into the corporate network and from there could get access into the control network and causing an uncontrolled shutdown is in my opinion a significant (and basic) security flaw.
- Second, we should consider implementing dual control for critical functions in the process that have a steady state. For example a compressor providing instrument air, or in this case a cool water pump and its redundant partners that should be on as the normal state. Dual control would demand that two users approve the shutdown action before the control system executes the command, this would generally be users with a different authorization level. Such a control would further raise the bar for the threat actor.
LESSON 2: For critical automation functions that can cause significant damage when initiated, consider to either isolate them from the control system (hardwired connections) or implement DUAL CONTROL to not only depend on a single layer of protection – access control. Defense in depth should never be ignored for critical functions, WHEN CRITICAL DON’T RELY ON A SINGLE CONTROL, APPLY DEFENSE IN DEPTH!
Above two additional security controls would make the sketched scenario much more difficult, almost impossible to succeed. OT security is more than installing AV, patching and a firewall. To reduce risk we need to assess security from all points of view: the automation function and its vulnerabilities, our protection of these vulnerabilities, our situational awareness of what is happening, and what threats there are – which attack scenarios could happen.
IEC 62443 specifies the “dual control” requirement as an SL 4 requirement, I personally consider this as a flaw in the standard and believe it should already be a mandatory requirement for SL 3 because unauthorized access to critical process equipment is well within the capabilities of the “sponsored attacker” type of threat actor.
Having dual control as a requirement doesn’t mean that all control actions should have this implemented, but there are critical functions in a plant that are seldom switched on or off where adding dual control adds security by adding DEFENSE IN DEPTH. When critical, don’t rely on a single control.
Also today’s blog is a story on the need for approaching OT security of ICS in a different manner as we do for IT systems where network security and hardening are the core activities for prevention. Network security and hardening of the computer systems are also essential first steps for an OT system but we should never forget the actual automation functions of the OT systems.
This is why OT security teams should approach OT security differently and construct bow-tie diagrams for cyber security hazards, relating cause (threat actor initiating threat action exploiting a vulnerability) to consequence (the functional deviation) and identifying security countermeasures, safeguards, and barriers. Only following this path we can analyze possible process failures as result of a functional deviation and can design OT systems in a secure way. However for this method to provide the required information we need a far higher level of detail / discrimination than terminology used in the past offered. Terminology such as loss of view and loss of control don’t lead to identifying countermeasures, safeguards and barriers. To make a step toward how a cyber attack can impact a production system we need much more granularity, therefor the 16 failure modes discussed in a prior blog.
Each ICS main function (BPCS, SIS, CCS, PMS, ….) has a fixed set of functions and potential deviations, it is the core task of the cyber security HAZOP to identify what risk they bring. Not to identify these deviations, these are fixed because each main function has a set of automation tasks it carries out. It are the deviations, their likelihood and severity for the production installation that are of importance, because than we know what benefits controls bring and which controls are of less importance. But we need to consider all options before we can select.
LESSON 3 – My final lesson of the day, first analyze what the problem is prior to trying to solve it.
Many times I see asset owners surfing the waves of the latest security hype. Every security solution has its benefits and “exciting” features, but this doesn’t mean it is the right solution for your specific problem or addressing your biggest security gap.
To know what are the right solutions you need to have a clear view on the cyber security hazards in your system, and then priority and select those that contribute most. There are often many features already available, start using them.
This leads to reusing methodologies we have learned to apply successfully for almost 60 years now in process safety, we should adapt them where necessary and reuse them for OT security.
There is no relationship between my opinions and references to publications in this blog and the views of my employer in whatever capacity. This blog is written based on my personal opinion and knowledge build up over 42 years of work in this industry. Approximately half of the time working in engineering these automation systems, and half of the time implementing their networks and securing them.
Author: Sinclair Koelemij
OTcybersecurity web site