This blog is about cyber security detection controls for cyber physical systems, process automation systems, process control systems or whatever we want to call them. The blog discusses their value in securing these systems and some of the requirements to consider when implementing them. As usual, my blogs approach the subject from a combination of the knowledge area of cyber security and the knowledge area of automation technology. More specific for this topic is the combination of Time Based Security and Asset Integrity Management.
It’s been a while since I started a blog, I postponed my retirement for another 2 years, so just too much to do to spend my weekends creating blogs. But sometimes the level of discomfort rises above my high alarm threshold for discomfort, forcing me to analyze the subject to better understand the cause of this discomfort. After seven months of no blogging, there are plenty of topics to discuss. For this blog I want to analyze the requirements for a cybersecurity detection function for cyber-physical systems. Do they differ from the requirements we set for information systems, and if so, what are these differences? Let’s start with the historical context.
Traditional cybersecurity concepts were based on a combination of fortress mentality and risk aversion. This resulted in the building of static defenses as described in the TCSEC and the Orange Book of the 1980s. But almost 40 years later, this no longer works in a world where the information resides increasingly at the very edge of the automation systems. Approximately 20 years ago a new security model was introduced as an answer to the increasing number of issues the traditional models were facing.
This new model was called Time Based Security, it is a dynamic view of cyber security based on protection, detection and response. Today we still use elements of this model, for example the NIST cybersecurity framework includes the TMS model by defining security as Identify, Protect, Detect, Respond and Recover. For the topic of the blog, I focus specifically on the three elements of the TMS model Protect, Detect and React. The idea behind this model is that the formula P> D + R defines a secure system. This formula expresses that the amount of time it takes for a threat actor to break through protection (P) must be greater than the time it takes to (D) detect the intrusion and the time it takes to respond ( R) on this intrusion.
In the days of the Orange Book we were building our fortresses, but also discovered that sooner or later someone found a path across, under or through the wall and we wouldn’t know about it. The TMS model kind of suggests that if we have a very fast detection and response mechanism, we can use a less strong protection mechanism. This is a very convenient idea in a world where systems become increasingly open.
This does not mean that we can throw away our protection strategies, it just means that in addition to protection, we can also implement a good detection and response strategy to compensate for a possible lack of protection. We need a balance, sometimes protection seems the logical answer, in other cases a focus on detection and response makes sense. If we compare this to physical security, it makes sense for a bank to protect a safe from the use of a torch burning through the safe’s wall. Delaying the attack increases the time we have to detect and respond. On the other hand, a counter that faces the public and is therefore relatively unprotected may benefit more from a panic button (detection) to initiate a response.
It is important to realize that detection and response are “sister processes” and that they are in series. If one fails, they both fail. Now with the concept of TMS in mind, let’s move on to a cyber attack on a physical cyber system.
To describe a cyber attack, I use the cyber kill chain defined by Lockheed Martin in combination with a so-called P-F curve. P-F curves are used in the Asset Integrity Management discipline to show equipment health over time to identify the interval between potential failure and functional failure. Since I need a relationship between time and the progress of a cyber attack, I borrowed this technique and created the following diagram.
A p-f curve has two axes, the vertical axis represents the cyber resilience of the cyber-physical system and the horizontal axis represents the progress of the attack over time. The curve shows several important moments in the time of the attack. The P0 point is where it all starts. In order for an attack to happen, we need a particular vulnerability, even more than this, the vulnerability must be exposed – the threat actor must be able to exploit it.
The threat actor who wants to attack the plant gathers information and looks for ways to penetrate the defenses. He / she can find multiple vulnerabilities, but eventually one will be selected and the threat actor will start developing an exploit. This can be malware that collects more information, but also when enough information has been collected, the code can be developed to attack the cyber-physical system (CPS).
To do this, we need a delivery method, sometimes along with a propagator module that penetrates deeper into the system to deliver the payload (the code that is doing the actual attack on the CPS) to one of the servers in the inner network segments of the system. This could be a USB stick containing malware, but it could also be a phishing email or a website infecting a user’s laptop and perhaps installing a keylogger that waits for the user to connect to the CPS from a remote location. It is not necessarily the factory it was initially aimed at, it could also be a supplier organization that builds the system or provides support services. Sensitive knowledge of the plant is stored in many places, different organizations build, maintain and operate the plant together and as such have a shared responsibility for the security of the plant. There are many scenarios here, for my example I propose that the attacker has developed malicious code with the ability to perform a scripted attack on the Basic Process Control System (BPCS). The BPCS is the automation function controlling the production process, if our target would be a chemical plant it will usually be a Distributed Control System (DCS).
The first time security is faced with the attack is on P1, where the chosen delivery method successfully penetrates the system to deliver its payload. Here we have protection mechanisms such as USB protection checks, antivirus protection, firewalls and maybe two-factor authentication against unauthorized logins. But we should also have detection controls, we can warn the moment a USB device is inserted, we can monitor who is logging into the system, we can even inspect what traffic is going through the firewall and see if it is using a suspicious amount of bandwidth , which indicates a download of a file. We can monitor many indicators to check if something suspicious is happening. All of this falls into a category defined as “early detection”, detection at a time when we can still stop the attack by an appropriate response before the attack hits the BPCS.
Nevertheless, our threat actor is skilled enough to slip through this first layer of defense and the malicious code delivers the payload to a server or desktop station of the BPCS. At this point (P2), the malicious code is installed in the BPCS and the threat actor is ready for its final attack (P3). We assume that the threat actor uses a script that describes the steps to attack the BPCS. Perhaps the script is writing digital outputs that shut down a compressor or initiate an override command to bypass a specific safety instrument function (SIF) in the safety system. Many scenarios are possible at this level in the system once the threat actor has gathered information about how the automation features are implemented. Which tag names represent which function, which parameters to change, which authorizations are required, which parameters are displayed, etc.
At this point, the threat actor passed several access controls that protect the inner security zones, passed controls that prevented malicious code from installing on the BPCS servers / desktops, and whitelisted the application controls that prevented execution of code. For detection at this level, we can use different audit events that signal access to BPCS system, we can also use anomaly detection controls that signal abnormal traffic. An important point here is that the attacker has already penetrated very deep into the system, at such a level that the reaction time to stop the attack is almost impossible without an automated response.
The problem with what we call “late detection” is that we need an automated response to be on time when the actor has an immediate malicious intent. Only when the threat actor postpones the attack until a later moment can we find the time to respond “manually”. However, an automated response requires a lot of knowledge of what to do in different situations and requires a high level of confidence in the detection system. A detection system that delivers an abundance of false positives would become a bigger threat to the production system than a threat actor.
If we look at how quickly attacks on cyber-physical systems progress, we can take Ukraine’s attack on the electricity grid as an example. The estimated time per attack here was estimated to be less than 25 minutes. In these 25 minutes – the speed of onset – the threat actors gained access to the SCADA system: opened the circuit breakers; corrupted the firmware of a protocol converter preventing the panel operators from closing the circuit breakers remotely; and wiped the SCADA hard drives. So in our theoretical TMS model, we would need a D + R time of less than 25 minutes to be effective, even if only partially effective, depending on the moment of detection. This is very short.
Can a security operations center fill this role? What would this require? Let us first look at the detection function to find an answer.
First of all we need a reliable detection mechanism, a detection mechanism that causes false positives will soon slow down any reaction. What is required for a reliable detection mechanism:
- In-depth knowledge of the various events that can happen, what is an expected event and what is a suspicious event? This can differ depending on the plant’s modes such as: normal operation, abnormal operation, or emergency. Also system behavior is quite different during a plant stop. All of these conditions can be a source of suspicious events;
- In-depth knowledge of the various protocols in use, for system’s such as DCS including a wide set of proprietary protocols not publicly documented. Also these have shown to be a considerable hurdle for anomaly detection systems;
- In-depth knowledge of process operations, what activities are normal and which not? Should we detect the modification of the travel rate of a valve, is the change of this alarm setting correct, is disabling this process alarm a suspicious event or a normal action to address a transmitter failure? This level of detection is not even attempted today, anomaly detection systems might detect the program download of a controller or PLC, but the process operator would also know this immediately without a warning from the SOC. In all cases close collaboration with process operations is essential, this would slow down the reaction.
How if the security operations center (SOC) is outsourced, what would be different from an “in-company” SOC?
- On the in depth knowledge of the various events that can happen, the outsourced SOC will have a small disadvantage because as an external organization the SOC will not always be aware of the the different plant states. But overall it will perform similarly as an in-company SOC.
- On the in-depth knowledge of the various protocols, the outsourced SOC can have a disadvantage because it will have a bigger learning curve with multiple similar customers. Never the less knowledge of the proprietary protocols would remain a hurdle.
- The in-depth knowledge of process operations is in my opinion fully out of reach for an outsourced SOC. As such control over this would remain primarily a protection task. Production processes are too specific to formulate criteria for normal and suspicious activity.
How about if the threat actor would also target the detection function? Disabling the detection function during the attack will certainly prevent the reaction. In the case of an outsourced SOC we would expect:
- A dedicated connection between SOC and plant. If the connection would be over public connections such as the Internet a denial of service attack would become feasible to disable the detection function;
- A local – in the plant – detection system would still provide information on what is going on if the connection to the SOC fails. The expertise of a SOC might be missing but the detection function would still be available;
- The detection system would be preferably “out of band” because this would make a simultaneously attack on protection and detection systems more difficult.
How about the reaction function what do we need to have a reaction that can ‘outperform” the threat actor?
Reaction is a cyber physical system is not easy because we have two parts, the programmable electronic automation functions that are attacked and the physical system controlled by the automation system. The physical system has completely independent ‘dynamics”, we can’t just stop an automation function and expect the physical process to be stable in all cases. Cooling might be required, or Even during a plant stop we often need certain functions to be active. Even after a black shutdown – ESD 0 / ESD 0.1 – some functions are still required. There are rotating reactors that need to keep rotating to prevent a runaway reaction, a cement kiln needs to rotate otherwise it just breaks under its wait, cooling might be required to prevent a thermal runaway, and we need power for many functions.
Stopping a cyber attack by disconnecting from external networks is also not always a good idea. Some malware “detonates” when it looses its connection with a command and control server and might wipe the hard disk to hide its details. Detection therefore requires a good identification on what is happening and the reaction on this needs to be aligned with the attack but also with the production process. A wrong reaction can easily increase the damage.
While a SOC can often actively contribute to the response to a breach of the security of an information system, this is in my opinion not possible for a cyber-physical system. In a cyber-physical system, the state of the production process is essential, but also which automation functions remain active. In principle, a safety system – if present – can isolate process parts, make the installation pressure-less / de-energized, but even then some functions are required. Process operations therefore play a key role in the decision-making process, it is generally not acceptable to initiate an ESD 0 when a cyber attack occurs. So several preparatory actions are needed to organize a reaction, such as:
- We need a disconnection process describing under what circumstances disconnection is allowed and what alternative processes need to startup when disconnection is activated. Without having this defined and approved before the situation occurs it is difficult to quickly respond;
- We need to have playbooks that document what to do in what situation depending on which automation function is affected. Different scenarios exist we need to be ready and not improvise at the moment the attack occurs, some guidance is required;
- We need to understand how to contain the attack, is it possible to split a redundant automation system in two halves, a not affected part and an affected part;
- We need to understand if we can contain the attack to parts of the plant if multiple “trains” exist in order to limit the overall impact;
- We need to test / train processes to make sure they work when needed.
All this would be documented in the incident handling process where containment and recovery zones, and response strategies are defined to organize an adequate and rapid response. But all this must be in accordance with the state of the production process, possible process reconstitution and the defined / selected recovery strategy.
Process operations plays a key role in the decision making during a cyber attack that impacts the physical system. As such a SOC is primarily advisor and spectator because a knowledge gap. Site knowledge is key here, detailed knowledge of the systems and their automation functions is essential for the reaction process.
Of course we should also think about recovery, but the first focus must be on detection and response, because here we can still limit the impact of the cyber attack if our response is timely and adequate. If we fail at this stage, the problems could be bigger than necessary. An OT SOC differs very much from an IT SOC and has a limited role because of the very specific requirements and unique differences per production process.
So in the early detection zone, a SOC can have a lot of value. In the late detection zone, in the case where the threat actor would act immediately, I think the SOC has a limited role and the site team should take up the incident handling process. But in all cases, with or without a SOC, detection and reaction is a key part of cyber security. It is important to realize that detection and reaction are “sister processes”, even the best detection system is rather useless if there is no proper reaction process that supports it.
There is no relationship between my opinions and references to publications in this blog and the views of my employer in whatever capacity. This blog is written based on my personal opinion and knowledge build up over 42 years of work in this industry. Approximately half of the time working in engineering these automation systems, and half of the time implementing their networks and securing them.
Author: Sinclair Koelemij
OTcybersecurity web site