In this blog I like to discuss some of the specific requirements for securing OT systems, or what used to be called Real-Time Systems (RTS). Specifically because of the real-time requirements for these systems and the potential impact on these requirements by cyber security controls. RTS needs to be treated differently when we secure them to maintain this real-time performance. If we don’t we can created very serious consequences for both the continuity of the production process as well as the safety of the plant staff.
I am regularly confronted with engineers with an IT background working in ICS that lack familiarity with the requirements and implement cyber security solutions in a way that directly impact the real-time performance of the systems with sometimes very far reaching consequences leading to production stops. This blog is not a training document, its intention is to make people aware of these dangers and suggest to avoid them in future.
I start with explaining the real-time requirements and how they are applied within industrial control systems and then follow-up with how cyber security can impact these requirements if incorrectly implemented.
According to Gartner, Operational Technology (OT) is defined as the hardware and software that detects or causes a change, through direct monitoring and / or control of industrial equipment, assets, processes and events.
Figure 1 – A generic RTS architecture, nowadays also called an OT system
OT is a relatively new term that primarily came into use to express that there is a difference between OT and IT systems, at the time when IT engineers started discovering OT systems. Before the OT term was introduced the automation system engineering community called these systems Real-Time Systems (RTS). Real-Time Systems are in use for over 50 years, long before we even considered cyber security risk for these systems or considered the need to differentiate between IT and OT. It was clear for all, these were special systems no discussion needed. Time changed this, and today we need to explain that these systems are different and therefore need a to be treated differently.
The first real-time systems made use of analog computers, but with the rise of mini-computers and later the micro-processor the analog computers were replaced by digital computers in the 1970s. These mini computer based systems evolved into micro-processor based systems making use of proprietary hardware and software solutions in the late 1970s and 1980s. The 1990s was the time these systems started to adopt open technology, initially Unix based technology but with the introduction of Microsoft Windows NT this soon became the platform of choice. Today’s real-time systems, the industrial control systems, are for a large part based on similar technology as used in the corporate networks for office automation, Microsoft servers, desktops, thin clients, and virtual systems. Only the controllers, PLCs and field equipment are still using proprietary technology, though also for this equipment many of the software components are developed by only a few companies and used by multiple vendors. So also within this part of the system a form of standardization occurred.
In today’s automation landscape, RTS are everywhere. RTS is implemented in for instance cars, air planes, robotics, space craft, industrial control systems, parking garages, building automation, tunnels, trains, and many more applications. Whenever we need to interact with the real world by observing something and act upon what we observe, we typically use an RTS. The RTS applied are generally distributed RTS, where we have multiple RTS that exchange information over a network. In the automotive industry the Controller Area Network (CAN) or Local Interconnect Network (LIN) is used, in aerospace we use ARINC 629 (named after Aeronautical Radio, Incorporated) for example used in Boeing and Airbus aircraft, and networks such as Foundation Fieldbus, Profibus, and Ethernet are examples connecting RTS within industrial control systems (ICS)such as DCS and SCADA.
Real-time requirements typically express that an interaction must occur within a specified timing bound. This is not the same as that the interaction must be as fast as possible, a deterministic periodicity is essential for all activity. If an ICS needs to sample a series of process values each 0.5 second, this needs to be done in a way that the time between the samples is constant. To accurately measure a signal, the Nyquist-Shannon theorem states that we need to have a sample frequency with at minimum twice the frequency of the signal measured. If this principle is not maintained values for pressure, flow, and temperature will deviate from their actual value in the physical world. Depending on the technology used tens to hundreds of values can be measured by a single controller, different lists are maintained within the controller for scanning these process values each with a specific scan frequency (sampling rate). Variation in this scan frequency, called jitter, is just not allowed.
Measuring a level in a tank can be done with a much lower sampling rate than measuring a pressure signal that fluctuates continuously. So different tasks exist in an RTS that scan a specific set of process points within an assigned time slot. An essential rule is that this task needs to complete the sampling of its list of process points within the time reserved for it, there is no possibility to delay other tasks to complete the list. If there is not sufficient time than the points that remain in the list are just skipped. The next cycle will start again at the start of the list. This is what is called a time triggered execution strategy, a time triggered strategy can lead to starvation if the system becomes overloaded. With a time-triggered execution, activities occur at predefined instances of time, like a task that samples every 0.5 second a list of process values, and another task that does the same for another list every 1 second, or every 5 second, etc.
There also exists an event triggered execution strategy, for example when a sampled value (e.g. a level) reaches a certain limit an alarm will go off. Or if a sampled value has changed for a certain amount or if the process point is a digital signal that changed from open to closed. Apart from collecting information an RTS also needs to respond to changes in process parameters. If the RTS is a process controller the process operator might change the setpoint of the control loop or adjust the gain or another parameter. And of course there is an algorithm to be executed that determines the action to execute, for example the change of an output value toward an actuator or a boolean rule that opens or closes a contact.
In ICS up to 10 – 15 years ago this activity resided primarily within a process controller, when information was required from another controller this was exchanged through analog wiring between the controllers. However hard-wiring is costly, so when functionality became available that allowed this exchange of information over the network (what is called peer-2-peer control) it was more and more used. (See figure 2) Various mechanisms were developed to prevent that a loss of communication between the controllers would not be detected and could be acted upon if it occurred.
Figure 2 – Example process control loop using peer-2-peer communication
One of these mechanisms is what is called mode shedding. Control loops have a mode, names sometimes differ per vendor, but commonly used names are Manual (MAN), Automatic (AUTO), Cascade (CASC), Computer (COM). The names and details differ between different systems, but in general when the mode is in MAN, the control algorithm is not executed anymore and the actuator remains in its last position. When the mode is AUTO the process algorithm is executed and makes use of its local setpoint (entered by the process operator) and measured process value to adjust its output. When the mode is CASC the control algorithm receives its setpoint value from the output of another source. this can be a source within the controller or an external source that makes use of for example the network. If such a control algorithm doesn’t receive its value in time, mode shedding occurs. It is generally configurable to what mode the algorithm falls back but often manual mode is selected. This freezes the control action and requires an operator intervention, failures may happen as long as the result is deterministic. Better fail than continuing with some unknown state. So within an ICS network performance is essential for real-time performance, essential to keep all control functions doing their designed task, essential for a deterministic behavior.
Another important function is redundancy, most ICS make use of redundant process controllers, redundant input/output (I/O) functions, and redundant servers, so if a process controller fails, the control function continuous to operate, because it is taken over by the redundant controller. A key requirement here is that this switch-over needs to be what is called a bump-less transfer. So the control loops may not be impacted in their execution because another controller has taken over the function. This requires a very fast switch-over that the regular network technology often can’t handle. If the switch-over function would take too long, we would have again this mode-shedding mechanism triggered to keep the process in a deterministic state. The difference with the previous example is that in this case mode shedding wouldn’t occur in a single process loop but in all process loops configured in that controller. So a major process upset will occur. A double controller failure would normally lead to a production stop, resulting in high costs. Two redundant controllers need to be continuously synchronized, an important task running under the same real-time execution constraints as the example of the point sampling discussed earlier. Execution of the synchronization task needs to complete within its set interval, this exchange of data takes place over the network. If somehow the network is not performing as required and the data is not exchanged in time the switch-over might fail when needed.
So network performance is critical in an RTS, cyber security however can negatively impact this if implemented in an incorrect manner. Before we discuss this let’s have a closer look at a key factor for network performance, network latency.
Factors affecting the performance in a wired ICS network are:
- Bandwidth – the transmission capacity in the network. Typically 100 Mbps or 1000 Mbps.
- Throughput – the average of actual traffic transferred over a given network path.
- Latency – the time taken to transmit a packet from one network node (e.g. a server or process controller) to the receiving node.
- Jitter – this is best described as the variation in end-to-end delay.
- Packet loss – the transmission might be disturbed because of a noisy environment such as cables close to high voltage equipment or frequency converters.
- Quality of service – Most ICS networks have some mechanism in place that set the priority for traffic based on the function. This to prevent that less critical functions can delay the traffic of more critical functions, such as an process operator intervention or a process alarm.
The factor that is most often impacted by badly impacted security controls is network latency, so let’s have a closer look at this.
There are four types of delay that cause latency:
- Queuing delay – Queuing delay depends on the number of hops for a given end-to-end path. This typically caused by routers, firewalls, and intrusion prevention systems (IPS).
- Transmission delay – This is the time taken to transmit all the bits of the frame containing the packet. So the time taken between emission of the first bit of the packet and the emission of the last bit. The main factor influencing transmission delay is cable type (copper, fiber, dark fiber), and cable distance. For example very long fiber cables. This is a factor normally not influenced by specific cyber security controls, exception is when data-diode is implemented. The type of data diode can have influence.
- Propagation delay – This is the time between emission of the first bit and reception of the last bit. Propagation delay is created by all network equipment, but also a firewall (and type of firewall) and IPS contribute to this.
- Processing delay – This is the time taken by the software execution of the protocol stack. Processing delay is created by access control lists, by encryption, by integrity checks build in the protocol, either for TCP, UDP, or IP.
Let’s discuss the potential conflict between real-time performance and cyber security.
The impact of cyber security on real-time performance
How do we create real-time performance within Ethernet, a network never designed for providing real-time performance? There is only one way to do this and that is creating over-capacity. A typical well configured ICS network has a considerable over-capacity to handle peak loads and prevent delays that can impact the RTS requirements. However the only one managing this over-capacity is the ICS engineer designing the system. The time available for tasks to execute is a design parameter that needs to meet small and large systems. To make certain that network capacity is sufficient is complex in redundant and fault tolerant networks. A redundant network has two paths available between nodes, a fault tolerant network has four paths available. Depending on how this redundancy or fault tolerance is created, can impact the available bandwidth / throughput. In systems where the redundant paths are also used for traffic network paths can become saturated by high throughput, for example caused by creating automated back-ups of server nodes, or distributing cyber security patches (Especially Windows 10 security patches). Because this traffic can make use of multiple paths, it becomes constrained when it hits the spot in the network redundancy or fault tolerance ends and the traffic has to fall back to a much lower bandwidth. Quality of service can help a little here but when the congestion impacts the processing of the network equipment, extra delays will occur also for the prioritized traffic.
Another source of network latency can be the implementation of anomaly detection systems making use of port spanning. A port span has some impact on the network equipment, partially depending on how configured, generally not much but this depends very much on the base load of this equipment and its configuration. Similarly low cost network taps also can add significant latency. This has caused issues in the field.
Another source of delay are the access filters. Ideally when we segment an ICS network in its hierarchical levels (level 1, level 2, level 3) we want to restrict the traffic between the segments as much as possible, but specifically at the levels 1 and level 2 this can cause network latency that potentially impacts control critical functions such as the peer-2-peer control and controller redundancy. Additionally the higher the network latency the less process points can be configured for overview graphics in operator stations, because also these have a configurable periodic scan. A scan that can also be temporarily raised by the operator for control tuning purposes.
The way vendors manage this traffic load is by their system specifications limiting the number of operator stations, servers, and points per controller, and breaking up the segments into multiple clusters. These specifications are verified with intensive testing to meet performance under all foreseen circumstances. This type of testing can only be done on a test bed that supports the maximum configuration, the complexity and impact of these tests make it impossible to verify proper operation on an operational system. because of this vendors will always resist against implementing functions in the level 1 / levels 2 parts of the system that can potentially impact performance and are not tested.
In the field we see very often that security controls are implemented in a way that can cause serious issues that can lead to dangerous situations and / or production stops. Controls are implemented without proper testing, configurations are created that cause a considerable processing delay, networks are modified in a way that a single mistake of a field service engineer can lead to a full production stop. In some cases impacting both the control side as well as the safety side.
Still we need to find a balance between adding the required security controls to ICS and preventing serious failures. This requires a separate set of skills, engineers that understand how the systems operate, which requirements need to be met, and have the capabilities to test the more intrusive controls before implementing them. This makes IT security very different from OT security.