Remote access


In times of Covid-19 the interest in remote access solutions has grown. Remote access has always been a hot topic for industrial control systems, some asset owners didn’t want any remote access, others made use of their corporate VPN solutions to create access to the control network, and some made use of the remote access facilities provided by their service providers. In this blog I will discuss a set of security controls to consider when implementing remote access.

There are multiple forms of remote access:

  • Remote access from an external organization, for example a service provider. This requires interactive access, often with privileged access to critical ICS functions;
  • Remote access from an internal organization, for example process engineers, maintenance engineers and IT personnel. Also often requires privileged access;
  • Remote operations, this provides the capability to operate the production process from a remote location. Contrary to remote access for support this type of connection requires 24×7 availability and should not hinder the process operator to carry out his / her task;
  • Remote monitoring, for example health monitoring of turbines and generators, well head monitoring and similar diagnostic and monitoring functions;
  • Remote monitoring of the technical infrastructure for example for network performance, or remote connectivity to a security operation center (SOC);
  • Remote updates, for example daily updates for the anti-virus signatures, updates for the IPS vaccine, or distribution of security patches.

The rules I discuss in this blog are for remote interactive access for engineering and support tasks, a guy or girl accessing the control system from a remote location for doing some work.

In this discussion I consider a remote connection to be a connection with a network with a different level of “trust”. I put “trust” between quotes because I don’t want to enter in all kind of formal discussions on what trust is in security, perhaps even if we should allow trust in our life as security professional.


RULE 1 – (Cascading risk) Enforce disjoint protocols when passing a DMZ.

The diagram shows a very common architecture, but unfortunately also one with very high risk because of allowing inbound access into the DMZ toward the terminal server and allowing outbound access from the terminal server to the ICS function using the same protocol.

Recently we have had several RDP vulnerabilities that were “wormable”, meaning a network worm can make use of the vulnerability to propagate through the network. So allowing a network worm that infects the corporate network to reach the terminal server and from there infect the control network. Which is high risk in times of ransomware spreading as a network worm.

This is a very bad practice and should be avoided! Not enforcing disjoint protocols increases what is called cascading risk, the risk that a security breach propagates through the network.

RULE 2 – (Risk of hacking) Prevent inbound connections. Architectures where the request is initiated from the “not-trusted” side, like in above terminal server example, require an inbound firewall port to be open to facilitate the traffic. For example in above diagram the TCP port for the RDP protocol.

Solutions using a polling mechanism, where a function on the trusted side polls a function on the not-trusted side for access requests, offers a smaller attack surface because the response channel makes use of the firewall’s stateful behavior, where the port is only temporarily open for just this specific session.

Inbound connections expose the service that is used, if there would be a vulnerability in this service this might be exploited to gain access. Prevent at all times such connections coming from the Internet. Direct Internet connectivity requires a very good protection, an inbound connection for remote access offers a high risk for compromise.

So also a very bad practice, unfortunately a practice I came across to several times because some vendor support organizations use such connectivity.

RULE 3 – (Risk of exposed login credentials) Enforce two-factor authentication. The risk that the access credentials are revealed through phishing attacks capturing the access credentials is relatively big. Two factor authentication adds to this the requirements that apart from knowing the credentials (login / password) the threat actor also needs to possess access to a physical token generator for login.

This raises the bar for a threat actor. Personally I have the most trust in a physical token like a key fob that generates a code. Alternatives are tokens installed in the PC, either as software or as a USB device.

RULE 4 – (Risk of unauthorized access) Enforce an explicit approval mechanism where someone on the trusted side of the connection explicitly needs to “enable” remote access. Typically after a request over the phone either the process operator / supervisor or a maintenance engineer needs to approve access before the connectivity can be established.

Multiple solutions exist, some solutions have this feature build-in, sometimes an Ethernet switch is used, and there are even firewalls where a digital input signal can alter the access rule.

Sometimes implicit approval seems difficult to prevent, for example access to sub-sea installations, access to installations in remote areas, or access to unmanned installations. But also for these situations implementing explicit approval is often possible with some clever engineering.

RULE 5 – (Risk of prohibited traffic) Don’t allow for end to end tunneled connections between a server in the control network and a server at the not trusted side (either corporate network or a service provider network)

Encrypted tunnels prevent the firewall to inspect the traffic, so bypass more detailed checks on the traffic. So best practice is to break the tunnel have it inspected by the firewall and reestablish the tunnel to the node on the trusted side of the network. Where to break the tunnel is often a discussion, my preference is to break it in the DMZ. Tunneled traffic might contain clear text communication, so we need to be careful where to expose this traffic if we open the tunnel.

RULE 6 – (Risk of unauthorized activity) Enforce for connectivity with external users, such as service providers, a supervision function where someone on the trusted side can see what the remote user does and intervene when required.

Systems exist that no only supervise the activity visual, but also log all activity allowing it to be replayed later in time.

RULE 7 – (Risk of unauthorized activity) Make certain there is an easy method to disconnect the connection. A “supervisor” on the trusted side (inside) of the connection must be able to disconnect the remote user. But prevent that this can be done accidentally, because if the remote user does some critical activity, his control over the activity shouldn’t be suddenly lost.

RULE 8 – (Risk of unauthorized activity) Restrict remote access to automation system nodes as much as possible. Remote access to a safety engineering function might not be a good idea, so prevent this where possible. Where in my view this should be prevented with a technical control, an administrative policy is fine but generally not considered by an attacker.

RULE 9 – (Risk of unauthorized access) Restrict remote access for a limited time. A remote access session should be granted for a controlled length of time. The time duration of the session needs to match the time requirement of the task, never the less there should be an explicit end time that is reasonable.

RULE 10 – (Risk of exposure confidential data) Enforce the use of secure protocols for remote access, login credentials should not pass the network in clear text at any time. So for example don’t use protocols such as Telnet for accessing network devices, use the SSH protocol that encrypts the traffic instead.

In principle all traffic that passes networks outside the trusted zones should be encrypted and end to end authenticated. Using certificates is a good option, but it better be a certificate specifically for your plant and not a globally used certificate.

In times of state sponsored attackers, the opponent might have the same remote access solution installed and inspected in detail.

RULE 11 – (Risk of exposure confidential data) Don’t reveal login credentials of automation system functions to external personnel from service providers. Employees of service providers generally have to support tens or hundreds of installations. They can’t memorize all these different access credentials, so quickly mechanism are used to store these. Varying from paper to Excel spreadsheets and password managers, prevent that a compromise of this information compromises your system. Be aware that changing passwords of services is not always an easy task, so control this cyber security hazard.

A better approach might be to manage the access credentials for power users, including external personnel, using password managers that support login as a proxy function. In these systems the user only needs to know his personnel login credentials and the proxy function will use the actual credentials in the background. This has several security advantages:

  • Site specific access credentials are not revealed, if access is no longer required, disabling / removing access to the proxy function is sufficient to block access without ever having compromised the system’s access credentials.
  • Enforcing access through such a proxy function blocks the possibility of hopping between servers, because the user is not aware of the actual password. (This does require to enforce certain access restrictions for users in general.)

Also consider separating the management of login credentials for external users from the management of login credentials for internal users (users with a login on the trusted side). You might want to differentiate between what a user is allowed to do remotely and what he can do when on site. Best to enforce this with technical controls.

RULE 12 – (Risk of unauthorized activity) Enforce least privilege for remote activities. Where possible only provide view-only capabilities. This often requires a collaborative session where a remote engineer guides a local engineer to execute the actions, however it reduces the possibilities for an unauthorized connection to be used for malicious activity.

RULE 13 – (Risk of unauthorized activity) Manage access into the control network from the secure side, the trusted side. Managing access control and authorizations from the outside of the network is like putting your door key outside under the doormat. Even if the task is done from remote, the management systems should be on the inside.

RULE 14 – (Risk of unauthorized activity) Detection. Efi Kaufman addressed in a response a very valid point.

We need to build our situational awareness. The whole idea of creating a DMZ zone is to have one zone where we can do some detailed checks before we let the traffic in. In RULE 5, I already mentioned to break the tunnel open so we can inspect, but there is of course lots more to inspect. If we don’t apply detection mechanisms we have an exclusive focus on prevention assuming everyone behaves fine when we open the door.

This is unfortunately not true, so a detection mechanism is required to check if nothing suspicious happens. Exclusively focusing on prevention is a common trap, and I fell in it!

Robin, a former pen tester pointed out that it is important to monitor the antivirus function, as pentester he was able to compromise a system, because av triggering on the payload was not monitored, giving him all the time to investigate and modify the payload until it worked.

RULE 15 – (Risk of hacking, cascading risk) Patching, patching, patching. There is little excuse not to patch remote access systems, or systems in the DMZ in general. Exposure demands that we patch.

Beware that these systems can have connectivity with the critical systems, their users might be logged in using powerful privileges, privileges that could be misused by the attacker. Therefore patching is also very important.

RULE 16 – (Risk of malware infection, cascading risk) Keep your AV up to date. While we start to do some security operation tasks, better to make sure our AV signatures are up to date.

RULE 17 – (Risk unauthorized activities) Robin addressed in a response the need for enforcing that the remote user logs for each request the purpose of the remote access request. This facilitates identification if processes are followed, and people are not abusing privileges or logging in daily for 8 hours/day for a month instead of coming to site.


Seventeen simple rules to consider when implementing remote access that popped up in my mind while typing this blog.

If I missed important controls please let me know than I will add them.

Use these rules when considering how to establish remote connectivity for support type of tasks. Risk appetite differs, so engineers might only want to select some rules and accept a bit higher risk.

But Covid-19 should not lead to an increased risk of cyber incidents by implementing solutions that increase exposure on the automation systems in an irresponsible manner.


There is no relationship between my opinions and publications in this blog and the views of my employer in whatever capacity.


Author: Sinclair Koelemij

Thanks to CISO CLUB a Russian version

Are Power Transformers hackable?

There is a revived attention for supply chain attacks after the seize of a Chinese transformer in the port of Houston. While on its way to a US power transmission company – Western Area Power Administration (WAPA) – the 226 ton transformer was rerouted to Sandia National Laboratories in Albuquerque New Mexico for inspection on possible malicious implants.

The sudden inspection happens more or less at the same time that the US government issued a presidential directive aiming for white listing vendors allowed to supply solutions for the US power grid, and excluding others to do so. So my curiosity is raised and additionally triggered by the Wall Street Journal claim that transformers do not contain software-based control systems and are passive devices. Is this really true in 2020? So the question is, are power transformers “hackable” or must we see the inspection exclusively as a step in increasing trade restrictions.


Before looking into potential cyber security hazards related to the transformer, let’s first look at some history of supply chain “attacks” relevant for industrial control systems (ICS). I focus here on supply chain attacks using hardware products because in the area of software products, Trojan horses are quite common.

Many supply chain attacks in the industry are based on having purchased counterfeit products. Frequently resulting in dangerous situations, but generally driven by economic motives and not so much by a malicious intent to damage the production installation. Some examples of counterfeits are:

  • Transmitters – We have seen counterfeit transmitters that didn’t qualify for the intrinsic safety transmitter zone qualifications specified by the genuine product sheet. And as such creating a dangerous potential for explosions in a plant when these products would be installed in zone 1 and zone 0 areas with a potential for the presence of explosive gases.
  • Valves – We have seen counterfeit valves, where mechanical specifications didn’t meet the spec sheet of the genuine product. This might lead to the rupture of the valve resulting in a loss of containment with potential dangerous consequences.
  • Network equipment – On the electronic front we have seen counterfeit Cisco network equipment that could be used to create a potential backdoor in the network.

However it seems that the “attack” here is more an exploit of the asset owner’s vulnerability for low prices (even if they sound ridiculously low), in combination with highly motivated companies trying to earn some fast money, than an intentional and targeted attack on the asset integrity of an installation.

That companies selling these products are often found in Asia, with China as the absolute leader according to reports, is probably caused by a different view / attitude toward patents, standards and intellectual property in a fast growing region and additionally China’s economic size. Not necessarily a plot against an arrogant Western world enemy.

The most spectacular example of such an incident is where counterfeit Cisco equipment ended up in the military networks of the US. But as far as I know, it was also in this case never shown that the equipment’s functionality was maliciously altered. Main problem was a higher failure rate caused by low manufacturing standards, potentially impacting the networks reliability. Never the less also here a security incident because of the potential for malicious functionality.

Also proven malicious incidents have occurred, for instance in China, where computer equipment was sold with already pre-installed malware. Malware not detectable by antivirus engines. So the option to attack industrial control systems through the supply chain certainly exist, but as far as I am aware never succeeded.

But there is always the potential that functionality is maliciously altered, so we need to see above incidents as security breaches and consider them to be a serious cyber security hazard we need to address. Additionally power transformers are quite different from the hardware discussed above, so a supply chain attack on US power grid using power transformers is a different analysis. If it would happen and was detected it would mean end of business for the supplier, so stakes are high and chances that it happens are low. Let’s look now at the case of the power transformer.


For many people, a transformer might not look like an intelligent device. But in today’s world everything in the OT world becomes smart (not excluding the possibility we ourselves might be the exception), so we also have smart power transformers today. Partially surfing on the waves of the smart hype, but also adding new functions that can be targeted.

Of course I have no information on the specifications of the WAPA transformer, but it is a new transformer so probably making use of today’s technology. Since seizing a transformer is not a small thing, transformers used in the power transmission world are designed to carry 345 kilo volts or more and can weigh as much as 410 tons (820.000 lb in the non-metric world), there must be a good reason to do so.

One of the reasons is of course that it is very critical and expensive equipment (can be $5.000.000+) and is built specifically for the asset owner. If it would fail and be damaged, replacement would take a long time. So this equipment must not only be secure, but also be very reliable. So worth an inspection from different viewpoints.

What would be the possibilities for an attacker to use such a huge lump of metal for executing a devious attack on a power grid. Is it really possible, are there scenarios to do so?

Since there are many different types of power transformers, I need to make a choice and decided to focus on what are called conservator transformers, these transformers have some special features and require some active control to operate. Looking at OT security from a risk perspective, I am more interested in if a feasible attack scenario exists – are there exposed vulnerabilities to attack, what would be the threat action – then in a specific technical vulnerability in the equipment or software that make it happen today. To get a picture of what a modern power transformer looks like, the following demo you can play with (demo).

Look for instance at the Settings tab and select the tap position table from where we can control or at minimum monitor the onload tap changes (OLTC). Tap changers select variable turn ratios to either increase or decrease the turn ratio to regulate the output voltage for variations on the input side. Another interesting selection you find when selecting the Home icon, leading you directly to the Buchholz safety relay. Also look at the interface protocol Goose, I would say it all looks very smart.

I hope everyone realizes from this little web demo, that what is frequently called a big lump of dumb metal might actually be very smart and containing a lot more than a view sensors to measure temperature and level as the Wall Street Journal suggests. Like I said I don’t know WAPA’s specification, so maybe they really ordered a big lump of dumb metal but typically when buying new equipment companies look ahead and adopt the new technologies available.

Let’s look in a bit more detail to the components of the conservator power transformer, being a safety function the Buchholz relay is always a good point to start if we want to break something. The relay is trying to prevent something bad from happening, what is this and how does this relay counter this, can we interfere?

A power transformer is filled with insulating oil to insulate and serve as a coolant between the windings. The Buchholz relay connects between the overhead conservator (a tank with insulating oil) and the main oil tank of the transformer body. If a transformer fails, or is overloaded this causes extra heat, heated insulating oil forms gas and the trapped gas presses the insulating oil level further down (driving it into the conservator tank passing the Buchholz relay function) so reducing the insulation between the windings. The lower level could cause an arc, speeding up the process and causing more gas pressure, pressing the insulating oil even more away and exposing the windings.

It is the Buchholz relay’s task to detect this and operate a circuit breaker to isolate the transformer before the fault causes additional damage. If the relay wouldn’t do its task quick enough the transformer windings might be damaged causing a long outage for repair. In principal Buchholz relays, as I know them, are mechanical devices working with float switches to initiate an alarm and the action. So I assume there is not much to tamper with from a cyber point of view.

How about the tap changer? This looks more promising, specifically an on load tap changer (OLTC). There are various interesting scenarios here, can we make step changes that impact the grid? When two or more power transformers work in parallel, can we create out-of-step situations between the different phases by causing differences in operation time?

An essential requirement for all methods of tap changing under load is that circuit continuity must be maintained throughout the tap stepping operation. So we need a make-before-break principle of operation, which causes at least momentary, that a connection is made simultaneously to two adjacent taps on the transformer. This results in a circulating current between these two taps. To limit this current, an impedance in the form of either resistance or inductive reactance is required. If not limited the circulating current would be a short-circuit between taps. Thus time also plays a role. Voltage change between taps is a design characteristic of the transformer, this is normally small approximately 1.25% of the nominal voltage. So if we want to do something bad, we need to make a bigger step than expected. The range seems to be somewhere between +2% and -16% in 18 steps, so quite a jump is possible if we can increase the step size.

To make it a bit more complex, a transformer can be designed with two tap changers one for in phase changes and one for out of phase changes, this also might provide us with some options to cause trouble.

So plenty of ingredients seem to be available, we need to do things in a certain sequence, we need to do it within a certain time, and we need to limit the change to prevent voltage disturbances. Step changers use a motor drive, and motor drives are controlled by motor drive units, so it looks like we have an OT function. Again a bit larger attack surface than a few sensors and a lump of metal would provide us. And then of course we saw Goose in the demo, a protocol with issues, and we have the IEDs that control all this and provide protection, a wider scope to investigate and secure but not part of the power transformer.

Is this all going to happen? I don’t think so, the Chinese company making the transformers is a business, and a very big one. If they would be caught tampering with the power transformers than that is bad for business. Can they intentionally leave some vulnerabilities in the system, theoretically yes but since multiple parties (the delivery contains also non-Chinese parts) are involved it is not likely to happen. But I have seen enough food for a more detailed analysis and inspection to find it very acceptable that also power transformers are assessed for their OT security posture when used in critical infrastructure.

So on the question are power transformers hackable, my vote would be yes. On the question will Sandia find any malicious tampering, my vote would be no. Good to run an inspection but bad to create so much fuss around it.


There is no relationship between my opinions and publications in this blog and the views of my employer in whatever capacity.


Sinclair Koelemij

The Purdue Reference Model outdated or up-to-date?


Is the Purdue Reference Model (PRM) outmoded? If I listen to the Industrial Internet of Things (IIoT) community, I would almost believe so. IIoT requires connectivity to the raw data of the field devices and our present architectures don’t offer this in an easy and secure way. So let’s look in a bit more detail to the PRM and the IIoT requirements, to see if they really conflict or can coexist side by side?


I start the discussion with the Purdue Reference Model, developed in the late 80’s. Developed in a time that big data was anything above 256 Mb and the Internet was still in its early years, a network primarily used by and for university students. PRM’s main objective was creating a hierarchical model for manufacturing automation, what was called computer integrated manufacturing (CIM) in those days. If we see the model as it was initially published we note a few things:

  • The model has nothing to do with a network, or security. There are no firewalls, there is no DMZ.
  • There is an operator console at level 1. Maybe a surprise for people who started to work in automation in the last 20 years, but in those days quite normal.
  • Level 4 has been split in a level 4A and a level 4B, to segment functions that directly interact with the automation layers and functions requiring no direct interface.
  • There is no level 0, the Process layer is what it is a bunch of pipes, pumps, vessels and some clever stuff.

Almost a decade later we had ANSI / ISA-95 that took the work of the Purdue university and extended it with some new ideas and created an international standard that became very influential for ICS design. It was the ANSI / ISA-95 standard that named the Process level, level 0. But in those days level 0 was still the physical production equipment. The ISA-95 standard says following on level 2, level 1, and level 0 : ” Level 0 indicates the process, usually the manufacturing or production process. Level 1 indicates manual sensing, sensors, and actuators used to monitor and manipulate the process. Level 2 indicates the control activities, either manual or automated, that keeps the process stable or under control.” So level 0 is the physical production equipment, and level 1 includes the sensors and actuators. It was many years later that people started using level 0 for the field equipment and their networks, all new developments with the rise of Foundation Fieldbus, Profibus, the HART protocol, but never part of the standard. It was probably the ISA 99 standard that introduced a new layer between level 4 and level 3, the DMZ. It was the vendor community that started giving it a level name, level 3.5. But level 3.5 has never been a functional level, it was an interface between two trust zones for adding security. Though often the way it was implemented contributed little to security, but it was a nice feeling to say we have a demilitarized zone between the corporate network and the control network. So far the history of the Purdue Reference Model and ISA-95 and their contribution to the levels. Now lets have a look at how a typical industrial control system (ICS) looks without immediately using names for levels.


From a functional viewpoint we can split a traditional ICS architecture in 5 main parts:

  • Production management system, which task it is to control the automation functions. Typical domain of the DCS and SCADA systems. But also compressor control (CCS), power management systems (PMS) and the motor control center (MCC) reside here. basically everything that takes care of the direct control of the production process. And of course all these systems interact with each other using a wide range of protocols, many of them with security short comings;
  • Operations management system, which task it is to optimize the production process, but also to monitor and diagnose the process equipment (for example machine monitoring functions such as vibration monitoring), and various management functions such as asset management, accounting and reconciliation functions to monitor the mass balance of process units, and emission monitoring systems to meet environmental regulations;
  • Information systems is the third category, these systems collect raw data and transform it into information to be used for historical trends or to feed other systems with information. The objective here is to transform the raw data into information and ultimately information into knowledge. The data samples stored are normally one minute snapshots, hourly averages, shift averages, daily averages, etc. An other example of an information system is custody metering system (CMS) for measuring product transfer for custody purposes.
  • The last domain of the ICS is the Application domain, for example for advanced process control (APC) applications, traditionally performing advisory functions. But overtime the complexity of running an production processes grew, response to changes in the process became more important so advisory functions were taking over the task of the process operator immediately changing the setpoints using setpoint control or controlling the valve with direct digital control functions. There are plants today that if APC would fail, the production is no longer profitable or can’t reach the required product quality levels.
  • Finally there is the Business domain, generally not part of the ICS. in the business domain we mange for example the feed stock and products to produce. It is the decision domain.
Functional model of a chemical plant or refinery

The production management systems are shown in the light blue color, the operations management, information management, and application systems in the light purple color. The model seems to comply with the models of the ANSI / ISA-95 standard. However this model is far from perfect, because an operations management function as vibration monitoring, or corrosion monitoring also have sensors that connect to the physical system. And an asset management system requires information from the field equipment. If we consider a metering system, also part of the information function, it is actually a combination of a physical system and sensors.


And then of course we have all the issues around vendor support obligations. Asset owners expect systems with a very high level of reliability, vendors have to offer this level of reliability in a challenging real-time environment by testing and validating their systems and changes rigorously. Than there is nothing worse than suddenly having to integrate with the unknown, a 3rd party solution. As a result we see systems that have a level 2 function connected to level 3, in principle exposing a critical system more and making the functionality rely on more components so reducing reliability.


So many conflicts in the model, still it has been used for over 20 years, and it still helps us to guide us in securing systems today. If we take the same model and add some basic access security controls we get the following:

At high level this represents the architecture in use today, in smaller systems the firewall between level 2 and 3 might miss, some asset owners might prefer a DMZ with two firewalls (from different vendors to enforce diversity), the level 4A hosts those functions that interface with the control network and generally different policies are enforced for these systems than required at level 4B. For example asset owners might enforce the use of desktop clients, might limit Internet access for these clients, might restrict email to internal email only, etc. All to reduce the chance on compromising the ICS security.


Despite that the reference model is not supporting all of the new functions we have in plants today – for example I didn’t discuss wireless, virtualization, and system integration topics at level 2 – the reference model still helps structure the ICS. Is it impossible to add the IIoT function to this model? Let’s have a look at what IIoT requires?


In an IIoT environment we have a three level architecture:

  • The edge tier – The edge tier is for our discussion the most important, this is where the interface is created with the sensors and actuators. Sometimes existing sensors / actuators, but also new sensors and actuators.
  • The platform tier – The platform tier processes and transforms the data and controls the direction of the data flow. So from that point also of importance for the security of the ICS.
  • The enterprise tier – This tier hosts the application and business logic that needs to provide the value. It is external to the ICS, so as long as we have a secure end to end connection between the platform tier and the enterprise tier ICS security needs to trust that the systems and data in this tier are properly secured.

The question is can we reconcile these two architectures? The answer seems to be what is called the gateway-mediated edge. This is a device that aggregates and controls the data flows from all the sensors and actuators providing a single point of connection to the enterprise tier. This type of device also performs the necessary translation for communicating with the different protocols used in ICS. The benefits of such a gateway device is the scalability of the solution together with controlling the entry and exit of the data flows from a single controllable point. In this model some of the data processing is kept local, close to the edge, allowing controlled interfaces with different data sources such as Foundation Field Bus, Profibus, and HART enabled field equipment. To implement such an interface device doesn’t require to change the Purdue reference model in my opinion, it can make use of the same functional architecture. Additionally new developments such as the APL (Advanced Physical Layer) technology, the technology that will remove the I/O layer as we know it today, will fully align with this concept.


So I am far from convinced that we need a new model, also today the model doesn’t reflect all the data flows required by today’s functions. The very clear boundaries we used to have, have disappeared with the introduction of new technologies such as virtualization and wireless sensor networks. Centralized process computer systems of the 70s, have disappeared becoming decentralized over the last 30 years and are now moving back into centralized data centers where hardware and software have lose ties. Today these data centers are primarily onsite, but over time confidence will grow and larger chunks of the system will move to the cloud.

Despite all these technology changes, the hierarchical structure of the functions hasn’t change that much. It is the physical location of the functions that changes, undoubtedly demanding a different approach to cyber security. It are the cyber security standards of today that are dated on the day they are released. The PRM was never about security or networking, it is a hierarchical functional model for the relationship between functions which is as relevant today as it was 30 years ago.

We have a world of functional layers, and between these layers data is exchanged, so we have boundaries to protect. As long as we have bi-directional control over the data flows between the functional layers and keep control over where the data resides, we protect the overall function. If there is something we need to change it are the security standards, not the Purdue reference model. But we need to be careful with the security of these gateways as the recent OSI PI security risk has learned us.


There is no relationship between my opinions and publications in this blog and the views of my employer in whatever capacity.


Sinclair Koelemij

TRISIS revisited


For this blog I like to go back in time, to 2017 the year of the TRISIS attack against a chemical plant in Saudi Arabia. I don’t want to discuss the method of the attack (this has been done excellently in several reports) but want to focus on the potential consequences of the attack because I noticed that the actual threat is underestimated by many.

The subject of this blog was triggered after reading Joe Weiss’ blog on the US presidential executive order and noticing that some claims were made in the blog that are incorrect in my opinion. After checking what the Dragos report on TRISIS wrote on the subject, and noticing a similar underestimation of the dangers I decided to write this blog. Let’s start with summing up some statements made in Joe’s blog and the Dragos report that I like to challenge.


I start with quoting the part of Joe’s blog that starts with the sentence: “However, there has been little written about the DCS that would also have to be compromised. Compromising the process sensors feeding the SIS and BPCS could have led to the same goal without the complexity of compromising both the BPCS and SIS controller logic and the operator displays.” The color high lights are mine to emphasize the part I like to discuss.


The sentence seems to suggest (“also have to be compromised”) that the attacker would ultimately also have to attack the BPCS to be effective in an attempt to cause physical damage to the plant. For just tripping the plant by activating a shutdown action the attacker would not need to invest in the complexity of the TRISIS attack. Once gaining access to the control system at the level the attackers did, tens of easier to realize attack scenarios were available if only a shutdown was intended. The assumption that the attacker needs the BPCS and SIS together to cause physical damage is not correct, the SIS can cause physical damage to the plant all by it self. I will explain this later with a for safety engineers well known example of an emergency shutdown of a compressor.


Next I like to quote some conclusions in the (excellent) Dragos report on TRISIS. It starts at page 18 with:

Could This Attack Lead to Loss of Life?

Yes. BUT, not easily nor likely directly. Just because a safety system’s security is compromised does not mean it’s safety function is. A system can still fail-safe, and it has performed its function. However, TRISIS has the capability to change the logic on the final control element and thus could reasonably be leveraged to change set points that would be required for keeping the process in a safe condition. TRISIS would likely not directly lead to an unsafe condition but through its modifying of a system could deny the intended safety functionality when it is needed. Dragos has no intelligence to support any such event occurred in the victim environment to compromise safety when it was needed.


The conclusion that the attack could not likely lead to the loss of life, is in my opinion not a correct conclusion and shows the same underestimation as made by Joe. As far as I am aware the part of the modified logic has never been published (hopefully someone did analyze) so the scenario I am going to sketch is just guessing a potential objective. It is what is called a cyber security hazard, it could have occurred under the right conditions for many production systems including the one in Saudi Arabia. So let’s start with explaining how shutdown mechanisms in combination with safety instrumented systems (SIS) work, and why some of the cyber security hazards related to SIS can actually lead to significant physical damage and potential loss of life.


A SIS has different functions like I explained in my earlier blogs. A little bit simplified summary, there is a preventative protection layer the Emergency Shutdown System (ESD) and there is a mitigative layer, e.g. the Fire & Gas system detecting fire or gas release and activating actions to extinguish fires and to alert for toxic gases. For our discussion I focus on the ESD function, but interesting scenarios also exist for F&G.

The purpose of the ESD system is to monitor process safety parameters and initiate a shutdown of the process system and/or the utilities if these parameters deviate from normal conditions. A shutdown function is a sequence of actions, opening valves, closing valves, stopping pumps and compressors, routing gases to the flare, etc. These actions need to be done in a certain sequence and within a certain time window, if someone has access to this logic and modifies the logic this can have very serious consequences. I almost would say, it always has very serious consequences because the plant contains a huge amount of energy (pressure, temperatures, rotational speed, product flow) that needs to be brought to a safe (de-energized) state in a very short amount of time, releasing incredible powers. If an attacker is capable of tampering with this shutdown process serious accidents will occur.


Let’s discuss this scenario in more detail in the context of a centrifugal compressor, most plants have multiple so always an interesting target for the “OT certified” threat actor. Centrifugal compressors increase the kinetic energy of for example a gas into a pressure so a gas flow through pipelines is created either to transfer a product through the various stages of the production process or perhaps to create energy for opening / closing pneumatic driven valves.

Transient operations, for example the start-up and shutdown of process equipment, always have dangers that need to be addressed. An emergency shutdown because there occurred in the plant a condition that demanded the SIS to transfer the plant to a safe state, is such a transient operation. But in this case unplanned and in principle fully automated, no process operator to guard the process and correct where needed. The human factor is not considered a very reliable factor in functional safety and is often just too slow. SIS on the other hand is reliable, the redundancy and the continuous diagnostic checks all warrant a very low failure on demand probability for SIL 2 and SIL 3 installations. They are designed to perform when needed, no excuses allowed. But this is only so if the program logic is not tampered with, the sequence of actions must be performed as designed and is systematically tested after each change.

Compressors are controlled by compressor control systems (CCS), one of the many sub-systems in an ICS. The main task of a compressor control system is anti surge control. The surge phenomenon in a centrifugal compressor is a complete breakdown and reversal of the flow through the compressor. A surge causes physical damage to the compressor and pipelines because of the huge forces released if a surge occurs. Depending on the gas this can also lead to explosions and loss of containment.

An anti surge controller of the CCS continuously calculates the surge limit (which is dynamic) and controls the compressor operation to stay away from this point of danger. This all works fine during normal operation, however when an emergency shutdown occurs the basic anti surge control provided by the CCS has shown to be insufficient to prevent a surge. In order to improve the response and prevent a surge, the process engineer has two design options called a hot bypass method or a cold bypass method recycling the gas to allow for a more gradual shutdown. The hot bypass is mostly used because of its closeness to the compressor which results into a more direct response. Such a hot bypass method requires to open some valves to feed the gas back to the compressor, this action is implemented as a task of the ESD function. The quantity of gas that can be recycled has a limit, so it is not just opening the bypass valve to 100% but opening it with the right amount. Errors in this process or a too slow reaction would easily result into a surge, damaging the compressor, potentially rupturing connected pipes, causing loss of containment, perhaps resulting in fire and explosions, and potentially resulting in casualties and a long production stop with high repair cost.


All of this is under control of the logic solver application part of the SIS. If the TRISIS attacker’s would have succeeded into loading altered application logic, they would have been capable of causing physical damage to the production installation, damage that could have caused loss of life.

So my conclusion differs a bit, an attack on a SIS can lead to physical damage when the logic is altered, which can result in loss of life. A few changes in the logic and the initiation of the shutdown action would have been enough to accomplish this.


This is just one example of a cyber security hazard in a plant, multiple examples exist showing that the SIS by itself can cause serious incidents. But this blog is not supposed to be a training for certified OT cyber terrorist so I keep it with this for safety engineers well known example.

Proper cyber security requires proper cyber security hazard identification and hazard risk analysis. This has too little focus and is sometimes executed at a level of detail insufficient to identify the real risks in a plant.

I don’t want to criticize the work of others, but do want to emphasize that OT security is a specialism not a variation on IT security. ISA published a book “Security PHA Review” written by Edwar Marsazal and Jim MgClone which addresses the subject of secure safety systems in a for me far too simplified manner by basically focusing on an analysis / review of the process safety hazop sheet to identify cyber related hazards.

The process safety hazop doesn’t contain the level of detail required for a proper analysis, neither does the process safety hazop process assume malicious intent. One valve may fail, but multiple valves at the same time in a specific sequence is very unlikely and not considered. While these options are fully open to the threat actor with a plan.

Proper risk analysis starts with identifying the systems and sub-systems of an ICS, than identifying cyber security hazards in these systems, identifying which functional deviations can result from these hazards, and than translate how these functional deviations can impact the production system. That is much more than a review of a process hazop sheet on “hackable” safeguards and causes. That type of security risk analysis requires OT security specialists with a detailed knowledge on how these systems work, what their functional tasks are, and the imagination and dose of badness to manipulate these systems in a way that is beneficial for an attacker .

Sinclair Koelemij

How does advisory ICSA-20-133-02 impact sensor security?

May 12, CISA issued an ICS advisory on the OSIsoft PI system, ICSA-20-133-02. OSI PI is an interesting system with interfaces to many systems, so always good to have a closer look when security flaws are discovered. The CISA advisory lists a number of potential consequences that can result from a successful attack. Among which:

  • A local attacker can modify a search path and plant a binary to exploit the affected PI System software to take control of the local computer at Windows system privilege level, resulting in unauthorized information disclosure, deletion, or modification.
  • A local attacker can exploit incorrect permissions set by affected PI System software. This exploitation can result in unauthorized information disclosure, deletion, or modification if the local computer also processes PI System data from other users, such as from a shared workstation or terminal server deployment.

Because the OSI PI system also has the capability to interface with field equipment using HART-IP, I became curious what cyber security hazards related to field equipment security are induced by this flaw. Even though the advisory mentions an attack by a “local attacker”, a local attacker is easily replaced by some sophisticated malware created by nation state sponsored threat actors. So local or remote attacker doesn’t make a big difference here.

To get more detail there are two interesting publications on how the HART-IP connector is used for collecting data from field equipment:

  1. OSIsoft Live library – Introduction to PI connector for HART-IP
  2. Emerson press bulletin together with an interesting architecture published.

These documents show the following architecture.

Figure 1 – Breached HART IP connector having access to all field equipment

If an attacker or malware gains access to the server executing the HART IP connector, and the security advisory seems to suggest this possibility, an attacker can gain simple access to the field equipment through using the configured virtual COM ports that connect the server with the HART multiplexers. The OSIsoft document describes the HART commands used to collect the data.  Among others it starts with sending a command 0 to the HART multiplexer, the connected field equipment will return information on the vendor, the device type, and some communication specific details among which the address. In a HART environment it is not required to know the specific addresses and type of connected field devices, the HART devices report this information to the requester using the various available commands. Applications such as asset management systems for field equipment are “self configuring”, they get all the information they need from the sensor and actuators. Only additional configuration required is adding tagnames and organizing the equipment in logic groups.

But when an attacker gets access to the OSI PI connector (perhaps through malware), it is quite simple (even scriptable) to inject other commands toward the field equipment, commands such as command 42 (Field device reset) or command 52 (Set device variable to zero) and a long list of other destructive commands that can modify the range, the engineering units, the damping values and some field devices even allow that the low range can be set higher than the high range value. Such a change would effectively reverse the control direction.

The situation can be even worse if both the field devices of the BPCS and SIS would be connected to a common system. In this case it becomes possible to launch a simultaneous attack on the BPCS and SIS, potentially crippling both systems at the same time with potential devastating consequences for the production equipment and the safety of personnel. See also my blogs “Interfaced or Integrated” and “Cybersecurity and safety, how do they converge?”. We always need to be careful putting all our eggs in the same basket.

Often these systems (other examples are a Computerized Maintenance Management System (CMMS) and Instrument Asset Management System (IAMS)) reside at level 3 of the process control network. I consider such an architecture a bad practice, exposure of the field equipment is raised this way. There should never be a path from level 3 to level 0 (where the field equipment resides) without a guarantee that data can only be read. In my opinion such an architecture poses a high cyber security risk.

The recently published OSI PI security issue shows that we have to be careful with how we connect systems, and what the consequences are when such a system would be breached. We create network segments to reduce the risk for the most critical parts of the system such as field devices. Many might say this application is just an interface that only collects data from field instruments for analysis purposes and therefore it does not create a high risk. This assessment will be completely different when we consider what a threat actor can do when he/she gains access to the server and misuses the functionality available.

Like I stated in my blog on sensor security, the main risk for field equipment is not their inherent insecurity but the way we connect the equipment in the system. Proper architecture is a key element in OT security. This blog is another example for this statement.


There is no relationship between my opinions and publications in this blog and the views of my employer in whatever capacity.


Author: Sinclair Koelemij

Are sensors secure, is life an unhealthy affair?

Introduction

Another topic increasingly discussed in the OT security community is the security of our field equipment. The sensors through which we read process values and the actuators that control the valves in the production system. But if we would look at the automation functions in the power management systems we might add devices like IEDs and feeder relays. In this blog I focus on the instrumentation as used for controlling a refining processes, chemical processes, or oil and gas in general. The power management is an entirely different world, using different architectures and different instrumentation. Another area I don’t discuss are the wireless sensors used by technologies such as ISA 100 and WirelessHART. Another category I don’t discuss in this blog, are the actuators using a separate field unit to control the valve. The discussion in this blog focuses on architectures using either the traditional I/O or field bus concepts, which represent the majority of solutions used in the industry.

The first question I like to raise is when is a device secure, is this exclusively a feature of the intrinsic qualities of the device or do other aspects play a role? To explain I like to make an analogy with personal health. Am I considered not healthy because my personal resilience might be breached by a virus, toxic gas, fire, a bullet, or a car while crossing a zebra? I think when we consider health we accept that living is not a healthy exercise if we don’t protect the many vulnerabilities we have. We are continuously exposed by a wide range of attacks on our health. However we seem to accept our intrinsic vulnerabilities and choose to protect these by wearing protective masks, clothes, sometimes helmets and bullet free vests depending on the threat and the ways we are exposed. Where possible we mitigate the intrinsic vulnerabilities by having an operation or taking medicine, and sometimes we adapt our behavior by exercising more, eating less, and stop smoking. But I seldom have heard the statement “Life is an unhealthy affair”, though there are many arguments to support such a statement.

So why are people saying sensors are not secure? Do we have different requirements for a technical solution than the acceptance we seem to have of our own inadequacies? I would say certainly, when ever we develop something new we should try our best to establish functional perfection and nowadays we add to this resilience against cyber attacks. So any product development process should take cyber security into account, including threat modelling. And if the product is ready for release it should be tested and preferably certified for its resilience against cyber attacks.

But how about these hundreds of millions (if not billions) sensors in the field that sometimes do their task for decades and if they fail are replaced by identical copies to facilitate a fast recovery. Can we claim that these sensors are not secure, just because their intrinsic cyber security vulnerabilities exist? My personal opinion is that even an intrinsically vulnerable sensor can be secured when we analyze the cyber security hazards and control the sensor’s exposure. I would even extend this claim for embedded control equipment in general, it is the exposure that drives the insecurity once exposure is controlled we reduce the risk of becoming a victim of a cyber attack.

Some cyber security principles

Which elements play a role in exposure? If we look at exposure we need to differentiate between asset exposure and channel exposure. The asset here is the sensor, the channel is the communication protocol, for example the HART protocol or some other protocol when we discuss Foundation Fieldbus or Profibus or any of the many other protocols in use to communicate with sensors. Apart from this, in risk analysis we differentiate between static exposure as consequence of our design choices, and dynamic exposure as consequence of our maintenance (or rather lack of maintenance) activities. To measure exposure we generally consider a number of parameters.

One such parameter would be connectivity, if we would allow access to the sensor from the Internet it would certainly have a very high degree of exposure. We can grade exposure by assigning trust levels to security zones. The assets within the zone communicate with other assets in other zones, based on this we can grade connectivity and its contribution to increasing the likelihood of a specific cyber security hazard (threat) to happen. Another factor that plays a role is complexity of the asset, in general the more complex an asset the more vulnerable it becomes because the chance on software errors grows with the number of code lines. A third parameter would be accessibility, the level of control over access to the asset also determines its exposure. If there is nothing stopping an attacker to access and make a change in the sensor, and there is no registration that this access and change occurred, the exposure is higher.

If we would have a vault with a pile of euros (my vendor neutrality objective doesn’t imply any regional preferences) and would put this vault with an open door at the local bus stop, it would soon be empty. The euros would be very exposed. Apart from assessing the assets we also need to assess the channel, the communication protocols. Is it an end to end authenticated connection, is it clear text or encrypted, or is it an inbound connection or strictly outbound, who initiates the communication, these type of factors determine channel exposure.

We can reduce exposure by adding countermeasures, such as segmentation or more general the selection of a specific architecture, adding write protections, authentications and encryption mechanisms. Another possibility available is the use of voting systems, for example 2oo4 systems as used in functional safety architectures to overcome the influence of a sensor’s mean time to failure (MTTF) on the probability of failure on demand (PFD), a specification that is required to meet a  safety integrity level (SIL).

I/O architectures

Input / output (I/O) is the part of the ICS that feeds the process controllers, safety controllers and PLCs with the process values measured by the sensors in the field. Flows, pressures, temperatures, levels are measured using the sensors, these sensors provide a value to the controllers using I/O cards (or at least that is the traditional way, we discuss other solutions later), this was originally an analog signal (e.g. 1 – 5V or 4-20 mA) and later a digital signal, or in the case of HART a hybrid solution where the digital signal is superimposed on the analog signal.

Figure 1 – Classic I/O architecture

The sensors are connected through two wires with the I/O module / card. Generally there are marshaling panels and junction boxes used to connect all the wires. In classic I/O, an I/O module also had a fixed function, for example an analog input, analog output, digital input, digital output, or counter input function. Sometimes special electronic auxiliary cards were used to further condition the signals such as adding a time delay on an analog output to control the valves travel rate. But also when much of this was replaced with digital communication the overall architecture didn’t change that much. A controller, had I/O cards (often multiple racks of I/O cards connected to an internal bus) and I/O cards connected to the field equipment. A further extension was the development of remote I/O to better support systems that are very dispersed. A new development in today’s systems is the use of I/O cards that have a configurable function, the same card can be configured as an analog output, a digital input, or analog input. This flexibility created major project savings, but of course like any function can also become a potential target. Just like the travel rate became a potential target when we replaced the analog auxiliary cards based on RC circuitry with a digital function that could be configured.

From a cyber security point of view, the sensors and actuators have a small exposure in this I/O architecture, apart from any physical exposure in the field where they could be accessed with handheld equipment. Today, many plants still have this type of I/O, which I would consider secure from a cyber security perspective. The attacker would need to breach the control network and the controller to reach the sensor. If this would happen the loss of control over the process controller would be of more importance than the security of the sensor.

However the business required, for both reasons of safety and cost, that the sensors could be managed centrally. Maintenance personnel would not have to do their work within the hazardous areas of the plant, but could do their work from their safe offices. The centralization additionally added efficiency and improved documentation to the maintenance process. So a connection was required between a central instrument asset management system (IAMS) and the field equipment to realize this function. There are in principle two paths possible for such a connection, either through an interface between the IAMS function and the controller over the network or a dedicated interface between the IAMS and the field equipment. If we take the HART protocol as example, the connection through the controller is called HART pass-through. The HART enabled I/O card has a HART master function that accepts and processes HART command messages that are send from the IAMS to the controller, which passes it on to the I/O card for execution. However for this to work the controller should either support the HART IP protocol or embed the HART messages in its vendor proprietary protocol. In most cases controllers accept only the proprietary vendor protocols, where PLCs often support the HART-IP protocol for managing field equipment. But the older generation process controllers, safety controllers, and PLCs doesn’t support the HART-IP protocol so a dedicated interface became a requirement. In the next diagram I show these two architectures in figure 2.

Figure 2 – Simplified architectures connecting an IAMS

The A architecture on the right of the diagram shows the connection over the control network segment between the controller and the IAMS, the traffic in this architecture is exposed if it is not using encryption. So if an attacker can intercept and modify this traffic he has the potential to modify the sensor configuration by issuing HART commands. Also when an attacker would get access to the IAMS it becomes possible to make modifications.

However most controllers support a write  enable / disable mechanism that prevents modifications to the sensor. So from a security point of view we have to enforce that this parameter is in the disable setting. Changing this setting is logged, so can be potentially monitored and alerted on. Of course there are some complications around the write disable setting which have to do with the use of what are called Device Type Managers (DTM) which would fill a whole new blog. But in general we can say if we protect the IAMS, the traffic between IAMS and controller, and enforce this write disable switch the sensors are secure. If an attacker would be able to exploit any of these vulnerabilities we are in bigger trouble than having less cyber resilient field equipment.

The B architecture on the left side of figure 2 is a different story. In this architecture a HART multiplexer is used to make a connection with the sensors and actuators. In principle there are two types of HART multiplexers in use, with slightly different cyber security characteristics. We have HART multiplexers that use the serial RS 485 technology allowing for various multi-drop connections with other multiplexers (number 1 in figure 2), and we have HART multiplexers that directly connect to the Ethernet (number 2 in figure 2).

Let’s start with the HART multiplexers making use of RS 485. There are two options to connect these with the IAMS, a direct connection using an RS 485 interface card in the IAMS server or by using a protocol converter that connects to the Ethernet in combination with the configuration of a virtual COM port in the IAMS server. When we have a direct serial link the exposure is primarily the exposure of the IAMS server, when we use the option with the protocol converter the HART IP traffic is exposed on the control network. However if we select the correct protocol converter, and configure it correctly, we can secure this with encryption that also provides us a level of authentication. If this architecture is correctly implemented then security of the sensor depends on security of the protocol converter and the IAMS. An important security control we miss in this architecture is the write protection parameter. This can be compensated by making use of tamper proof field equipment, this type of equipment has a physical dip switch on the instrument providing a write disable function. The new version of the IEC 61511 also suggests this control for safety related field equipment that can be accessed remotely.

The other type of HART multiplexer with a direct Ethernet interface is more problematic to secure. These multiplexers typically use a communication DTM. Without going into an in depth discussion on DTMs and how to secure the use of device type managers, there are always two DTMs. One DTM communicating with the field device understanding all the functions provided by the field device, and a communication DTM that controls the connection with the field device. A vendor of a HART multiplexer making use of an Ethernet interface typically also provides a communication DTM to communicate with the multiplexer. This DTM is software (a .net module) that runs into what is called an FDT (Field Device Tool), a container that executes the various DTMs supporting the different field devices. Each manufacturer of a field device provides a DTM (called a Device DTM) that controls the functionality available for the IAMS or other control system components that need to communicate with the sensor. The main exposure in this architecture is created by the Communication DTM, frequently the communication DTM doesn’t properly secure the traffic. Often encryption is missing, so exposing the HART command messages to modifications, and also authentication is missing allowing rogue connections with the HART multiplexer. Additionally there is often no hash code check on the DTM modules in the IAMS, allowing the DTM to be replaced by a Trojan adding some malicious functionality. The use of HART multiplexers does expose the sensors and adds additional vulnerabilities we need to address to the system. Unfortunately we see that some IIoT solutions are making use of this same HART multiplexer mechanism to collect the process values to send them into the cloud for analysis. If not properly implemented, and security assessments frequently conclude this, sensor integrity can be at risk in these cases.

Systems using the traditional I/O modules are costly because of all the wiring, so bus type systems were developed such as Foundation Field bus and Profibus. Additionally sensors and actuators gained processing power, this allowed for architectures where the field equipment could communicate and implement control strategies at field device level (level 0). Foundation Fieldbus (FF) supports this functionality for the field equipment. Just like with the I/O architecture we have two different architectures that pose different security challenges. I start the discussion with the two different FF architectures used in ICS.

Figure 3 – Foundation Fieldbus architectures

Architecture A in figure 3 is the most straight forward, we have a controller with an a Foundation Fieldbus (FF) interface card that connects to the H1 token bus of FF. The field devices are connected  to the bus using junction boxes and there is a power source for the field equipment. Control can take place either in the controller making use of the field devices or of field devices connected to its regular I/O. All equipment on the control network needs to communicate with the controller and its I/O and interface cards for sampling data or making modifications.

DTMs are not supported for FF, the IAMS makes use of device descriptor files that describe a field device in the electronic device descriptor language (EDDL). No active code is required, like we had with the HART solution when we used DTMs, the HART protocol is not used to communicate with the field equipment.

Also the HART solution in figure 2 supports device descriptor files, this was the traditional way of communicating with a HART field device, however this method offers less functionality (e.g. missing graphic capabilities) in comparison with DTMs. So the DTM capabilities became more popular, also the Ethernet connected HART multiplexer required the use of a communication DTM.

In architecture A the field device is shielded by the controller, the field device may be vulnerable but the attacker would need to breach the nodes connected to the control network before he / she can attempt to exploit a vulnerability in the field equipment. If an attacker would be able to do this the plant would already be in big trouble. So the exposed parts are the assets connected to the control network and the channels flowing in the control network, not so much the field equipment. Weakest vulnerability are most often the channels lacking secure communication that implements authentication and encryption.

In architecture B we have a different architecture, here the control system and IAMS interface with the FF equipment through a gateway device. The idea here is that all critical control takes place locally on the H1 bus. But if required also the controller function can make use of the field equipment data through the gateway. Management of the field equipment through the IAMS would be similar as in architecture A, however now using the gateway.

The exposure in this architecture (B) is higher the moment the control loops would depend on traffic passing the control network. This should be avoided for critical control loops. Next to the channel exposure, the gateway asset is the primary defense for access to the field equipment. It depends on the capabilities to authenticate traffic and the overall network segmentation how strong this defense is. In general architecture A offers more resilience against a cyber attack than architecture B. From a functional point of view architecture B offers several benefits over architecture A.

There are many other bus systems in use such as for example Devicebus, Sensorbus, AS-i bus, and Profibus. Depending on the vendor and regional factors their popularity differs. To limit the length of the blog I pick one and end with a discussion on Profibus (PROcess FieldBUS), which is an originally German standard that is frequently used within Europe. Field buses were developed to reduce point to point wiring between I/O cards and the field devices. Their primary difference between the available solutions is the way they communicate. For example, Sensorbus communicates at bit level for instance with proximity switches, buttons, motor starters, etc. Devicebus allows for communication in the byte range (8 bytes max per message), a bus system used when for example diagnostic information is required or larger amounts of data need to be exchanged.  The protocol is typically used for discrete sensors and actuators.

Profibus has several applications, there is for example Profibus DP, short for Decentralized Peripherals, this is an RS 485 based protocol. Profibus DP is primarily used in the discrete manufacturing industry. Profibus PA, short for Process Automation, is a master / slave protocol developed to communicate with smart sensors. It supports such features as floating point values in the messages, it is similar to Foundation Fieldbus with an important difference that the protocol doesn’t support control at the bus level. So when we need to implement a control loop, the control algorithm runs at controller level where in FF it can run at fieldbus level.

Figure 4 – Profibus architectures

Similar to the FF architectures, also for the Profibus (PB) architectures we have the choice between an interface directly with a controller (architecture A) or a gateway interface with the control network (architecture B). An important difference here is that contrary to the FF functionality, in a PB system there can’t be any local control. So we can collect data from the sensors and actuators and we can send data.

Because we don’t have local control, the traffic to the field devices and therefore all control functionality is now exposed to the network. To reduce this exposure often a micro firewall is added so the Profibus gateway can directly communicate with the controller without its traffic being exposed.

In architecture A the field devices are again shielded from a direct attack by the controller, in architecture B it is the gateway that shields the field devices but the traffic is exposed. Architecture C solves this issue by adding the micro firewall which shields the traffic between controller and field equipment. Though architecture B is vulnerable, there are proper solutions available and in use to solve this. So this architecture should be avoided.

Never the less also in these architectures the exposure of the field equipment is primarily indirectly through the control network, and when an attacker gains access to this network no direct access to the field equipment should be possible.

So far we didn’t discuss IIoT for the FF and PB architectures. We mentioned a solution when we discussed the HART architecture, where we saw a server was added to the system that used the HART IP protocol to collect the data from the field equipment, similar architectures exist for both Foundation Fieldbus and Profibus where a direct connection to the field bus is created to collect data. In these cases the exposure of the field equipment directly depends on a server that either has a direct or indirect connection with an external network. This type of connectivity must be carefully examined for which cyber security hazards exist, what are the attack scenarios, what are the vulnerabilities exploited and what are the potential consequences and what is the estimated risk. Based on this analysis we can decide on how to best protect these architectures, and determine which security controls contribute most to reducing the risk.

Summary and conclusions

Are the sensors not sufficiently secure? In my opinion this is not the case for the refining, chemical, and oil and gas industry. I didn’t discuss in this blog SCADA and power systems, but so far never encountered situations that differed much from the discussed architectures. Field devices are seldom directly exposed, with exception of the wireless sensors we didn’t discuss and the special actuators that have a local field unit to control the actuators. Especially the last category has several cyber security hazards that need to be addressed , but it is a small category of very special actuators. I admit that sensors today have almost no protection mechanisms, with the exception of the tamper proof sensors, ISA 100 and WirelessHART sensors, but because the exposure is limited I don’t see it as the most important security issue we have in OT systems. Access to the control network reducing exposure through hardening of the computer equipment, enforcing least privilege for users and control functions is far more effective than replacing sensors for more secure devices. A point of concern are the IIoT solutions available today, these solutions increase the exposure of the field devices much more and not all solutions presented seem to offer an appropriate level of protection.

Author: Sinclair Koelemij

Date: May 23, 202

Cyber Security in Real-Time Systems

Introduction

In this blog I like to discuss some of the specific requirements for securing OT systems, or what used to be called Real-Time Systems (RTS). Specifically because of the real-time requirements for these systems and the potential impact on these requirements by cyber security controls. RTS needs to be treated differently when we secure them to maintain this real-time performance. If we don’t we can created very serious consequences for both the continuity of the production process as well as the safety of the plant staff.

I am regularly confronted with engineers with an IT background working in ICS that lack familiarity with the requirements and implement cyber security solutions in a way that directly impact the real-time performance of the systems with sometimes very far reaching consequences leading to production stops. This blog is not a training document, its intention is to make people aware of these dangers and suggest to avoid them in future.

I start with explaining the real-time requirements and how they are applied within industrial control systems and then follow-up with how cyber security can impact these requirements if incorrectly implemented.

Real-Time Systems

According to Gartner, Operational Technology (OT) is defined as the hardware and software that detects or causes a change, through direct monitoring and / or control of industrial equipment, assets, processes and events.

Figure 1 – A generic RTS architecture, nowadays also called an OT system

OT is a relatively new term that primarily came into use to express that there is a difference between OT and IT systems, at the time when IT engineers started discovering OT systems. Before the OT term was introduced the automation system engineering community called these systems Real-Time Systems (RTS). Real-Time Systems are in use for over 50 years, long before we even considered cyber security risk for these systems or considered the need to differentiate between IT and OT. It was clear for all, these were special systems no discussion needed. Time changed this, and today we need to explain that these systems are different and therefore need a to be treated differently.

The first real-time systems made use of analog computers, but with the rise of mini-computers and later the micro-processor the analog computers were replaced by digital computers in the 1970s. These mini computer based systems evolved into micro-processor based systems making use of proprietary hardware and software solutions in the late 1970s and 1980s. The 1990s was the time these systems started to adopt open technology, initially Unix based technology but with the introduction of Microsoft Windows NT this soon became the platform of choice. Today’s real-time systems, the industrial control systems, are for a large part based on similar technology as used in the corporate networks for office automation, Microsoft servers, desktops, thin clients, and virtual systems. Only the controllers, PLCs and field equipment are still using proprietary technology, though also for this equipment many of the software components are developed by only a few companies and used by multiple vendors. So also within this part of the system a form of standardization occurred.

In today’s automation landscape, RTS are everywhere. RTS is implemented in for instance cars, air planes, robotics, space craft, industrial control systems, parking garages, building automation, tunnels, trains, and many more applications. Whenever we need to interact with the real world by observing something and act upon what we observe, we typically use an RTS. The RTS applied are generally distributed RTS, where we have multiple RTS that exchange information over a network. In the automotive industry the Controller Area Network (CAN) or Local Interconnect Network (LIN) is used, in aerospace we use ARINC 629  (named after Aeronautical Radio, Incorporated) for example used in Boeing and Airbus aircraft, and networks such as Foundation Fieldbus, Profibus, and Ethernet are examples connecting RTS within industrial control systems (ICS)such as DCS and SCADA.

Real-time requirements typically express that an interaction must occur within a specified timing bound. This is not the same as that the interaction must be as fast as possible, a deterministic periodicity is essential for all activity. If an ICS needs to sample a series of process values each 0.5 second, this needs to be done in a way that the time between the samples is constant. To accurately measure a signal, the Nyquist-Shannon theorem states that we need to have a sample frequency with at minimum twice the frequency of the signal measured. If this principle is not maintained values for pressure, flow, and temperature will deviate from their actual value in the physical world. Depending on the technology used tens to hundreds of values can be measured by a single controller, different lists are maintained within the controller for scanning these process values each with a specific scan frequency (sampling rate). Variation in this scan frequency, called jitter, is just not allowed.

Measuring a level in a tank can be done with a much lower sampling rate than measuring a pressure signal that fluctuates continuously. So different tasks exist in an RTS that scan a specific set of process points within an assigned time slot. An essential rule is that this task needs to complete the sampling of its list of process points within the time reserved for it, there is no possibility to delay other tasks to complete the list. If there is not sufficient time than the points that remain in the list are just skipped. The next cycle will start again at the start of the list. This is what is called a time triggered execution strategy, a time triggered strategy can lead to starvation if the system becomes overloaded.  With a time-triggered execution, activities occur at predefined instances of time, like a task that samples every 0.5 second a list of process values, and another task that does the same for another list every 1 second, or every 5 second, etc.

There also exists an event triggered execution strategy, for example when a sampled value (e.g. a level) reaches a certain limit an alarm will go off. Or if a sampled value has changed for a certain amount or if the process point is a digital signal that changed from open to closed. Apart from collecting information an RTS also needs to respond to changes in process parameters. If the RTS is a process controller the process operator might change the setpoint of the control loop or adjust the gain or another parameter. And of course there is an algorithm to be executed that determines the action to execute, for example the change of an output value toward an actuator or a boolean rule that opens or closes a contact.

In ICS up to 10 – 15 years ago this activity resided primarily within a process controller, when information was required from another controller this was exchanged through analog wiring between the controllers. However hard-wiring is costly, so when functionality became available that allowed this exchange of information over the network (what is called peer-2-peer control) it was more and more used. (See figure 2) Various mechanisms were developed to prevent that a loss of communication between the controllers would not be detected and could be acted upon if it occurred.

Figure 2 – Example process control loop using peer-2-peer communication

One of these mechanisms is what is called mode shedding. Control loops have a mode, names sometimes differ per vendor, but commonly used names are Manual (MAN), Automatic (AUTO), Cascade (CASC), Computer (COM). The names and details differ between different systems, but in general when the mode is in MAN, the control algorithm is not executed anymore and the actuator remains in its last position. When the mode is AUTO the process algorithm is executed and makes use of its local setpoint (entered by the process operator) and measured process value to adjust its output. When the mode is CASC the control algorithm receives its setpoint value from the output of another source. this can be a source within the controller or an external source that makes use of for example the network. If such a control algorithm doesn’t receive its value in time, mode shedding occurs. It is generally configurable to what mode the algorithm falls back but often manual mode is selected. This freezes the control action and requires an operator intervention, failures may happen as long as the result is deterministic. Better fail than continuing with some unknown state. So within an ICS network performance is essential for real-time performance, essential to keep all control functions doing their designed task, essential for a deterministic behavior.

Another important function is redundancy, most ICS make use of redundant process controllers, redundant input/output (I/O) functions, and redundant servers, so if a process controller fails, the control function continuous to operate, because it is taken over by the redundant controller. A key requirement here is that this switch-over needs to be what is called  a bump-less transfer. So the control loops may not be impacted in their execution because another controller has taken over the function. This requires a very fast switch-over that the regular network technology often can’t handle. If the switch-over function would take too long, we would have again this mode-shedding mechanism triggered to keep the process in a deterministic state. The difference with the previous example is that in this case mode shedding wouldn’t occur in a single process loop but in all process loops configured in that controller. So a major process upset will occur. A double controller failure would normally lead to a production stop, resulting in high costs. Two redundant controllers need to be continuously synchronized, an important task running under the same real-time execution constraints as the example of the point sampling discussed earlier. Execution of the synchronization task needs to complete within its set interval, this exchange of data takes place over the network. If somehow the network is not performing as required and the data is not exchanged in time the switch-over might fail when needed.

So network performance is critical in an RTS, cyber security however can negatively impact this if implemented in an incorrect manner. Before we discuss this let’s have a closer look at a key factor for network performance, network latency.

Network latency

Factors affecting the performance in a wired ICS network are:

  • Bandwidth – the transmission capacity in the network. Typically 100 Mbps or 1000 Mbps.
  • Throughput – the average of actual traffic transferred over a given network path.
  • Latency – the time taken to transmit a packet from one network node (e.g. a server or process controller) to the receiving node.
  • Jitter – this is best described as the variation in end-to-end delay.
  • Packet loss – the transmission might be disturbed because of a noisy environment such as cables close to high voltage equipment or frequency converters.
  • Quality of service – Most ICS networks have some mechanism in place that set the priority for traffic based on the function. This to prevent that less critical functions can delay the traffic of more critical functions, such as an process operator intervention or a process alarm.

The factor that is most often impacted by badly impacted security controls is network latency, so let’s have a closer look at this.

There are four types of delay that cause latency:

  • Queuing delay – Queuing delay depends on the number of hops for a given end-to-end path. This typically caused by routers, firewalls, and intrusion prevention systems (IPS).
  • Transmission delay – This is the time taken to transmit all the bits of the frame containing the packet. So the time taken between emission of the first bit of the packet and the emission of the last bit. The main factor influencing transmission delay is cable type (copper, fiber, dark fiber), and cable distance. For example very long fiber cables. This is a factor normally not influenced by specific cyber security controls, exception is when data-diode is implemented. The type of data diode can have influence.
  • Propagation delay – This is the time between emission of the first bit and reception of the last bit. Propagation delay is created by all network equipment, but also a firewall (and type of firewall) and IPS contribute to this.
  • Processing delay – This is the time taken by the software execution of the protocol stack. Processing delay is created by access control lists, by encryption, by integrity checks build in the protocol, either for TCP, UDP, or IP.

Let’s discuss the potential conflict between real-time performance and cyber security.

The impact of cyber security on real-time performance

How do we create real-time performance within Ethernet, a network never designed for providing real-time performance? There is only one way to do this and that is creating over-capacity. A typical well configured ICS network has a considerable over-capacity to handle peak loads and prevent delays that can impact the RTS requirements. However the only one managing this over-capacity is the ICS engineer designing the system. The time available for tasks to execute is a design parameter that needs to meet small and large systems. To make certain that network capacity is sufficient is complex in redundant and fault tolerant networks. A redundant network has two paths available between nodes, a fault tolerant network has four paths available. Depending on how this redundancy or fault tolerance is created, can impact the available bandwidth / throughput. In systems where the redundant paths are also used for traffic network paths can become saturated by high throughput, for example caused by creating automated back-ups of server nodes, or distributing cyber security patches (Especially Windows 10 security patches). Because this traffic can make use of multiple paths, it becomes constrained when it hits the spot in the network redundancy or fault tolerance ends and the traffic has to fall back to a much lower bandwidth. Quality of service can help a little here but when the congestion impacts the processing of the  network equipment, extra delays will occur also for the prioritized traffic.

Another source of network latency can be the implementation of anomaly detection systems making use of port spanning. A port span has some impact on the network equipment, partially depending on how configured, generally not much but this depends very much on the base load of this equipment and its configuration. Similarly low cost network taps also can add significant latency. This has caused issues in the field.

Another source of delay are the access filters. Ideally when we segment an ICS network in its hierarchical levels (level 1, level 2, level 3) we want to restrict the traffic between the segments as much as possible, but specifically at the levels 1 and level 2 this can cause network latency that potentially impacts control critical functions such as the peer-2-peer control and controller redundancy. Additionally the higher the network latency the less process points can be configured for overview graphics in operator stations, because also these have a configurable periodic scan. A scan that can also be temporarily raised by the operator for control tuning purposes.

The way vendors manage this traffic load is by their system specifications limiting the number of operator stations, servers, and points per controller, and breaking up the segments into multiple clusters. These specifications are verified with intensive testing to meet performance under all foreseen circumstances. This type of testing can only be done on a test bed that supports the maximum configuration, the complexity and impact of these tests make it impossible to verify proper operation on an operational system. because of this vendors will always resist against implementing functions in the level 1 / levels 2 parts of the system that can potentially impact performance and are not tested.

In the field we see very often that security controls are implemented in a way that can cause serious issues that can lead to dangerous situations and / or production stops. Controls are implemented without proper testing, configurations are created that cause a considerable processing delay, networks are modified in a way that a single mistake of a field service engineer can lead to a full production stop. In some cases impacting both the control side as well as the safety side.

Still we need to find a balance between adding the required security controls to ICS and preventing serious failures. This requires a separate set of skills, engineers that understand how the systems operate, which requirements need to be met, and have the capabilities to test the more intrusive controls before implementing them. This makes IT security very different from OT security.

 

 

Interfaced or integrated?

Introduction

Like the Stuxnet attack in 2010 made the industry aware that targeted cyber-attacks on physical systems are a serious possibility, the TRISIS attack in 2017 targeting a safety system of a chemical plant also caused a considerable rise in awareness of the cyber-threat. Many asset owners started to reevaluate their company policies on how to connect BPCS and SIS systems.

That BPCS and SIS need to exchange information is not so much a discussion, all understand that this is required for most installations. But how to do this has become a debate within and between the asset owner and vendor organizations. The discussions have focused on the theme do we allow an integrated architecture with its advantages, or do we fallback to an interfaced architecture.

I call it a fallback because an interfaced connection between a Basic Process Control System (BPCS) and Safety Instrumented System (SIS) used to be the standard up to approximately 15-20 years ago.

Four main architectures are in use today for exchanging information between BPCS and SIS:

  • Isolated / air-gapped – The two systems are not connected and limit their exchange of data through analog hard-wiring between the I/O cards of the safety controller and the I/O cards of the process controller;
  • Interfaced – The process controller and the safety controller are connected through a serial interface (could also be an Ethernet interface, but most common is serial);
  • Integrated – The BPCS and SIS are connected over a network connection, today the most common architecture;
  • Common – Here the BPCS and SIS are no longer independent from each other and either run on a shared platform.

There are variations on above architecture, but above four categories summarize the main differences good enough for this discussion. Although security risk differs, because of differences in exposure,  we cannot say that one architecture is better than another as business requirements and their associated business / mission risk differs. Take for instance a batch plant that produces multiple different products on the same production equipment, such a plant might require changes to its SIS configuration for every new product cycle, where a continuous process of a gas treatment plant or a refinery only incidentally require changes to the safety logic. These constraints do influence decisions on which architecture best meets business requirements. For the chemical, oil & gas, and refining industries the choice is mainly between interfaced or integrated the topic of the blog.

I start the discussion by introducing some basic requirements for the interaction between BPCS and SIS using the IEC 61511 as reference.

Requirements

Following clauses of the IEC 61511 standard are describing the requirements for this separation.

Section 11.2.2:

“Where the SIS is to implement both SIFs and non-SIFs then all the hardware, embedded software and application program that can negatively affect any SIF under normal and fault conditions shall be treated as part of the SIS and comply with the requirements for the highest SIL of the SIFs it can impact.”

Section 11.2.3:

“Where the SIS is to implement SIF of different SIL, then the shared or common hardware and embedded software and application program shall conform to the highest SIL.”

A Safety Integrity Function (SIF) is the function that takes all necessary actions to bring the production system to a safe state when predefined process conditions are exceeded. A SIF can either be manually activated (a button or key on the console of the process operator) or acting when a predefined process limit (trip point) is exceeded, such as for example a high pressure limit. SIL stands for Safety Integrity Limit, it is a measure for process safety risk reduction, a level of performance for the SIF. The higher the SIL level the more criteria need to be met to reduce the probability on failure that the SIF is not able to act when demanded by the process conditions. In general a BPCS can’t meet the higher SIL (SIL 2 and above) and a SIS doesn’t provide all the control functions a BPCS has, so we end up with two systems. One system provides the control functions and one system enforces that the production process stays within a safe state, section 11.2.4 of the standard describes this. For a common architecture this is different because a common architecture has a shared hardware / software platform, therefore such an architecture is generally used for processes with no SIL requirements. Next clauses discuss the separation of the BPCS and SIS functions.

Section 11.2.4:

“If it is intended not to qualify the Basic Process Control System (BPCS) to this standard, then the basic process control system shall be designed to be separate and independent to the extent that the functional integrity of the safety instrumented system is not compromised.”

NOTE 1 Operating information may be exchanged but should not compromise the functional safety of the SIS.

NOTE 2 Devices of the SIS may also be used for functions of the basic process control system if it can be shown that a failure of the basic process control system does not compromise the safety instrumented functions of the safety instrumented system.

So BPCS and SIS should be separate and independent but it is allowed to exchange operating information (process values, status information, alarms) as long as a failure of the BPCS doesn’t compromise the safety instrumented functions. This requirement allows for instance the use process data from the SIS, reducing cost in the number of sensors installed. It also allows valves that combine a control and a safety function. And it allows for initiating safety override commands from the BPCS HMI to temporarily override a safety instrumented function to carry out maintenance activities. But functions that under normal operation fully meet the IEC 61511 requirements, can be maliciously manipulated when taking cybersecurity into account. An important requirement is that BPCS and SIS should be independent, the following sections discuss this:

Section 11.2.9:

The design of the SIS shall take into consideration all aspects of independence and dependence between the SIS and BPCS, and the SIS and other protection layers.

Section 11.2.10:

A device to perform part of a safety instrumented function shall not be used for basic process control purposes, where a failure of that device results in a failure of the basic process control function which causes a demand on the safety instrumented function, unless an analysis has been carried out to confirm that the overall risk is acceptable.

Below diagram shows the various protection layers that need to be independent. In today’s BPCS the supervisory function and control function are generally combined into a single system, as such form a single BPCS protection layer.

Figure 1 – Protection layers (Source: Center for Chemical Process Safety – CCPS)

The yellow parts in the diagram form the functional safety part implemented using what the standard calls Electrical/Electronic/Programmable Electronic Safety- related system (E/E/PES), or to be more specific systems such as: BPCS ( for the control and supervisory functions), SIS ESD ( for the preventative emergency shutdown function), SIS F&G (for the mitigative fire and gas detection function). Apart from these functions fire and alarm panels are used that interface with SIS, MIMIC panels also require interfaces, and public address general alarm systems (PAGA) require an interface. Though PAGA’s generally are hardwired connected, fire and alarm panels often use Modbus TCP to connect. These aren’t the only SIS that exist in a plant, also boiler management systems (BMS) and high integrity pressure protection systems (HIIPS) are preventative safety functions implemented in SIS.

Interfaced and integrated architectures

So far the discussion on the components and functions of the architecture let’s get into more detail to investigate these architectures. There are several reports published describing BPCS ó SIS  architectures. For example the LOGIIC organization published a report in 2018 and ISA published a report in 2017 on the subject. The ISA-TR84.00.09-2017 report ignores architectures with a dedicated isolated network for SIF related traffic. Because the major SIS vendors support and promote this architecture and in my experience it is also the most commonly used architecture my examples will follow architectures defined in the LOGIIC report on SIS. LOGIIC defines following architectures as example of integrated and interfaced structures. (Small modifications are mine)

Figure 2 – Interfaced and integrated architecture principles

In the interfaced architecture there are two networks isolated from each other. The BPCS network generally has some path to the corporate network, while the SIS network remains fully isolated. The exchange of data between the two systems makes use of Modbus RTU where the BPCS is the master and the SIS is the slave. The example shows two serial connections for example using the RS-232C protocol, but a more common alternative is a multi-drop connection using the RS-485 protocol, where a single serial interface on the BPCS side connects with multiple logic solvers (safety controllers).

An integrated architecture has a common network connecting both the BPCS and SIS equipment. Variations exist where separation between a BPCS network segment and SIS network segment is created with a firewall between the two systems (A). Or alternatively a dual-homed OPC server is sometimes used when the communication is making use of the OPC protocol (B). See the diagrams below.

Figure 3 – Some frequently used integrated architectures

Because one of the main targets is the SIS engineering function, architecture C and D were developed. In this case the SIS engineering station can be connected to the SIF network or alternatively an isolated 3rd network segment is created. Different vendors support different architectures, but the four examples above seem to cover most of the installations. Depending on the communication protocol used for the communication between BPCS and SIS there might be a need for a publishing service. This is a function mapping the SIS internal addresses into a format the BPCS understands and uses for its communication.

In the C and D architecture an additional security function can be an integrated firewall in the logic solver restricting access to a configured set of network nodes (servers, stations, controllers) for a defined set of protocols. Above architectures are just some typicals relevant for comparing interfaced and integrated architectures. Actual installations can be more complex because of additional business requirements such as extending the control and SIF networks to a remote location and the use of instrument asset management systems (IAMS) for centrally managing field equipment.

The architectures above do not show a separation between level 1 and level 2 for the process controller and the logic solver. All controllers / logic solvers are shown as connected to level 2, this is not necessarily correct. Some vendors implement a level 1 / level 2 separation using a micro firewall function for each controller with a preset series of policies that limit access to a sub set of approved protocols and allow for throttling traffic to protect against high volumes of traffic. Other vendors use a dedicated level 1 network segment where they can enforce network traffic limitations, and sometimes it is a mix depending on the technology used. There are also vendor implementations with a dedicated firewall between the level 2 and level 1 network segment. Since I want to discuss the differences independent of a specific vendor implementation I ignore these differences and consider them all connected as shown in the diagrams.

Let’s now compare the pros and cons of these architectures.

A comparison between interfaced and serial architectures

For the discussion I assume that the attacker’s objective is to create a functional deviation in the Safety Integrity Function (SIF) to either deviate from the operation intent / design intent (What I call Loss of Required Performance, for example modified logic or trip point) or making the SIS fail when it needs to act. Just tripping the plant can be done in many ways, the more dangerous threat is causing physical damage to the production system.

If we want to compare the security of an interfaced and integrated architecture the first thing is to look at the possibilities to manipulate the BPCS <=> SIS traffic. For the interfaced architecture this traffic passes the serial connection, for the integrated architecture the pink high lighted part of the network passes this traffic.

Figure 4 – BPCS <=> SIS traffic

One potential attack scenario is that the attacker attempts to create a sink hole by intercepting all traffic and dropping it. In an integrated architecture relative simple malware can do this using ARP poisoning as a mechanism to create a Man-In-The-Middle attack (MITM). The malware hijacks the IP address of one or both of the communicating BPCS/SIS nodes and just drops the message. This is possible because level 1 and level 2 traffic are normally in the same Ethernet broadcast domain, so actual messages are exchanged using Ethernet MAC addresses. A malware infection on any of the station and server equipment within the broadcast domain can perform this simple attack. Possible consequences of such a simple attack would be that SIF alarms are lost, SIS status messages are lost, SIS process data is lost, and messages to enforce an override will not arrive at the logic solver.

The noticeability of such an attack is relatively high, but this differs depending on the type of traffic and additional ingenuity added by the attacker. The loss of a SIF alarm would not be detected that quickly, but the loss process data would most likely be immediately detected. So selectively removing traffic can make the attack less noticeable. Also the protocol used is of importance for the success of this attack, a Modbus master (BPCS) would quickly detect if the slave (SIS) would not respond and generate a series of retries and ultimately raises an alarm. Similar detection mechanisms exist when using vendor proprietary protocols. But in those situations where the SIS can initiate traffic independent of the BPCS, than a SIF alarm might get lost if there is not a confirmation from the BPCS is required causing that the process operator might not receive the new alarm.

More ingenious attacks can make use of the MITM mechanism to capture and replay BPCS ó SIS traffic. Replaying process data messages would essentially freeze the operators information, replaying override messages might override or undo an override command. And of course an attacker can overload an operator this way with false alarms and status change messages causing the potential for missing an important alarm. The danger of these two attacks is that the attacker doesn’t need to know that much specific system information to be successful.

Enhancing the attack even more by intercepting and modifying traffic or injecting new messages is another option. In that scenario the attacker can attempt to make “intelligent” malicious modifications. There are various ways to reduce the likelihood of a MITM attack, but the risk remains in the absence of secure communications that authenticate, encrypt, confirms reception of a message, and includes a timestamp to validate if the message was send within a specific time window. If we analyze the protocols used today for communication between BPCS and SIS most don’t offer these functions.

Above MITM scenarios are only possible in an integrated architecture, in an interfaced architecture the communication channel is isolated from the control network by the process controller. An attacker would need physical access to the cabling to perform a similar attack.

Apart from attacking the traffic it is possible to directly attack the logic solver if it would have an exposed software vulnerability. In the integrated architecture the logic solver is always exposed to this threat and only testing, limiting network access to the logic solver, and using software validation techniques can offer protection against this threat.

In an interfaced architecture, the logic solver is no longer exposed to an external connected network but this does not offer a guarantee that such an attack can’t happen. The SIS engineering station might get infected with malware and execute an attack. Also in the Stuxnet attack the target system was on an isolated network, this didn’t stop the attackers to penetrate. But it certainly requires more skills and resources to design such an attack for an interfaced architecture.

In an integrated architecture the SIS engineering station might be exposed. In architecture A above the exposure depends very much on the firewall configuration, but there might be connections required for the antivirus engine, security patches, and in some cases for the publishing function. In architectures C and D the exposure of the SIS engineering station is equivalent to the exposure in an interfaced architecture. However it can still be infected by malware when files are transferred, for example to support this publishing function, or through some supply chain attack. But in general architecture D, where the SIS engineering station is isolated from the SIF traffic reduces the exposure for a number of possible cyber-attacks. Integrated architectures C and D are considered more secure, but how do they compare with an interfaced architecture?

Two main cyber security hazards seem to make a difference:

a)     Manipulation of BPCS <=> SIS traffic;

b)     Exploitation of a logic solver resident software vulnerability;

To mitigate / reduce the risk of scenario a) the industry needs secure communications between the two systems. Protocols used today such as Modbus TCP, Classic OPC, and most of the proprietary protocols don’t offer this.

To mitigate / reduce the risk of scenario b) we would need mechanisms that prevent or detect the insertion of malicious code. These have partially been implemented, but not completely and not by all vendors as we can learn from the TRISIS incident.

So overall we can say that the exposure in an interfaced environment is smaller than the exposure in an integrated environment. Since exposure directly correlates with likelihood we can say that the risk in an interfaced architecture is smaller than the risk in an integrated architecture. But security is not all about risk reduction, in the end the cost of security something must be weight against the benefits the business has for selecting an architecture.

I think there is also a price to pay when the industry moves back to interfaced architectures.

Possible disadvantages of a change to interfaced solutions can be:

·       Reduced communication capabilities, e.g. the bandwidth of a serial connection is lower than the bandwidth of an Ethernet connection;

·       The support for Sequence Of Event (SOE) functionality over serial connections would be hindered by not having a single clock source;

·       The cost of the architecture is higher because the need for additional components and being constrained by cable length restrictions;

·       Sharing instrument readings between SIS and BPCS is limited because of the relatively low speed serial connectivity, this can increase the number of field instruments required;

·       If Instrument Asset Management Systems (IAMS) is used these systems can no longer benefit from the HART pass-through technology and need to use the less secure HAT multiplexers;

·       The cost for engineering and maintaining a Modbus RTU implementation is higher because of the more complex address mapping;

·       Future developments w.r.t. the Industrial Internet Of Things (IIoT) would become more difficult for safety instrumentation related data. For example the new developments around Advanced Physical Layer (APL) technology are hard to combine with an interfaced architecture.

So generally serial connections were seen as a thing of the past and some vendors stopped supporting them on their newer equipment. However the TRISIS attack has put the interfaced architecture very prominent on the table again, specifically in the absence of secure communications between BPCS and SIS.

The cost of moving to an interfaced architecture must be weight against the lost benefits for the business. How high is the residual risk of an integrated architecture compared to the residual risk of an interfaced architecture. Is moving from integrated to interfaced justified by the difference and what are the alternatives?

Author: Sinclair Koelemij

Date: May 9, 2020

Cyber security and process safety, how do they converge?

Introduction

There has been a lot of discussion on the relationship between cybersecurity and process safety in Industrial Control Systems (ICS). Articles have been published on the topic of safety / cybersecurity convergence, and on the topic to add safety to the cybersecurity triad for ICS. The cybersecurity community practicing securing ICS seems to be divided over this subject. This blog approaches the discussion from the process automation practice in combination with the asset integrity management practice as practiced in the chemical, refining, and oil and gas industry. The principles discussed are common for most automated processes, independent where applied. Let’s start the discussion by defining process / functional safety, asset integrity and cybersecurity before we discuss their relationships and dependencies.

Process safety is the discipline that aims at preventing loss of containment, the prevention of unintentional releases of chemicals, energy, or other potentially dangerous materials that can have a serious effect to the production process, people and the environment. The industry has developed several techniques to analyze process safety for a specific production process. Common methods to analyze the process safety risk are Process Hazard Analysis (PHA), Hazard and Operability study (HAZOP), and Layers Of Protection Analysis (LOPA). The area of interest for cybersecurity is what is called Functional Safety, the part that is implemented using programmable electronic devices. Functional safety implements the automated safety integrity functions (SIF), alarms, permissives and interlocks protecting the production process.

Figure 1 – Protection layers (Source: Center for Chemical Process Safety (CCPS))

Above diagram shows that several automation system components play a role in functional safety, there are two layers implemented by the basic process control system (BPCS), and two layers implemented by the safety instrumented system (SIS): the preventative layer implemented by the emergency shutdown function (ESD) and the mitigative layer implemented by the fire and gas system (F&G). The supervisory and basic control layers of the BPCS are generally implemented in the same system, therefore not considered independent and often shown as a single layer. Interlocks and permissives are implemented in the BPCS, where the safety integrity functions (SIF) are implemented in the SIS (ESD and F&G). Other functional safety functions exist such as the High Integrity Pressure Protection System (HIPPS) and Boiler Management System (BMS). For this discussion it is not important to make a distinction between these process safety functions. Important is too understand that the ESD safety function is responsible to act and return the production system to a safe state when the production process entered a hazardous state, where the F&G safety function acts on detection of smoke and gas and will activate mitigation systems depending on the nature, location, and severity of the detected hazard. This includes such actions as initiating warning alarms for personnel, releasing extinguishants, cutting off the process flow, isolating fuel sources, and venting equipment. The BPCS, ESD and F&G form independent layers, so their functions should not rely on each other but they don’t exist in isolation extending their ability to prevent or mitigate an incident by engaging with other systems.

Asset integrity is defined as the ability of an asset to perform its required function effectively and efficiently whilst protecting health, safety and the environment and the means of ensuring that the people, systems, processes, and resources that deliver integrity are in place, in use and will perform when required over the whole life-cycle of the asset.

Asset integrity includes elements of process safety. In this context, an asset is a process or facility that is involved in the use, storage, manufacturing, handling or transport of chemicals, but also the equipment comprising such a process or facility. Examples of process control assets include pumps, furnaces, tanks, vessels, piping systems, buildings, but also includes the BPCS and SIS among other process automation functions. As soon as a production process is started up the assets are subject to many different damage and degradation mechanisms depending on the type of asset. For electronic programmable components this can be hardware failures, software failures, but today also maliciously caused failures by cyber-attacks. From an asset integrity perspective there are two types of failure modes:

  • Loss of required performance;
  • Los of ability to perform as required;

The required performance of an asset is the successful functioning (of course while in service) achieving its operational / design intent as part of a larger system or process. For example in the context of a BPCS the control function adheres to the defined operating window, such as sensor ranges, data sampling rates, valve travel rates, etc. The BPCS presents the information correctly to the process operator, measured values accurately represent the actual temperatures, levels, flows, and pressures as present in the physical system. In the context of a SIS it means that the set trip points are correct, that the measured values are correct, and that the application logic acts as intended when triggered.

Loss of ability is not achieving that required performance. An example for a BPCS is loss of view or loss of control, the ability fails where in loss of required performance the ability is present but doesn’t execute correctly. Loss of ability is very much related to availability and loss of required performance to integrity. Never the less I prefer to use the terminology used by the asset integrity discipline because it more clearly defines what is missing.

The simplest definition of cybersecurity is that it is the practice of protecting computers, servers, mobile devices, electronic devices, networks, and data from malicious attacks. Typical malicious cyber-attacks are gaining unauthorized access into systems, and distributing malicious software.

In the context of ICS we often talk about Operational Technology (OT), which is defined as the hardware and software dedicated to detecting and / or causing changes in a production processes through monitoring and/or control of physical devices such as sensors, valves, pumps, etc. This is a somewhat limited definition because process automation systems contain other functions such as for example dedicated safety, diagnostic and analyzing functions.

The term OT was introduced by the IT community to differentiate between cybersecurity disciplines that protect these OT systems and those that protect the IT (corporate network) systems. There are various differences between IT and OT systems that justified to create this new category, though there is also significant overlap frequently confusing the discussion. In this document the OT systems are considered the process automation systems, the various embedded devices such as process and safety controllers, network components, computer servers running applications and stations for operators and engineers to interface with these process automation functions.

The IT security discipline defined a triad based on confidentiality, integrity, availability (CIA) as a model highlighting the three core security objectives in an information system. For OT systems ISA extended the number of security objectives by introducing 7 foundational requirement categories (Identification and authentication control, Use control, System integrity, Data confidentiality, Restricted data flow, Timely response to events, Resource availability) to group the requirements.

Now we have defined the three main disciplines I like to discuss the topic if we need to extend the triad with a fourth element safety.

Is safety a cybersecurity objective?

Based upon the above introduction to the three disciplines and taking asset integrity as the leading discipline for plant maintenance organizations, we can define three cybersecurity objectives for process automation systems:

  • Protection against loss of Required Performance;
  • Protection against Loss of Ability;
  • And protection against Loss of Confidentiality.

If we have established these three objectives we also established functional safety. Not necessarily process safety because this also depends non-functional safety elements. But these non-functional safety elements are not vulnerable to cyber attacks other than potentially revealing these elements through loss of confidentiality. Based upon all information on the TRISIS attack against a Triconex safety system, I believe all elements in this attack have been covered. The loss of confidentiality can be linked to the 2014 attack that is suggested to be the source of the disclosure of the application logic. The aim of the attack was most likely causing a loss of required performance by modifying the application logic that is part of the SIF. The loss of ability, the crash of the safety controller, was not the objective but an “unfortunate” consequence of an error in the malicious software.

Functional safety is established by the BPCS and SIS functionality, the combination of interlocks, permissives, and the safety integrity functions contribute to overall process safety. Cybersecurity contributes indirectly to functional safety by maintaining above three objectives. Loss of required performance and loss of ability would have a direct consequence to the process safety, loss of confidentiality might lead over time to the exposure of a vulnerability or contribute to the design of the attack. Required performance is overlapping what a process engineer would call operability, operability also includes safety so also from this angle nothing is missing in the three objectives.

Based upon above reasoning I don’t see a justification for adding safety to the triad, the issue is more that the availability and integrity objectives of the triad should be aligned with the terminology used by the asset integrity discipline to include safety. Which would make OT cybersecurity different from IT cybersecurity.

Author: Sinclair Koelemij

Date: April 2020