Home » Blog

Zero Downtime MES System Upgrades

 · 6 min · Pedro Fernandes

How Critical Manufacturing implemented a Blue-Green Deployment MVP to achieve near-zero downtime upgrades

Database Deployments

Have you ever felt that unnerving sensation when performing a system upgrade? You’re thinking the upgrade might take too long, or that something wrong could happen.

Indeed, among other, these are two issues that often weigh against upgrading an MES:

  • Downtime - the system may become unavailable for a long period of time;
  • Unexpected failure - if the upgrade fails, then another maintenance window needs to be scheduled.

In an ideal world, we could upgrade an MES with next-to-no downtime, even on major upgrades! What if we lived in that world?

Status

At Critical Manufacturing, we currently support two strategies for MES system upgrades between minor/major versions:

  • Transactional Deployment
  • Non-Transactional Deployment

The first strategy, which we may also refer to as traditional deployment, is the default strategy. The database upgrade is embedded in an SQL transaction, which means that if something goes wrong, the transaction can be rolled back to the initial state, at no additional risk to the production environment.

The cost for safety is time – a SQL transaction has a performance overhead when compared to its counterpart.

In order to achieve faster upgrades, we have introduced support for a second strategy that does not use SQL transactions and is therefore much faster than the alternative; however, to achieve the same level of security, we need to back up the production database so that we can restore it in case an unexpected error occurs.

Deployment Strategies
Comparison of successful and unsuccessful upgrades

System downtime may not always be required though, assuming that we are moving between consecutive patch or minor versions, and that the package installation takes less than 10 minutes.

Our standard upgrade process follows a rolling strategy: container pods for the target MES version are provisioned while the pods running the current version remain active until the upgrade completes. This ensures that requests continue to be served during the database upgrade.

When upgrading across major versions, however, this strategy cannot be applied. Such upgrades typically require the installation of multiple database packages and involve changes to the database schema. As a result, service downtime during the upgrade is unavoidable.

As we can see, when we pick a deployment strategy for major upgrades, we almost have to choose between faster downtimes or a safer process. What we truly want is a strategy that combines both - for that, we will talk about a third contender!

Blue-Green Deployment

Blue-Green deployment is a strategy that aims to reduce both the risk and the downtime of performing a system upgrade.

It consists of booting up a clone of the production environment where the upgrade will actually be performed. Therefore, we have the blue environment (production) and the green environment (tentative). During the deployment, the blue environment is not modified in any way until the very end - this means that everyone, including operators and equipment, can continue using the MES like they usually do. More than that, they can access the green environment (in read-only mode) to validate that the upgrade is going smoothly.

Blue Green Sample
High level overview

One particular design decision I want to focus on is how to handle the database. One possible solution is to duplicate the application-level components and use a single database instance. In the context of our MES, we quickly realized this would not be a good fit for two particular reasons:

  1. Because we are modifying the production database, we are still required to perform a backup if something goes wrong.
  2. It becomes harder to deal with eventual breaking changes in the data model between major versions.

Because of this, we have also opted to duplicate the database instance, as we see in the image above. With this approach, we guarantee the safety of the production environment during the upgrade and also have more liberty to deal with data model changes. Therefore, even if something goes wrong during the upgrade, the production environment will not suffer any impact.

But this strategy does have one major challenge when compared to the alternative: there needs to be a data replication mechanism to synchronize both environments.

Minimum Viable Product (MVP)

At Critical Manufacturing, we have decided to explore the idea of supporting Blue-Green Deployments in our MES – so we have built an MVP. Let’s see how it works, step by step.

To see this process in action, take a look at the following video (please turn on the sound):

Installation

We start by backing up the database of the blue environment. Once it’s done, we install the green environment in the same MES version, supplying the respective backup. This ensures that the green environment is an identical copy of the original one. Then, we upgrade the green environment to the desired version. This process is made easier by the DevOps Center section of the Critical Manufacturing Customer Portal, which is a tailor-made solution to install MES environments.

Green installation
Green installation

Synchronization

Once the upgrade is complete, we introduce a component responsible for synchronizing the data from the blue to the green environment. To achieve this, we are taking advantage of the existing replication mechanism by subscribing to a Kafka topic that publishes an event for each operation that is performed in the blue environment, taking that information and replicating it in the green database.

Additionally, we are also periodically running data migration scripts targeting the data that was modified by the users in the course of the upgrade. The goal is to minimize the impact of the scripts (such as running them all at once) at the end of the upgrade, which could potentially increase the downtime.

Synchronization
Environment synchronization mechanism

Environment Switch

The synchronization process is running, so now all that is left is deciding when to finalize the upgrade by promoting the green environment to production. To aid the system administrator in making a decision, we are estimating the expected period of downtime by analyzing the number of events that still have to be processed. When the user feels comfortable, he can make the switch, at which point the system will shut down the blue environment, finalize the synchronization and then route the traffic from the blue to the green environment. This will cause a brief period of downtime (in the demo above, it only took around 20 seconds). And that’s it! The system has been upgraded and the blue environment can be discarded.

Result
Post-upgrade result

Summary

In this post, we have looked at a way to perform system upgrades that tackles the main pain points most MES deployments face: downtime and confidence. Blue/Green deployments are not yet released, but we have already seen some very promising results - near-zero downtime is certainly in our grasp!

Author

Hi! My name is Pedro Fernandes. ✌️

I’m a Software Engineer working at Critical Manufacturing in the R&D area. You can find me on LinkedIn.

Pedro Fernandes
Senior Software Engineer