The Hard Truth About Multi-Cloud MES Deployments: Why State Management is Your Real Problem
Manufacturing Execution Systems (MES) have become mission-critical infrastructure for modern factories. They’re the digital nervous system that keeps production lines running, optimizing resources, and meeting compliance requirements.
The latest AWS and Azure outages rehashed a common question that I often get during calls with customers and prospects. Can we deploy the MES in a multi-cloud or cloud plus on-premises configuration in a fully redundant way? As organizations explore cloud flexibility and multi-region deployments to achieve resilience, they’re discovering that distributed architectures introduce a problem far more complex than infrastructure orchestration: how do you keep your database synchronized across multiple infrastructures without destroying application performance?
Part 1: Why Your MES Cannot Afford Downtime
The Business Reality of Manufacturing Downtime
Manufacturing downtime is catastrophically expensive. According to recent industry data, [unplanned downtime costs the world’s 500 biggest companies 11% of their revenues - approximately $1.4 trillion annually]https://www.automation.com/article/white-paper-true-costs-downtime-2024). In the automotive sector alone, a single hour of production stoppage now costs approximately $2.3 million - or $600 per second.
For a typical large manufacturing facility, unplanned downtime averages 323 production hours per year, translating to $532,000 per hour in lost revenue, financial penalties, idle staff time, and line restart costs - amounting to $172 million annually per plant. These aren’t edge cases or theoretical numbers - they’re institutional realities across Fortune 500 manufacturers.
Why High-Availability Isn’t Optional
A Manufacturing Execution System drives real-time production decisions: material allocation, quality checks, equipment scheduling, and regulatory compliance documentation. When your MES goes offline, you don’t just lose visibility - you lose the ability to make decisions at all. Production doesn’t pause elegantly; it halts chaotically. Work-in-progress (WIP) inventory stalls. Quality escapes increase. Regulatory traceability breaks.
This is why high-availability (HA) and disaster recovery (DR) aren’t luxuries. They’re infrastructure requirements. Organizations deploy SQL Server Availability Groups specifically to ensure Recovery Time Objectives (RTO) measured in minutes and Recovery Point Objectives (RPO) that limit data loss to acceptable windows.
The standard production deployment for a Critical Manufacturing MES includes:
- A multi-node Kubernetes cluster for application redundancy
- SQL Server Always-On Availability Groups with synchronous commits for zero-loss failover
- Network infrastructure optimized for sub-100 millisecond latency between nodes
This architecture works exceptionally well within a single region or data center. The problem emerges the moment you try to stretch it across geographies.
The technologies mentioned through out the article are illustrative. The failure scenarios we will cover apply to any sort of system that stores state.
Part 2: The Technical Nightmare of Distributed State
The Fundamental Constraint: Latency and Consistency
Here’s where the physics of networks collides with the mathematics of distributed databases. To maintain zero Recovery Point Objective (RPO) across multiple clouds or a cloud-plus-on-premises hybrid deployment, you need synchronous replication. Each database transaction must be confirmed as received and written to disk on every replica before the primary node acknowledges the write to the application.
This guarantee is powerful: if the primary datacenter burns down at 3 PM, you can fail over to your secondary in another cloud and know with absolute certainty that all committed data is present. No data loss. None.
But here’s the cost: synchronous replication is fundamentally latency-constrained.
SQL Server Always-On: The Synchronous-Commit Trap
SQL Server Always-On Availability Groups offers two fundamental modes of synchronization:
Synchronous Commit Mode:
- Every transaction is written to the primary and ALL secondary replicas’ transaction logs before the primary acknowledges success to the application
- The primary waits for all secondaries to confirm receipt and hardening
- Provides RPO = 0 - zero data loss
- Performance impact is direct and unavoidable: the application’s transaction latency increases by the round-trip time (RTT) to the farthest replica
In a single data center with fiber-optic connectivity (sub-millisecond latency), this overhead is acceptable. In a multi-cloud or hybrid cloud-plus-on-premises scenario, the numbers get brutal.
Under synchronous replication, every single database write operation adds latency to your transaction time. A manufacturing system processing production orders, recipe changes, and quality measurements might generate 100-5000 database transactions per minute across the factory floor. Each added millisecond of latency compounds.
This is why your MES becomes sluggish. User operations that should complete in single-digit seconds now require 10-15 seconds. The system feels broken - because it is.
Asynchronous Commit Mode:
- Transactions are immediately acknowledged to the application after writing to the primary only
- The primary then ships transaction logs to secondaries asynchronously, without waiting
- The secondary applies changes to its copy independently, in the background
- Performance impact: zero. You pay no latency penalty
- Data loss exposure: significant
Here’s the trap: in asynchronous mode, if your primary data center fails and your secondary is still processing transactions, you’ve lost committed transactions. The user confirmed the order. It’s gone.
Worse: in production systems, the lag can be seconds or minutes, not milliseconds. Network congestion, disk write bottlenecks on the secondary, or high transaction volume cause replication queues to back up. You have no guarantee that the secondary has ANY recent data, let alone all the data you need.
Real-World Evidence: The Azure DevOps September 2018 Outage
Don’t take my word for it. Microsoft learned this lesson the hard way.
On September 4, 2018, Azure DevOps (formerly VSTS) suffered a 21+ hour regional outage affecting South Central US, cascading to global services. A lightning strike hit the datacenter, forcing power-down procedures. When the infrastructure recovered, Microsoft faced a critical decision: fail over to the paired US North Central region using asynchronous replication, or wait for recovery.
Here’s what Microsoft’s postmortem revealed about why they couldn’t simply fail over:
“The round-trip latency is added to every write. This adds approximately 70ms for each round trip between South Central US and US North Central. For some of our key services, that’s too long… Since every write only succeeds when two different sets of services in two different regions can successfully commit the data and respond, there is twice the opportunity for slowdowns and failures. As a result, either availability suffers (halted while waiting for the secondary write to commit) or the system must fall back to asynchronous replication.”
Microsoft’s architects faced the exact tradeoff we’re describing: synchronous replication across regions destroyed performance; asynchronous replication meant unacceptable data loss risk. They chose to wait for recovery rather than fail over to a secondary they couldn’t trust.
Their conclusion is damning: “Achieving perfect synchronous replication across regions to guarantee zero data loss on every fail over across regions at any point in time is not possible for every service that also needs to be fast.”
This is a $200+ billion company with unlimited engineering resources and deep cloud platform expertise. And they still couldn’t solve this problem. That should tell you something about the fundamental constraints you’re facing.
The Real Problem: You Can’t Split the Difference
Here’s what many architects try: use asynchronous replication for cross-region links (latency), but synchronous for intra-region replicas (local data protection).
This works technically. It doesn’t work philosophically.
The moment your disaster recovery scenario is “secondary data center has everything up to 5 minutes ago,” you’re managing a distributed system in a state of eventual consistency. You’re no longer designing for zero-loss failover; you’re designing for “acceptable loss.”
For a pharmaceutical or medical devices manufacturer required to maintain traceable records of every production step, or an automotive supplier operating under strict regulatory requirements, “acceptable loss” isn’t a simple business decision - it’s a compliance question. What transactions are you comfortable losing? Which regulatory violations are you willing to explain?
And that assumes everything works as designed. In practice:
- SQL Server can switch replicas from synchronous to asynchronous automatically if the secondary falls too far behind during high load. Your “RPO = 0” guarantee silently becomes “RPO = unknown” and you won’t know until you try to failover.
- Network partitions introduce split-brain scenarios. If the WAN link flickers between your cloud provider and your on-premises facility, which replica is primary? Who’s authoritative? How do you merge divergent datasets afterward?
Additional Operational Nightmares
The latency problem is just the technical surface. Beneath it lurk operational complexities:
Cost Explosion:
Multi-cloud deployments require dedicated network connectivity to achieve reliable, low-latency synchronous replication. These connections run hundreds to thousands of dollars per month depending on bandwidth and geography. Your cloud bills don’t shrink; they multiply by the number of regions.
Maintenance Complexity:
Each cloud provider (AWS, Azure, Google Cloud) has different networking models, different security postures, different operational runbooks. A network engineer needs specialized knowledge for each. A single misconfiguration cascades across your entire deployment. Automated failover becomes a minefield of edge cases - what happens when one cloud provider has a regional outage but your on-premises facility is fine?
Monitoring and Observability:
In single-region, you measure lag using SQL Server DMVs and dashboard counters. In multi-region, you’re blind until you implement specialized cross-region replication monitoring, which itself becomes complex.
The Uncomfortable Reality
Multi-cloud MES deployments are achievable. They’re not easy.
The most pragmatic architectures accept the latency reality:
Keep synchronous replication within a single metropolitan region (cloud availability zones or co-located data centers with sub-15ms latency). This preserves RPO = 0 for local failures.
Use asynchronous replication for cross-region or cloud-to-premises links, accepting a defined recovery point objective (typically 1-5 minutes of acceptable data loss) and a longer recovery time objective.
Design your MES architecture to tolerate this limitation: Some functions (real-time production control, quality checks) stay in the primary region. Only non-critical analytics or reporting workloads run in secondary regions.
Invest heavily in network engineering: Private WAN links, traffic prioritization, and redundant connectivity paths are non-negotiable costs.
Establish clear policies about failover decisions: When do you accept data loss? Who decides? What’s your communication plan to customers? Microsoft learned that you need to give customers the choice of whether to fail over with possible data loss or wait for recovery - not decide for them.
This isn’t the “cloud is everything, everywhere” narrative that marketing decks promote. It’s the reality of maintaining both performance and reliability across distributed systems.
The hard truth: distributed systems are hard because the physics is hard. Synchronous replication provides the guarantees you want but destroys the performance you need. Asynchronous replication preserves performance but breaks your guarantees. There is no free lunch, and anyone telling you otherwise isn’t thinking through the failure scenarios.
For Critical Manufacturing’s MES customers, the question isn’t “should we do multi-cloud?” It’s “which specific failure scenarios justify which specific tradeoffs?” And that answer is different for every customer, every geography, and every regulatory environment.
Author
Hi! My name is José Pedro Silva. ✌️
I’m the R&D Director at Critical Manufacturing. Passionate about cybersecurity and everything Software Engineering.
You can find me on LinkedIn.
