SL#56 - All Three Availability Zones Failed Together. Twice in Five Weeks.

On 29 May 2026, a thunderstorm took down all three Availability Zones in Azure's West US 2 region at the same time. Five weeks earlier, on 24 April, all three zones in East US degraded too. Different region, different root cause, same outcome: the thing you spread your workload across precisely so it would not fail as a unit, failed as a unit.

If you run anything serious on a single cloud region, you have almost certainly drawn the same architecture diagram everyone draws. Three zones. Replicas in each. A load balancer in front. The mental model behind that diagram is that the zones are independent failure domains, so the probability of losing all three at once is the product of three small numbers, which rounds to never. Both Azure incidents are a reminder that this multiplication only holds for one specific failure mode, and that your diagram quietly assumes it holds for all of them.

Here is the thesis: a multi-AZ deployment is insurance against a single named peril, and the two incidents this spring were claims filed against perils the policy never covered. One was physical and lived above the layer that AZs isolate. The other was logical and lived inside a service whose state spanned the zones. Neither is exotic. Both will happen to you.

What "independent" actually promises

Read the marketing and an Availability Zone sounds like a hermetically sealed bunker. Read the engineering and it is narrower than that. A zone is one or more datacenters with independent power, cooling, and networking, physically separated from the other zones in the region, close enough for synchronous replication and far enough that a single physical event should not take out more than one. The whole construct is designed around one dominant historical cause of datacenter loss: power. A zone has its own utility feed, its own generators, its own batteries, because the failure that used to take buildings offline was a power failure, and isolating power is what stops one building's bad day from becoming three.

That is the multiplication that works. If the perils are independent single-building events, three zones really do give you the product of three small probabilities. The problem is that "independent" is a property of a specific cause, not a property of the zones in general. The moment a failure shares a dependency that sits above the power layer, or reaches through a service whose state lives in all three zones, the independence you paid for evaporates and the three numbers stop multiplying. They start adding.

The physical correlation: weather is bigger than a building

The West US 2 incident is almost a textbook demonstration that the power abstraction held and everything above it did not. From Azure's preliminary post-incident review (tracking ID GHRP-84G): a severe thunderstorm caused utility power loss to multiple datacenter facilities serving the region. The generators did exactly what they exist to do. In Azure's own words, "the transition to generator power worked successfully, across each of the Availability Zones, so the utility power disruption did not directly impact any IT infrastructure or customer workloads."

So the layer that AZ design is built to isolate, power, behaved perfectly. And the region still went down, because "independent of the transition from utility to generator power, a subset of the components within the mechanical cooling system faulted, resulting in them entering a 'lockout' protective state." Temperatures climbed. Thermal protection systems started shutting hardware down to avoid physical damage. The shutdowns hit "at least one datacenter within each Availability Zone."

The detail that should bother you is the word Azure used: "systemic." The cooling faults "were driven by a systemic failure within the cooling infrastructure, which occurred in multiple Availability Zones and resulted in repeated failures, despite our design that generally isolates Availability Zones." The weather event was geographically large enough, and the cooling design was similar enough across zones, that the same protective lockout tripped in all three. The failure domain that mattered here was not the building. It was the weather system, and the weather system does not respect the boundaries on your architecture diagram.

This is the general shape of correlated physical failure. Your zones are isolated on the axis the cloud provider chose to isolate them on. A failure that travels on a different axis, a regional weather front, a shared cooling design with a shared failure mode, a fiber conduit that several "independent" paths happen to share, sees one big target, not three small ones. You did not buy protection against it. You bought protection against the last war.

The logical correlation: quorum has no zone

The April incident in East US is the other half of the lesson, and it is the half people forget because there is no storm to point at. The trigger, per Azure's PIR (tracking ID 5GP8-W0G), was a latent regression in a recently deployed version of an internal control-plane service called PubSub, which sits between resource providers and the networking agents on each host. A single partition in one zone hit lock contention. The automatic failover to a secondary replica did not complete. The manual failover did not complete either.

Then the interesting part. As the team rolled back zone by zone and load redistributed across the region, the impact did not stay put. It moved. "There were two periods of time during which we were unable to maintain two simultaneous fully healthy instances of the PubSub service across availability zones... This resulted in a temporary loss of quorum within the service. As the service attempted to self-heal, customer impact shifted between availability zones, leading to periods of degraded behavior across multiple zones."

Sit with that. The zones were physically fine the entire time. Power was fine, cooling was fine, no datacenter went dark. The thing that spanned the zones was a distributed service that needed a quorum of healthy replicas across them to function, and when the regression knocked replicas out faster than they could be rebuilt, the quorum requirement turned three independent-looking zones into one shared fate. Your stateless workload running in those zones inherited the blast radius of a control-plane service you do not operate and probably did not know was on the critical path of a VM create.

Logical correlation is harder to design against than physical correlation because it is invisible on the diagram. The shared dependency is not a power bus or a chiller. It is a consensus group, a metadata service, a regional control plane, a single Kafka cluster, a database whose primary lives in one zone and whose failover has never actually been tested under load. Every one of those is a wire connecting your "independent" zones, and most of them are wires you did not draw.

The part that actually sets the clock: state

There is a second pattern that both incidents share, and it is the one with the most direct operational payoff. In neither case did compute set the recovery time. State did.

In West US 2, once cooling came back, the recovery split cleanly. Compute came back fast: by 06:15 UTC roughly half the affected VMs had recovered, and by 12:00 UTC about 95% had, "after premium Storage services had been largely recovered." Storage was the long pole. Azure is blunt about why: "affected storage systems needed manual data integrity validation before being returned to service. This validation is designed to ensure that customer data remains consistent and uncorrupted, but must be performed sequentially and cannot be significantly accelerated. Storage recovery was the primary factor in the extended duration of this incident." Cooling was restored by 06:00 UTC. The incident did not fully close until 02:30 UTC the next day. Almost all of that twenty-hour tail was state being carefully, un-parallelizably checked.

In East US, the same asymmetry shows up wearing a different costume. The control-plane service co-located compute and data on the same Service Fabric nodes for performance, which meant rollback sometimes required rebuilding full replicas on new nodes. Azure: "Recovery time was further extended because compute and data are co-located on the same nodes. In some cases, rollback requires rebuilding full replicas on new nodes, which significantly increases the time required to complete each stage." Restarting a stateless process is instant. Rebuilding a replica means moving and re-validating state, and state moves at the speed of bytes and consistency checks, not the speed of a scheduler.

This is not an Azure quirk. It is the deepest result in the literature on cloud availability. Google's 2010 OSDI study of its own globally distributed storage, "Availability in Globally Distributed Storage Systems," found that "correlation among node failures dwarfs all other contributions to unavailability in our production environment," and that larger failure bursts cluster hard inside physical domains: every burst of more than 20 nodes had rack affinity above 0.7, and every burst of more than 40 had affinity of at least 0.9. The failures that hurt are correlated, they ride physical structure, and the recovery is gated by getting state consistent again. Sixteen years later, two Azure PIRs say the same thing in plainer language.

Yes, but multi-AZ still earns its keep

None of this means you should collapse back to a single zone, and it does not mean multi-AZ is theater. The Google data cuts both ways. Yes, correlated failures dominate the unavailability that remains. But the failures that multi-AZ prevents, the single-rack, single-room, single-power-domain events, are common, and stopping them from becoming user-visible is most of the value. Multi-AZ is the right default. It is cheap relative to the outages it absorbs, and the vast majority of the time the bad day stays inside one zone exactly as designed. Both Azure incidents are notable precisely because they are the exception, not the rule. If correlated multi-zone loss were routine, you would not have needed me to point at two examples.

The honest claim is narrower than "multi-AZ is broken." It is this: multi-AZ is priced and reasoned about as though it covers all causes of regional loss, and it covers one. The fix is not to abandon it. The fix is to stop letting it be the entire disaster-recovery story, and to know exactly which perils it leaves on the table.

What to do Monday

Three things, in order of how little they cost.

First, write down the failure mode your multi-AZ setup actually isolates, and the ones it does not. If the answer to "what happens when a region-wide weather event trips a shared cooling design, or a regional control plane loses quorum" is a shrug, your three-zone diagram is giving you a confidence it has not earned. Multi-AZ is your defense against single-domain physical failure. It is not your defense against correlated regional failure, and the only real defense against that is another region.

Second, find your stateful recovery bottleneck and measure its recovery time, not your compute's. Both incidents say the same thing: compute comes back in minutes, state comes back in hours, and state sets the clock. Whatever your equivalent of "storage needs sequential integrity validation" is, your database failover, your replica rebuild, your cache rewarm, that number is your real recovery time objective. Most teams know how fast they can launch a VM and have no idea how long their primary takes to come back consistent. The second number is the one your customers feel.

Third, run the chaos experiment that matters. Killing a random VM tells you your stateless tier reschedules, which you already knew. The experiment that would have predicted both Azure incidents is killing a shared dependency: take quorum away from a control-plane service, force a region's worth of failover load onto a stateful layer at once, and watch whether the recovery time is the one you have been quoting in your SLA. It usually is not.

The diagram with three zones is not wrong. It is just answering one question, and you have been reading it as if it answered all of them.

SL#56 - All Three Availability Zones Failed Together. Twice in Five Weeks.

What "independent" actually promises

The physical correlation: weather is bigger than a building

The logical correlation: quorum has no zone

The part that actually sets the clock: state

Yes, but multi-AZ still earns its keep

What to do Monday

Sources

Keep Reading

Software Letters

Home