SL#49 - The Eight-Hour Outage After a Seven-Minute Fix

On May 19, at 22:20 UTC, Google Cloud's automated systems placed Railway's production account into a suspended state. Railway noticed in 90 seconds. A P0 ticket landed with GCP at 22:22. The account was unsuspended at 22:29. End to end, the actual provider error lasted seven minutes.

Railway's status page marked the incident resolved at 07:58 the next morning. That is more than nine hours from the first alert to "done," and roughly eight hours of customer-visible failure after the account was restored. Workloads on Railway Metal and AWS, infrastructure that Google Cloud had no business affecting, started returning 404 errors. Logins broke. Builds queued up and then jammed. By the time GitHub's OAuth rate limits kicked in around 02:47, the original GCP suspension had been over for more than four hours.

The interesting question is not how a cloud provider managed to suspend a major customer's account by accident. That is a governance story and it has been told several times already. The interesting question is why a seven-minute provider error produced an eight-hour customer-facing outage, and what that tells you about every multi-cloud architecture currently in production.

What "multi-cloud" actually meant for Railway

Railway runs what they call a mesh ring: high availability fiber interconnects between their bare-metal datacenters, AWS, and GCP. The pitch in their own incident report is that this gives them redundancy across providers. If one cloud has a bad day, the mesh routes around it.

That was true for the data plane. Workloads were running across three providers. AWS instances were healthy. Metal instances were healthy. The fiber interconnects were carrying traffic.

It was not true for the control plane.

Railway's network control plane API, the service that populates the routing tables used by their edge proxies, was hosted on machines inside Google Cloud. The edge proxies cached the routing tables they pulled from that API. While the cache held, the data plane kept working. Workloads on Metal and AWS continued to serve traffic for about an hour after GCP went dark. Then the cache entries expired, the proxies tried to refresh from the control plane, the control plane was unreachable, and the entire mesh forgot where its endpoints were.

This is the failure mode that matters. Workloads were not down. The network did not know how to reach them. Three clouds of compute, fully operational, fully isolated from each other at the data layer, all rendered invisible by a single dependency at the control layer.

The fix Railway has committed to, removing that hard dependency so traffic can be directed independently of any single cloud provider, is the right one. But the more useful question is how many other teams are running the same architecture without realising it.

Account suspension is not in the failure model

Most disaster recovery planning is built around a small set of failure modes that have been studied to death. Hardware fails. Networks partition. Disks corrupt. Regions fall over. Cosmic rays flip bits. The chaos engineering tooling, the SRE textbooks, and the cloud providers' own resilience patterns all assume the failures look like that.

Account suspension is a different beast. It is not a fault. It is a decision. Specifically, it is a decision made by an automated policy at your provider that may be triggered by signals you cannot see and cannot test against. Railway's report is specific: the suspension "extended to many accounts within Google Cloud" and was applied "as part of an automated action" with no proactive outreach. You cannot game-day this. You cannot inject it in a chaos experiment without coordinating with your provider. You cannot Terraform around it.

And the blast radius is total. A hardware failure takes out a rack, an AZ, occasionally a region. An account suspension takes out everything you have on that provider, simultaneously, without grace period. Compute, storage, networking, identity. The unit of failure is the contract, not the machine.

If your control plane lives entirely inside one account on one provider, your data plane inherits that provider's account-level failure domain whether you want it to or not. Multi-cloud at the workload tier does not change this. The cache helps for an hour; then the cache helps for zero hours.

What "the backup didn't work" looks like in two flavors

Twelve days earlier, on May 7, a room in an AWS data center in Northern Virginia overheated and the cooling units in availability zone use1-az4 failed. EC2 instances and EBS volumes on the affected racks lost power within the hour. Coinbase was offline for approximately seven hours.

Multi-AZ would have saved them. They had multi-AZ for most workloads. Their matching engine, the latency-critical core of the exchange, was deliberately not multi-AZ. The trade-off was conscious: cross-AZ network hops on every order match would have given up the microseconds that competitive exchanges spend their lives chasing. So Coinbase made the call, ran the matching engine in one zone, and built a backup that was supposed to take over when the zone died.

Their Head of Platform was unusually direct in the postmortem statement: backup systems "did not work as expected during the incident, extending the outage and forcing engineers to manually execute disaster recovery procedures." Engineers wrote, tested, deployed, and validated a fix while the production system was on fire.

Two incidents, two providers, two architectural shapes, same conclusion. A backup that has not been exercised under the specific conditions of the actual failure is a hypothesis, not a backup. Coinbase's backup was distributed in some sense but not in the sense that survived this particular failure. Railway's mesh was redundant in some sense but not in the sense that survived this particular failure. Both teams shipped what looked like resilient designs and both designs broke in the way that mattered.

The shared lesson is one engineers keep relearning at every scale: your failover only protects you from failure modes you have actually tested it against. Everything else is an architecture diagram with arrows pointing in optimistic directions.

What true control-plane independence costs

Railway's planned remediations are concrete enough to be useful as a template. Three things, in order of difficulty:

First, remove the hard dependency on a single-cloud control plane. Every dataset the edge needs to keep serving traffic should be reachable from at least two clouds, with no priority ordering that defaults all reads to the suspended provider. The cost is operational complexity: you now run the same control plane in multiple places, with quorum semantics, and you become responsible for cross-cloud consistency.

Second, extend high-availability database shards across providers so that a sudden disappearance of one cloud still leaves quorum. The cost is a meaningful jump in storage and bandwidth bills. Asynchronous cross-cloud replication is much cheaper than synchronous, but asynchronous means accepting a recovery point objective measured in seconds or minutes, not zero. Most teams cannot pay for synchronous and most teams should not need to.

Third, demote the suspended provider to secondary or failover only, keeping it out of the data plane hot path. This is the political one. It is the architectural acknowledgement that you have to design as though the provider can vanish, even when you have signed enterprise contracts and have an account manager. The cost is honesty about how much your business depends on a single procurement relationship.

These three changes are not exotic. They are what "multi-cloud" was always supposed to mean. The reason teams skip them is that they look expensive in the months when nothing is on fire. The reason teams adopt them is that one Tuesday in May, something is on fire and the cost of not having them is suddenly the entire revenue line.

Yes, but most of us are not Railway

The reasonable objection: most engineering teams do not run multi-cloud meshes. Most teams have one cloud, a few regions, maybe a CDN, and a vague intention to think about disaster recovery later. The Railway story is a fascinating case study but it does not seem to apply.

It does. The question generalizes once you strip out the multi-cloud framing.

Where does your control plane live, and what failure modes does it inherit from that location? The control plane for a SaaS app might be a Postgres database in one AZ. The control plane for a deploy pipeline might be a single GitHub Actions runner. The control plane for an authentication system might be a managed Cognito or Auth0 tenant. Each of those is a single point of decision-making whose failure mode is inherited by every workload downstream, regardless of how those workloads are distributed.

The substantive question is not "are you multi-cloud." It is "if the system that decides where requests should go disappears for an hour, does the rest of the architecture know what to do?" For most teams, the honest answer is no, and they have not been forced to test it because the system that decides where requests should go has been quietly reliable. That reliability is a property of the provider's good week, not the architecture's good design.

What to do on Monday

Pull up your architecture diagram. Find the box labeled "control plane" or "orchestration" or "service discovery" or, if you do not have a box like that, the place where the routing decisions actually get made. Note which provider, which region, which account that box lives in.

Then ask the uncomfortable question: if the entity that owns that account decides tomorrow morning that you are violating some policy, by mistake or otherwise, and freezes the account at 09:00, what is the timeline for your customers? Not the timeline for getting the account unfrozen. The timeline for the cached state in front of the control plane to expire, the dependent services to start failing, the queues to back up, the partner integrations to start rate-limiting you for retrying, and the manual recovery to complete after the account is restored.

If that timeline is longer than you can defend to your customers, you have a control plane location problem, and the fact that it has not bitten you yet is luck. Railway and Coinbase both spent the last two weeks publishing what their luck running out looked like. The cheap version of learning from their incidents is to do the audit before yours runs out too.

SL#49 - The Eight-Hour Outage After a Seven-Minute Fix

What "multi-cloud" actually meant for Railway

Account suspension is not in the failure model

What "the backup didn't work" looks like in two flavors

What true control-plane independence costs

Yes, but most of us are not Railway

What to do on Monday

Sources

Keep Reading

Software Letters

Home