SL#56 - All Three Availability Zones Failed Together. Twice in Five Weeks.
Multi-AZ is sold as protection against correlated failure. Two Azure incidents this spring show the independence assumption only ever covered one peril, and that your stateful layer never recovers at the speed of your compute.
SL#49 - The Eight-Hour Outage After a Seven-Minute Fix
Railway's Google Cloud account was suspended at 22:20 UTC and reinstated at 22:29. The platform was still down at 06:14 the next morning. The lesson is bigger than Railway.