Cloud Failure Modes: Incidents, Cascades, and Blast Radius

Overview

Cloud infrastructure is not failure-free — it is designed for failure at a component level while attempting to maintain availability at the service level. Understanding how cloud failures actually occur, propagate, and are resolved is essential for every system architect. Real production incidents teach lessons that no documentation or training exercise can replicate. This file dissects major cloud outages in technical detail, extracts the underlying failure patterns, and derives architectural principles that make systems more resilient.

The failures examined here are not exotic edge cases. They are representative of the classes of failures that regularly affect cloud infrastructure. The scale of cloud operations means that low-probability failures happen regularly: if a failure has a one-in-a-million chance per operation and you perform a billion operations per day, that failure happens 1000 times daily.

Prerequisites

Understanding of distributed systems consistency and consensus
Familiarity with BGP routing fundamentals
Understanding of DNS and DNSSEC
Basic knowledge of queuing theory (feedback loops, saturation)
Familiarity with cloud storage architectures (EBS, S3)

Cloud Failure Taxonomy

Cloud failures are not monolithic. Different classes of failure have different impact scopes, different detection times, and different recovery strategies.

FAILURE TAXONOMY
════════════════════════════════════════════════════════════
│ Class             │ Scope      │ Recovery   │ Example     │
├───────────────────┼────────────┼────────────┼─────────────┤
│ Instance failure  │ One VM     │ Minutes    │ Hardware    │
│                   │            │            │ fault       │
├───────────────────┼────────────┼────────────┼─────────────┤
│ AZ failure        │ ~1/3 region│ Hours      │ Power,      │
│                   │            │            │ cooling     │
├───────────────────┼────────────┼────────────┼─────────────┤
│ Region failure    │ All AZs    │ Hours-days │ Rare, but   │
│                   │            │            │ occurred    │
├───────────────────┼────────────┼────────────┼─────────────┤
│ Control plane     │ Create/    │ Hours      │ AWS Dec     │
│ failure           │ modify ops │            │ 2021        │
├───────────────────┼────────────┼────────────┼─────────────┤
│ Data plane failure│ Active     │ Minutes-   │ S3 2017     │
│                   │ traffic    │ hours      │             │
├───────────────────┼────────────┼────────────┼─────────────┤
│ Human error       │ Variable   │ Variable   │ S3 2017,    │
│                   │            │            │ FB 2021     │
├───────────────────┼────────────┼────────────┼─────────────┤
│ Cascading failure │ Multi-     │ Hours      │ EBS 2011    │
│                   │ service    │            │             │
├───────────────────┼────────────┼────────────┼─────────────┤
│ CDN/Edge failure  │ Global     │ <2 hours   │ Fastly 2021 │
│                   │            │            │             │
├───────────────────┼────────────┼────────────┼─────────────┤
│ DNS failure       │ Global     │ Hours      │ FB 2021     │
│                   │            │            │             │
└───────────────────┴────────────┴────────────┴─────────────┘

AWS us-east-1 EBS Cascade — April 2011

What Happened

AWS was performing routine maintenance on their US-East-1 network, specifically upgrading a portion of the EBS cluster's traffic routing. During the maintenance window, a misconfigured network shift caused a large number of EBS volumes to simultaneously lose connectivity to their primary storage nodes.

EBS is designed to re-mirror data when a primary becomes unavailable. When connectivity was lost, thousands of EBS volumes simultaneously began searching for secondary nodes to re-establish mirroring. The re-mirroring process was coordinated through the EBS control plane, which became overwhelmed by the simultaneous flood of re-mirroring requests from thousands of volumes at once.

The control plane queue filled. Re-mirroring could not complete. The volumes entered a "stuck" state — neither fully available nor fully failed, but consuming control plane capacity. As the control plane became saturated, even healthy volumes that subsequently needed any coordination were unable to get responses.

Initial state: network maintenance begins
       │
       ▼
Large cluster of EBS volumes lose primary connectivity
       │
       ▼
All affected volumes simultaneously initiate re-mirror protocol
       │
       ▼
Re-mirror requests flood EBS control plane
       │
       ▼
Control plane queue saturates (positive feedback loop begins)
       │
       ├──→ New re-mirror requests queue behind existing ones
       │         │
       │         └──→ Timeouts increase → retry storms → more load
       │
       └──→ Healthy volume operations also queue → degrade

This is a classic incast problem combined with a retry storm. The initial failure triggered a response mechanism that was not designed for simultaneous mass activation.

The Multi-AZ Surprise

The most damaging revelation was that many customers believed they had achieved multi-AZ redundancy for their databases (deploying MySQL primary in us-east-1a, replica in us-east-1b), yet both their primary and replica databases were affected.

The reason: customers had deployed Multi-AZ EBS volumes (AWS's terminology for EBS synchronous replication across AZs) but assumed this meant their entire database was protected. In reality, some customers had deployed instances in separate AZs but had inadvertently placed them in the same EBS cluster partition — a partition that crossed AZ boundaries in the backend storage tier.

The lesson is profound: AZ isolation at the application deployment layer does not guarantee AZ isolation at every underlying infrastructure layer. The EC2 instance is in AZ-a; the EBS volume backing it might be stored on physical nodes that span AZ boundaries internally.

Duration and Impact

4+ hours of degraded service for affected EBS volumes. RDS multi-AZ deployments using EBS were affected. Reddit, Netflix, Foursquare, and other major services experienced outages. Netflix's investment in Chaos Engineering and their "Simian Army" accelerated significantly following this incident.

AWS S3 us-east-1 — February 28, 2017

What Happened

At 9:37 AM PST, an AWS engineer executing a runbook for an S3 billing system debugging task entered an incorrect command. The command was intended to remove a small number of servers from the S3 index subsystem. Instead, a larger subset of index servers was specified — a typo in the command argument.

S3's internal architecture uses index servers to maintain metadata about object locations. When a large fraction of the index capacity was suddenly removed, S3 could not locate many objects — reads returned errors, and existing objects appeared to not exist.

AWS attempted to restart the removed S3 subsystems, but had not done so at this scale in years. The restart process itself took longer than expected: the index servers needed to verify their state against the overall cluster state before coming back online. This process, designed for safety, required several hours at this scale.

9:37 AM: Runbook command executed with incorrect arguments
         → Too many S3 index servers removed
                │
                ▼
         S3 read operations begin failing (cannot locate objects)
         S3 PUT operations succeed (writes go through)
         S3 LIST operations fail
                │
                ▼
         Engineers begin restart sequence for index subsystem
                │
                ▼
         Restart verification takes unexpected hours
         (system had not been restarted at this scale in years)
                │
                ▼
         4:54 PM: Full recovery (~5 hours duration)

The Cascade Effect

S3's failure caused cascading failures across AWS because S3 is a foundational dependency of AWS's own infrastructure:

AWS Console: the management console used S3 to serve static assets. The console became unavailable while S3 was degraded.
AWS Service Health Dashboard: the dashboard itself used S3 and could not update to show the S3 outage in real time (meta-irony).
Lambda: Lambda function code packages stored in S3 could not be retrieved for cold starts.
CloudFormation, Elastic Beanstalk, CodeDeploy: all use S3 for artifact storage.
Third-party services: thousands of companies use S3 as their primary storage tier. Any service that made S3 calls on the critical path was affected.

Lessons Learned

Cell-based architecture: following this incident, AWS accelerated the adoption of shuffle sharding and cell-based architectures for S3 and other services. In a cell-based architecture, failures are bounded to a cell (a shard of the overall system). A runbook error affecting cell 3 of 200 cells impacts 0.5% of capacity, not 100%.

Tooling that limits blast radius: the erroneous command removed too many servers because the tooling allowed it. Modern runbook tooling should include guardrails: "this operation would affect 40% of index servers — this exceeds the 10% safety threshold. Confirm override?"

Monitoring that doesn't depend on itself: S3's own status dashboard depended on S3. When designing monitoring, ensure the monitoring system does not depend on the system being monitored. Use out-of-band communication channels (PagerDuty, separate cloud provider for status pages).

Fastly Global Outage — June 8, 2021

What Happened

At 09:47 UTC, approximately 85% of Fastly's CDN network went offline. Major customers including the New York Times, BBC, Reddit, GitHub, Stack Overflow, Amazon, Twitch, and The Guardian became unavailable. The outage lasted approximately 59 minutes.

The root cause was a dormant software bug in a specific version of Varnish (Fastly's edge cache software), triggered by a configuration change made by a single customer in the Fastly UI.

In May 2021, Fastly had deployed a software update that introduced a latent bug. The bug remained inactive — no customer configuration in production triggered the specific code path. On June 8, a customer updated their service configuration, which triggered the bug's code path. Because configuration changes in a CDN propagate globally (to serve consistent behavior across all edge nodes), the bug manifested simultaneously across 85% of Fastly's PoPs worldwide.

May 2021: Software update deployed with latent bug
    │
    │ (bug dormant — no customer config triggers it)
    │
June 8, 09:47 UTC: Customer updates configuration
    │
    ▼
Configuration change propagates globally (within seconds)
    │
    ▼
Bug triggers: Varnish processes crash on affected PoPs
    │
    ▼
85% of Fastly's edge nodes become unavailable simultaneously
    │
    ▼
Fastly engineers identify bug at 10:10 UTC (23 minutes)
    │
    ▼
Mitigation deployed: disable the customer's configuration change
    │
    ▼
10:46 UTC: Full recovery (~59 minutes total)

Why Simultaneous?

CDN configuration changes are designed to be globally consistent — you push a config, it applies everywhere. This property, which is a feature (consistent caching behavior globally), becomes a catastrophic risk when the config change triggers a bug. The blast radius was 100% of PoPs because the design assumption was "configs cannot cause crashes."

This is an architectural coupling problem: the deployment plane (config changes) was not isolated from the data plane (serving traffic). Modern CDN architectures use staged rollouts for configuration changes (10% → 50% → 100% with automatic rollback on error rate increase).

Facebook Global Outage — October 4, 2021

What Happened

At 15:39 UTC, Facebook, Instagram, WhatsApp, and Oculus went offline globally for approximately 6 hours. The outage was caused by a combination of a BGP route withdrawal and a dependent DNSSEC failure.

Facebook operates its own Autonomous System (AS32934). Their data center network is connected to the internet via BGP peering at multiple exchange points. During routine maintenance on their backbone network, a configuration change caused the BGP announcements for Facebook's IP prefixes to be withdrawn from the global routing table.

Maintenance command sent to backbone routers
    │
    ▼
BGP routes for Facebook's prefixes (157.240.0.0/17, etc.)
withdrawn from global routing table
    │
    ├──→ Facebook's own services cannot reach each other
    │    (internal services also used BGP for inter-DC routing)
    │
    └──→ External internet cannot route to Facebook's servers

DNS Cascade:
    │
    ▼
Facebook's authoritative DNS servers are inside Facebook's network
    │
    ▼
After BGP withdrawal, DNS resolvers worldwide cannot reach
Facebook's authoritative DNS servers
    │
    ▼
DNS queries for facebook.com, instagram.com, whatsapp.com
all fail with SERVFAIL
    │
    ▼
TTL cache entries expire across the internet (~5 minutes)
    │
    ▼
DNS failure layer added on top of routing failure

The Physical Access Problem

Facebook's internal tools for managing network infrastructure were also hosted inside Facebook's network. When BGP went down, the tools to fix BGP were also unreachable. Engineers could not log into the systems remotely to push configuration fixes.

Facebook's on-call engineers had to physically travel to data center facilities to manually override systems. Some engineers encountered delays because their physical access badges use an authentication system that itself depends on internal network connectivity — staff were locked out of their own data centers.

This illustrates a critical out-of-band access requirement: emergency remediation systems (and physical access control) must be independent of the production network being fixed.

DNSSEC Amplification

Facebook uses DNSSEC to sign their DNS records. During the outage, DNS resolvers worldwide were receiving SERVFAIL responses for facebook.com and retrying aggressively. The DNSSEC infrastructure (which validates signatures) was adding processing overhead to each failed lookup. At the scale of the global DNS system, the retry storm generated a significant DDoS-like load on DNS resolver infrastructure worldwide.

Cloudflare's 1.1.1.1 resolver published metrics showing 10x normal query volume for facebook.com during the outage, as resolvers worldwide repeatedly retried the failing lookups.

AWS December 7, 2021 — Kinesis and Cascading Dependencies

What Happened

At 07:35 PST, AWS Kinesis began experiencing errors in the US-EAST-1 region. Kinesis is AWS's managed streaming service, but more critically, it is a foundational internal dependency for AWS's own monitoring and observability infrastructure.

The root cause was an increase in the number of worker threads in the Kinesis front-end fleet, which caused the fleet to exceed the maximum number of threads configured for the operating system. The thread limit caused Kinesis front-end servers to fail to process requests.

Kinesis failure cascaded into AWS-internal observability systems that relied on Kinesis to propagate metrics and alarms. CloudWatch Alarms could not process new data. Automated scaling and remediation systems that relied on CloudWatch Alarms stopped functioning.

The Monitoring Paradox

The cascading failure revealed a structural weakness: AWS's monitoring systems depended on the services they were monitoring. When Kinesis went down, the automated systems that would normally detect and remediate service degradations were themselves degraded.

Kinesis fails (thread count exceeds OS limit)
    │
    ▼
CloudWatch Metrics ingestion uses Kinesis internally
    │
    ▼
CloudWatch Alarms cannot evaluate (no new metric data)
    │
    ▼
Auto Scaling policies based on CloudWatch Alarms stop working
    │
    ▼
Other services' automated remediations dependent on alarms are blind
    │
    ▼
Cognito, EventBridge, Lambda, Auto Scaling, AppSync all degrade
    │
    ▼
Impact ripples to end customers of those services

The dependency chain was opaque — many service teams did not know their service's health checking was routed through Kinesis. This kind of hidden dependency is one of the most dangerous patterns in distributed systems.

Architectural Lessons

Cell-Based Architecture

Divide the system into small, independent cells. Each cell serves a subset of users. A failure (including operator error) in one cell affects only that cell's users. AWS's S3 now uses shuffle sharding for this purpose — each customer's requests are handled by a subset of S3 nodes, not the entire fleet.

Without cells:                  With cells (6 cells, 1 fails):
┌──────────────┐                ┌────┐┌────┐┌────┐┌────┐┌────┐┌────┐
│              │                │ C1 ││ C2 ││ C3 ││ C4 ││ C5 ││ C6 │
│  All traffic │   failure →    │    ││    ││ !! ││    ││    ││    │
│  all nodes   │   100% impact  │16% ││16% ││16% ││16% ││16% ││16% │
└──────────────┘                └────┘└────┘└────┘└────┘└────┘└────┘
                                             ↑ only this cell fails

Always-On Monitoring Independence

Monitoring systems must not depend on the services they monitor. Separate observability infrastructure, ideally from a different cloud provider or region. Use synthetic monitoring (Canary probes from outside your VPC) that would detect a total outage even if internal monitoring is blind.

Out-of-Band Management Channels

Maintain break-glass access paths: serial consoles for physical servers, out-of-band network management interfaces, VPN access from networks independent of the primary production network. For SaaS/cloud: ensure your runbook execution environment, on-call tooling, and status page all run outside the system they support.

Blast Radius Reduction for Changes

Every change — code deployment, config change, runbook command — should have a bounded blast radius: - Staged rollouts (1% → 10% → 100% with automated rollback) - Change validation guards ("this would affect 40% of fleet — confirm?") - Configuration change canaries (validate behavior on 1% of edge nodes before global propagation) - Automated rollback triggers (error rate increase → automatic revert)

Correlated AZ Failure Avoidance

Multi-AZ does not automatically mean independent failures. As demonstrated in the 2011 EBS incident, backend infrastructure layers may have dependencies that cross AZ boundaries. Explicitly verify with your cloud provider that each layer of your stack is independently isolated across AZs. Use AWS AZ-aware deployment checks and test multi-AZ resilience with chaos engineering.

Debugging Notes

During a suspected regional cloud failure, always check the provider's service health dashboard AND independent monitoring services (downdetector.com, isitdownrightnow.com, CloudFlare Radar). Provider dashboards are sometimes delayed (the S3 2017 incident).
For cascading failures, draw the dependency graph before debugging. Identify which service's failure initiated the cascade vs which services are secondary victims.
BGP routing issues: use BGP monitoring services (BGPView, RIPE Stat, RouteViews) to observe prefix reachability from external vantage points.
When DNS fails, test both authoritative DNS (dig @authoritative-ns record) and recursive resolution (dig @8.8.8.8 record) separately to isolate whether the authoritative server is unreachable or the recursive resolver cannot reach it.

Security Implications

Human error (fat-finger, incorrect runbook argument) caused two of the five major incidents here. Operational security includes process controls (four-eyes principle for dangerous operations, change approval workflows, test in staging first).
BGP hijacking (deliberate route injection) is a real attack vector. RPKI (Resource Public Key Infrastructure) partially mitigates this by cryptographically binding IP prefixes to ASNs, but adoption is incomplete.
DNSSEC provides authenticity for DNS responses but does not improve availability. During the Facebook outage, DNSSEC added processing overhead to failed lookups, amplifying the impact.
Physical access control systems that depend on the production network are a security and availability risk. Hardware tokens, offline authentication, and break-glass physical access procedures are required.

Performance Implications

Retry storms are a leading cause of cascading failures. Implement exponential backoff with jitter (not linear or fixed-interval retry). Set retry budget limits. Use circuit breakers to stop retrying failing endpoints immediately rather than continuing to load them.
The time-to-recovery for the 2017 S3 incident was dominated by restart verification time — a process that had not been exercised at scale. Regularly practice large-scale restarts in non-production environments (game days, disaster recovery exercises).

Failure Modes

Correlated dependencies: your redundant systems secretly depend on the same underlying service. Map all critical dependencies explicitly. Use dependency injection and chaos testing to verify independence.
Recovery mechanism as amplifier: re-mirror storms (EBS 2011), retry storms, auto-scaling storms. Recovery mechanisms designed for isolated failures become destabilizing under correlated failures. Rate-limit all recovery operations.
Monitoring blindness: alerting systems that depend on the systems they monitor. Build observability infrastructure with independence as a first-class requirement.

Modern Usage

Chaos engineering (Netflix Chaos Monkey, AWS Fault Injection Simulator, Gremlin) is now standard practice at resilience-focused organizations. Systematically injecting failures in production (during business hours, with safeguards) validates that architecture resilience claims are accurate. Discovering that your multi-AZ architecture has a hidden single point of failure during a game day is far better than discovering it during a real incident.

AWS has invested heavily in post-incident architecture changes: S3 shuffle sharding, independent control and data plane designs, and Kinesis architectural separation from core observability are all direct results of the incidents described here.

Future Directions

Formal verification of operational runbooks (mathematical proof that a command cannot affect more than N% of infrastructure before execution)
Automated dependency graph generation: services declare dependencies in a registry, and infrastructure teams can query "what breaks if service X fails?"
AI-assisted incident response: LLMs trained on incident runbooks and system logs to suggest remediation steps during active incidents (AWS is already experimenting with this via the Q Developer incident response assistant)

Exercises

Build a dependency map for a 3-tier application (web → app → database). Identify hidden dependencies (DNS resolver, NTP, certificate revocation endpoints). For each dependency, calculate the blast radius if it fails and define a mitigation.
Implement exponential backoff with jitter in a language of your choice. Test it under load: simulate a downstream service failure and measure the retry load generated by 1000 concurrent clients with and without jitter.
Design a chaos engineering experiment to validate that your application handles an AZ failure correctly. What assertions would you make? What automatic rollback conditions would you set?
Read all five post-mortems referenced below. For each incident, write one sentence describing the failure mode and one sentence describing the architectural change that would prevent it.
Implement a circuit breaker pattern using a state machine (Closed → Open → Half-Open). Test it with simulated failure injection and verify that it stops sending requests during the Open state and recovers correctly.

References

AWS Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region (April 2011)
AWS Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region (March 2017)
Fastly Incident Summary for June 8, 2021 (https://www.fastly.com/blog/summary-of-june-8-fastly-network-incident)
Facebook Engineering: More details about the October 4 outage (https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/)
AWS Detailed Event Summary: December 7, 2021 (https://aws.amazon.com/message/12721/)
Colm MacCárthaigh: "Reliability and Availability" (re:Invent 2019, ARC303)
Netflix Technology Blog: "5 Lessons We've Learned Using AWS" (2010)
Amazon Builders' Library: "Avoiding insurmountable queue backlogs" and "Implementing health checks"