Skip to content

06 — Cloud Outage Post-Mortems

Technical Overview

Cloud platform outages are the modern equivalent of the datacenter fires of the mainframe era — except that a single failure domain now covers millions of customers simultaneously. The post-mortems in this document cover six major cloud outages spanning AWS, Azure, Google Cloud, Fastly, and Facebook from 2011 to 2021. Each is examined with 5-10 paragraphs of technical depth: the timeline, root cause chain, detection, blast radius, recovery, and the architectural changes that followed. Collectively, they tell the story of how cloud platforms learned to build for failure at scale.

Prerequisites

  • Cloud storage fundamentals: EBS, S3, object storage replication
  • BGP and DNS (see 04-network-and-distributed-system-failures.md)
  • Certificate lifecycle management and PKI
  • Traffic engineering: load balancing, anycast, edge infrastructure
  • Cell-based architecture and blast radius limitation
  • AWS service dependency map: Kinesis, CloudWatch, EC2 metadata service
  • Varnish and CDN configuration management

Historical Context

The first decade of public cloud (2006-2016) was a period of discovering failure modes at scale that had never existed before. Single AZ failures were expected; the discovery that multi-AZ configurations shared failure domains was not. Certificate expiry had been a known risk; that it would cascade globally via shared infrastructure was not anticipated at scale. The Facebook 2021 outage showed that "internal infrastructure depending on Internet-facing routing" was still a failure mode that a $1T company had not closed.

Each of these post-mortems contains the actual lessons that shaped modern cloud architecture. They are required reading for anyone building or operating at cloud scale.


Post-Mortem 1: AWS us-east-1 April 2011 — EBS Cascade Failure

Timeline

  • April 21, 2011, 00:47 PDT: AWS engineers begin a network configuration change to upgrade connectivity between the us-east-1 backbone and a redundant network. A miscommunication causes traffic to shift to a lower-capacity backup path instead of the intended higher-capacity path.

  • 00:47-01:00: The lower-capacity path becomes congested. EBS nodes within a single Availability Zone (us-east-1a) lose connectivity to their replica nodes in other network segments of the same AZ.

  • 01:00-02:00: Hundreds of EBS primary storage nodes, having lost contact with their replicas, enter a "re-mirroring" state simultaneously. Each node begins searching the cluster for available storage capacity to create a new replica, then begins copying its full volume to the new replica location.

  • 02:00-05:00: The re-mirroring storm overwhelms the EBS cluster's internal network. The EBS control plane — responsible for accepting new volume creation requests, tracking volume state, and coordinating replica assignments — begins experiencing severe latency and then stops responding to requests.

  • 05:00 onwards: EBS volumes fall into two categories: (1) already-stuck volumes that cannot complete re-mirroring due to control plane unavailability, and (2) new requests for volumes that fail immediately. The stuck volumes represent tens of thousands of customer EBS volumes in a limbo state: primary data intact, replica synchronization stuck, control plane unable to process state transitions.

  • April 21 - April 29: Manual recovery. AWS engineers develop and execute a process to individually verify stuck volumes, cancel failed re-mirror operations, and schedule fresh replications. With 10,000+ affected volumes, this takes 8 days to complete at full pace.

Technical Root Cause Chain

The cascade had three distinct phases, each making the next worse:

Phase 1 — Trigger: A single network configuration change moved traffic to a backup path. This was a routine, apparently low-risk change. The backup path had insufficient capacity for the full AZ's traffic, creating congestion.

Phase 2 — Resonance: EBS's replication design at the time assumed that if a primary node lost contact with its replica, the primary should find a new replica immediately. This was correct behavior for a single-node failure. With hundreds of nodes triggering simultaneously due to the same network event, the behavior became a "re-mirroring storm." The design did not include rate limiting or coordination for this scenario — each node acted independently and simultaneously.

Phase 3 — Collapse: The EBS control plane was a shared service for the entire us-east-1 EBS fleet. The control plane had not been designed to handle simultaneous state-transition requests from hundreds of nodes in a re-mirroring state. At capacity, control plane requests timed out. This created a deadlock: nodes needed the control plane to complete re-mirroring, but the control plane was overloaded by re-mirroring requests.

Detection

AWS's monitoring detected the network anomaly almost immediately. The EBS control plane overload was detected within minutes. However, the scope of the cascade — specifically how many volumes would enter a stuck state — was not fully apparent for several hours. External customer detection (via error rates on EC2 instances and RDS) was widespread within 30 minutes.

Blast Radius

The outage affected the us-east-1 region. Approximately 10-14% of all EBS volumes in us-east-1 were affected (AWS did not publish exact numbers). Customers included: Reddit (down for hours), Quora (degraded), Foursquare (down), and many others. The outage was severe enough that major technology media covered it as front-page news.

The multi-AZ surprise: Many customers had deployed Multi-AZ architectures (RDS Multi-AZ, EC2 in multiple AZs). These customers found their multi-AZ configurations also failed. The reason: while EC2 and RDS were in different AZs, the underlying EBS infrastructure — the cluster management and control plane — was shared across AZs within the region. When the control plane failed, it affected volumes in all AZs.

This discovery was perhaps the most important learning from the outage: multi-AZ redundancy at the service layer did not guarantee isolation from region-level infrastructure failures.

Recovery

AWS engineers developed a pipeline for recovering stuck volumes:

  1. Verify primary data integrity (checksums on volume content)
  2. Cancel the failed re-mirror operation
  3. Schedule a fresh replica creation from verified primary data
  4. Monitor replica sync to completion

Steps 1-3 were relatively fast per volume; the bottleneck was the sheer number of volumes. AWS ran this process in parallel across engineering teams, recovering approximately 1,000-2,000 volumes per day.

Architectural Lessons and Changes

  1. AZ-isolated control planes: Post-2011, AWS redesigned EBS to use separate control plane instances per AZ, preventing a region-level control plane failure.

  2. Re-mirroring rate limiting: EBS added global rate limiting on re-mirroring operations. No matter how many nodes simultaneously enter re-mirroring state, the total re-mirroring bandwidth is capped to prevent network saturation.

  3. Multi-AZ means fully independent failure domains: AWS redesigned the "multi-AZ" promise for EBS and RDS to ensure that AZ-level failures truly cannot cascade. Physical network separation, separate power feeds, and separate control planes per AZ were verified.

  4. Customer architecture guidance: AWS produced explicit guidance (the "Multi-AZ doesn't mean you're protected from region-level failures" warning) and recommended cross-region replication for true resilience.

  5. Automated volume recovery tooling: The 8-day manual recovery built the case for and drove investment in automated storage recovery pipelines.

Reference: AWS post-mortem: aws.amazon.com/message/65648


Post-Mortem 2: Azure October 2012 — Storage Certificate Expiry Cascade

Timeline

  • October 27, 2012: Microsoft Azure experienced a global outage affecting Azure Storage in multiple regions.

  • Root cause (discovered): An X.509 certificate used for internal service-to-service authentication within the Azure storage infrastructure expired. The certificate had a finite validity period that was not being tracked by automated certificate management tooling.

  • Impact timeline: When the certificate expired, services that depended on it for mutual TLS authentication began failing authentication. Because the certificate was used by a foundational storage infrastructure component, the failure cascaded: storage services that depended on the expiring certificate began returning errors. Application services depending on those storage services compounded the outage.

  • Duration: Portions of the outage lasted several hours across multiple Azure regions.

Technical Root Cause Chain

Phase 1 — The expired certificate:

Internal Azure services used X.509 certificates for mutual TLS authentication — a service could only communicate with another if both could verify each other's certificates. This is a sound security model. The failure was operational: the certificate was not enrolled in an automated renewal system. When the certificate reached its validity end date, it became invalid, and authentication between the affected services began failing.

Phase 2 — Cascade through shared infrastructure:

The affected certificate was in a shared storage infrastructure component, not isolated to a single service. Many Azure storage services depended on it. The failure propagated through the dependency graph: services that depended on the expired-certificate service began failing, and services that depended on those began failing further. The cascade was proportional to the depth and breadth of the dependency graph.

Phase 3 — Global scope:

Azure had deployed this infrastructure component globally — it was not isolated to a single region. When the certificate expired, it expired everywhere simultaneously, causing a global (not regional) outage.

Detection

The expiry was not detected proactively. The failure was detected reactively when services began returning authentication errors. Internal monitoring caught the errors immediately, but correlating "authentication failures" to "expired certificate" in the specific internal component took time to trace.

Blast Radius

Azure Storage was affected globally. Applications deployed across multiple Azure regions found their storage access failing simultaneously. The Microsoft Azure team had not yet deployed the "AZ" model that later characterized Azure's regional design — regions were less isolated in 2012 than they became later.

Recovery

Emergency certificate renewal and deployment. Certificate renewal for a widely-deployed certificate involves: 1. Generating new certificate and key 2. Distributing new certificate to all service instances 3. Coordinating restart/reload of services to pick up new certificate 4. Verifying that authentication succeeds with new certificate

At Azure's scale in 2012, this process took hours. Once the new certificate was deployed globally and services restarted, the cascade unwound.

Architectural Lessons and Changes

  1. Automated certificate lifecycle management: Post-2012, Azure (and the broader cloud industry) invested heavily in automated certificate rotation. Every certificate should have: expiry monitoring with alerts at 90/30/7/1 days before expiry, automated renewal tooling, and automated deployment.

  2. Certificate expiry chaos engineering: Deliberately expiring certificates in controlled tests to verify that the rotation process works and that services handle certificate rotation without downtime (hot reload of certificates without restart).

  3. Dependency scoping: A certificate used by foundational shared infrastructure should be designed to fail gracefully or use separate certificates per region to prevent global cascade.

  4. Azure's investment in certificate management: Azure Key Vault (launched 2015) provides managed certificate lifecycle, with automated renewal notifications and integration with Azure services. This was directly motivated by the 2012 incident and others.

  5. Industry-wide lesson: Certificate expiry incidents have affected Google, AWS, Microsoft, and many others. The lesson is universal: certificates require automation, not operational discipline.


Post-Mortem 3: Google Cloud March 2019 — Network Congestion Cascade

Timeline

  • March 12, 2019, 11:45 PDT: Google begins a routine network configuration change — updating traffic engineering rules for an internal backbone link.

  • 11:45-12:20: The configuration change introduces a routing error. Traffic that should be routed directly between Google data centers is instead routed via longer paths, causing traffic to traverse links not sized for the load.

  • 12:20 onwards: Congested backbone links begin dropping packets. Google's internal services — and customer-facing GCP services — begin experiencing elevated error rates.

  • 12:20-17:00: The congestion affects: Google Compute Engine (VM control plane), Google Kubernetes Engine, Google Cloud Storage, BigQuery, and Stackdriver (now Cloud Monitoring). Gmail and YouTube also experience elevated error rates (they run on Google's internal infrastructure).

  • 17:00: Google engineers identify the root cause and roll back the configuration change.

  • ~21:00: Full recovery after traffic stabilizes.

Technical Root Cause Chain

The configuration change: Google operates one of the largest private networks in the world — a custom internal WAN ("B4" backbone) connecting Google data centers globally. Traffic engineering on B4 uses software-defined routing to allocate bandwidth among traffic classes. A change to B4's traffic engineering configuration reassigned traffic to a different set of backbone paths.

The path selection error: The updated configuration sent a significant volume of production traffic through paths with insufficient capacity. Specifically, traffic for cloud services in us-east1 was redirected through a path that was also serving other high-bandwidth traffic, causing the link to hit capacity.

Congestion under Google's internal architecture: Google's services make extensive use of RPCs (Remote Procedure Calls) between microservices. Under packet loss, RPC timeouts trigger, which triggers retries, which increases network load, which increases packet loss — a congestion collapse analogous to the 1986 TCP collapse but at the internal RPC layer.

The Gmail and YouTube impact: Google's consumer services (Gmail, YouTube, Google Search) share infrastructure with GCP. When the backbone congestion occurred, it affected both GCP customers and Google's own consumer services. This shared fate meant the blast radius was enormous — not just GCP enterprise customers but hundreds of millions of consumer users.

Detection

Google's global monitoring systems detected the error rate increases within minutes of the configuration change. Identifying that a specific network configuration change was responsible, rather than a service-specific bug, took approximately 30 minutes of investigation. The investigation involved: - Correlating error rate increases with network link utilization metrics - Identifying which backbone paths were congested - Tracing which network change had altered traffic routing - Confirming that rollback was safe and would not create a different failure

Blast Radius

The outage affected: - GCP services in the US (primarily us-east1, us-central1) - Google Compute Engine, Kubernetes Engine, Cloud Storage, BigQuery, Cloud SQL - Gmail (email delivery delays, slow loading) - YouTube (elevated error rates, buffering) - Google Search (slight elevation in errors) - Google Workspace (Drive, Docs, Sheets — slow or unavailable)

Duration: approximately 4 hours of significant impact, with tail impact extending to 7 hours.

Recovery

The configuration rollback was the primary recovery action. However, because the congestion had been ongoing for hours, simply removing the routing error did not immediately restore service: - In-flight requests that had timed out needed to be retried - Retry storms from clients (browsers, mobile apps, server-side clients) created additional load at recovery time - Google's backend services needed to drain queues built up during the degraded period

Recovery was gradual over approximately 3-4 hours after the rollback.

Architectural Lessons and Changes

  1. Traffic engineering changes need staged rollout: Configuration changes to network routing should be applied incrementally — first to a small fraction of traffic, verified for impact, then expanded. A change that affects 100% of traffic simultaneously is dangerous.

  2. Network configuration change controls: Changes to backbone traffic engineering should be subject to the same canary deployment discipline as software deployments. "Change 5% of traffic, verify, change 25%, verify" etc.

  3. GCP/Google infrastructure separation: The shared fate between GCP and Google consumer services (Gmail, YouTube) in the same outage was notable. Google has since invested in further isolation between consumer and cloud infrastructure layers.

  4. Dependency on internal network for monitoring: Google's monitoring and alerting systems also depend on the internal network. When the network is congested, monitoring signals may be delayed or dropped — potentially masking the severity of an outage.

  5. Congestion control for RPC: Google uses protocols with explicit congestion control for internal RPCs (QUIC-based internal protocols), analogous to TCP's congestion control, to prevent RPC retry storms from amplifying congestion.


Post-Mortem 4: Fastly June 2021 — Varnish Config Bug Triggered by Single Customer

Timeline

  • June 8, 2021, 09:47 UTC: A single Fastly customer changes a specific setting in their Fastly service configuration.

  • 09:47-09:58: Approximately 85% of Fastly's global CDN network goes offline. Affected sites include: UK Government (gov.uk), New York Times, Guardian, Financial Times, Reddit, Twitch, GitHub, HBO Max, CNN, BBC, PayPal.

  • 09:58 UTC: Fastly engineers identify the root cause.

  • 10:27 UTC: Fastly pushes a software update to disable the trigger. Service restoration begins.

  • 10:43 UTC: Full service restored for most customers (~56 minutes total outage).

Technical Root Cause Chain

Phase 1 — The bug, deployed months earlier:

In May 2021, Fastly deployed a software update to its globally-deployed Varnish cache management infrastructure. Varnish is the reverse proxy and cache software that Fastly uses at its edge nodes. The update introduced a bug in how Fastly's Varnish Configuration Language (VCL) parsing handled a specific combination of settings.

The bug was a latent code path: it did not trigger during normal operation. It required a specific, unusual combination of VCL configuration options that was not present in the default or most common configurations. The bug passed all pre-deployment tests because none of the tests used the triggering configuration combination.

Phase 2 — The customer change that triggered it:

On June 8, a Fastly customer (reported to be Fanly itself, testing a new feature — the exact customer identity was not publicly confirmed) updated their CDN configuration in a way that matched the triggering pattern. Specifically, the update set a VCL option that, in combination with another option in the customer's existing configuration, hit the code path containing the bug.

Phase 3 — Global propagation:

Fastly deploys VCL configuration changes to all its edge nodes globally. When a customer updates their CDN service, the updated VCL is pushed to all Fastly edge nodes that serve that customer's traffic (which can be all of them, for a large customer). The push is fast — designed for rapid configuration propagation.

When the triggering VCL combination was deployed globally, every Fastly edge node that received it crashed or entered an error state within seconds. Because Fastly's infrastructure is globally distributed and the VCL push was global, 85% of the edge network failed simultaneously.

Phase 4 — The 11-minute window:

The 11 minutes between first customer impact (09:47) and Fastly engineer identification of root cause (09:58) was unusually fast for a global CDN outage. Fastly's monitoring immediately showed the global error rate spike. Engineers saw the correlation between the VCL push and the error rate increase. The 11-minute investigation was essentially "confirm correlation = causation, identify the triggering VCL combination."

Detection

Fastly's monitoring detected the global error rate increase within ~1 minute. The correlation to a specific configuration push was identified rapidly because Fastly's deployment systems log VCL push events with timestamps that could be correlated directly with the error onset.

Blast Radius

~85% of Fastly's CDN capacity offline simultaneously. Every website serving content through Fastly (which includes many of the world's highest-traffic sites) was either completely down or severely degraded. The UK Government's official website (gov.uk) going offline during the outage became particularly noted because government services are expected to have high availability.

The 56-minute duration was short relative to other major outages in this document. The fast detection and fix reflected well on Fastly's incident response.

Recovery

Fastly's fix was a software update that disabled the triggering VCL feature flag globally. Pushing this update to all edge nodes resolved the crash condition. Nodes that had been in error state restarted and rejoined the CDN pool within minutes of receiving the fix.

The rapid recovery also reflected the nature of the bug: once the triggering configuration was neutralized, there were no persistent state corruption issues to recover from. Varnish caches repopulate quickly from origins.

Architectural Lessons and Changes

  1. Latent bug detection requires configuration fuzzing: A bug that only triggers on unusual configuration combinations will not be found by standard unit/integration tests. Configuration space fuzzing — systematically testing unusual combinations — is required.

  2. Global VCL deployment risk: Fastly's model of deploying customer VCL changes globally is efficient but creates a mechanism for a single customer configuration change to trigger a global outage. Staged deployment of configuration changes (like staged software deployment) would limit blast radius.

  3. Feature flag lifetime management: The triggering bug was in a feature deployed months before the triggering event. Features with limited testing that are deployed globally but not yet widely used represent latent risk. Feature flags should be time-bounded and reviewed regularly.

  4. CDN customer isolation: CDN edge nodes running all customers' VCL on shared infrastructure creates shared-fate risk. Customer-specific crashes should be isolated to a sandbox, not propagate to the edge node itself.

  5. Fastly's post-mortem transparency: Fastly published a detailed, honest post-mortem within 12 hours of the incident. This transparency (including acknowledging "a specific customer configuration triggered a bug we deployed") was noted as a positive industry example of post-mortem culture.

Reference: Fastly post-mortem: fastly.com/blog/summary-of-june-8-outage


Post-Mortem 5: Facebook October 4, 2021 — BGP Withdrawal and Physical Access Failure

(See also detailed BGP analysis in 04-network-and-distributed-system-failures.md)

Timeline

  • October 4, 2021, 15:40 UTC: Facebook engineers executing a routine backbone maintenance command accidentally withdraw all BGP routes for Facebook's IP address space.

  • 15:40-15:41: Facebook's DNS servers, Facebook's edge infrastructure, and all Facebook-owned applications (Instagram, WhatsApp, Messenger, Oculus) become unreachable from the public Internet.

  • 15:41: Facebook's DNS servers fail health checks (they cannot be reached either). Automated health-check failure triggers DNS servers to remove themselves from service.

  • 15:41-16:00: Facebook engineers attempt to SSH to backbone routers to restore BGP routes. SSH requires network access; network is down. Engineers recognize this is going to require physical access.

  • ~16:30-17:00: Teams dispatched to Facebook data centers. Badge access systems are on Facebook's own internal network — offline. Security teams provide manual overrides.

  • 17:00-21:00: Engineers use out-of-band management consoles (serial consoles, dedicated management network) to access backbone routers. Work through restoring BGP routes.

  • 21:27 UTC: BGP routes re-advertised. Internet routing to Facebook IPs restored.

  • 21:27-22:30 UTC: Traffic recovery. Hundreds of millions of clients simultaneously attempt to reconnect. Load spikes require staged recovery to prevent re-crashing infrastructure.

  • ~22:30 UTC: Full recovery. Total outage: ~6 hours 50 minutes.

Technical Root Cause Chain

Phase 1 — The configuration command:

Facebook's backbone network uses a custom routing infrastructure. Engineers were performing an audit and update of BGP configurations as part of a planned maintenance. A specific command (the exact nature of which Facebook described vaguely in their post-mortem as a "configuration change that intended to assess the availability of global backbone capacity") was issued incorrectly.

The command, as executed, caused the backbone routers to withdraw all BGP route advertisements for Facebook's address space. In BGP terms: Facebook's autonomous system ceased announcing its own IP prefixes to the Internet. From the Internet's perspective, Facebook's IP addresses ceased to be routable.

Phase 2 — DNS cascade:

Facebook's authoritative DNS servers are located at Facebook's IP addresses. With those IPs unroutable, DNS queries for any Facebook domain could not reach the authoritative DNS servers. DNS resolvers worldwide began returning NXDOMAIN (no such domain) or SERVFAIL errors for facebook.com, instagram.com, whatsapp.com.

Facebook's DNS servers themselves had health-check automation: if a DNS server is repeatedly failing to respond to probes, it should be removed from the DNS service pool (to prevent sending queries to a broken server). The health checks were failing because the servers were unreachable. The automation withdrew the DNS servers from service. This was correct behavior in isolation but created a compounding effect: now even if BGP routes returned, the DNS servers might not be in service.

Phase 3 — Physical access failure:

This was the most operationally interesting phase. Facebook's data center badge access was managed by a system running on Facebook's own internal network infrastructure. With BGP routes withdrawn, internal network connectivity between data centers was also affected (Facebook's internal services used many of the same IP addresses and routing infrastructure as public-facing services).

When engineers arrived at data centers, badge readers were offline or unable to contact the authentication backend. Physical security guards were present but needed override authorization from Facebook's physical security team — adding time. The failure of a security system during a major network outage created a cascading physical access problem.

Phase 4 — Recovery complexity:

Even after engineers reached the backbone router consoles and restored BGP routes, recovery was not instantaneous: - BGP convergence takes 30-90 seconds globally - DNS caches had been poisoned with NXDOMAIN/SERVFAIL responses; TTLs needed to expire before clients would retry - Facebook's DNS servers needed to be re-added to the DNS service pool manually - The load spike from hundreds of millions of simultaneous reconnections required careful rate limiting to avoid re-triggering failures

Detection

Facebook's monitoring systems use many external dependencies. With BGP routes withdrawn, Facebook's own internal monitoring was itself degraded. Engineers monitoring Facebook's status had to rely on third-party reports (Cloudflare Radar showing route withdrawal, Downdetector showing user impact) to get early indicators.

Facebook's NOC (Network Operations Center) detected the BGP withdrawal via external routing monitoring — routes visible in BGP looking glasses worldwide disappeared. The internal correlation ("this is because of the maintenance command") took approximately 15-20 minutes.

Blast Radius

Every Facebook-owned application was offline globally for ~7 hours: - Facebook: 2.9 billion users - Instagram: 1.5 billion users - WhatsApp: 2 billion users - Messenger: ~1.3 billion users - Oculus VR devices: dependent on Facebook authentication, many non-functional - Facebook Workplace (enterprise product): offline

Estimated revenue loss: ~$100M+ (Meta's revenue run rate is approximately $110B/year).

Staff impact: approximately 1,000+ Facebook engineers were working the incident. Many could not access internal tools (which also ran on Facebook's infrastructure), reducing their effectiveness.

Architectural Lessons and Changes

  1. Management plane must be independent of data plane: The backbone router management (SSH access, out-of-band console) must be reachable regardless of the state of the data plane (customer-facing BGP routes). Facebook implemented dedicated out-of-band management networks post-incident.

  2. Physical access systems cannot depend on the network being managed: Badge access systems must have local authentication fallback or be on physically separate infrastructure from the network they are used to access. Facebook implemented local authentication caching for badge systems.

  3. BGP configuration change safeguards: Any command that could result in route withdrawal should have a rate limit and an automatic "roll back if routes are not re-advertised within N minutes" safety.

  4. Internal services on public IP space: Facebook's internal services sharing IP address space with public-facing services created shared fate. Post-incident, Facebook reviewed which internal services could be isolated to RFC1918 private address space that is independent of public BGP.

  5. DNS health check automation: The automated removal of DNS servers during an outage (correct behavior in normal failure mode) became a problem during a BGP withdrawal (the "failure" was network-level, not server-level). Health check automation should distinguish between "server is broken" and "network is unreachable."

Reference: Facebook Engineering post-mortem: engineering.fb.com/2021/10/04/networking-infrastructure/outage (published same day)


Post-Mortem 6: AWS us-east-1 December 7, 2021 — Kinesis Failure and Monitoring Collapse

Timeline

  • December 7, 2021, approximately 07:30 PST: AWS's Kinesis Data Streams service in us-east-1 begins experiencing elevated error rates due to a thread count overflow in the Kinesis front-end fleet.

  • 07:30-09:00: Kinesis errors begin cascading to AWS services that use Kinesis for internal coordination and event streaming. AWS's CloudWatch (monitoring) service uses Kinesis internally. CloudWatch begins failing.

  • 09:00-12:00: With CloudWatch degraded, AWS's internal monitoring systems are unable to accurately report service health. Operators managing other AWS services have reduced visibility. The irony: the monitoring system failure masks and delays diagnosis of the outage.

  • 12:00+: AWS identifies the Kinesis root cause (thread count overflow). Engineers begin the repair process for Kinesis, which requires scaling down and restarting the front-end fleet.

  • ~22:00: Kinesis fully restored. Downstream services recover. Total duration: ~14 hours for full resolution.

Technical Root Cause Chain

Phase 1 — The thread count overflow in Kinesis:

Kinesis Data Streams' front-end servers used a thread pool to handle incoming requests. Each thread in the pool had metadata stored in a server-side component. The metadata storage had a fixed size — it could accommodate a specific number of thread entries.

In December 2021, AWS added a new monitoring capability to Kinesis front-end servers that required additional per-thread metadata. The change did not update the maximum thread count capacity in the metadata storage layer. When the new monitoring threads started, the metadata storage was already at or near capacity from existing threads.

The metadata storage overflowed: new threads could not be registered. When threads could not be registered, the server began returning errors for new Kinesis requests. Because the issue affected all front-end servers (they all received the same software update), the error was global within us-east-1.

Before change: 
  front-end thread count: N
  metadata storage capacity: N+50 (comfortable headroom)

After change:
  front-end thread count: N + monitoring_threads
  monitoring_threads: 50-100 additional threads
  total: N + monitoring_threads > metadata storage capacity
  → thread registration fails
  → new requests cannot be handled
  → Kinesis returns errors

Phase 2 — Why CloudWatch was affected:

AWS CloudWatch is AWS's monitoring, metrics, logging, and alerting service. Internally, CloudWatch uses Kinesis Data Streams for processing metric data, log ingestion, and event streaming. When Kinesis began failing, CloudWatch's ability to ingest and process metrics was degraded.

The blast radius through CloudWatch was wide because many AWS services that depend on CloudWatch for health checks, alarms, and operational coordination began operating with degraded or delayed monitoring data.

Phase 3 — The monitoring failure masked the outage:

This is the most critically important lesson from this incident. AWS's internal operational processes depend heavily on CloudWatch for detecting when a service is degraded and for providing engineers with visibility into the root cause. When CloudWatch itself was degraded:

  • Service health dashboards showed stale or incomplete data
  • Engineers investigating downstream service failures had incomplete visibility
  • Alerts that should have fired did not fire (the alert pipeline itself was broken)
  • The estimated number of affected services and customers was initially underestimated

The failure of the monitoring system created a second-order problem: engineers had to debug the outage with reduced visibility, extending the time to identify root cause and increasing the time to recovery.

Phase 4 — Services affected by Kinesis directly:

Beyond CloudWatch, Kinesis is used internally by multiple AWS services for data streaming, event coordination, and capacity scaling signals. Services affected included: - Amazon Connect (contact center) — uses Kinesis for real-time metrics - Amazon EventBridge (event bus) — partial degradation - Cognito (authentication) — degraded (uses Kinesis internally) - Various EC2 control plane operations — degraded - Auto Scaling signal processing — delayed

Detection

Kinesis errors were detected by AWS's monitoring within minutes. However, because the monitoring system itself was becoming degraded simultaneously, the full picture of impact was not immediately clear. External customer detection was via CloudWatch error rates visible to customers, and via downstream service failures.

The correlation "Kinesis failure → CloudWatch failure → monitoring blindness → underestimated impact" was identified over the first few hours of the incident.

Blast Radius

The us-east-1 region was affected. Services impacted included Kinesis itself, CloudWatch, Cognito, Amazon Connect, Auto Scaling, and portions of the EC2 control plane. The monitoring failure had an outsize effect on perceived blast radius: because monitoring was degraded, many teams did not immediately see their services as affected.

The monitoring failure extended the recovery: Normally, as a root cause is fixed, operators watch monitoring dashboards to confirm service recovery. With CloudWatch degraded, recovery confirmation was slower — operators had less confidence that fixed services were actually recovering normally.

Recovery

The Kinesis fix required: 1. Identifying the thread count overflow as the root cause (requires investigation of front-end server logs, which required working around degraded monitoring) 2. Reducing thread count on Kinesis front-end fleet by rolling back the new monitoring change 3. Restarting front-end servers gradually (thundering herd risk if all restart simultaneously) 4. Verifying Kinesis recovery (difficult with degraded monitoring) 5. Allowing CloudWatch to recover as Kinesis recovered 6. Confirming all downstream services had recovered

The cascaded nature of the failure — Kinesis causes CloudWatch causes monitoring blindness — meant each step was slower than normal due to reduced operational visibility.

Architectural Lessons and Changes

  1. Monitoring systems must not share failure domains with the systems they monitor. CloudWatch using Kinesis internally was a circular dependency: the monitor depended on the monitored. Post-incident, AWS invested in a "monitoring of monitors" layer using a separate, isolated infrastructure path. External monitoring probes (synthetic monitoring from outside AWS) became a primary signal that does not depend on internal AWS infrastructure.

  2. Thread count capacity must be validated before software changes. The overflow was a resource capacity bug — adding threads without verifying storage capacity. This type of error should be caught in load testing that exercises the maximum thread count.

  3. Cell-based architecture for monitoring: A single Kinesis Data Streams cluster serving all of us-east-1's monitoring needs created a SPOF. Distributing monitoring data processing across cells (each cell serving a partition of the monitoring load) limits blast radius.

  4. Operational runbooks for degraded-monitoring scenarios: When monitoring is degraded, operators need manual procedures for assessing service health (direct service-level health checks, out-of-band probes). These runbooks should be maintained and practiced.

  5. Kinesis back-pressure handling in CloudWatch: CloudWatch should have a fallback path for when Kinesis is unavailable — either local buffering with delayed ingestion, or a secondary non-Kinesis transport. Designing monitoring infrastructure for monitoring service unavailability is meta but necessary.

Reference: AWS post-mortem: aws.amazon.com/message/12721


ASCII Diagram: Cloud Outage Failure Pattern Taxonomy

Failure Type 1: Shared Control Plane (EBS 2011, Kinesis 2021)
  ┌────────────┐    ┌────────────┐    ┌────────────┐
  │ Service AZ1 │    │ Service AZ2 │    │ Service AZ3 │
  └─────┬──────┘    └─────┬──────┘    └─────┬──────┘
        │                 │                 │
        └─────────────────┼─────────────────┘
                          │
                   ┌──────▼──────┐
                   │ Shared       │ ← Single point of failure
                   │ Control Plane│   affects all AZs
                   └─────────────┘

Failure Type 2: Config Cascade (Fastly 2021, Google 2019)
  Software Update → All Nodes → Global Failure
  [No staged rollout → zero blast radius limitation]

Failure Type 3: Self-referential dependency (Kinesis/CloudWatch 2021)
  Service A → uses → Service B (monitoring)
  Service B → depends on → Service A
  Service A fails → Service B fails → no visibility into A's failure

Failure Type 4: Physical access dependency (Facebook 2021)
  Network fails → Access control fails → Cannot reach network equipment
  [Management plane depends on data plane]

Failure Type 5: Human error amplification (S3 2017)
  Operator typo → Overly-broad command → Large blast radius
  [No rate limit, no confirmation, no automated rollback]

Debugging Notes

# AWS: Check service health without depending on CloudWatch
# Use external synthetic monitoring or direct endpoint probes
curl -I https://kinesis.us-east-1.amazonaws.com/
aws kinesis list-streams --region us-east-1  # direct service health check

# Check EBS volume state
aws ec2 describe-volumes \
  --filters Name=status,Values=error,creating \
  --query 'Volumes[*].[VolumeId,State]'

# Fastly: Check CDN status without using the CDN
# Use direct origin IP (bypassing CDN) to verify origin health
curl --resolve example.com:443:ORIGIN_IP https://example.com/health

# BGP monitoring (external to your network)
# Use public looking glasses: lg.he.net, bgpview.io
# Check: are your prefixes still visible from external ASes?

# DNS health without relying on your own DNS
dig @8.8.8.8 yourdomain.com  # Google DNS
dig @1.1.1.1 yourdomain.com  # Cloudflare DNS
# If both fail: authoritative DNS is unreachable (BGP-level issue)

# Certificate expiry check
echo | openssl s_client -connect hostname:443 2>/dev/null | \
  openssl x509 -noout -dates
# Or: check all internal services

Security Implications

  • Out-of-band management access security: OOB access (serial consoles, IPMI) bypasses normal access controls — must be independently secured.
  • BGP security: Unauthorized BGP route withdrawal is a form of denial of service at the Internet routing level; RPKI and route origin validation are the mitigations.
  • Certificate expiry as attack vector: An attacker who can time their attack to coincide with a certificate expiry can cause authentication failures that are harder to diagnose.

Performance Implications

  • Cloud outages illustrate that shared infrastructure has shared performance characteristics; cells improve both performance isolation and fault isolation.
  • Monitoring overhead (the Kinesis thread count bug was introduced by adding monitoring threads) can itself introduce performance and stability issues — the monitoring infrastructure must be sized for its own load.

Failure Modes Summary

Incident Root Failure Cascade Mechanism Duration
AWS EBS 2011 Network config → re-mirror storm Control plane overload 10 days (full)
Azure 2012 Certificate expiry Auth cascade Hours
Google Cloud 2019 Network config → congestion RPC retry storm 4-7 hours
Fastly 2021 Latent VCL bug Customer config triggered global edge crash 56 min
Facebook 2021 BGP route withdrawal DNS cascade + physical access 7 hours
AWS us-east-1 2021 Thread overflow in Kinesis Monitoring failure masked impact 14 hours

Modern Usage

  • AWS Resilience Hub: Automated resilience assessment for AWS workloads — identifies single points of failure and recommends cell-based isolation.
  • Multi-region active-active: Lessons from these outages drove adoption of active-active multi-region architectures for critical services.
  • Chaos Engineering: Netflix Chaos Monkey (now open-sourced), AWS Fault Injection Simulator — systematically injecting failures to find hidden dependencies before they cause production outages.
  • Automated rollback: Feature flag platforms (LaunchDarkly, AWS AppConfig) with automatic rollback on error rate thresholds — directly addresses the Fastly VCL scenario.

Future Directions

  • BCP (Business Continuity Planning) automation: Automated orchestration of failover to backup regions or providers during outage, rather than manual operator response.
  • Formal verification of network configuration changes: Using TLA+ or model checking to verify that a network config change does not create unreachable nodes before deployment.
  • Dependency mapping tools: Automated discovery of service dependencies (including transitive monitoring dependencies) to identify circular dependencies and shared failure domains.

Exercises

  1. Design a redundant monitoring architecture that does not share a failure domain with the services it monitors. Specify: what infrastructure the monitoring runs on, how it gets data from services, and how it alerts operators when both the service AND the monitoring are failing.

  2. For the EBS 2011 outage: design a re-mirroring subsystem with rate limiting that would have prevented the storm. What are the tradeoffs between recovery speed and storm prevention?

  3. For the Facebook 2021 outage: design a physical access control system for data centers that degrades gracefully when network connectivity to the authentication backend is lost. How do you balance security (prevent unauthorized physical access) with availability (allow authorized engineers in during outages)?

  4. Implement a "dead man's switch" for network configuration changes: a system that automatically rolls back a BGP configuration change if the originating session doesn't send a "keep this change" confirmation within 5 minutes. What failure modes does this introduce?

  5. Analyze the Fastly 2021 outage in terms of: (a) fault isolation — why did one customer's config change affect all customers? (b) testing — what test would have caught the latent bug? (c) rollout — how would staged deployment have limited the blast radius?

References

  • AWS. "Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region." aws.amazon.com/message/65648, April 2011.
  • AWS. "Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region." aws.amazon.com/message/41926, March 2017.
  • AWS. "Post-Event Summary - AWS Service Event in the Northern Virginia (US-EAST-1) Region." aws.amazon.com/message/12721, December 2021.
  • Fastly. "Summary of June 8 outage." fastly.com/blog/summary-of-june-8-outage, June 2021.
  • Facebook Engineering. "More details about the October 4 outage." engineering.fb.com, October 4, 2021.
  • Google Cloud. "Google Cloud Infrastructure Components Incident #19009." cloud.google.com/support/docs/incident-response/2019/march
  • Vogels, Werner. "Lessons Learned from AWS Outages." Various keynote presentations, 2012-2022.
  • Hamilton, James. "On Designing and Deploying Internet-Scale Services." LISA 2007. (foundational principles for cloud reliability)
  • Nygard, Michael T. "Release It! Design and Deploy Production-Ready Software." 2nd ed., 2018. (circuit breakers, bulkheads, timeouts — the patterns these outages motivated)