Skip to content

04 — Network and Distributed System Failures

Technical Overview

Networks fail in ways that are qualitatively different from local system failures: they introduce partial failures (some nodes see the failure, others do not), timing uncertainty (the message may arrive eventually), and adversarial behavior (BGP lies). Distributed systems built atop unreliable networks inherit all of these failure modes plus new ones arising from coordination, consistency, and cascading effects. This document examines five landmark network and distributed system failures, each of which produced lasting changes in how the Internet and cloud infrastructure is designed.

Prerequisites

  • BGP (Border Gateway Protocol): AS paths, route advertisement, route propagation
  • TCP congestion control: slow start, congestion avoidance, AIMD
  • DNS: resolution hierarchy, caching, TTL, authoritative vs recursive resolvers
  • Distributed system fundamentals: consensus, CAP theorem, failure detectors
  • Cloud infrastructure: control plane vs data plane, cell-based architecture
  • AWS service architecture: S3, DynamoDB, Kinesis, internal services

Historical Context

The Internet was designed for survivability, not correctness. BGP assumes AS operators are trustworthy. TCP assumed congestion was rare. DNS assumed the resolution tree was stable. Each of these assumptions was violated catastrophically in the incidents below, forcing the Internet's operators to add security and robustness mechanisms that the original designers never anticipated.


Case Study 1: BGP Routing Disasters

1a. Pakistan Telecom Hijacks YouTube (February 24, 2008)

What happened:

On February 24, 2008, Pakistan Telecom (AS17557) accidentally took YouTube offline globally for approximately 2 hours. Pakistan's telecommunications regulator had ordered YouTube blocked in Pakistan. Pakistan Telecom implemented the block by advertising a more-specific BGP route for YouTube's address space to their upstream provider (PCCW, AS3491).

Technical root cause:

BGP's route selection prefers more-specific prefixes (longer prefix match):

YouTube's legitimate route:
  208.65.152.0/22 (covers 208.65.152.0 – 208.65.155.255)
  Advertised by AS36561 (YouTube/Google)

Pakistan Telecom's hijack route:
  208.65.153.0/24 (more specific — covers 208.65.153.0 – 208.65.153.255)
  Advertised by AS17557 (Pakistan Telecom)

BGP route selection on any router:
  For destination 208.65.153.128:
    Both routes match
    /24 is more specific than /22
    → Select Pakistan Telecom's route (the hijack)
    → Traffic sent to Pakistan Telecom's null route (block)

The critical escalation: PCCW accepted the more-specific route from Pakistan Telecom and propagated it globally. Within minutes, the route had propagated to virtually every BGP router on the Internet. All traffic to YouTube's /24 subnet was being blackholed in Pakistan.

Timeline: - 18:47 UTC: Pakistan Telecom advertises the hijacked /24 - 18:47-18:50 UTC: Route propagates globally via PCCW - YouTube goes offline globally - 20:10 UTC: PCCW discovers the issue and withdraws the hijacked route - 20:30 UTC: Route convergence restores YouTube globally - Total outage: ~107 minutes

Root cause (fundamental): BGP has no built-in origin authentication. Any AS can advertise any prefix. There is no mechanism for a receiving AS to verify that the advertiser has legitimate authority over that address space.

Fix and lasting changes:

  1. RPKI (Resource Public Key Infrastructure): A cryptographic framework where IP address block holders publish signed "Route Origin Authorizations" (ROAs). A ROA states: "AS X is authorized to originate prefix P/len." BGP routers with RPKI can reject routes that don't have a valid ROA. Deployment has accelerated since 2018: by 2024, ~40% of BGP routes are covered by RPKI ROAs.

  2. IRR (Internet Routing Registry) filtering: More consistent enforcement of IRR-based prefix filtering by large ISPs, preventing untrusted AS from advertising arbitrary prefixes.

  3. Google's response: Google began announcing more-specific routes for critical subnets to make hijacking harder (a more-specific route is harder to successfully hijack).

1b. Level 3/Cogent Route Leak (June 2015)

What happened:

In June 2015, Telekom Malaysia (AS4788) accidentally advertised ~170,000 routes — essentially the entire Internet routing table — to their upstream provider Level 3. Level 3 accepted and propagated these routes. Cogent then accepted them from Level 3. For several hours, a significant portion of Internet traffic was being routed suboptimally or dropped.

Technical root cause:

Telekom Malaysia's BGP configuration caused them to re-advertise routes learned from one provider to another — a "route leak." A properly configured BGP router marks routes with a community attribute indicating they should not be re-advertised. Telekom Malaysia's router either did not set this community or Level 3 did not honor it.

The failure chain: Malaysia → Level 3 → Cogent → global propagation of Malaysian-originated routes for prefixes Malaysia did not own. Traffic to many global destinations was routed through Malaysia (high latency, or dropped at Malaysia's limited international capacity).

Root cause: No BGP path validation, and ISPs not consistently filtering routes based on prefix length or IRR registration.

1c. Facebook BGP Withdrawal (October 4, 2021)

What happened:

On October 4, 2021, Facebook, Instagram, and WhatsApp went offline globally for approximately 6 hours. The proximate cause was a BGP configuration change that accidentally withdrew all of Facebook's BGP routes from the Internet. More interesting was why it took 6 hours to fix.

Technical root cause:

Facebook's engineers were executing a routine maintenance operation: auditing and updating BGP configuration on the backbone routers. A command with an unintended side effect was issued to the backbone routers. The command caused the backbone routers to withdraw all BGP routes that advertised Facebook's IP address space to the Internet.

Normal state:
  Facebook's BGP routes advertised to Internet
  DNS servers resolve facebook.com → 157.240.x.x (Facebook IPs)
  Clients connect to Facebook's edge servers

After BGP withdrawal:
  No BGP routes to Facebook's IPs exist on the Internet
  DNS queries for facebook.com: DNS resolvers query Facebook's authoritative
    DNS servers... but cannot reach them (no BGP route)
  DNS queries time out globally

Cascade:
  Facebook's internal systems also used Facebook's BGP-advertised IPs
  Internal services: monitoring, access control, VPN
  All internal services go offline
  Facebook's DNS servers themselves went offline (they used internal IPs
    that were also unreachable)
  Facebook's DNS servers started failing health checks
  Health check failures triggered DNS servers to withdraw themselves
    from DNS service (auto-remove from rotation)
  All Facebook DNS capacity removed

The physical access problem:

The normal remediation path — SSH into the backbone routers and issue the correct BGP commands — required network access. With all BGP routes withdrawn, no external SSH was possible. Internal SSH depended on internal services (jump hosts, access control systems) that ran on the now-unreachable IP space.

Facebook engineers had to physically drive to their data centers. Physical access required badge authentication. Badge authentication ran on Facebook's own infrastructure — which was offline. Engineers needed to obtain overrides from security teams to enter the building.

Once inside, the backbone routers had out-of-band management access (serial consoles, dedicated management network). Using these, engineers were able to re-advertise the withdrawn BGP routes.

But the recovery was not instantaneous: when the BGP routes came back, hundreds of millions of clients simultaneously tried to reconnect to Facebook. The sudden traffic spike caused overload on Facebook's infrastructure (DNS, load balancers, application servers), requiring careful staged recovery.

Timeline: - 15:40 UTC: BGP routes withdrawn - 15:40-21:27 UTC: Outage (5h47m) - 16:00 UTC: Physical access problem recognized - ~17:00-18:00 UTC: Engineers achieve physical data center access - ~18:00-21:00 UTC: Diagnosis and BGP reconfiguration via out-of-band console - 21:27 UTC: BGP routes re-advertised, services begin recovery - 22:00+ UTC: Full recovery after staged traffic restart

Architectural lessons: 1. Out-of-band management access must be independent of the infrastructure being managed. Facebook's management plane (SSH, badge access systems) ran on the same infrastructure it managed. 2. BGP configuration changes need staged rollout with automatic rollback. A command that withdraws all routes should require manual confirmation or have an automatic revert if routes are not re-advertised within N minutes. 3. DNS auto-withdrawal from rotation can amplify outages. Health checks causing DNS removal created a positive feedback loop. 4. Physical access procedures must have bypass paths. Badge systems that run on the internet-dependent infrastructure are a failure mode.


Case Study 2: TCP Congestion Collapse (1986)

What Happened

In October 1986, the Internet effectively stopped working. Throughput on the NSFnet T1 link between UC Berkeley and LBL dropped from 32 Kbps to 40 bps — a 1000-fold reduction. The cause: TCP congestion collapse, where the Internet's congestion response mechanism failed, causing a self-reinforcing downward spiral.

Technical Root Cause

TCP in 1986 had no congestion control. When a network link became congested, routers dropped packets. Senders' TCPs experienced timeouts and retransmitted — but retransmitted at the same rate as before (no backoff). The retransmissions increased load, caused more drops, more timeouts, more retransmissions — the network locked in a positive feedback loop approaching zero useful throughput.

The collapse mechanism:

Step 1: Link becomes slightly congested → router drops packets
Step 2: TCP sender times out waiting for ACK
Step 3: TCP sender retransmits — at same rate (no AIMD knowledge)
Step 4: Multiple senders do the same
Step 5: Link now extremely congested → more drops
Step 6: All TCP senders retransmitting as fast as possible
Step 7: Link 100% utilized with retransmissions of lost packets
         0% throughput: useful data essentially zero
Step 8: "Congestion collapse"

The network was stuck: every additional retransmission added load that caused more drops, which triggered more retransmissions.

Van Jacobson's Fix (1988)

Van Jacobson (LBL) analyzed the collapse and published "Congestion Avoidance and Control" (ACM SIGCOMM 1988) — one of the most cited papers in computer science. He implemented the fix in 4.3BSD Tahoe TCP.

The four algorithms:

1. Slow Start:

Begin connection with cwnd = 1 MSS (not the full window)
For each ACK received: cwnd += 1 MSS
Window doubles each RTT until ssthresh reached
Prevents initial burst from overwhelming network

2. Congestion Avoidance (AIMD — Additive Increase, Multiplicative Decrease):

Above ssthresh:
  For each ACK: cwnd += 1/cwnd (increases by ~1 MSS per RTT)
  Linear increase — probing for available bandwidth slowly

On loss (timeout or triple duplicate ACK):
  ssthresh = cwnd / 2
  cwnd = 1 (timeout) or cwnd = ssthresh (fast recovery)
  Multiplicative decrease — fast reaction to congestion signal

3. Fast Retransmit:

Receive 3 duplicate ACKs → retransmit lost segment immediately
Don't wait for RTO timeout (which can be 1-3 seconds)

4. Fast Recovery:

After fast retransmit:
  ssthresh = cwnd / 2
  cwnd = ssthresh (not 1 — still have useful segments in flight)
  Continue sending without re-entering slow start

The AIMD property provides global stability: when all TCP senders independently increase linearly and decrease multiplicatively, they converge to fair bandwidth sharing without central coordination.

Lasting Architectural Changes

  • TCP's congestion control is now mandated by RFC 2581 (1999), RFC 5681 (2009)
  • Variants: Reno (1990), Vegas (1994), CUBIC (Linux default since 2.6.19), BBR (Google, 2016 — production at YouTube and cloud)
  • BBR (Bottleneck Bandwidth and RTT): Google's 2016 replacement for CUBIC. Instead of using packet loss as a congestion signal, BBR models the network path's bottleneck bandwidth and RTT, achieving higher utilization on high-bandwidth-delay-product paths and more stable performance on lossy links.

Case Study 3: Amazon DynamoDB Availability Event (2015)

What Happened

In September 2015, Amazon DynamoDB experienced a significant availability event. DynamoDB is one of AWS's most critical services and is used internally by many other AWS services. The event was caused by a metadata service becoming overloaded due to a change in traffic patterns, which cascaded to affect DynamoDB availability.

Technical Root Cause

DynamoDB's architecture at the time included a metadata service used for partition management — routing requests to the correct storage nodes for a given partition key. The metadata service maintained in-memory caches of partition routing information.

The event was triggered by a traffic pattern shift: a large customer began sending a highly unusual access pattern (very high request rate to a small number of hot partitions). This created an extremely high cache miss rate in the metadata service, causing the metadata service to make high volumes of backend calls to refresh partition routing information.

Normal operation:
  DynamoDB request → metadata cache hit → route to storage node → serve request
  Cache hit rate: ~99.9%
  Metadata backend calls: low

During traffic anomaly:
  Hot partition access pattern → cache misses (partition routing changes often
    due to split/merge operations triggered by adaptive capacity)
  Cache miss rate: dramatically higher
  Each cache miss → backend call to partition metadata store
  Metadata store CPU: spikes to 100%
  Metadata store latency: increases from ms to seconds

Cascade:
  DynamoDB requests waiting for metadata resolution
  Connection pool exhaustion at metadata service
  Timeout errors propagated to DynamoDB callers
  AWS services using DynamoDB: EC2 instance metadata service, CloudWatch, IAM
  Those services begin returning errors or degraded responses
  Services depending on those services degrade further

Blast Radius

DynamoDB unavailability or degradation affected: - Directly: applications using DynamoDB - Indirectly: AWS services with DynamoDB dependencies (including some IAM paths, CloudWatch metrics) - Duration: Several hours of impaired service

Fix and Architectural Lesson

Immediate fix: Rate limiting and backoff at the metadata service, emergency capacity scaling.

Architectural lesson — Cell-based architecture:

This event (and the 2011 EBS event) taught AWS the danger of "shared fate" components — services where a single failure domain can affect all customers and all dependent services simultaneously.

The response was the widespread adoption of cell-based architecture at AWS:

Pre-cell:
  DynamoDB Metadata Service (single global cluster)
  Failure: all DynamoDB customers affected

Cell-based:
  DynamoDB Cell 1 (serves customers A-F)
  DynamoDB Cell 2 (serves customers G-M)
  DynamoDB Cell 3 (serves customers N-Z)

  Each cell is fully independent: separate metadata, storage, networking
  A hot customer in Cell 1 affects only Cell 1
  No cross-cell dependencies for data path

The cell-based architecture limits the blast radius of any single failure: a bad traffic pattern from one customer can only affect their cell, not the global service. AWS has applied this pattern broadly (DynamoDB, Kinesis, Route 53, and others).

Reference: Amazon's CTO Werner Vogels' "Shuffle Sharding" blog post (2012) introduced the concept publicly; the 2015 event confirmed the necessity.


Case Study 4: AWS S3 us-east-1 Outage (February 28, 2017)

What Happened

On February 28, 2017, Amazon S3 in the us-east-1 region experienced a major outage lasting approximately 4 hours. The cause was human error amplified by an overly-broad automation script, combined with tight coupling between S3 subsystems.

Technical Root Cause

An S3 engineer was investigating high error rates in a billing system. The billing system issue was related to a subsystem called the "index" subsystem — which manages S3's internal metadata index for bucket and object listings.

The engineer was executing a playbook to remove a small number of servers from the index subsystem to allow for capacity rebalancing. The command was a well-known operational command. However, the engineer made a typo in the count parameter:

Intended command:
  remove_servers --subsystem index --count 10
  (remove 10 servers from the index subsystem for maintenance)

Actual command (paraphrased):
  remove_servers --subsystem index --count <much larger number>
  (remove a much larger fraction of the index subsystem)

The automation executed the command as specified:
  → Removed a very large number of index servers
  → Index subsystem capacity reduced drastically
  → S3 operations requiring index lookup failed
  → Cascade: S3 subsystems that depended on index became unavailable

Why recovery took 4 hours:

S3's index subsystem had not been fully restarted at this scale in years. The restart process — bringing index servers back online, rebuilding their in-memory state, and re-synchronizing — took far longer than expected:

  • Index servers must rebuild their in-memory index by scanning stored data
  • At S3's scale (trillions of objects), this scan takes hours
  • Attempting to restart all servers simultaneously would itself cause a different overload (thundering herd on data access)
  • AWS engineers had to carefully stagger the restart, monitoring for stability at each step

AWS S3 status page also went down:

S3 serves static assets for the AWS status page (status.aws.amazon.com). With S3 down, the status page couldn't load its static assets. AWS updated the status page to show "Service is operating normally" for 30+ minutes after the outage started — because the status page itself was down. Engineers monitoring AWS's status had to use third-party sources (DownDetector, Twitter). This caused significant confusion.

Architectural Lessons

  1. Operational commands with large blast radius must have safeguards. The command that removed too many servers should have: (a) required an explicit upper-bound percentage parameter, (b) had an automatic "too many servers" guard, (c) required confirmation when the count exceeded a safe threshold.
  2. Restart/recovery procedures must be tested at scale. The multi-hour restart was a surprise because the recovery path had never been executed at full scale. Post-incident, AWS ran periodic drills of large-scale subsystem restarts.
  3. Status infrastructure must not depend on the service it monitors. The AWS status page depending on S3 was a critical self-referential dependency. Post-incident, the status page was moved to a separate infrastructure.
  4. Typos are inevitable. Operational tooling must be defensive about the scale of destructive operations.

Reference: AWS post-mortem "Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region." aws.amazon.com/message/41926.


Case Study 5: TCP Congestion and the Modern BBR Story

(Supplementary technical case — modern relevance)

Bufferbloat and the Limits of CUBIC

In the 2010s, a phenomenon called "bufferbloat" was identified by Jim Gettys (Bell Labs): ISP and home router buffers had grown enormous (megabytes), causing TCP to fill these buffers. A full buffer introduces seconds of queuing delay for interactive traffic while bulk transfers run. CUBIC (the default TCP congestion control since Linux 2.6.19) reacts to loss — but with large buffers, packets are queued for seconds before being dropped. Latency spikes to 1-5 seconds during congestion became common on cable modem and DSL links.

This is a distributed system failure mode: the congestion control algorithm (designed for the 1990s Internet with small buffers) interacted badly with the 2010s Internet's oversized buffers.

BBR's architectural response:

Google's BBR (Bottleneck Bandwidth and RTT) congestion control (2016) explicitly models network state rather than inferring it from packet loss:

CUBIC model: "Packet loss = congestion → reduce cwnd"
BBR model:   "Measure RTT and delivery rate continuously
              Estimate bottleneck bandwidth: BtlBw
              Estimate minimum RTT: RTprop
              Target: cwnd = BtlBw × RTprop (the BDP — bandwidth-delay product)
              This fills the pipe without overfilling buffers"

BBR deployed at YouTube reduced RTT by 33% and retransmit rates by 50% for global users. Google Cloud Load Balancer uses BBR by default. Linux 5.5+ includes BBR v2.


ASCII Diagram: BGP Route Propagation (YouTube Hijack)

                    INTERNET BGP MESH

YouTube AS36561                       Pakistan Telecom AS17557
announces:                            announces:
208.65.152.0/22 ──────────────────→  [208.65.153.0/24] (hijack)
                                               │
                                               │ to PCCW AS3491
                                               ▼
                                   PCCW AS3491 accepts and propagates
                                        [208.65.153.0/24]
                                               │
                                    ┌──────────┼──────────┐
                                    ▼          ▼          ▼
                               AS1234       AS5678    AS9012
                              (accepts)   (accepts)  (accepts)
                                    │          │          │
                                    └──────────┴──────────┘
                                              │
                                    Global Internet prefers /24
                                    (longer prefix = more specific)
                                    All traffic to 208.65.153.x
                                    → Pakistan Telecom → blackhole

RPKI solution:
  YouTube publishes ROA: "AS36561 may announce 208.65.152.0/22"
  Pakistan Telecom announces 208.65.153.0/24 (not in ROA)
  RPKI-validating routers: REJECT — no valid ROA for this origin
  Hijack blocked at RPKI-enabled peers

Debugging Notes

# Check BGP route table (on a BGP router or via looking glass)
show ip bgp 208.65.153.0/24  # Cisco IOS
show route 208.65.153.0/24 detail  # Juniper

# RPKI validation
bgpq4 -A AS36561  # query IRR for AS36561's prefixes
rpki-validator -p 208.65.153.0/24  # validate ROA

# Check TCP congestion state
ss -i  # Linux socket statistics with congestion info
cat /proc/net/sockstat  # socket statistics
sysctl net.ipv4.tcp_congestion_control  # current algorithm

# Set BBR
sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq  # BBR requires fq qdisc

# DNS diagnostics (for BGP withdrawal scenario)
dig @8.8.8.8 facebook.com  # external resolver
dig @dns.fb.com facebook.com  # direct to FB authoritative (would fail during outage)

# TCP congestion control comparison
ss --extended --tcp state established | grep bbr

Security Implications

  • BGP hijacking can redirect traffic for surveillance, MITM attacks, or service disruption. RPKI is the mitigation but is not universally deployed.
  • DynamoDB metadata overload used access patterns as a DoS vector — high-volume hot-partition access could amplify into metadata service overload affecting all customers.
  • Facebook BGP withdrawal demonstrates that BGP configuration errors can be weaponized as denial of service; access to backbone router configuration must be tightly controlled.
  • TCP congestion collapse can be triggered by a single high-bandwidth sender on a shared link — historically used as a denial-of-service vector.

Performance Implications

  • BGP convergence after a route withdrawal takes 30-90 seconds globally (BGP MRAI timer = 30s for eBGP)
  • TCP slow start limits initial throughput — BBR alleviates this with pacing
  • DNS TTL caching can cause 5-30 minute propagation delays for DNS changes — a feature (reduces load) and a bug (slow recovery from outages)
  • Cell-based architectures add routing overhead for cross-cell requests but eliminate shared fate

Failure Modes

Failure Trigger Detection Recovery Time
BGP hijack More-specific route advertisement NOC route monitoring, RPKI 2 hours (YouTube)
TCP congestion collapse No congestion control + packet loss Throughput → 0 Operator intervention
DynamoDB metadata overload Hot partition access pattern DynamoDB error rate spike Hours
S3 command typo Overly-broad remove command S3 error rate → 100% 4 hours
Facebook BGP withdrawal Misconfigured backbone router Global DNS failure 6 hours (physical access)

Modern Usage

  • RPKI (Resource Public Key Infrastructure): deployed by Cloudflare, AT&T, Comcast, and many others. ~40% of BGP prefixes have valid ROAs as of 2024.
  • MANRS (Mutually Agreed Norms for Routing Security): Industry initiative for BGP operational security — filtering, anti-spoofing, coordination.
  • BBR v2: Deployed in Google's production infrastructure, Linux kernel 5.13+.
  • AWS Cell-based architecture: Standard for all new AWS service designs post-2015.
  • BCP38 (Network Ingress Filtering): Prevents IP spoofing — still not universally deployed.

Future Directions

  • BGPsec: Cryptographic signing of the full AS path (not just origin) — specified in RFC 8205 but deployment is minimal due to performance overhead
  • QUIC congestion control: QUIC (HTTP/3) uses its own congestion control (default: CUBIC or BBR), allowing faster iteration without kernel changes
  • P4-programmable network devices: Allow custom BGP processing and faster mitigation of routing anomalies
  • Service mesh security: mTLS between all microservices — a distributed analogue of the Facebook problem where internal services should not depend on external network reachability for internal communication

Exercises

  1. Simulate TCP congestion collapse: use tc netem to add packet loss to a loopback interface. Run a TCP throughput test with iperf3. Observe how different congestion control algorithms (Reno, CUBIC, BBR) behave. Compare throughput and RTT under 1%, 5%, and 10% loss.

  2. Look up the current RPKI state of a well-known prefix using a looking glass (routeviews.org or lg.he.net). Check if the prefix has a valid ROA. Look up the ROA at ARIN/RIPE RPKI repositories.

  3. Reproduce a BGP scenario using a BGP simulator (GNS3 + FRRouting or Containerlab). Create two ASes. Have AS2 advertise a more-specific prefix for AS1's address space. Observe that traffic is hijacked. Then deploy RPKI filtering and observe the hijack being blocked.

  4. Design a cell-based architecture for a hypothetical key-value store. Define: how many cells, what data goes in each cell, what cross-cell operations are required, and how you handle a cell-level failure without affecting other cells.

  5. Read Van Jacobson's "Congestion Avoidance and Control" (1988). Implement slow start and AIMD in a Python simulation. Graph cwnd over time for a connection experiencing periodic packet loss. Verify the sawtooth pattern characteristic of TCP congestion control.

References

  • Renesys. "Pakistan Hijacks YouTube." renesys.com, February 2008. (Original BGP monitoring report)
  • Lepinski, Matt; Kent, Steve. "An Infrastructure to Support Secure Internet Routing." RFC 6480, 2012. (RPKI specification)
  • Jacobson, Van. "Congestion Avoidance and Control." ACM SIGCOMM, 1988.
  • Cardwell, Neal et al. "BBR: Congestion-Based Congestion Control." ACM Queue, 2016.
  • AWS. "Summary of the Amazon S3 Service Disruption." aws.amazon.com/message/41926, March 2017.
  • Vogels, Werner. "Shuffle Sharding: Massive and Magical Fault Isolation." aws.amazon.com/blogs/architecture, November 2012.
  • Santry, Doug et al. "Facebook's October 2021 Outage — A Deep Dive." Various post-mortems, October 2021.
  • Gettys, Jim. "Bufferbloat: Dark Buffers in the Internet." IEEE Internet Computing, 2011.
  • Huston, Geoff. "RPKI is Coming of Age." APNIC Blog, various 2019-2024.