Skip to content

Section 39: Large-Scale Case Studies

Purpose and Scope

Case studies of real production systems reveal truths that no theoretical treatment can provide: the design decisions forced by actual workloads, the engineering tradeoffs that only manifest at scale, and the failures that reshape architecture. This section presents deep technical analysis of the foundational systems built by Google, Meta, Amazon, Netflix, and Cloudflare, alongside internals of the critical infrastructure software — Kafka, Redis, PostgreSQL, Cassandra, ClickHouse, Envoy, and NGINX — that forms the backbone of modern distributed systems. Each case study focuses on the architectural decisions, the problems they solved, and the lessons applicable to systems you will build.


Prerequisites

  • Distributed systems (Section 17): consensus, replication, consistency
  • Database internals (Section 18): storage engines, transactions, indexing
  • Networking (Section 15, 16): TCP, HTTP, anycast, BGP
  • System design fundamentals (Section 38): scalability patterns, caching, sharding
  • Linux and kernel fundamentals (Section 03, 04)

Learning Objectives

By the end of this section, you will be able to:

  1. Describe Google Borg's scheduling model and explain how it influenced Kubernetes
  2. Explain Spanner's TrueTime mechanism and why it enables external consistency
  3. Describe Bigtable's data model, SSTable/MemTable storage engine, and compaction
  4. Explain DynamoDB's partition model, consistent hashing, and the evolution to single-table design
  5. Describe Kafka's log-based architecture, partition leadership, and consumer group mechanics
  6. Explain Redis's single-threaded event loop, persistence options, and cluster sharding
  7. Describe PostgreSQL's MVCC implementation, WAL, and vacuum process
  8. Explain Cloudflare's global anycast architecture and how eBPF is used for DDoS mitigation

Architecture Overview

Google's Systems Stack

  Applications (Search, Ads, Gmail, YouTube)
       |
  +----v-----------------------+
  |  Borg (2004) / Omega /     |  Cluster management, task scheduling
  |  Kubernetes (2014)         |  Bin packing, resource quotas, health checks
  +----+-------+---------------+
       |       |
  +----v-+   +-v---------+
  |Colossus|  |  Bigtable |  (2006)  Petabyte-scale NoSQL
  |(GFS v2)|  |  HBase    |          Tablet-based key-value
  |Chunk   |  +-----------+
  |servers |
  +----+---+   +-----------+
       |       |  Spanner  |  (2012)  Globally distributed SQL
       |       |  TrueTime |          External consistency
       |       +-----------+
       |
  +----v-----------+
  | Chubby (2006)  |  Distributed lock service, configuration
  | (Paxos-based)  |  Used by GFS master, Bigtable master
  +----------------+

  +------------------+
  | Dapper (2010)    |  Distributed tracing system
  | Trace spans      |  Instrumented in all RPC stubs
  +------------------+

  +------------------+
  | MapReduce (2004) |  Batch processing; data transforms at PB scale
  | -> Flume         |  Replaced by Dataflow/Beam for streaming
  +------------------+

Spanner's TrueTime

  Problem: Distributed transactions require total order.
  Global clock synchronization to ~7 nanoseconds is impossible.

  TrueTime API:
  TT.now()  returns [earliest, latest] — an interval, not a point
  TT.after(t) returns true if t < TT.now().earliest
  TT.before(t) returns true if t > TT.now().latest

  GPS + Atomic clocks per datacenter -> bounded uncertainty (~7ms max)

  Commit Wait:
  Transaction coordinator assigns timestamp ts = TT.now().latest
  Waits until TT.after(ts) is true (i.e., ts is definitely in the past)
  Only then releases locks and returns to client

  Result: any transaction that starts after this commit's timestamp
  observes its effects. External consistency without global lock.

  TrueTime uncertainty: ~1-7ms typical -> commit wait = ~1-7ms
  Cost: latency. Benefit: true serializable external consistency at global scale.

Kafka's Log Architecture

  Topic: "user-events", 6 Partitions

  Partition 0 (Leader: Broker 1):
  +----+----+----+----+----+----+
  |msg0|msg1|msg2|msg3|msg4|msg5|  -> offset log, append-only
  +----+----+----+----+----+----+
       |              |
  Broker 2 (replica)  Consumer Group A, offset=3

  Producer:
  Partitioner(key) -> partition number
  Append to leader's segment file (sendfile() -> zero-copy)

  Consumer Group:
  N consumers, M partitions: each partition assigned to exactly 1 consumer
  Offset stored in __consumer_offsets topic (committed after processing)
  Group coordinator: a broker manages assignment via group protocol
  Rebalance triggered on member join/leave

  Storage:
  Segment files (.log) + index files (.index, .timeindex) per partition
  Log compaction: retain latest value per key (not time-based deletion)
  Log retention: time-based (7 days) or size-based

  Key guarantee: order within a partition, not across partitions
  Throughput: millions of messages/sec per broker

PostgreSQL MVCC and WAL

  MVCC (Multi-Version Concurrency Control):
  Each row has: xmin (created by txn ID), xmax (deleted by txn ID)

  INSERT: row with xmin=current_txn, xmax=0
  UPDATE: mark old row xmax=current_txn, insert new row with xmin=current_txn
  DELETE: mark row xmax=current_txn

  SELECT: visibility check: xmin committed AND xmax = 0 (or not committed)

  Readers never block writers; writers never block readers.

  WAL (Write-Ahead Log):
  +----------+     +----------+     +----------+
  | WAL buf  | --> | WAL file | --> | Replica  |
  | (shared) |     | (pg_wal) |     | (stream) |
  +----------+     +----------+     +----------+
       |
  Flush to WAL before committing -> durability
  Pages written to heap after WAL -> buffer pool can absorb writes

  Vacuum: reclaims dead tuples (old versions no longer visible to any txn)
  autovacuum daemon: triggered by dead tuple threshold
  Without vacuum: table bloat, transaction ID wraparound (XID exhaustion)

  Buffer pool:
  8KB pages in shared_buffers (RAM); dirty pages written to disk
  checkpoint: periodic flush of all dirty pages; WAL truncated after

Cloudflare Global Anycast

  Client: Sydney, Australia
  DNS query for 1.1.1.1 -> any Cloudflare resolver
  BGP announces 1.1.1.1/32 from 285+ PoPs globally
  Sydney AS announces 1.1.1.1 -> nearest Cloudflare PoP wins

  +--------+    BGP     +----------+     IP anycast     +----------+
  |Internet|  announce  |Cloudflare|  <-- same IP -->   |Cloudflare|
  |Routers |  1.1.1.1   |Sydney PoP|                    |London PoP|
  +--------+  from both +----------+                    +----------+

  DDoS mitigation with eBPF:

  Incoming packet -> XDP (eXpress Data Path) hook in NIC driver
  eBPF program:
    1. Parse packet headers at line rate (no copy, no allocation)
    2. Check source IP against blocklist (BPF hash map)
    3. Check packet rate (per-IP counters in BPF map)
    4. Drop if matching rule: XDP_DROP (returns before kernel stack)
    5. Pass legitimate traffic: XDP_PASS

  Throughput: ~20 Mpps per core, before kernel network stack involvement
  Cloudflare has handled 3.8 Tbps DDoS attacks using this approach

Key Concepts

  • Borg / Kubernetes: Google's internal cluster management system (Borg, 2004) directly inspired Kubernetes (open-sourced 2014). Key concepts: declarative desired state, scheduler as bin-packing optimizer, health probes, preemption priorities. Borg managed millions of tasks across ~100,000 machines.
  • Bigtable: Distributed sorted map: row key -> column family:column qualifier -> timestamp -> value. Built on GFS (Colossus). Tablet servers manage row ranges; master assigns tablets. SSTable on-disk format with MemTable in-memory buffer.
  • Dynamo / DynamoDB: Amazon's always-writeable key-value store (2007 paper). Consistent hashing with virtual nodes, vector clocks for conflict detection, sloppy quorum (W+R>N not required). DynamoDB (2012) removed vector clocks, added conditional writes, single-table design patterns.
  • Netflix Chaos Engineering: Deliberately injecting failures in production to build confidence in resilience. Simian Army: Chaos Monkey (kills instances), Chaos Gorilla (kills AZs), Chaos Kong (kills regions). Moved from Netflix-specific tools to open-source Chaos Engineering principles (Principles of Chaos Engineering, 2016).
  • Redis: In-memory data structure server. Single-threaded event loop (io_threads in 6.0+). Persistence: RDB snapshots or AOF (Append-Only File). Cluster mode: 16384 hash slots, distributed with gossip protocol. Common patterns: sorted sets for leaderboards, pub/sub, Lua scripting, distributed locks (Redlock, contested).
  • Cassandra: Wide-column NoSQL database. Decentralized (no master), gossip-based membership, consistent hashing with vnodes. Data model: partition key -> clustering key -> columns. Tunable consistency (ONE, QUORUM, ALL). Compaction strategies: STCS, LCS, TWCS for time-series.
  • ClickHouse: Column-oriented OLAP database. MergeTree storage engine: data sorted by primary key, stored in column files. Parts written atomically; background merges. Vectorized query execution (SIMD). Achieves billions of rows per second on analytical queries.
  • Envoy Proxy: L7 proxy written in C++ by Lyft (open-sourced 2016). xDS API for dynamic configuration (no reload). Filter chains for request processing (rate limit, auth, retry, circuit breaker). Used as sidecar in Istio/Linkerd service meshes.

Major Historical Milestones

Year Milestone
2003 Google GFS paper — distributed filesystem for large files
2004 Google MapReduce paper — batch compute at warehouse scale
2006 Google Bigtable paper — NoSQL at petabyte scale
2006 Amazon Dynamo paper — AP eventually-consistent key-value store
2006 Hadoop 0.1 — open-source MapReduce + HDFS (Yahoo)
2007 Memcached at Facebook — cache-aside at scale; Leslie Halton
2009 Netflix on AWS — cloud-native architecture begins
2010 Google Colossus — GFS successor; no single master
2010 Google Dapper paper — distributed tracing
2011 Kafka 0.7 — LinkedIn open-sources distributed log
2012 Google Spanner paper — globally distributed SQL, TrueTime
2012 Amazon DynamoDB — Dynamo principles as managed service
2012 Redis Cluster — horizontal scaling for Redis
2013 Netflix Simian Army — chaos engineering open-sourced
2014 Google Kubernetes open-sourced — Borg principles for everyone
2015 ClickHouse internal at Yandex (open-sourced 2016)
2016 Envoy Proxy — Lyft's L7 proxy open-sourced
2016 Meta TAO paper — distributed object/association cache
2017 Cloudflare Workers — serverless at edge, V8 isolates
2018 Cloudflare adopts eBPF/XDP for DDoS at 1 Tbps+
2019 Amazon Aurora Global Database — <1 second RPO across regions
2019 Meta Tupperware — containerized workload management
2020 Kafka 2.8 KRaft mode — removes ZooKeeper dependency
2022 ClickHouse Cloud — ClickHouse Inc. managed service
2023 Amazon DynamoDB Standard-IA — tiered storage

Modern Relevance

These systems are not historical artifacts — they are active production systems serving billions of users. Kafka is the de facto standard for event streaming; understanding its guarantees prevents subtle bugs in distributed applications. PostgreSQL is the dominant open-source relational database; its MVCC internals explain why long-running transactions cause table bloat. Redis is in virtually every production stack as a cache or pub/sub bus. Envoy is the universal data plane for service meshes and API gateways. Studying these systems reveals how theoretical concepts from distributed systems, database internals, and networking combine in real engineering. The papers behind these systems (Bigtable, Dynamo, Spanner, MapReduce, Raft) are the canon of systems literature that every infrastructure engineer should have read.


File Map

39-large-scale-case-studies/
├── 00-overview.md              <- This file
├── 01-google-borg-and-kubernetes.md
├── 02-google-spanner.md
├── 03-google-colossus.md
├── 04-google-bigtable.md
├── 05-google-mapreduce.md
├── 06-google-chubby.md
├── 07-google-dapper.md
├── 08-meta-linux-at-scale.md
├── 09-meta-tupperware.md
├── 10-meta-thrift.md
├── 11-meta-tao.md
├── 12-aws-nitro.md
├── 13-aws-dynamodb.md
├── 14-aws-aurora.md
├── 15-aws-s3.md
├── 16-netflix-chaos-engineering.md
├── 17-netflix-evcache.md
├── 18-netflix-zuul.md
├── 19-cloudflare-anycast.md
├── 20-cloudflare-ebpf.md
├── 21-cloudflare-workers.md
├── 22-kafka-internals.md
├── 23-redis-internals.md
├── 24-postgresql-internals.md
├── 25-cassandra-architecture.md
├── 26-clickhouse-internals.md
├── 27-envoy-proxy.md
└── 28-nginx-internals.md

Cross-References

  • Section 17 (Distributed Systems): Consensus, consistency models — Spanner TrueTime, Dynamo quorums
  • Section 18 (Database Internals): PostgreSQL MVCC, Cassandra LSM trees, ClickHouse columnar storage
  • Section 22 (Kubernetes): Direct descendant of Google Borg architecture
  • Section 23 (Observability): Google Dapper's influence on OpenTelemetry distributed tracing
  • Section 38 (System Design): These case studies are the real implementations of Section 38 patterns
  • Section 32 (AI Infrastructure): Meta AI clusters, Google TPU infrastructure, AWS AI chips (Trainium, Inferentia)