Section 39: Large-Scale Case Studies
Purpose and Scope
Case studies of real production systems reveal truths that no theoretical treatment can provide: the design decisions forced by actual workloads, the engineering tradeoffs that only manifest at scale, and the failures that reshape architecture. This section presents deep technical analysis of the foundational systems built by Google, Meta, Amazon, Netflix, and Cloudflare, alongside internals of the critical infrastructure software — Kafka, Redis, PostgreSQL, Cassandra, ClickHouse, Envoy, and NGINX — that forms the backbone of modern distributed systems. Each case study focuses on the architectural decisions, the problems they solved, and the lessons applicable to systems you will build.
Prerequisites
- Distributed systems (Section 17): consensus, replication, consistency
- Database internals (Section 18): storage engines, transactions, indexing
- Networking (Section 15, 16): TCP, HTTP, anycast, BGP
- System design fundamentals (Section 38): scalability patterns, caching, sharding
- Linux and kernel fundamentals (Section 03, 04)
Learning Objectives
By the end of this section, you will be able to:
- Describe Google Borg's scheduling model and explain how it influenced Kubernetes
- Explain Spanner's TrueTime mechanism and why it enables external consistency
- Describe Bigtable's data model, SSTable/MemTable storage engine, and compaction
- Explain DynamoDB's partition model, consistent hashing, and the evolution to single-table design
- Describe Kafka's log-based architecture, partition leadership, and consumer group mechanics
- Explain Redis's single-threaded event loop, persistence options, and cluster sharding
- Describe PostgreSQL's MVCC implementation, WAL, and vacuum process
- Explain Cloudflare's global anycast architecture and how eBPF is used for DDoS mitigation
Architecture Overview
Google's Systems Stack
Applications (Search, Ads, Gmail, YouTube)
|
+----v-----------------------+
| Borg (2004) / Omega / | Cluster management, task scheduling
| Kubernetes (2014) | Bin packing, resource quotas, health checks
+----+-------+---------------+
| |
+----v-+ +-v---------+
|Colossus| | Bigtable | (2006) Petabyte-scale NoSQL
|(GFS v2)| | HBase | Tablet-based key-value
|Chunk | +-----------+
|servers |
+----+---+ +-----------+
| | Spanner | (2012) Globally distributed SQL
| | TrueTime | External consistency
| +-----------+
|
+----v-----------+
| Chubby (2006) | Distributed lock service, configuration
| (Paxos-based) | Used by GFS master, Bigtable master
+----------------+
+------------------+
| Dapper (2010) | Distributed tracing system
| Trace spans | Instrumented in all RPC stubs
+------------------+
+------------------+
| MapReduce (2004) | Batch processing; data transforms at PB scale
| -> Flume | Replaced by Dataflow/Beam for streaming
+------------------+
Spanner's TrueTime
Problem: Distributed transactions require total order.
Global clock synchronization to ~7 nanoseconds is impossible.
TrueTime API:
TT.now() returns [earliest, latest] — an interval, not a point
TT.after(t) returns true if t < TT.now().earliest
TT.before(t) returns true if t > TT.now().latest
GPS + Atomic clocks per datacenter -> bounded uncertainty (~7ms max)
Commit Wait:
Transaction coordinator assigns timestamp ts = TT.now().latest
Waits until TT.after(ts) is true (i.e., ts is definitely in the past)
Only then releases locks and returns to client
Result: any transaction that starts after this commit's timestamp
observes its effects. External consistency without global lock.
TrueTime uncertainty: ~1-7ms typical -> commit wait = ~1-7ms
Cost: latency. Benefit: true serializable external consistency at global scale.
Kafka's Log Architecture
Topic: "user-events", 6 Partitions
Partition 0 (Leader: Broker 1):
+----+----+----+----+----+----+
|msg0|msg1|msg2|msg3|msg4|msg5| -> offset log, append-only
+----+----+----+----+----+----+
| |
Broker 2 (replica) Consumer Group A, offset=3
Producer:
Partitioner(key) -> partition number
Append to leader's segment file (sendfile() -> zero-copy)
Consumer Group:
N consumers, M partitions: each partition assigned to exactly 1 consumer
Offset stored in __consumer_offsets topic (committed after processing)
Group coordinator: a broker manages assignment via group protocol
Rebalance triggered on member join/leave
Storage:
Segment files (.log) + index files (.index, .timeindex) per partition
Log compaction: retain latest value per key (not time-based deletion)
Log retention: time-based (7 days) or size-based
Key guarantee: order within a partition, not across partitions
Throughput: millions of messages/sec per broker
PostgreSQL MVCC and WAL
MVCC (Multi-Version Concurrency Control):
Each row has: xmin (created by txn ID), xmax (deleted by txn ID)
INSERT: row with xmin=current_txn, xmax=0
UPDATE: mark old row xmax=current_txn, insert new row with xmin=current_txn
DELETE: mark row xmax=current_txn
SELECT: visibility check: xmin committed AND xmax = 0 (or not committed)
Readers never block writers; writers never block readers.
WAL (Write-Ahead Log):
+----------+ +----------+ +----------+
| WAL buf | --> | WAL file | --> | Replica |
| (shared) | | (pg_wal) | | (stream) |
+----------+ +----------+ +----------+
|
Flush to WAL before committing -> durability
Pages written to heap after WAL -> buffer pool can absorb writes
Vacuum: reclaims dead tuples (old versions no longer visible to any txn)
autovacuum daemon: triggered by dead tuple threshold
Without vacuum: table bloat, transaction ID wraparound (XID exhaustion)
Buffer pool:
8KB pages in shared_buffers (RAM); dirty pages written to disk
checkpoint: periodic flush of all dirty pages; WAL truncated after
Cloudflare Global Anycast
Client: Sydney, Australia
DNS query for 1.1.1.1 -> any Cloudflare resolver
BGP announces 1.1.1.1/32 from 285+ PoPs globally
Sydney AS announces 1.1.1.1 -> nearest Cloudflare PoP wins
+--------+ BGP +----------+ IP anycast +----------+
|Internet| announce |Cloudflare| <-- same IP --> |Cloudflare|
|Routers | 1.1.1.1 |Sydney PoP| |London PoP|
+--------+ from both +----------+ +----------+
DDoS mitigation with eBPF:
Incoming packet -> XDP (eXpress Data Path) hook in NIC driver
eBPF program:
1. Parse packet headers at line rate (no copy, no allocation)
2. Check source IP against blocklist (BPF hash map)
3. Check packet rate (per-IP counters in BPF map)
4. Drop if matching rule: XDP_DROP (returns before kernel stack)
5. Pass legitimate traffic: XDP_PASS
Throughput: ~20 Mpps per core, before kernel network stack involvement
Cloudflare has handled 3.8 Tbps DDoS attacks using this approach
Key Concepts
- Borg / Kubernetes: Google's internal cluster management system (Borg, 2004) directly inspired Kubernetes (open-sourced 2014). Key concepts: declarative desired state, scheduler as bin-packing optimizer, health probes, preemption priorities. Borg managed millions of tasks across ~100,000 machines.
- Bigtable: Distributed sorted map: row key -> column family:column qualifier -> timestamp -> value. Built on GFS (Colossus). Tablet servers manage row ranges; master assigns tablets. SSTable on-disk format with MemTable in-memory buffer.
- Dynamo / DynamoDB: Amazon's always-writeable key-value store (2007 paper). Consistent hashing with virtual nodes, vector clocks for conflict detection, sloppy quorum (W+R>N not required). DynamoDB (2012) removed vector clocks, added conditional writes, single-table design patterns.
- Netflix Chaos Engineering: Deliberately injecting failures in production to build confidence in resilience. Simian Army: Chaos Monkey (kills instances), Chaos Gorilla (kills AZs), Chaos Kong (kills regions). Moved from Netflix-specific tools to open-source Chaos Engineering principles (Principles of Chaos Engineering, 2016).
- Redis: In-memory data structure server. Single-threaded event loop (io_threads in 6.0+). Persistence: RDB snapshots or AOF (Append-Only File). Cluster mode: 16384 hash slots, distributed with gossip protocol. Common patterns: sorted sets for leaderboards, pub/sub, Lua scripting, distributed locks (Redlock, contested).
- Cassandra: Wide-column NoSQL database. Decentralized (no master), gossip-based membership, consistent hashing with vnodes. Data model: partition key -> clustering key -> columns. Tunable consistency (ONE, QUORUM, ALL). Compaction strategies: STCS, LCS, TWCS for time-series.
- ClickHouse: Column-oriented OLAP database. MergeTree storage engine: data sorted by primary key, stored in column files. Parts written atomically; background merges. Vectorized query execution (SIMD). Achieves billions of rows per second on analytical queries.
- Envoy Proxy: L7 proxy written in C++ by Lyft (open-sourced 2016). xDS API for dynamic configuration (no reload). Filter chains for request processing (rate limit, auth, retry, circuit breaker). Used as sidecar in Istio/Linkerd service meshes.
Major Historical Milestones
| Year | Milestone |
|---|---|
| 2003 | Google GFS paper — distributed filesystem for large files |
| 2004 | Google MapReduce paper — batch compute at warehouse scale |
| 2006 | Google Bigtable paper — NoSQL at petabyte scale |
| 2006 | Amazon Dynamo paper — AP eventually-consistent key-value store |
| 2006 | Hadoop 0.1 — open-source MapReduce + HDFS (Yahoo) |
| 2007 | Memcached at Facebook — cache-aside at scale; Leslie Halton |
| 2009 | Netflix on AWS — cloud-native architecture begins |
| 2010 | Google Colossus — GFS successor; no single master |
| 2010 | Google Dapper paper — distributed tracing |
| 2011 | Kafka 0.7 — LinkedIn open-sources distributed log |
| 2012 | Google Spanner paper — globally distributed SQL, TrueTime |
| 2012 | Amazon DynamoDB — Dynamo principles as managed service |
| 2012 | Redis Cluster — horizontal scaling for Redis |
| 2013 | Netflix Simian Army — chaos engineering open-sourced |
| 2014 | Google Kubernetes open-sourced — Borg principles for everyone |
| 2015 | ClickHouse internal at Yandex (open-sourced 2016) |
| 2016 | Envoy Proxy — Lyft's L7 proxy open-sourced |
| 2016 | Meta TAO paper — distributed object/association cache |
| 2017 | Cloudflare Workers — serverless at edge, V8 isolates |
| 2018 | Cloudflare adopts eBPF/XDP for DDoS at 1 Tbps+ |
| 2019 | Amazon Aurora Global Database — <1 second RPO across regions |
| 2019 | Meta Tupperware — containerized workload management |
| 2020 | Kafka 2.8 KRaft mode — removes ZooKeeper dependency |
| 2022 | ClickHouse Cloud — ClickHouse Inc. managed service |
| 2023 | Amazon DynamoDB Standard-IA — tiered storage |
Modern Relevance
These systems are not historical artifacts — they are active production systems serving billions of users. Kafka is the de facto standard for event streaming; understanding its guarantees prevents subtle bugs in distributed applications. PostgreSQL is the dominant open-source relational database; its MVCC internals explain why long-running transactions cause table bloat. Redis is in virtually every production stack as a cache or pub/sub bus. Envoy is the universal data plane for service meshes and API gateways. Studying these systems reveals how theoretical concepts from distributed systems, database internals, and networking combine in real engineering. The papers behind these systems (Bigtable, Dynamo, Spanner, MapReduce, Raft) are the canon of systems literature that every infrastructure engineer should have read.
File Map
39-large-scale-case-studies/
├── 00-overview.md <- This file
├── 01-google-borg-and-kubernetes.md
├── 02-google-spanner.md
├── 03-google-colossus.md
├── 04-google-bigtable.md
├── 05-google-mapreduce.md
├── 06-google-chubby.md
├── 07-google-dapper.md
├── 08-meta-linux-at-scale.md
├── 09-meta-tupperware.md
├── 10-meta-thrift.md
├── 11-meta-tao.md
├── 12-aws-nitro.md
├── 13-aws-dynamodb.md
├── 14-aws-aurora.md
├── 15-aws-s3.md
├── 16-netflix-chaos-engineering.md
├── 17-netflix-evcache.md
├── 18-netflix-zuul.md
├── 19-cloudflare-anycast.md
├── 20-cloudflare-ebpf.md
├── 21-cloudflare-workers.md
├── 22-kafka-internals.md
├── 23-redis-internals.md
├── 24-postgresql-internals.md
├── 25-cassandra-architecture.md
├── 26-clickhouse-internals.md
├── 27-envoy-proxy.md
└── 28-nginx-internals.md
Cross-References
- Section 17 (Distributed Systems): Consensus, consistency models — Spanner TrueTime, Dynamo quorums
- Section 18 (Database Internals): PostgreSQL MVCC, Cassandra LSM trees, ClickHouse columnar storage
- Section 22 (Kubernetes): Direct descendant of Google Borg architecture
- Section 23 (Observability): Google Dapper's influence on OpenTelemetry distributed tracing
- Section 38 (System Design): These case studies are the real implementations of Section 38 patterns
- Section 32 (AI Infrastructure): Meta AI clusters, Google TPU infrastructure, AWS AI chips (Trainium, Inferentia)