Section 17: Distributed Systems
Purpose and Scope
A distributed system is a collection of independent computers that appears to its users as a single coherent system. The field's central challenge is that the components fail independently, communicate over unreliable networks, and have no shared clock — yet applications demand strong correctness guarantees, high availability, and predictable performance simultaneously.
This section covers the theoretical foundations (CAP theorem, PACELC, FLP impossibility) and the practical algorithms that production systems are built on: Paxos, Raft, and Zab for consensus; gossip protocols and SWIM for failure detection; consistent hashing and sharding for data distribution; vector clocks and CRDTs for causality and conflict resolution; 2-phase commit and Saga for distributed transactions. The section deliberately bridges theory and implementation, treating the gap between a paper algorithm and a production deployment as a subject of study in its own right.
Prerequisites
- Section 15 (Networking): TCP semantics, network partitions, latency distributions
- Section 16 (TCP/IP Internals): timeouts, connection failures, partial failure modes
- Section 10 (Synchronization): local synchronization as a baseline for comparison
- Familiarity with basic probability and discrete math
Learning Objectives
Upon completing this section you will be able to:
- State the CAP theorem precisely, explain why it is often misapplied, and use PACELC for more nuanced trade-off analysis.
- Explain the FLP impossibility result and what it implies for practical consensus (timeouts, failure detectors).
- Trace a Paxos execution: prepare/promise, accept/accepted, and how the algorithm handles concurrent proposers.
- Explain Raft's decomposition into leader election, log replication, and safety, and how it simplifies Paxos for implementors.
- Describe the consistency model spectrum from sequential consistency through linearizability, causal consistency, and eventual consistency.
- Explain consistent hashing and virtual nodes, and analyze its behavior under node addition/removal.
- Design a system using vector clocks or version vectors to detect and resolve concurrent writes.
- Explain CRDTs: what makes a data type conflict-free, examples (G-Counter, LWW-Register, OR-Set).
- Explain 2-phase commit's failure modes and why the coordinator is a single point of blocking.
- Describe the Saga pattern and its compensating transaction model.
Architecture Overview
Application Layer
┌──────────────────────────────────────────────────────────────────┐
│ Client Request │
└──────────────────────────────┬───────────────────────────────────┘
│
┌──────────────────────────────▼───────────────────────────────────┐
│ Coordination Layer │
│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────────┐ │
│ │ Consensus │ │ Leader │ │ Distributed Lock │ │
│ │ Raft / Paxos│ │ Election │ │ (Chubby/etcd/ZK) │ │
│ └─────────────┘ └──────────────┘ └────────────────────────┘ │
└──────────────────────────────┬───────────────────────────────────┘
│
┌──────────────────────────────▼───────────────────────────────────┐
│ Data Distribution Layer │
│ ┌──────────────────┐ ┌───────────────────────────────────┐ │
│ │ Consistent Hash │ │ Replication (sync/async/semi) │ │
│ │ Virtual Nodes │ │ Quorum (majority / W+R > N) │ │
│ └──────────────────┘ └───────────────────────────────────┘ │
└──────────────────────────────┬───────────────────────────────────┘
│
┌──────────────────────────────▼───────────────────────────────────┐
│ Failure Detection Layer │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Heartbeat / SWIM / Gossip / Phi Accrual Detector │ │
│ └──────────────────────────────────────────────────────────┘ │
└──────────────────────────────┬───────────────────────────────────┘
│
┌──────────────────────────────▼───────────────────────────────────┐
│ Physical Layer (N nodes, unreliable network) │
│ Node A ────── Node B ────── Node C │
│ │ │ │
│ CRASH PARTITION ←─── Node D │
└──────────────────────────────────────────────────────────────────┘
Consistency Spectrum:
Linearizable ──► Sequential ──► Causal ──► FIFO ──► Eventual
(strongest) (weakest)
Spanner/etcd Zookeeper DynamoDB-style DNS Cassandra default
Key Concepts
- CAP Theorem (Brewer, 2000): A distributed system cannot simultaneously guarantee Consistency (linearizability), Availability (every request receives a response), and Partition Tolerance (operation despite message loss). In practice, partitions occur, so the choice is C vs A during a partition.
- PACELC: Extension of CAP: when there is no Partition (P), the trade-off is between latency (L) and consistency (C). More actionable for system design.
- FLP Impossibility: In an asynchronous system with even one crash failure, there is no deterministic consensus algorithm that always terminates (Fisher, Lynch, Paterson 1985). Practical systems escape via timeouts and failure detectors.
- Paxos: Single-decree consensus algorithm using two phases: prepare/promise establishes a ballot, accept/accepted commits a value. Multi-Paxos reduces to one round trip per slot after leader establishment.
- Raft: Consensus algorithm designed for understandability; separates leader election (randomized timeouts), log replication (AppendEntries RPC), and safety (leader completeness, log matching).
- Zab (ZooKeeper Atomic Broadcast): Consensus protocol underlying ZooKeeper; distinguishes between crash-recovery mode and broadcast mode.
- Linearizability: The strongest single-object consistency model; every operation appears to take effect atomically at some point between its invocation and response.
- Causal Consistency: Writes that are causally related appear in the same order on all nodes; concurrent writes may appear in different orders.
- Eventual Consistency: If no new updates are made, all replicas will converge to the same value; offers no bound on convergence time.
- Vector Clock: Per-node logical timestamp enabling causal ordering; a vector [t1, t2, …, tN] where ti is node i's local event count.
- CRDT (Conflict-free Replicated Data Type): Data structures whose operations commute and/or are idempotent, enabling automatic conflict resolution in AP systems.
- Consistent Hashing: Maps keys to a circular hash ring; adding/removing a node affects only 1/N of keys; virtual nodes improve balance.
- Quorum: For a system with N replicas, a write quorum W and read quorum R satisfying W+R > N ensures at least one node has the latest write.
- Gossip Protocol: Epidemic information dissemination; each node periodically contacts a random peer to exchange state; converges in O(log N) rounds.
- SWIM (Scalable Weakly-consistent Infection-style Membership): Failure detection using piggybacked gossip on periodic pings; O(1) message overhead per node.
- 2-Phase Commit (2PC): Distributed atomic commit protocol; coordinator sends prepare, collects votes, sends commit/abort; coordinator crash after prepare leaves participants blocked.
- Saga Pattern: Sequence of local transactions with compensating transactions for rollback; avoids distributed locks; requires idempotent compensations.
Major Historical Milestones
| Year | Milestone |
|---|---|
| 1978 | Lamport logical clocks; causal ordering of events |
| 1982 | Byzantine Generals problem (Lamport, Shostak, Pease) |
| 1983 | Two-phase commit protocol described (Gray) |
| 1985 | FLP impossibility proof (Fischer, Lynch, Paterson) |
| 1989 | Lamport publishes Paxos Made Simple (circulated; published 1998) |
| 1990 | Bayou replicated storage; eventual consistency formalized |
| 1997 | Brewer's CAP conjecture (formalized as theorem by Gilbert & Lynch 2002) |
| 1998 | Google File System design begins; Chubby lock service |
| 2001 | Practical Byzantine Fault Tolerance (PBFT) — Castro & Liskov |
| 2003 | Amazon Dynamo architecture (published 2007); consistent hashing, vector clocks |
| 2003 | Google Chubby paper — Paxos-based distributed lock service |
| 2006 | Zookeeper development at Yahoo; ZAB protocol |
| 2007 | CRDTs formalized (Shapiro et al., INRIA) |
| 2010 | Google Spanner: TrueTime, externally consistent transactions |
| 2012 | PACELC framework (Abadi) |
| 2014 | Raft consensus algorithm (Ongaro & Ousterhout PhD thesis) |
| 2014 | etcd v1 released (Raft-based); becomes Kubernetes' backing store |
| 2017 | CockroachDB, TiDB: Raft-based geo-distributed SQL |
| 2019 | FoundationDB open-sourced; Paxos-based multi-model store |
| 2021 | Flexible Paxos: quorum intersection requirements relaxed |
Modern Relevance and Production Use Cases
Kubernetes uses etcd (Raft consensus) as its single source of truth; every API object write is serialized through Raft; understanding quorum behavior explains why a 3-node etcd cluster tolerates 1 failure but a 5-node cluster tolerates 2.
Apache Kafka uses Raft (KRaft mode, replacing ZooKeeper) for metadata; partition leadership election and ISR (in-sync replica) management are direct applications of consensus and quorum theory.
Cassandra and DynamoDB are AP systems using consistent hashing, gossip membership (SWIM-derived), and tunable quorums; understanding their consistency model is essential for designing applications that tolerate eventual consistency.
Google Spanner and CockroachDB achieve external consistency via TrueTime (bounded clock uncertainty) and hybrid logical clocks respectively; they demonstrate that linearizable geo-distributed transactions are possible but expensive.
Service discovery (Consul, etcd, ZooKeeper) is a direct application of distributed consensus; application developers interact with these systems without knowing Raft, but operational failures (quorum loss, network partitions) are directly explained by it.
File Map
| File | Description |
|---|---|
01-fundamentals.md |
Distributed system properties, failure modes, partial failure |
02-cap-theorem.md |
Formal statement, partition scenarios, C vs A choice |
03-pacelc.md |
Latency vs consistency in normal operation, real system examples |
04-flp-impossibility.md |
Proof sketch, failure detectors, ◇S and ◇W |
05-consistency-models.md |
Linearizability, sequential, causal, FIFO, eventual |
06-paxos.md |
Single-decree Paxos, Multi-Paxos, leader lease, Fast Paxos |
07-raft.md |
Leader election, log replication, log compaction, membership change |
08-zab.md |
ZooKeeper atomic broadcast, epoch, discovery/sync/broadcast phases |
09-leader-election.md |
Bully algorithm, ring-based, Raft randomized timeout |
10-distributed-locking.md |
Chubby, etcd distributed lock, fencing tokens |
11-replication-strategies.md |
Sync/async/semi-sync, chain replication, primary-backup |
12-sharding.md |
Range vs hash partitioning, hotspot avoidance, re-sharding |
13-consistent-hashing.md |
Ring model, virtual nodes, bounded load hashing |
14-vector-clocks.md |
Lamport clocks, vector clocks, dotted version vectors |
15-crdts.md |
Convergent vs commutative, G-Counter, OR-Set, LWW-Register |
16-gossip-protocols.md |
Push/pull/push-pull gossip, convergence, anti-entropy |
17-failure-detection.md |
SWIM, Phi accrual detector, Cassandra/Akka failure detection |
18-distributed-transactions.md |
ACID in distributed setting, 2PC, 3PC, paxos commit |
19-saga-pattern.md |
Choreography vs orchestration Saga, compensating transactions |
20-case-studies.md |
Spanner, DynamoDB, Kafka KRaft, etcd, TiKV internals |
Cross-References
- Section 10 (Synchronization): local synchronization primitives as building blocks; contrast with distributed locking
- Section 13 (Filesystems): distributed filesystems (CephFS, GlusterFS, NFS) as applications
- Section 15 (Networking): network partition behavior, TCP timeout as failure signal
- Section 16 (TCP/IP): connection semantics that distributed systems must reason about
- Section 18 (Database Internals): distributed database implementations of Raft, 2PC, MVCC
- Section 19 (Virtualization): live migration as a distributed systems problem; cluster schedulers