Section 17: Distributed Systems

Purpose and Scope

A distributed system is a collection of independent computers that appears to its users as a single coherent system. The field's central challenge is that the components fail independently, communicate over unreliable networks, and have no shared clock — yet applications demand strong correctness guarantees, high availability, and predictable performance simultaneously.

This section covers the theoretical foundations (CAP theorem, PACELC, FLP impossibility) and the practical algorithms that production systems are built on: Paxos, Raft, and Zab for consensus; gossip protocols and SWIM for failure detection; consistent hashing and sharding for data distribution; vector clocks and CRDTs for causality and conflict resolution; 2-phase commit and Saga for distributed transactions. The section deliberately bridges theory and implementation, treating the gap between a paper algorithm and a production deployment as a subject of study in its own right.

Prerequisites

Section 15 (Networking): TCP semantics, network partitions, latency distributions
Section 16 (TCP/IP Internals): timeouts, connection failures, partial failure modes
Section 10 (Synchronization): local synchronization as a baseline for comparison
Familiarity with basic probability and discrete math

Learning Objectives

Upon completing this section you will be able to:

State the CAP theorem precisely, explain why it is often misapplied, and use PACELC for more nuanced trade-off analysis.
Explain the FLP impossibility result and what it implies for practical consensus (timeouts, failure detectors).
Trace a Paxos execution: prepare/promise, accept/accepted, and how the algorithm handles concurrent proposers.
Explain Raft's decomposition into leader election, log replication, and safety, and how it simplifies Paxos for implementors.
Describe the consistency model spectrum from sequential consistency through linearizability, causal consistency, and eventual consistency.
Explain consistent hashing and virtual nodes, and analyze its behavior under node addition/removal.
Design a system using vector clocks or version vectors to detect and resolve concurrent writes.
Explain CRDTs: what makes a data type conflict-free, examples (G-Counter, LWW-Register, OR-Set).
Explain 2-phase commit's failure modes and why the coordinator is a single point of blocking.
Describe the Saga pattern and its compensating transaction model.

Architecture Overview

  Application Layer
  ┌──────────────────────────────────────────────────────────────────┐
  │   Client Request                                                 │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │
  ┌──────────────────────────────▼───────────────────────────────────┐
  │              Coordination Layer                                   │
  │   ┌─────────────┐  ┌──────────────┐  ┌────────────────────────┐ │
  │   │  Consensus  │  │   Leader     │  │  Distributed Lock      │ │
  │   │ Raft / Paxos│  │  Election    │  │  (Chubby/etcd/ZK)      │ │
  │   └─────────────┘  └──────────────┘  └────────────────────────┘ │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │
  ┌──────────────────────────────▼───────────────────────────────────┐
  │              Data Distribution Layer                              │
  │   ┌──────────────────┐  ┌───────────────────────────────────┐   │
  │   │ Consistent Hash  │  │  Replication (sync/async/semi)    │   │
  │   │ Virtual Nodes    │  │  Quorum (majority / W+R > N)      │   │
  │   └──────────────────┘  └───────────────────────────────────┘   │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │
  ┌──────────────────────────────▼───────────────────────────────────┐
  │              Failure Detection Layer                              │
  │   ┌──────────────────────────────────────────────────────────┐   │
  │   │  Heartbeat / SWIM / Gossip / Phi Accrual Detector        │   │
  │   └──────────────────────────────────────────────────────────┘   │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │
  ┌──────────────────────────────▼───────────────────────────────────┐
  │              Physical Layer (N nodes, unreliable network)         │
  │   Node A ────── Node B ────── Node C                             │
  │     │                            │                               │
  │   CRASH                    PARTITION ←─── Node D                 │
  └──────────────────────────────────────────────────────────────────┘

  Consistency Spectrum:
  Linearizable ──► Sequential ──► Causal ──► FIFO ──► Eventual
  (strongest)                                          (weakest)
  Spanner/etcd    Zookeeper     DynamoDB-style   DNS    Cassandra default

Key Concepts

CAP Theorem (Brewer, 2000): A distributed system cannot simultaneously guarantee Consistency (linearizability), Availability (every request receives a response), and Partition Tolerance (operation despite message loss). In practice, partitions occur, so the choice is C vs A during a partition.
PACELC: Extension of CAP: when there is no Partition (P), the trade-off is between latency (L) and consistency (C). More actionable for system design.
FLP Impossibility: In an asynchronous system with even one crash failure, there is no deterministic consensus algorithm that always terminates (Fisher, Lynch, Paterson 1985). Practical systems escape via timeouts and failure detectors.
Paxos: Single-decree consensus algorithm using two phases: prepare/promise establishes a ballot, accept/accepted commits a value. Multi-Paxos reduces to one round trip per slot after leader establishment.
Raft: Consensus algorithm designed for understandability; separates leader election (randomized timeouts), log replication (AppendEntries RPC), and safety (leader completeness, log matching).
Zab (ZooKeeper Atomic Broadcast): Consensus protocol underlying ZooKeeper; distinguishes between crash-recovery mode and broadcast mode.
Linearizability: The strongest single-object consistency model; every operation appears to take effect atomically at some point between its invocation and response.
Causal Consistency: Writes that are causally related appear in the same order on all nodes; concurrent writes may appear in different orders.
Eventual Consistency: If no new updates are made, all replicas will converge to the same value; offers no bound on convergence time.
Vector Clock: Per-node logical timestamp enabling causal ordering; a vector [t1, t2, …, tN] where ti is node i's local event count.
CRDT (Conflict-free Replicated Data Type): Data structures whose operations commute and/or are idempotent, enabling automatic conflict resolution in AP systems.
Consistent Hashing: Maps keys to a circular hash ring; adding/removing a node affects only 1/N of keys; virtual nodes improve balance.
Quorum: For a system with N replicas, a write quorum W and read quorum R satisfying W+R > N ensures at least one node has the latest write.
Gossip Protocol: Epidemic information dissemination; each node periodically contacts a random peer to exchange state; converges in O(log N) rounds.
SWIM (Scalable Weakly-consistent Infection-style Membership): Failure detection using piggybacked gossip on periodic pings; O(1) message overhead per node.
2-Phase Commit (2PC): Distributed atomic commit protocol; coordinator sends prepare, collects votes, sends commit/abort; coordinator crash after prepare leaves participants blocked.
Saga Pattern: Sequence of local transactions with compensating transactions for rollback; avoids distributed locks; requires idempotent compensations.

Major Historical Milestones

Year	Milestone
1978	Lamport logical clocks; causal ordering of events
1982	Byzantine Generals problem (Lamport, Shostak, Pease)
1983	Two-phase commit protocol described (Gray)
1985	FLP impossibility proof (Fischer, Lynch, Paterson)
1989	Lamport publishes Paxos Made Simple (circulated; published 1998)
1990	Bayou replicated storage; eventual consistency formalized
1997	Brewer's CAP conjecture (formalized as theorem by Gilbert & Lynch 2002)
1998	Google File System design begins; Chubby lock service
2001	Practical Byzantine Fault Tolerance (PBFT) — Castro & Liskov
2003	Amazon Dynamo architecture (published 2007); consistent hashing, vector clocks
2003	Google Chubby paper — Paxos-based distributed lock service
2006	Zookeeper development at Yahoo; ZAB protocol
2007	CRDTs formalized (Shapiro et al., INRIA)
2010	Google Spanner: TrueTime, externally consistent transactions
2012	PACELC framework (Abadi)
2014	Raft consensus algorithm (Ongaro & Ousterhout PhD thesis)
2014	etcd v1 released (Raft-based); becomes Kubernetes' backing store
2017	CockroachDB, TiDB: Raft-based geo-distributed SQL
2019	FoundationDB open-sourced; Paxos-based multi-model store
2021	Flexible Paxos: quorum intersection requirements relaxed

Modern Relevance and Production Use Cases

Kubernetes uses etcd (Raft consensus) as its single source of truth; every API object write is serialized through Raft; understanding quorum behavior explains why a 3-node etcd cluster tolerates 1 failure but a 5-node cluster tolerates 2.

Apache Kafka uses Raft (KRaft mode, replacing ZooKeeper) for metadata; partition leadership election and ISR (in-sync replica) management are direct applications of consensus and quorum theory.

Cassandra and DynamoDB are AP systems using consistent hashing, gossip membership (SWIM-derived), and tunable quorums; understanding their consistency model is essential for designing applications that tolerate eventual consistency.

Google Spanner and CockroachDB achieve external consistency via TrueTime (bounded clock uncertainty) and hybrid logical clocks respectively; they demonstrate that linearizable geo-distributed transactions are possible but expensive.

Service discovery (Consul, etcd, ZooKeeper) is a direct application of distributed consensus; application developers interact with these systems without knowing Raft, but operational failures (quorum loss, network partitions) are directly explained by it.

File Map

File	Description
`01-fundamentals.md`	Distributed system properties, failure modes, partial failure
`02-cap-theorem.md`	Formal statement, partition scenarios, C vs A choice
`03-pacelc.md`	Latency vs consistency in normal operation, real system examples
`04-flp-impossibility.md`	Proof sketch, failure detectors, ◇S and ◇W
`05-consistency-models.md`	Linearizability, sequential, causal, FIFO, eventual
`06-paxos.md`	Single-decree Paxos, Multi-Paxos, leader lease, Fast Paxos
`07-raft.md`	Leader election, log replication, log compaction, membership change
`08-zab.md`	ZooKeeper atomic broadcast, epoch, discovery/sync/broadcast phases
`09-leader-election.md`	Bully algorithm, ring-based, Raft randomized timeout
`10-distributed-locking.md`	Chubby, etcd distributed lock, fencing tokens
`11-replication-strategies.md`	Sync/async/semi-sync, chain replication, primary-backup
`12-sharding.md`	Range vs hash partitioning, hotspot avoidance, re-sharding
`13-consistent-hashing.md`	Ring model, virtual nodes, bounded load hashing
`14-vector-clocks.md`	Lamport clocks, vector clocks, dotted version vectors
`15-crdts.md`	Convergent vs commutative, G-Counter, OR-Set, LWW-Register
`16-gossip-protocols.md`	Push/pull/push-pull gossip, convergence, anti-entropy
`17-failure-detection.md`	SWIM, Phi accrual detector, Cassandra/Akka failure detection
`18-distributed-transactions.md`	ACID in distributed setting, 2PC, 3PC, paxos commit
`19-saga-pattern.md`	Choreography vs orchestration Saga, compensating transactions
`20-case-studies.md`	Spanner, DynamoDB, Kafka KRaft, etcd, TiKV internals

Cross-References

Section 10 (Synchronization): local synchronization primitives as building blocks; contrast with distributed locking
Section 13 (Filesystems): distributed filesystems (CephFS, GlusterFS, NFS) as applications
Section 15 (Networking): network partition behavior, TCP timeout as failure signal
Section 16 (TCP/IP): connection semantics that distributed systems must reason about
Section 18 (Database Internals): distributed database implementations of Raft, 2PC, MVCC
Section 19 (Virtualization): live migration as a distributed systems problem; cluster schedulers