Skip to content

Section 17: Distributed Systems

Purpose and Scope

A distributed system is a collection of independent computers that appears to its users as a single coherent system. The field's central challenge is that the components fail independently, communicate over unreliable networks, and have no shared clock — yet applications demand strong correctness guarantees, high availability, and predictable performance simultaneously.

This section covers the theoretical foundations (CAP theorem, PACELC, FLP impossibility) and the practical algorithms that production systems are built on: Paxos, Raft, and Zab for consensus; gossip protocols and SWIM for failure detection; consistent hashing and sharding for data distribution; vector clocks and CRDTs for causality and conflict resolution; 2-phase commit and Saga for distributed transactions. The section deliberately bridges theory and implementation, treating the gap between a paper algorithm and a production deployment as a subject of study in its own right.


Prerequisites

  • Section 15 (Networking): TCP semantics, network partitions, latency distributions
  • Section 16 (TCP/IP Internals): timeouts, connection failures, partial failure modes
  • Section 10 (Synchronization): local synchronization as a baseline for comparison
  • Familiarity with basic probability and discrete math

Learning Objectives

Upon completing this section you will be able to:

  1. State the CAP theorem precisely, explain why it is often misapplied, and use PACELC for more nuanced trade-off analysis.
  2. Explain the FLP impossibility result and what it implies for practical consensus (timeouts, failure detectors).
  3. Trace a Paxos execution: prepare/promise, accept/accepted, and how the algorithm handles concurrent proposers.
  4. Explain Raft's decomposition into leader election, log replication, and safety, and how it simplifies Paxos for implementors.
  5. Describe the consistency model spectrum from sequential consistency through linearizability, causal consistency, and eventual consistency.
  6. Explain consistent hashing and virtual nodes, and analyze its behavior under node addition/removal.
  7. Design a system using vector clocks or version vectors to detect and resolve concurrent writes.
  8. Explain CRDTs: what makes a data type conflict-free, examples (G-Counter, LWW-Register, OR-Set).
  9. Explain 2-phase commit's failure modes and why the coordinator is a single point of blocking.
  10. Describe the Saga pattern and its compensating transaction model.

Architecture Overview

  Application Layer
  ┌──────────────────────────────────────────────────────────────────┐
  │   Client Request                                                 │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │
  ┌──────────────────────────────▼───────────────────────────────────┐
  │              Coordination Layer                                   │
  │   ┌─────────────┐  ┌──────────────┐  ┌────────────────────────┐ │
  │   │  Consensus  │  │   Leader     │  │  Distributed Lock      │ │
  │   │ Raft / Paxos│  │  Election    │  │  (Chubby/etcd/ZK)      │ │
  │   └─────────────┘  └──────────────┘  └────────────────────────┘ │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │
  ┌──────────────────────────────▼───────────────────────────────────┐
  │              Data Distribution Layer                              │
  │   ┌──────────────────┐  ┌───────────────────────────────────┐   │
  │   │ Consistent Hash  │  │  Replication (sync/async/semi)    │   │
  │   │ Virtual Nodes    │  │  Quorum (majority / W+R > N)      │   │
  │   └──────────────────┘  └───────────────────────────────────┘   │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │
  ┌──────────────────────────────▼───────────────────────────────────┐
  │              Failure Detection Layer                              │
  │   ┌──────────────────────────────────────────────────────────┐   │
  │   │  Heartbeat / SWIM / Gossip / Phi Accrual Detector        │   │
  │   └──────────────────────────────────────────────────────────┘   │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │
  ┌──────────────────────────────▼───────────────────────────────────┐
  │              Physical Layer (N nodes, unreliable network)         │
  │   Node A ────── Node B ────── Node C                             │
  │     │                            │                               │
  │   CRASH                    PARTITION ←─── Node D                 │
  └──────────────────────────────────────────────────────────────────┘

  Consistency Spectrum:
  Linearizable ──► Sequential ──► Causal ──► FIFO ──► Eventual
  (strongest)                                          (weakest)
  Spanner/etcd    Zookeeper     DynamoDB-style   DNS    Cassandra default

Key Concepts

  • CAP Theorem (Brewer, 2000): A distributed system cannot simultaneously guarantee Consistency (linearizability), Availability (every request receives a response), and Partition Tolerance (operation despite message loss). In practice, partitions occur, so the choice is C vs A during a partition.
  • PACELC: Extension of CAP: when there is no Partition (P), the trade-off is between latency (L) and consistency (C). More actionable for system design.
  • FLP Impossibility: In an asynchronous system with even one crash failure, there is no deterministic consensus algorithm that always terminates (Fisher, Lynch, Paterson 1985). Practical systems escape via timeouts and failure detectors.
  • Paxos: Single-decree consensus algorithm using two phases: prepare/promise establishes a ballot, accept/accepted commits a value. Multi-Paxos reduces to one round trip per slot after leader establishment.
  • Raft: Consensus algorithm designed for understandability; separates leader election (randomized timeouts), log replication (AppendEntries RPC), and safety (leader completeness, log matching).
  • Zab (ZooKeeper Atomic Broadcast): Consensus protocol underlying ZooKeeper; distinguishes between crash-recovery mode and broadcast mode.
  • Linearizability: The strongest single-object consistency model; every operation appears to take effect atomically at some point between its invocation and response.
  • Causal Consistency: Writes that are causally related appear in the same order on all nodes; concurrent writes may appear in different orders.
  • Eventual Consistency: If no new updates are made, all replicas will converge to the same value; offers no bound on convergence time.
  • Vector Clock: Per-node logical timestamp enabling causal ordering; a vector [t1, t2, …, tN] where ti is node i's local event count.
  • CRDT (Conflict-free Replicated Data Type): Data structures whose operations commute and/or are idempotent, enabling automatic conflict resolution in AP systems.
  • Consistent Hashing: Maps keys to a circular hash ring; adding/removing a node affects only 1/N of keys; virtual nodes improve balance.
  • Quorum: For a system with N replicas, a write quorum W and read quorum R satisfying W+R > N ensures at least one node has the latest write.
  • Gossip Protocol: Epidemic information dissemination; each node periodically contacts a random peer to exchange state; converges in O(log N) rounds.
  • SWIM (Scalable Weakly-consistent Infection-style Membership): Failure detection using piggybacked gossip on periodic pings; O(1) message overhead per node.
  • 2-Phase Commit (2PC): Distributed atomic commit protocol; coordinator sends prepare, collects votes, sends commit/abort; coordinator crash after prepare leaves participants blocked.
  • Saga Pattern: Sequence of local transactions with compensating transactions for rollback; avoids distributed locks; requires idempotent compensations.

Major Historical Milestones

Year Milestone
1978 Lamport logical clocks; causal ordering of events
1982 Byzantine Generals problem (Lamport, Shostak, Pease)
1983 Two-phase commit protocol described (Gray)
1985 FLP impossibility proof (Fischer, Lynch, Paterson)
1989 Lamport publishes Paxos Made Simple (circulated; published 1998)
1990 Bayou replicated storage; eventual consistency formalized
1997 Brewer's CAP conjecture (formalized as theorem by Gilbert & Lynch 2002)
1998 Google File System design begins; Chubby lock service
2001 Practical Byzantine Fault Tolerance (PBFT) — Castro & Liskov
2003 Amazon Dynamo architecture (published 2007); consistent hashing, vector clocks
2003 Google Chubby paper — Paxos-based distributed lock service
2006 Zookeeper development at Yahoo; ZAB protocol
2007 CRDTs formalized (Shapiro et al., INRIA)
2010 Google Spanner: TrueTime, externally consistent transactions
2012 PACELC framework (Abadi)
2014 Raft consensus algorithm (Ongaro & Ousterhout PhD thesis)
2014 etcd v1 released (Raft-based); becomes Kubernetes' backing store
2017 CockroachDB, TiDB: Raft-based geo-distributed SQL
2019 FoundationDB open-sourced; Paxos-based multi-model store
2021 Flexible Paxos: quorum intersection requirements relaxed

Modern Relevance and Production Use Cases

Kubernetes uses etcd (Raft consensus) as its single source of truth; every API object write is serialized through Raft; understanding quorum behavior explains why a 3-node etcd cluster tolerates 1 failure but a 5-node cluster tolerates 2.

Apache Kafka uses Raft (KRaft mode, replacing ZooKeeper) for metadata; partition leadership election and ISR (in-sync replica) management are direct applications of consensus and quorum theory.

Cassandra and DynamoDB are AP systems using consistent hashing, gossip membership (SWIM-derived), and tunable quorums; understanding their consistency model is essential for designing applications that tolerate eventual consistency.

Google Spanner and CockroachDB achieve external consistency via TrueTime (bounded clock uncertainty) and hybrid logical clocks respectively; they demonstrate that linearizable geo-distributed transactions are possible but expensive.

Service discovery (Consul, etcd, ZooKeeper) is a direct application of distributed consensus; application developers interact with these systems without knowing Raft, but operational failures (quorum loss, network partitions) are directly explained by it.


File Map

File Description
01-fundamentals.md Distributed system properties, failure modes, partial failure
02-cap-theorem.md Formal statement, partition scenarios, C vs A choice
03-pacelc.md Latency vs consistency in normal operation, real system examples
04-flp-impossibility.md Proof sketch, failure detectors, ◇S and ◇W
05-consistency-models.md Linearizability, sequential, causal, FIFO, eventual
06-paxos.md Single-decree Paxos, Multi-Paxos, leader lease, Fast Paxos
07-raft.md Leader election, log replication, log compaction, membership change
08-zab.md ZooKeeper atomic broadcast, epoch, discovery/sync/broadcast phases
09-leader-election.md Bully algorithm, ring-based, Raft randomized timeout
10-distributed-locking.md Chubby, etcd distributed lock, fencing tokens
11-replication-strategies.md Sync/async/semi-sync, chain replication, primary-backup
12-sharding.md Range vs hash partitioning, hotspot avoidance, re-sharding
13-consistent-hashing.md Ring model, virtual nodes, bounded load hashing
14-vector-clocks.md Lamport clocks, vector clocks, dotted version vectors
15-crdts.md Convergent vs commutative, G-Counter, OR-Set, LWW-Register
16-gossip-protocols.md Push/pull/push-pull gossip, convergence, anti-entropy
17-failure-detection.md SWIM, Phi accrual detector, Cassandra/Akka failure detection
18-distributed-transactions.md ACID in distributed setting, 2PC, 3PC, paxos commit
19-saga-pattern.md Choreography vs orchestration Saga, compensating transactions
20-case-studies.md Spanner, DynamoDB, Kafka KRaft, etcd, TiKV internals

Cross-References

  • Section 10 (Synchronization): local synchronization primitives as building blocks; contrast with distributed locking
  • Section 13 (Filesystems): distributed filesystems (CephFS, GlusterFS, NFS) as applications
  • Section 15 (Networking): network partition behavior, TCP timeout as failure signal
  • Section 16 (TCP/IP): connection semantics that distributed systems must reason about
  • Section 18 (Database Internals): distributed database implementations of Raft, 2PC, MVCC
  • Section 19 (Virtualization): live migration as a distributed systems problem; cluster schedulers