05 — Paxos

Technical Overview

Paxos is the foundational consensus algorithm for distributed systems. It solves the problem of getting a group of processes to agree on a single value despite failures and message loss. Every production consensus system — whether it uses Paxos directly or descends from it — is built on the same core insight: to safely commit a value, you need responses from a quorum of acceptors in two phases, and the second phase must ensure that any previously chosen value is preserved.

Paxos was described by Leslie Lamport in 1989 (circulated informally) and finally published in 1998 as "The Part-Time Parliament." Its notoriously allegorical presentation — set in an ancient Greek parliament — made it difficult to understand, leading to the 2001 "Paxos Made Simple" paper. Even then, implementing Paxos correctly took Google's top engineers years.

Prerequisites

Distributed systems fundamentals (01-distributed-systems-fundamentals.md)
CAP theorem and consistency models (02-cap-theorem.md, 03-consistency-models.md)
Understanding of quorums and majority voting
Basic understanding of fault tolerance (crash-stop model)

Core Content

Why Consensus Is Hard: FLP Impossibility

Before understanding Paxos, you must understand why achieving consensus is fundamentally hard. The FLP Impossibility Theorem (Fischer, Lynch, Paterson, 1985) proves:

In a fully asynchronous distributed system, no deterministic consensus protocol can guarantee both safety (all nodes agree on the same value) and liveness (the protocol terminates) in the presence of even a single crash failure.

This is not a statement about bad implementations — it's a mathematical impossibility. In an asynchronous system, you cannot distinguish a crashed node from a slow node. If you wait for a response, you might wait forever (if the node crashed). If you give up after a timeout, you might give up on a node that's still running, violating safety if you proceed without it.

The implication: All practical consensus protocols (Paxos, Raft, Zab) solve this by assuming partial synchrony — there exists some unknown but finite time bound after which messages arrive. They sacrifice guaranteed liveness in completely asynchronous environments but provide safety always.

FLP Impossibility Intuition:

  System: processes P1, P2, P3. P3 may crash.
  Goal: agree on 0 or 1.

  Case 1: P3 doesn't crash. Must reach consensus.
  Case 2: P3 crashes before any message. 
    P1 and P2 must reach consensus without P3.

  The critical state: P1 and P2 are waiting for P3.
  Is P3 crashed (Case 2) or just slow (Case 1)?

  If we decide (to avoid blocking) → might violate safety if P3 was slow and
  chose a different value.
  If we wait forever → no liveness guarantee.

  There exists a decision state from which either choice can lead to
  a protocol execution that is "stuck" — cannot safely commit.

Paxos Roles

Paxos has three roles (a single process can play multiple roles):

Proposer: Proposes a value to be agreed upon. In practice, the designated leader.
Acceptor: Votes on proposals. A quorum of acceptors must accept a proposal for it to be chosen. Typically all nodes play acceptor.
Learner: Learns the chosen value after it's committed. Often the same as proposers/acceptors.

Single-Decree Paxos: The Core Algorithm

Single-Decree Paxos achieves agreement on a single value (one decree). This is the foundation; Multi-Paxos extends it to a sequence of values (a log).

The algorithm proceeds in two phases:

Phase 1: Prepare / Promise

  Proposer selects a unique proposal number n.

  Proposer → ALL Acceptors: Prepare(n)

  Acceptor, upon receiving Prepare(n):
    If n > highest_prepared_n:
      Set highest_prepared_n = n
      Reply: Promise(n, accepted_n, accepted_value)
        where accepted_n = highest proposal number acceptor has accepted
              accepted_value = value it accepted at that number (or null)
    Else:
      Ignore (or send Nack)

Phase 2: Accept / Accepted

  IF Proposer receives Promise from a quorum (majority):

    Determine value v:
      If any Promise contained an accepted_value:
        v = accepted_value with the highest accepted_n
        (MUST use this value — cannot use proposer's own)
      Else:
        v = proposer's own value (free to choose)

    Proposer → ALL Acceptors: Accept(n, v)

    Acceptor, upon receiving Accept(n, v):
      If n >= highest_prepared_n:
        Accept the proposal.
        Set accepted_n = n, accepted_value = v
        Reply: Accepted(n, v)
        (Also notify Learners)
      Else:
        Ignore (a newer proposal took over)

  IF Proposer receives Accepted from a quorum:
    Value v is CHOSEN.
    Notify learners.

Phase Diagram

Proposer             Acceptor A          Acceptor B          Acceptor C
   │                     │                   │                   │
   │──── Prepare(n=5) ──→│                   │                   │
   │──── Prepare(n=5) ──────────────────────→│                   │
   │──── Prepare(n=5) ─────────────────────────────────────────→│
   │                     │                   │                   │
   │←── Promise(5,─,─) ──│   (no prior accept)                   │
   │←─────────────── Promise(5,─,─) ─────────│                   │
   │←───────────────────────────── Promise(5,3,"v_old") ─────────│
   │                     │                   │                   │
   │   Quorum received. B had accepted "v_old" at n=3.           │
   │   MUST use "v_old" as the value.                            │
   │                     │                   │                   │
   │──── Accept(5,"v_old") ─────────────────→│                   │
   │──── Accept(5,"v_old") ──────────────────────────────────────→│
   │──── Accept(5,"v_old") ──→│              │                   │
   │                     │                   │                   │
   │←── Accepted(5,"v_old")──│               │                   │
   │←──────────────── Accepted(5,"v_old") ───│                   │
   │←───────────────────────────── Accepted(5,"v_old") ──────────│
   │                     │                   │                   │
   │   CHOSEN: "v_old"                       │                   │

Why the proposer must use a previously accepted value (the key safety insight):

If acceptor C had previously accepted "v_old", that means a quorum might have accepted "v_old" in an earlier round (which might have been chosen already). If our new proposer proposes a different value and it gets chosen, we'd have two different chosen values — a safety violation. By constraining the new proposer to use the highest previously-accepted value, Paxos ensures that once a value is chosen (even if the proposer crashed before learning the result), any future successful proposal will choose the same value.

Safety Intuition:

If value V was chosen at round n (quorum accepted it):
  Any later proposal with round n' > n that reaches quorum 
  will include at least one acceptor that accepted V at n.
  That acceptor reports V in its Promise.
  The new proposer sees V as the highest-accepted value.
  Therefore, the new proposer MUST propose V again.
  → Only V can ever be chosen. Safety holds.

Multi-Paxos: Stable Leader Optimization

Single-Decree Paxos agrees on one value. A distributed log needs agreement on a sequence of values (position 1, position 2, ...). The naive approach — run an independent Paxos instance per log entry — requires two round-trips per entry.

Multi-Paxos optimization: run Phase 1 once to establish a stable leader. The leader can then commit entries with Phase 2 only (one round-trip) until it fails.

Multi-Paxos:

  [Phase 1 - once per leader term]
  Leader → Acceptors: Prepare(ballot=B)
  Acceptors → Leader: Promise(B, log_entries_accepted_so_far)

  [Phase 2 - once per log entry while leader is stable]
  Leader → Acceptors: Accept(ballot=B, slot=i, value=v)
  Acceptors → Leader: Accepted(B, i, v)

  Leader commits to learners once quorum achieved.

This reduces steady-state consensus from 2 round-trips to 1 round-trip, at the cost of needing a new Phase 1 when the leader fails.

Paxos Liveness Problem

Paxos can livelock. Consider two proposers racing:

Proposer 1: Prepare(n=1) → gets quorum
Proposer 2: Prepare(n=2) → interrupts, gets quorum
  → Acceptors now have highest_prepared_n=2. P1's Accept(1,...) rejected.
Proposer 1: Prepare(n=3) → interrupts P2
  → Acceptors now have highest_prepared_n=3. P2's Accept(2,...) rejected.
... infinite loop

Solutions: 1. Stable leader (Multi-Paxos): Only one designated proposer. Others defer to it. 2. Random backoff: Each proposer waits a random time before retrying, reducing collision probability. 3. Leader leases: Leader holds a time-limited lease. Others cannot propose until the lease expires.

Byzantine Paxos

Standard Paxos assumes crash-stop failures: nodes either work correctly or stop. Byzantine Paxos tolerates arbitrary failures (Byzantine generals). It requires 3f+1 nodes to tolerate f Byzantine faults (vs 2f+1 for crash-stop).

PBFT (Practical Byzantine Fault Tolerance, Castro & Liskov, 1999) is the most cited Byzantine consensus protocol. It's used in permissioned blockchain systems (Hyperledger Fabric). The communication complexity is O(n²) per consensus round, limiting practical cluster sizes to ~100 nodes.

Historical Context

1989: Lamport writes "The Part-Time Parliament," describing the Paxos algorithm using an extended metaphor of an ancient Greek parliament. The paper is submitted to ACM Transactions on Computer Systems, where reviewers find the narrative style confusing. The paper sits rejected or unreviewed for years.

1996: After years of informal circulation, Lamport resubmits. The paper is finally published in 1998 in ACM Transactions on Computer Systems, 16(2), 133–169. It wins the 2013 SIGOPS Hall of Fame Award.

1999: Castro and Liskov publish "Practical Byzantine Fault Tolerance" at OSDI, making Byzantine consensus practical for the first time.

2001: Lamport publishes "Paxos Made Simple" in ACM SIGACT News as a 14-page simplification. This becomes the primary reference for implementers.

2006: Google publishes "Chubby: The Chubby Lock Service for Loosely-Coupled Distributed Systems" (Burrows). Chubby is built on Multi-Paxos and becomes the coordination backbone for GFS, Bigtable, and MapReduce.

2007: Google builds Megastore (Baker et al.) on Paxos groups per entity group. Later, Spanner (2012) uses Paxos per shard.

2008: Apache ZooKeeper is open-sourced. It uses ZAB (ZooKeeper Atomic Broadcast), a Paxos variant optimized for primary-backup replication rather than state machine replication.

2014: Ongaro and Ousterhout publish Raft as "Paxos Made Understandable," acknowledging that Paxos implementations diverge widely and are hard to teach.

Production Examples

Google Chubby: The canonical Paxos deployment. A 5-node lock service used by every Google infrastructure component. GFS uses Chubby for master election; Bigtable uses it for tablet server election and schema storage. Chubby implements Multi-Paxos with a stable master. Described in Burrows (2006).

Google Spanner: Each shard (Paxos group) runs Multi-Paxos independently. The leader of each group handles reads and writes for that shard. Leaders hold time-limited leases and can serve reads without a full consensus round. Spanner extends this to distributed transactions across shards using a two-phase commit on top of Paxos.

Apache ZooKeeper: Uses ZAB (ZooKeeper Atomic Broadcast) rather than vanilla Paxos. ZAB is a Paxos variant that uses a primary-backup model: a single leader broadcasts ordered updates; followers apply them in order. ZAB's key difference from Paxos: ZAB ensures all updates from the previous leader are committed before a new leader begins. Used by Kafka (until KRaft), Hadoop HDFS (NameNode high availability), and HBase.

Apache Kafka (KRaft): Kafka 2.8+ replaced ZooKeeper with KRaft (Kafka Raft), its own Raft implementation for metadata management. This eliminates the ZooKeeper dependency and allows Kafka controllers to manage metadata directly via Raft log replication.

Debugging Notes

Paxos implementations are notoriously subtle. Bugs typically fall into several categories:

Phase 1 quorum not enforced: A proposer proceeds to Phase 2 without hearing from a majority. This breaks safety — two proposers might choose different values.
Ballot number reuse: Ballot numbers must be globally unique and monotonically increasing. Using the same ballot number twice (e.g., after a crash) violates safety. Solutions: persistent ballot storage, or use a scheme like max_seen_ballot + 1 on restart.
Acceptor state loss: If an acceptor forgets its highest_prepared_n or accepted_value after a crash (non-persistent state), it might accept proposals that conflict with previously made promises. State must be persisted to durable storage before responding.
Learner notification races: The proposer might crash after a value is chosen but before notifying all learners. Learners must be able to re-query acceptors to determine the chosen value.

Testing: The Jepsen framework has been used to verify Paxos-based systems. FoundationDB tests its Paxos implementation with a deterministic simulation that can run millions of fault injection scenarios per second.

Security Implications

Quorum hijacking: If an attacker controls a majority of acceptors, they control consensus. Authentication between proposers and acceptors is essential (mTLS or equivalent).
Ballot number prediction: If ballot numbers are predictable (e.g., sequential), an attacker might be able to preemptively send higher-numbered Prepare messages, disrupting liveness. Use globally unique, large ballot numbers (e.g., UUID-based or (timestamp, node_id) pairs).
State persistence attacks: If an acceptor's persistent state can be corrupted (disk attack, hardware fault), the acceptor might make conflicting promises. Checksums and write-ahead logging with verification protect against this.
Denial of service: A single Byzantine proposer can continuously send high-ballot Prepare messages, preventing any value from being committed (the livelock attack). Multi-Paxos's stable leader and lease mechanisms mitigate this.

Performance Implications

Steady-state (Multi-Paxos with stable leader): - Single write: 1 round-trip from leader to quorum + fsync at each acceptor - In datacenter: ~5–15ms per commit - Google Chubby: handles ~10,000 operations/second - Google Spanner: ~1,000–10,000 commits/second per Paxos group (bounded by disk I/O)

Leader election (Phase 1): - 1 additional round-trip - Happens only on leader failure, so infrequent - ZooKeeper: ~200ms typical election time in a 5-node ensemble

Batching: Leaders should batch multiple client requests into a single Paxos round. This amortizes the round-trip cost. Google Spanner batches up to ~1000 mutations per Paxos round.

Read optimization: Leaders can serve reads without a Paxos round if they have a valid lease (they're guaranteed to be the leader for the lease duration). This is called "leader reads" or "local reads" and reduces read latency to near-zero network overhead.

Failure Modes and Real Incidents

2012, "Paxos Made Live" implementation bugs (Chandra et al.): Google's paper "Paxos Made Live — An Engineering Perspective" (2007) describes the challenges of implementing Paxos in production. Key bugs encountered: disk corruption causing acceptors to forget state, snapshot/restore interactions with Paxos log causing divergence, master lease expiration racing with master election.

2015, ZooKeeper data loss: A ZooKeeper bug (ZOOKEEPER-2355) caused data loss during leader election under specific failure scenarios. The bug was in ZAB's leader election: the new leader did not correctly recover all committed transactions from the previous epoch before starting a new epoch. Committed client operations were lost.

2019, etcd Raft bug: A bug in etcd's Raft implementation (a Paxos variant) caused a node that had been offline for an extended period to rejoin the cluster and replay stale entries. The entries had been committed and the log compacted (snapshotted), but the rejoining node's log diverged before the snapshot point. This was fixed in etcd 3.4.

Modern Usage

Paxos is the algorithmic backbone of most serious distributed systems, even if they use Raft (a Paxos variant) at the surface:

CockroachDB: Uses Raft (a Paxos variant) per range (shard). Each range is replicated across 3–7 nodes.
TiKV: RocksDB-backed key-value store using Raft for replication. Used as TiDB's storage layer.
ScyllaDB: Uses Raft for schema changes and strongly-consistent operations (replacing Cassandra's eventual consistency for DDL).
AWS DynamoDB: Uses Multi-Paxos for its strongly-consistent tier (not published, but referenced in internal documentation).
Hyperledger Fabric: Uses PBFT (Byzantine Paxos) for permissioned blockchain consensus.

Future Directions

Flexible Paxos (Howard et al., 2016): Shows that different phases of Paxos can use different quorum sizes. Phase 1 (Prepare) quorum + Phase 2 (Accept) quorum only needs to intersect, not be a majority each. This enables latency-optimized configurations (e.g., Phase 1 quorum = 1, Phase 2 quorum = n, for fast commits with slow elections).
Paxos in NVMe era: As NVM Express SSDs reduce fsync latency to <100µs, Paxos commit latency is dominated by network round-trips rather than disk I/O. New protocol designs optimize for this.
Geo-distributed Paxos: Systems like CockroachDB and Spanner run Paxos across multiple datacenters. Research on "geo-Paxos" (Elnikety et al.) optimizes for the case where the leader is geographically close to a majority of acceptors.

Exercises

Paxos Simulation: Implement single-decree Paxos in Python with 5 acceptors. Simulate two proposers racing: Proposer 1 with ballot n=1 and Proposer 2 with ballot n=2. Have them interleave Phase 1 and Phase 2 messages. Show that only one value is chosen, even if Proposer 1 completes Phase 1 first and Proposer 2 interrupts before Phase 2 completes.
Safety Proof Exercise: Walk through this scenario manually: Round 1, Proposer A gets quorum, acceptors A1/A2/A3 accept value "X". Proposer A crashes before notifying learners. Proposer B starts Round 2 with n=2. B gets promises from A2, A3, A4 (different quorum). B sees that A2 and A3 accepted "X" in round 1. Must B propose "X"? Why? What would happen if B proposed "Y"?
Multi-Paxos Log: Implement a Multi-Paxos log that can commit a sequence of string commands. Simulate a leader crash after committing 10 commands. Have a new leader take over via Phase 1. Verify that the new leader's Phase 1 responses allow it to recover the 10 already-committed commands before accepting new ones.
Ballot Number Design: Design a ballot number scheme for a 5-node cluster where each node can independently generate unique, monotonically increasing ballot numbers without coordination. Requirements: (a) ballots from the same node must be monotonically increasing, (b) ballots from different nodes must be totally ordered (no ties), (c) the scheme must survive node reboots. Hint: use (timestamp, node_id).
FLP Thought Experiment: Consider a 3-process system where P3 may crash. Construct a specific execution where, if the algorithm waits for P3 (assuming it's slow), safety is maintained but liveness is sacrificed. Then construct an alternative execution where, if the algorithm proceeds without P3 (assuming it's crashed), liveness is maintained but safety might be violated. Write down both executions explicitly with message sequences.

References

Lamport, L. (1998). "The Part-Time Parliament." ACM Transactions on Computer Systems, 16(2), 133–169.
Lamport, L. (2001). "Paxos Made Simple." ACM SIGACT News, 32(4), 18–25.
Fischer, M. J., Lynch, N. A., & Paterson, M. S. (1985). "Impossibility of Distributed Consensus with One Faulty Process." Journal of the ACM, 32(2), 374–382.
Chandra, T. D., Griesemer, R., & Redstone, J. (2007). "Paxos Made Live — An Engineering Perspective." PODC 2007.
Burrows, M. (2006). "The Chubby Lock Service for Loosely-Coupled Distributed Systems." OSDI 2006.
Castro, M., & Liskov, B. (1999). "Practical Byzantine Fault Tolerance." OSDI 1999.
Howard, H., Malkhi, D., & Spiegelman, A. (2016). "Flexible Paxos: Quorum Intersection Revisited." arXiv:1608.06696.
Corbett, J. C., et al. (2012). "Spanner: Google's Globally Distributed Database." OSDI 2012.