06 — Raft Consensus
Technical Overview
Raft is a consensus algorithm designed to be more understandable than Paxos. Published by Diego Ongaro and John Ousterhout at Stanford in 2014, Raft decomposes consensus into three largely independent subproblems — leader election, log replication, and safety — and solves each with clear, explicit mechanisms. While Paxos treats these concerns as interleaved and leaves many practical questions unspecified, Raft specifies a complete system with well-defined behaviors for every case.
Raft is now the most widely deployed consensus algorithm in production infrastructure. etcd (Kubernetes' backbone), CockroachDB, TiKV, Consul, and TiDB all use Raft.
Prerequisites
- Paxos consensus (
05-paxos.md) - Distributed systems fundamentals (
01-distributed-systems-fundamentals.md) - Understanding of replicated logs and state machines
- Familiarity with distributed failure models
Core Content
Raft Design Goals
Ongaro's dissertation "Consensus: Bridging Theory and Practice" makes Raft's primary goal explicit: understandability. Every design decision was evaluated by asking "is this easier to understand than the alternative?" This matters because:
- Implementors must understand the algorithm to implement it correctly.
- Operators must understand it to debug failures.
- A misunderstood algorithm leads to subtle bugs in edge cases.
The Ongaro & Ousterhout paper tested comprehension of Raft vs Paxos with Stanford graduate students. On comprehension questions, students who studied Raft significantly outperformed those who studied Paxos.
Raft's Decomposition
Raft solves consensus by decomposing into:
┌─────────────────────────────────────────────────────┐
│ 1. Leader Election │
│ — Select one server as leader │
│ — Randomized timeouts, majority vote │
│ │
│ 2. Log Replication │
│ — Leader accepts client requests │
│ — Replicates entries to followers │
│ — Entries committed when quorum acknowledges │
│ │
│ 3. Safety │
│ — Only servers with up-to-date logs can win │
│ elections │
│ — Committed entries are never lost │
└─────────────────────────────────────────────────────┘
Raft Terms
Time in Raft is divided into terms — consecutive integers. Each term begins with an election.
Terms:
Term 1: [Election: S1 wins] ──────────────[S1 is leader]──────→
Term 2: [Election: split vote] ──────────→ (no leader, term ends)
Term 3: [Election: S2 wins] ──────────────[S2 is leader]──────→
──────────────────────────────────────────────────────── time →
| Term 1 | Term 2 | Term 3 |
Terms serve as logical clocks in Raft. Each message includes the sender's current term. If a server receives a message with a higher term number, it updates its term and converts to follower (if it was leader or candidate). This ensures stale leaders immediately step down when they learn a newer election occurred.
Leader Election State Machine
Server States:
┌──────────┐ election ┌──────────────┐
│ Follower │─────timeout────→│ Candidate │
└──────────┘ └──────────────┘
↑ │ │
│ discovers │ │ receives
│ current wins │ │ votes from
│ leader or election │ │ majority
│ new term │ │ │
│ ↓ ↓ │
│ ┌──────────────────────┐ │
└───────────│ Leader │←┘
└──────────────────────┘
│
│ discovers server with
│ higher term
↓
┌──────────┐
│ Follower │
└──────────┘
Election Process:
1. Follower election timeout fires (150–300ms, randomized)
2. Server increments term, transitions to Candidate
3. Votes for itself, sends RequestVote RPCs to all others
4. RequestVote includes: (term, candidateId, lastLogIndex, lastLogTerm)
5. Server grants vote if:
a. Haven't voted in this term yet (or voted for this candidate)
b. Candidate's log is at least as up-to-date as voter's log
(lastLogTerm > myLastLogTerm, OR
lastLogTerm == myLastLogTerm AND lastLogIndex >= myLastLogIndex)
6. Candidate wins if it receives votes from majority
7. Winner immediately sends AppendEntries (heartbeat) to all,
asserting leadership and preventing new elections
Why randomized timeouts? If all servers had the same timeout, they would all start elections simultaneously, likely causing split votes repeatedly. Randomizing the timeout (150–300ms in the original Raft) means one server usually times out before others, starts an election, and wins before others can start their own.
Log Replication
Log Structure (each server maintains):
Index: 1 2 3 4 5
Term: 1 1 2 2 3
Entry: [cmd1][cmd2][cmd3][cmd4][cmd5]
↑ ↑
committed uncommitted
Leader Log Replication:
1. Client sends command to leader.
2. Leader appends entry to its log (in memory first).
3. Leader sends AppendEntries RPCs to all followers in parallel:
AppendEntries(term, leaderId,
prevLogIndex, prevLogTerm, ← consistency check
entries[], leaderCommit)
4. Followers append if consistency check passes:
- Their log contains entry at prevLogIndex with term prevLogTerm
5. Once majority acknowledged: leader commits the entry.
Sets commitIndex = new entry's index.
6. Leader applies committed entries to state machine.
7. Leader notifies followers of new commitIndex in next AppendEntries.
8. Followers apply entries up to commitIndex to their state machines.
9. Leader responds to client.
Log consistency check: The prevLogIndex and prevLogTerm in AppendEntries implement an inductive check. If a follower has a matching entry at prevLogIndex, it means their logs are identical up to that point (proven by induction on the log history). This ensures that followers never have "holes" in their logs.
Log repair: When a new leader is elected, followers might have fewer entries, or might have uncommitted entries from a previous term. The leader sends AppendEntries; if a follower rejects (consistency check fails), the leader decrements nextIndex for that follower and retries. Eventually the leader finds the point where they agree and overwrites the follower's diverging entries.
Log Repair Example:
Leader (S1): [1][1][2][2][3][3]
Follower (S2):[1][1][2] ← behind, needs entries 4,5,6
Follower (S3):[1][1][2][2][x][x] ← has stale entries from old term
S1 sends AppendEntries to S3 starting at index 5:
prevLogIndex=4, prevLogTerm=2
S3 has term x≠2 at index 5 → reject
S1 tries index 4: prevLogIndex=3, prevLogTerm=2
S3 agrees → sends entries for index 4,5,6
S3 overwrites its stale entries with the leader's entries
Log Commitment: The Quorum Write
An entry is committed once the leader has stored it on a majority of servers. Crucially, Raft's safety guarantee adds a constraint: a leader cannot commit entries from previous terms by counting replicas alone.
The Uncommitted Entry Problem:
S1 leader, term 1. Replicates entry at index 3 to S1 and S2.
S1 crashes before committing. S3 becomes leader (term 2).
S3 may not have index 3. S3 replicates different entry at index 3.
Now there are two different values at index 3!
Raft's Solution: A leader ONLY commits entries from its OWN term.
Entries from previous terms are "committed" only as a side effect
of committing a later entry from the current term.
Proof: By the time a leader commits an entry from term T (its term),
it must have replicated it to a majority. Any future leader must get
votes from a majority, and that majority overlaps with the committed
majority. At least one voter has the committed entry and has a log at
least as new → the new leader must have that entry.
Log Compaction: Snapshots
A Raft log that grows forever is impractical. Log compaction replaces the log prefix with a snapshot of the state machine at a specific index.
Before snapshot:
Log: [1][2][3][4][5][6][7][8][9][10]
↑ ↑
old entries commitIndex=7
After snapshot at index 7:
Snapshot: state_at_index_7
Log: [8][9][10]
Snapshot includes: last_included_index=7, last_included_term=3
When a follower is so far behind that the leader has already discarded the log entries it needs, the leader sends the snapshot via InstallSnapshot RPC. The follower replaces its log with the snapshot.
Cluster Membership Changes: Joint Consensus
Adding or removing servers from a Raft cluster is non-trivial. If you switch from a 3-node cluster to a 5-node cluster abruptly, there's a window where the old majority (2 of 3) and the new majority (3 of 5) disagree, potentially allowing split decisions.
Raft's solution: joint consensus — a two-phase transition through a combined configuration (C_old,new) where decisions require a majority of both the old and new configurations simultaneously.
Membership Change Protocol:
Phase 1: Leader appends joint config entry [C_old,new] to log.
- Any decision needs majority of C_old AND majority of C_new.
- Committed once replicated to majority in both.
Phase 2: Leader appends new config entry [C_new] to log.
- Once committed, old nodes are no longer needed.
- Nodes not in C_new shut themselves down.
Raft vs Paxos
| Property | Paxos | Raft |
|---|---|---|
| Leader designation | Not specified | Explicit leader |
| Log management | Not specified | Complete spec |
| Membership changes | Not specified | Joint consensus |
| Understandability | Hard | Designed for this |
| Equivalence | Same safety class | Same safety class |
| Phase 1 frequency | Per-value (naive) | Per-term |
| Implementation count | Few correct | Many (etcd, etc.) |
Both algorithms achieve the same safety guarantees under the same assumptions (crash-stop failures, partial synchrony). Raft is essentially Multi-Paxos with explicit specification of all the parts Paxos left to implementors.
Historical Context
2013: Diego Ongaro begins his PhD dissertation at Stanford under John Ousterhout, motivated by the difficulty of understanding Paxos implementations in practice.
2014: Ongaro and Ousterhout publish "In Search of an Understandable Consensus Algorithm (Extended Version)" at USENIX ATC. The paper includes the now-famous user study comparing Raft comprehension to Paxos comprehension.
2014–2016: etcd (CoreOS), CockroachDB, and TiKV adopt Raft. The implementations diverge in interesting ways, exposing underspecified aspects of the algorithm (snapshotting, pre-vote, leadership transfer).
2016: etcd's Raft library is open-sourced and becomes the most widely used Raft implementation. Kubernetes switches its state storage to etcd, making Raft indirectly the foundation of the Kubernetes control plane.
2019: TiKV (PingCAP) adopts Multi-Raft with placement driver, managing thousands of Raft groups across a large cluster.
2022: Apache Kafka replaces ZooKeeper with KRaft (Kafka Raft) in production deployments, a significant milestone for Raft adoption.
Production Examples
etcd: The Kubernetes control plane stores all cluster state in etcd — node registrations, pod specs, service endpoints, secrets, ConfigMaps. etcd uses Raft with a 3 or 5-node cluster. A failed etcd cluster means a broken Kubernetes control plane: no new pods can be scheduled, no configuration changes applied. etcd is among the most battle-tested Raft implementations.
CockroachDB: CockroachDB uses Raft at the per-range (shard) level. Each range (typically 64MB) has its own Raft group with 3 replicas. A write to a range goes through that range's Raft leader. CockroachDB runs thousands of Raft groups simultaneously per cluster. The "Raft command timeout" is a common issue when a Raft group's leader is on an overloaded node.
TiKV: The storage layer of TiDB (a distributed SQL database). Uses Multi-Raft to manage shards. PD (Placement Driver) assigns Raft leadership to balance load. TiKV's Raft implementation adds "Raft learner" nodes — replicas that receive log entries but don't vote — for read scaling and data migration.
Consul: HashiCorp Consul uses Raft for its key-value store and service catalog. The Raft implementation is in HashiCorp's hashicorp/raft Go library, widely used in the Go ecosystem. A 3-node Consul server cluster is the standard deployment; 5 nodes for higher availability.
Debugging Notes
Common Raft operational issues:
-
Election storms: Frequent leader elections indicate network instability or an overloaded leader. Check: (a) heartbeat interval vs follower timeout (heartbeat must be << timeout), (b) disk write latency (leader must persist entries before sending heartbeat), (c) GC pauses causing artificial timeouts.
-
Log replication backpressure: If a follower is far behind (network partition, disk issue), the leader batches entries for it. This can exhaust leader memory. Monitor:
leader_pending_bytes_per_follower. -
Snapshot transfer failures: Large snapshots (GBs) can fail to transfer within the
InstallSnapshottimeout. Increase snapshot timeout or implement snapshot chunking. -
Split votes: If you see repeated elections without a leader being elected, check if an odd number of voters is voting. A 4-node cluster with one voter abstaining can have ties. Use odd cluster sizes (3, 5, 7).
-
Pre-vote optimization: Without pre-vote, a partitioned follower that times out will increment its term and start an election. When it rejoins, it disrupts the current leader even though it couldn't win. etcd implements pre-vote: a server checks if it could win an election before incrementing its term.
Key Raft metrics to monitor:
- raft_leader_last_contact (duration since followers heard from leader — exceeding election timeout = danger)
- raft_commit_index vs raft_applied_index (lag indicates slow state machine application)
- raft_replication_lag (per-follower log lag in entries)
Security Implications
- Cluster membership authentication: Only authorized servers should join a Raft cluster. Use mTLS with cluster-specific certificates for inter-node communication. An unauthorized node joining as a voter can disrupt consensus.
- Log entry authentication: In a threat model with Byzantine nodes, Raft is insufficient. An adversary who compromises one node can propose arbitrary log entries. For Byzantine-tolerant consensus, use BFT protocols.
- Sniffing log contents: Raft logs are not encrypted at rest by default. In etcd (Kubernetes), the log contains secrets, ConfigMaps, and certificates in plaintext. Enable etcd encryption-at-rest using the
--encryption-provider-configflag. - Leader trust: The Raft leader has disproportionate power — it can delay log entries or prioritize specific clients. In federated systems, this is a trust concern. Raft assumes the leader is trusted and non-Byzantine.
Performance Implications
Write path latency:
Client → Leader: 0.5ms (network)
Leader log write (fsync): 0.1-2ms (SSD fsync)
Leader → Follower replication: 0.5ms (network)
Follower fsync: 0.1-2ms (SSD fsync)
Follower → Leader ack: 0.5ms (network)
Leader applies to state machine: ~0ms
Leader → Client response: 0.5ms (network)
─────────────────────────────────────────
Total: ~3-8ms (single datacenter)
Throughput optimizations: 1. Batching: The leader batches multiple client requests into one AppendEntries. This amortizes the fsync cost. etcd pipelines up to 64 requests per round. 2. Pipelining: The leader sends AppendEntries for slot N+1 before receiving acknowledgment for slot N. This fills the network pipe. 3. Read index: For linearizable reads without consensus, the leader records the commit index and waits for a full heartbeat round (to confirm it's still leader). This is cheaper than a full Raft round.
Read scaling: Raft reads go through the leader by default (for linearizability). For AP reads (follower reads), Raft follower reads are eventually consistent — the follower applies log entries asynchronously.
Failure Modes and Real Incidents
2018, etcd leader loss under load: At a large Kubernetes cluster (1000+ nodes), etcd leaders were experiencing frequent re-elections. Root cause: the etcd leader was serving 50,000 watch connections. The overhead of sending watch notifications was causing the leader to miss sending heartbeats within the election timeout. Fix: increase election timeout, reduce watch connection count, separate watch traffic from Raft replication.
2019, CockroachDB Raft command deadlock: A bug in CockroachDB's Raft implementation caused commands to deadlock if the Raft log was compacted (snapshotted) while the command was in flight. The command referenced a log index that no longer existed, and the error handling path held a lock that prevented the snapshot from completing. Fixed in CockroachDB 19.1.
2020, TiKV region split storm: When a TiKV cluster underwent rapid data ingestion, many regions hit the split threshold simultaneously. Each split required a Raft configuration change (membership change for the new region). Thousands of simultaneous membership changes overwhelmed the Placement Driver, causing Raft election timeouts to spike. TiKV now rate-limits region splits.
2021, etcd "learner stuck" bug: A Raft learner node (non-voting replica) that failed to receive a snapshot could get stuck in a state where it would never catch up. The learner would request a snapshot, the leader would begin sending it, the learner would restart, and the leader would start over — indefinitely. Fixed by adding a snapshot timeout and retry limit.
Modern Usage
Raft is now the default choice for new consensus implementations:
- FoundationDB uses a Paxos variant but the design is closest to Multi-Raft.
- Pulsar (Apache) uses Raft-based BookKeeper for durable log storage.
- Zookeeper's replacement in most systems is now etcd (Raft-based) — Kafka's KRaft, Kubernetes' native etcd, HDFS's move away from HDFS NameNode HA on ZooKeeper.
- CockroachDB, TiDB, YugabyteDB: All NewSQL databases use Raft per-shard as their consensus primitive.
The Raft paper's contribution is not just the algorithm — it's the specification clarity that enabled high-quality, verified implementations.
Future Directions
- Flexible quorums in Raft: Adapting Flexible Paxos principles to Raft would allow asymmetric quorums for read vs. write, reducing write latency.
- Geo-Raft: Running Raft across multiple datacenters introduces inter-datacenter round-trips into the critical path. Research into "local commit" with global ordering (similar to Spanner's approach) may enable sub-millisecond Raft commits within a region.
- Raft for large clusters: Raft scales to ~100 voting nodes before election message overhead becomes prohibitive. For very large clusters, hierarchical Raft (Raft groups managing other Raft groups) is being explored.
- Formal verification: TLA+ models of Raft (Ongaro's original TLA+ spec plus community extensions) are being used to verify implementations. AWS has published their formal verification of core Raft properties.
Exercises
-
Raft Leader Election: Implement a Raft leader election simulator in Python. Use 5 nodes with random election timeouts (150–300ms simulated time). Run 100 elections and measure: (a) average time to elect a leader after a failure, (b) frequency of split votes, (c) effect of reducing the timeout variance (e.g., 190–200ms vs 150–300ms).
-
Log Replication Trace: Trace a Raft write through the system by hand. 5 servers: S1 (leader), S2, S3, S4, S5. S4 and S5 are partitioned. Client writes "cmd1", "cmd2", "cmd3". Show the AppendEntries RPCs, which servers have which log entries, and when entries are committed. Then partition heals — show how S4 and S5 catch up.
-
Raft TLA+ Spec: Read the Raft TLA+ specification (available from Ongaro's website). Identify the TypeInvariant, and the LogMatching, ElectionSafety, and LeaderCompleteness invariants. Explain in English what each invariant says about the safety of the algorithm.
-
etcd Benchmark: Deploy a 3-node etcd cluster (Docker Compose or Kubernetes). Benchmark: (a) single-key write throughput without batching, (b) single-key write throughput with 100-request batches, (c) impact of fsync=false (unsafe mode). Measure p50 and p99 latency for each. Explain the results in terms of Raft's I/O requirements.
-
Split-Brain Prevention: etcd uses pre-vote to prevent disruption from partitioned followers. Implement pre-vote as an extension to your Raft simulator from Exercise 1. Show a scenario where without pre-vote, a partitioned follower disrupts the cluster on rejoin, and show that with pre-vote, it does not.
References
- Ongaro, D., & Ousterhout, J. (2014). "In Search of an Understandable Consensus Algorithm." USENIX ATC 2014.
- Ongaro, D. (2014). Consensus: Bridging Theory and Practice. Stanford PhD dissertation.
- Lamport, L. (2001). "Paxos Made Simple." ACM SIGACT News, 32(4), 18–25.
- Howard, H., Malkhi, D., & Spiegelman, A. (2016). "Flexible Paxos: Quorum Intersection Revisited." arXiv:1608.06696.
- Corbett, J. C., et al. (2012). "Spanner: Google's Globally Distributed Database." OSDI 2012.
- Apache Kafka KIP-500: "Replace ZooKeeper with a Self-Managed Metadata Quorum." (2020). Apache Kafka Improvement Proposals.
- Huang, C., et al. (2012). "Erasure Coding in Windows Azure Storage." USENIX ATC 2012. (Context for Raft use in Azure.)