07 — Leader Election

Technical Overview

Leader election is the process by which a group of distributed processes selects one process to act as a coordinator or primary. Many distributed algorithms (Paxos, Raft, primary-backup replication) require a single leader. Leader election determines which process takes that role, and re-elects when the current leader fails. The central challenge is preventing split-brain: two processes both believing they are the leader simultaneously, each accepting writes, leading to divergent state and data corruption.

Prerequisites

Distributed systems fundamentals (01-distributed-systems-fundamentals.md)
CAP theorem (02-cap-theorem.md)
Paxos and Raft consensus (05-paxos.md, 06-raft.md)
Understanding of distributed locks and leases

Core Content

Why Leader Election Is Needed

Many distributed algorithms assume one coordinator:

Single-leader replication: One primary accepts writes; replicas follow. Eliminates write conflicts.
Distributed coordination: A single ZooKeeper leader processes all transactions in order.
Kafka controller: One broker manages partition leadership assignments across the cluster.
Kubernetes controller manager: One process reconciles cluster state to avoid conflicting reconciliation.

Without leader election, either you have no coordinator (fully peer-to-peer, complex conflict resolution), or you hardcode a coordinator (single point of failure).

The Bully Algorithm (Garcia-Molina, 1982)

The Bully Algorithm is the simplest leader election algorithm, presented by Hector Garcia-Molina in 1982. Each process has a unique numeric ID. The process with the highest ID becomes leader.

Bully Algorithm:

Assumptions: synchronous system, processes know all other process IDs.

1. A process P detects the leader has failed (no heartbeat).
2. P sends Election(P.id) to all processes with higher IDs.
3. If no response within timeout: P declares itself leader,
   sends Coordinator(P.id) to all processes with lower IDs.
4. If any higher-ID process responds: that process takes over,
   repeating the election (bullies lower processes).

Example (5 processes, P5 is current leader, P5 fails):
  P4 detects failure, sends Election to P5. No response.
  P4 sends Coordinator to P1, P2, P3. P4 is new leader.

  If P3 detects failure first:
  P3 → P4, P5: Election
  P4 responds to P3: "I'm taking over"
  P4 → P5: Election (no response)
  P4 → P1, P2, P3: Coordinator

Problems with Bully: - Assumes synchrony (bounded message delays) — fails in async systems. - If a high-ID process is slow (not failed), it bullies lower processes and then fails to serve, causing oscillation. - O(n²) messages in the worst case. - Does not prevent split-brain in a network partition.

Ring Election Algorithm (LeLann 1977, Chang-Roberts 1979)

Processes are arranged in a logical ring. Each process sends a message around the ring, keeping the maximum ID seen. The process whose ID returns to it (highest in the ring) becomes leader.

Ring Election:

  Ring: P1 → P2 → P3 → P4 → P5 → P1

1. P3 detects leader failure. Sends Election(3) to P4.
2. P4: max(4, 3)=4, forwards Election(4) to P5.
3. P5: max(5, 4)=5, forwards Election(5) to P1.
4. P1: max(5, 1)=5, forwards Election(5) to P2.
5. P2: max(5, 2)=5, forwards Election(5) to P3.
6. P3 receives Election(5): 5≠3, forwards Election(5) to P4.
   ... continues around ring
7. When P5 receives Election(5): 5==my_id → I am leader!
8. P5 sends Coordinator(5) around the ring.

Properties: O(n) message rounds, O(n²) total messages (worst case). Works in asynchronous systems. Still vulnerable to partition: if the ring is split, each half might elect a leader.

Raft's Leader Election (Randomized Timeouts)

Raft's election mechanism (described in 06-raft.md) uses randomized timeouts to avoid split votes. Key properties: - Election timeouts are randomized (150–300ms) - Any server can start an election; the first to timeout usually wins - A server only votes for a candidate with an up-to-date log - Majority vote required

This is the most widely used mechanism in production systems because it's simple, well-analyzed, and resistant to common failure modes.

ZooKeeper's ZAB Leader Election

ZooKeeper uses the Fast Leader Election (FLE) algorithm, part of ZAB (ZooKeeper Atomic Broadcast):

ZooKeeper Fast Leader Election:

1. All servers start in LOOKING state.
2. Each server broadcasts a vote: (epoch, myZxid, myId)
   where epoch = current election epoch
         myZxid = most recent transaction ID in log
         myId = server ID
3. On receiving a vote (e, z, id):
   - If e > my_epoch: update epoch, revote for (e, z, id)
   - If e == my_epoch:
     - If z > myZxid, OR (z == myZxid AND id > myId):
         revote for (e, z, id)
     - Else: keep my vote
4. Count votes. If any server has quorum of votes: that server is leader.
5. Winner enters LEADING state; others enter FOLLOWING state.

Key: server with highest (epoch, zxid, id) wins.
This ensures the most up-to-date server becomes leader.

Distributed Lock-Based Election

A common pattern uses a distributed lock service for election:

Election via Distributed Lock (e.g., ZooKeeper ephemeral nodes):

1. All candidates try to create /election/leader (ephemeral, sequential)
2. ZooKeeper creates nodes like: /election/leader-0001, /election/leader-0002
3. The process with the LOWEST sequential number is the leader.
4. Others watch the node just below them:
   - /leader-0003 watches /leader-0002
   - /leader-0002 watches /leader-0001
5. If /leader-0001 disappears (leader fails):
   - /leader-0002 becomes leader (it has the lowest number now)
   - Only one watcher triggers → no "herd effect"
6. If /leader-0002's node disappears:
   - /leader-0003 watches the new lowest node (/leader-0001)

Ephemeral nodes in ZooKeeper are automatically deleted when the creating session times out. This ensures the lock is released if the lock-holder crashes without explicit cleanup.

etcd Leases: etcd provides an equivalent mechanism via leases. A process creates a key with a TTL (lease). The lease is maintained by periodic KeepAlive heartbeats. If the leader crashes, the lease expires, and another candidate creates the key.

# etcd leader election pseudocode
import etcd3

client = etcd3.client()
lease = client.lease(ttl=10)  # 10-second lease

def campaign():
    key = "/election/leader"
    value = my_node_id
    # Try to create; only succeeds if key doesn't exist
    success = client.put_if_not_exists(key, value, lease=lease)
    if success:
        print("I am the leader")
        while leader_still_running():
            lease.refresh()  # Keep the lease alive
    else:
        watch_for_leader_failure(key)

Split-Brain Prevention

Split-brain is the condition where two nodes both believe they are the leader. This is the most dangerous failure mode in leader election.

Causes of split-brain: 1. Network partition: Two halves each elect their own leader. 2. GC pause: Leader pauses for GC (minutes in Java), followers time out and elect new leader, then old "leader" resumes thinking it's still leader. 3. Slow disk: Leader's heartbeat loop blocks on a slow fsync, followers time out.

Prevention mechanisms:

1. Fencing tokens (Martin Kleppmann, "Designing Data-Intensive Applications", 2017):

Fencing Token Mechanism:

1. Lock service issues fencing token with each lock grant.
   Token is a monotonically increasing number.

   Leader election grant → token=15

2. All writes to storage must include the current fencing token.
3. Storage rejects writes with tokens older than the highest seen.

Example:
  Old leader (token=14) wakes from GC pause, tries to write: REJECTED
  New leader (token=15) writes: ACCEPTED
  Old leader retries: still REJECTED (storage has seen token=15)
  Old leader must step down (realizes it's not current leader)

  ┌──────────────────────────────────────────────────────────────┐
  │  Old Leader     Storage          New Leader                  │
  │  token=14                        token=15                    │
  │      │    write(data, token=14)  │          │               │
  │      │──────────────────────────→│          │               │
  │                    REJECT (have seen 15)                     │
  │      │←─────────────────────────│          │               │
  │                                 │          │               │
  │                                 │  write(data, token=15)    │
  │                                 │←──────────────────────────│
  │                                 │  OK                       │
  │                                 │──────────────────────────→│
  └──────────────────────────────────────────────────────────────┘

2. STONITH (Shoot The Other Node In The Head):

A cluster management protocol where a node that might be split-brained is physically powered off (via IPMI, PDU, or cloud API). If a node doesn't respond to heartbeats, the other nodes use an out-of-band channel to power it off. Used in pacemaker/corosync high-availability clusters for databases.

STONITH Flow:

  Node A (primary) stops responding.
  Node B and C detect A's failure.
  B sends STONITH command to A's IPMI controller: "power off A"
  A is physically powered off.
  B safely takes over as primary.

  (Without STONITH: A might be in a network partition, 
   still accepting writes, creating split-brain)

3. Lease-based exclusion:

A leader holds a time-limited lease. It MUST stop being leader before the lease expires. If it loses contact with the lock service, it voluntarily steps down when the lease expires, even without explicit failure detection. This prevents an isolated leader from continuing to act.

Leader Election in Practice

Kafka Controller Election:

Kafka has one controller broker that manages partition leadership, handles broker failures, and coordinates rebalancing. Controller election uses ZooKeeper (pre-KRaft) or Raft (KRaft):

Kafka ZooKeeper controller election:
1. All brokers try to create /controller (ephemeral node) with their broker ID.
2. First to create wins; ZooKeeper serializes creation.
3. Winner watches /controller for deletion.
4. When the controller dies (or ZK session expires), /controller is deleted.
5. All other brokers race to create /controller again.

Problems with this approach:
- "Controller resign storm" on network partition: ZK session timeout causes
  multiple resignations and elections in quick succession.
- Single ZooKeeper node for critical metadata.

Kafka KRaft (post-3.0):

Kafka's internal Raft consensus replaces ZooKeeper. The metadata cluster (3 or 5 controller nodes) runs Raft. The Raft leader is the active controller. Elections are handled by Raft's randomized timeout mechanism. The Raft log replaces ZooKeeper's znode tree for metadata storage.

Kubernetes Leader Election (client-go):

Kubernetes controllers (scheduler, controller manager, cloud controller) use the k8s.io/client-go/tools/leaderelection library:

// Kubernetes leader election pattern
leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
    Lock: &resourcelock.LeaseLock{
        LeaseMeta: metav1.ObjectMeta{
            Name:      "my-controller",
            Namespace: "kube-system",
        },
        Client: k8sClient.CoordinationV1(),
        // ...
    },
    LeaseDuration: 15 * time.Second,  // How long leader holds lease
    RenewDeadline: 10 * time.Second,  // How long to retry renewing
    RetryPeriod:   2 * time.Second,   // How often to retry
    Callbacks: leaderelection.LeaderCallbacks{
        OnStartedLeading: func(ctx context.Context) {
            // I am leader, run controller logic
            runController(ctx)
        },
        OnStoppedLeading: func() {
            // I lost leadership, exit
            os.Exit(0)
        },
        OnNewLeader: func(identity string) {
            // Another node is the leader
        },
    },
})

Kubernetes stores the lease in a Lease object (in etcd). The LeaseDuration (15s default) is how long the lease lasts without renewal. If the leader fails to renew in RenewDeadline (10s), it stops considering itself leader and a new election occurs.

Redis Sentinel:

Redis Sentinel provides automatic failover for Redis primary-replica setups:

Redis Sentinel Architecture:

  ┌───────────┐  ┌───────────┐  ┌───────────┐
  │ Sentinel 1│  │ Sentinel 2│  │ Sentinel 3│
  └─────┬─────┘  └─────┬─────┘  └─────┬─────┘
        │         monitoring │           │
        └──────────┬──────────┘           │
                   │                       │
  ┌────────────────▼──────────────────────▼──┐
  │ Redis Primary    Redis Replica 1         │
  │ (master)         Redis Replica 2         │
  └──────────────────────────────────────────┘

Failover:
1. A sentinel marks primary as "subjectively down" (no PING response).
2. Sentinels gossip; if quorum (default=2) sentinels agree: "objectively down."
3. Sentinels elect a Sentinel leader (via a simplified Raft-like vote) to 
   orchestrate the failover.
4. Leader sentinel picks the best replica (most up-to-date) and promotes it.
5. Other replicas are reconfigured to replicate from the new primary.
6. If old primary recovers, it becomes a replica of the new primary.

Redis Sentinel does not prevent split-brain. If sentinels are partitioned from each other, a minority partition may promote a replica while the majority partition still has the original primary. Both are accepting writes. On partition heal, the old primary's writes (acknowledged to clients) are discarded. This is a known limitation; Redis Cluster with Raft provides stronger guarantees.

Historical Context

1977: Le Lann introduces ring-based election algorithms for distributed systems.

1979: Chang and Roberts improve ring election (the Chang-Roberts algorithm) to O(n log n) expected message complexity.

1982: Hector Garcia-Molina publishes the Bully Algorithm in "Elections in a Distributed Computing System" in IEEE Transactions on Computers.

1989: The concept of distributed leases introduced by Cary Gray and David Cheriton in "Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency."

2006: Google Chubby (Burrows) publishes the production experience with Paxos-based leader election at Google scale, influencing ZooKeeper's design.

2010: ZooKeeper Fast Leader Election algorithm published as part of ZooKeeper's production deployment documentation.

2014: Raft's randomized timeout election provides a simpler, more understandable election mechanism.

Debugging Notes

Detecting split-brain: 1. Monitor metrics from both potential leaders. If two nodes both report is_leader=true, you have split-brain. 2. Compare write counters: if two nodes both show increasing write counts for the same data, split-brain is active. 3. Check for diverging state: compare the last N writes between the suspected dual-leaders.

Election oscillation (frequent leader changes): - Symptom: leader changes every 5–30 seconds. - Cause: election timeout shorter than max message delivery time (network jitter causes false timeouts). - Fix: increase election timeout, or investigate and fix the source of high latency.

Leader GC pause split-brain: - Java-based systems (ZooKeeper, Kafka, Elasticsearch) are prone to GC pauses causing leadership oscillation. - Fix: use G1 or ZGC for bounded pause times; set ZooKeeper tick time and session timeout to accommodate worst-case GC.

Security Implications

Election manipulation: An attacker who can forge messages from a specific node ID can impersonate that node in Bully or Ring elections. Use authenticated channels (mTLS) for all election messages.
Lease timing attacks: An attacker who can delay lease renewal messages (network manipulation) can cause a valid leader's lease to expire, triggering unnecessary re-election and availability disruption.
STONITH key security: STONITH uses IPMI or cloud APIs to power off nodes. These credentials must be protected; a compromised STONITH mechanism can destroy the entire cluster.
Fencing token forgery: If an attacker can forge a fencing token with a high number, they can prevent legitimate writes. Token generation must use authenticated, monotonically increasing sequences.

Performance Implications

Election latency: Raft elections take 1–3× the election timeout (150–300ms = 300ms–900ms recovery time). During this window, the cluster is unavailable for writes.
Heartbeat frequency: More frequent heartbeats reduce election time (faster failure detection) but increase network overhead. ZooKeeper's tick time (default 2s) and session timeout (default 10s) trade off these concerns.
ZooKeeper session timeout vs latency: A 10s session timeout means a leader failure takes up to 10s to detect. For databases with strict availability SLAs, this is a major concern. Setting session timeout to 3s improves recovery time but increases false failure detection under load.

Failure Modes and Real Incidents

2016, GitHub MySQL Split-Brain: GitHub's primary MySQL server became unreachable due to a network issue. Orchestrator (their automated failover tool) promoted a replica. When connectivity was restored, the old primary had ~30 seconds of writes that the new primary didn't have. Because Redis Sentinel was configured without fencing tokens, the old primary briefly served reads, returning data the rest of the system hadn't seen. 30 seconds of writes were lost after manual reconciliation.

2015, Apache Solr Split-Brain: A Solr SolrCloud cluster experienced a ZooKeeper session timeout during high load. Multiple nodes simultaneously believed they were leader for the same shard. Both accepted indexing requests. After the split was detected, the secondary leader's documents were deleted. Users saw search results that included, then excluded, the same documents depending on which shard replica served the query.

2017, Kubernetes Controller Manager Split-Brain: A bug in the Kubernetes leader election lease refresh mechanism caused two instances of the controller manager to both believe they held the lease during a brief network hiccup. Both instances started reconciling cluster state, creating duplicate objects. The fix was to strengthen the OnStoppedLeading callback to immediately terminate the process, ensuring a clean handoff.

Modern Usage

Leader election is now a solved problem for well-defined failure models, with mature libraries:

HashiCorp Raft (hashicorp/raft): Go library used by Consul, Vault, Nomad.
etcd client leader election: go.etcd.io/etcd/client/v3/concurrency provides election APIs on top of etcd.
Kubernetes leader election: k8s.io/client-go/tools/leaderelection used by all Kubernetes controllers.
Apache Curator (ZooKeeper client): Provides LeaderSelector recipe for ZooKeeper-based election.

Future Directions

Leaderless designs: CRDTs and leaderless replication reduce the need for leader election in some workloads. Amazon Aurora's leaderless replication handles failover without explicit election.
Adaptive timeouts: Machine-learning-based timeout adjustment that dynamically sets election timeouts based on observed network conditions, reducing unnecessary elections.
Serverless leader election: In a serverless environment with ephemeral function instances, traditional leader election doesn't map well. New patterns use atomic counters in databases or cloud-native lock primitives.

Exercises

Bully Algorithm Simulation: Implement the Bully Algorithm for 7 nodes. Simulate the highest-ID node (P7) failing. Trace all messages sent during the election. Count total messages. Now simulate P5 failing (not the highest). Compare message counts. Explain why the Bully Algorithm is inefficient in the second case.
ZooKeeper Election Recipe: Using the ZooKeeper Python client (or kazoo), implement the sequential-ephemeral-node election recipe described above. Deploy three competing "candidates" and verify only one is ever leader. Simulate the leader crashing by killing its process. Measure time to new leader election.
Split-Brain Simulation: Create two "leaders" in your Bully simulation that both believe they are leader due to a simulated network partition. Add a fencing token: each leader gets a monotonically increasing token from a simulated lock service. Show that when the "wrong" leader tries to write with its old token, the write is rejected. Measure how long split-brain persists without fencing tokens vs with them.
Kubernetes Leader Election: Deploy a 3-pod Kubernetes Job that uses the client-go leader election library. Have each pod log its status (leader/follower). Delete the leader pod and measure the time to new leader election. Change LeaseDuration and RenewDeadline and observe the impact on failover time and false election frequency.
STONITH Design: Design a STONITH mechanism for a 3-node cluster running on a cloud provider (AWS, GCP, or Azure). Your design must: (a) use the cloud provider's API to stop a node, (b) handle the case where STONITH itself fails (network issue reaching cloud API), (c) prevent STONITH from being used to create a split-brain rather than resolve one. Sketch the decision flowchart.

References

Garcia-Molina, H. (1982). "Elections in a Distributed Computing System." IEEE Transactions on Computers, C-31(1), 48–59. (Bully Algorithm)
Chang, E. J. H., & Roberts, R. (1979). "An Improved Algorithm for Decentralized Extrema-Finding in Circular Configurations of Processes." Communications of the ACM, 22(5), 281–283.
Gray, C., & Cheriton, D. (1989). "Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency." SOSP 1989.
Burrows, M. (2006). "The Chubby Lock Service for Loosely-Coupled Distributed Systems." OSDI 2006.
Hunt, P., Konar, M., Junqueira, F. P., & Reed, B. (2010). "ZooKeeper: Wait-free Coordination for Internet-scale Systems." USENIX ATC 2010.
Ongaro, D., & Ousterhout, J. (2014). "In Search of an Understandable Consensus Algorithm." USENIX ATC 2014.
Kleppmann, M. (2017). Designing Data-Intensive Applications. O'Reilly. (Chapter 8 on split-brain and fencing tokens.)