etcd in Kubernetes: Consensus, MVCC, and Cluster State

Overview

etcd is Kubernetes's only persistent storage component. Every object in a Kubernetes cluster — every Pod specification, every Service endpoint, every ConfigMap, every Secret, every RBAC binding, every Lease object — lives in etcd. If etcd is lost without a backup, the cluster's state is permanently gone. Running pods will continue running until they fail or are restarted, but the control plane has no memory of their existence. etcd is, without exaggeration, the single most operationally critical component in a Kubernetes deployment.

Understanding etcd means understanding Raft consensus, the MVCC (Multi-Version Concurrency Control) data model, the watch mechanism that makes Kubernetes reactive, the performance characteristics that govern what hardware etcd requires, and the operational tasks (backup, compaction, sizing) that keep etcd healthy under sustained load.

Prerequisites

Understanding of distributed consensus and the CAP theorem
Familiarity with B-tree and B+ tree data structures
Basic understanding of MVCC (as used in databases)
Kubernetes architecture fundamentals (API server, controllers)
Familiarity with TLS mutual authentication

Historical Context

etcd was created by CoreOS (now part of Red Hat/IBM) in 2013, initially to solve the distributed configuration management problem for their fleet cluster manager. It was built on the Raft consensus algorithm, published by Ongaro and Ousterhout in 2014 (though implemented before the paper was published, based on an earlier draft).

Kubernetes adopted etcd as its backing store from the very beginning. The choice was deliberate: etcd's strong consistency guarantees (linearizable reads and writes by default) align with Kubernetes's need for a single consistent view of cluster state. An eventually-consistent store would make it possible for two controller instances to read conflicting state and make conflicting decisions.

etcd v3 (2016) was a significant redesign: the data model changed from a filesystem-like tree (v2) to a flat key-value store with range queries, MVCC was added, and the watch API was made significantly more efficient. Kubernetes migrated from v2 to v3 API in Kubernetes 1.6.

Raft Consensus in etcd

etcd uses Raft to ensure that all cluster members agree on the sequence of operations applied to the state machine. Raft is a consensus algorithm designed to be understandable; it is functionally equivalent to Paxos in fault tolerance but with a clearer separation of concerns.

Leader Election

A Raft cluster has exactly one leader at any time. The leader handles all write requests and orchestrates log replication. Election works as follows:

Normal operation (leader healthy):
  Leader ─── AppendEntries (heartbeat) ──→ Follower 1
  Leader ─── AppendEntries (heartbeat) ──→ Follower 2

Leader fails (no heartbeat received within election timeout):
  Follower 1: election timeout fires (randomized: 150-300ms)
  Follower 1: increments term, votes for self, sends RequestVote to others
              ─── RequestVote(term=5, lastLogIndex=42) ──→ Follower 2
  Follower 2: grants vote (term 5 > last known term 4, log up to date)
              ─── VoteGranted ──→ Follower 1
  Follower 1: has majority (self + Follower 2) → becomes Leader for term 5

Randomized election timeouts (each follower picks a random timeout independently) prevent split votes where two followers simultaneously start elections and neither gets a majority.

Log Replication

Client writes: PUT /key "value"

1. Client sends request to Leader
2. Leader appends entry to its local log (not yet committed)
3. Leader sends AppendEntries to all followers in parallel
4. Followers append to their logs, respond with success
5. Once MAJORITY has appended (quorum):
   - Leader commits the entry (advances commitIndex)
   - Leader applies to state machine (writes to bbolt)
   - Leader responds to client with success
6. Leader notifies followers of new commitIndex in next AppendEntries
7. Followers commit and apply their copies

┌─────────┐    ┌─────────┐    ┌─────────┐
│ Leader  │    │Follower1│    │Follower2│
│         │    │         │    │         │
│ [1][2]  │    │ [1][2]  │    │ [1][2]  │
│ [3]─────┼────┼──[3]    │    │         │ ← 3 appended to 2/3 = committed
│ [4]─────┼────┼──[4]    │    │         │   even if Follower2 is slow
│         │    │         │    │         │
│ commit─────→ │ commit  │    │         │
│ Index=4 │    │ Index=4 │    │         │
└─────────┘    └─────────┘    └─────────┘

Term numbers provide a total ordering of leadership periods. A node receiving a message with a higher term immediately reverts to follower state. Log entries are identified by (term, index) pairs — if two entries have the same (term, index), they are guaranteed to have the same content.

Quorum and Fault Tolerance

Cluster size  Quorum  Tolerated failures
     1          1           0 (no fault tolerance)
     3          2           1 (tolerate 1 failure)
     5          3           2 (tolerate 2 failures)
     7          4           3 (tolerate 3 failures)

5-node clusters are recommended for critical production clusters: they tolerate 2 simultaneous failures and provide better read throughput (reads can be served from followers in linearizable mode with quorum reads). 3-node is common for clusters where the cost of 5 nodes is significant.

Even-numbered clusters (2, 4, 6) offer no improvement over the next smaller odd number (2 can tolerate 0 failures, no better than 1; 4 tolerates 1 failure, same as 3) — avoid even-numbered etcd clusters.

etcd Data Model: MVCC Key-Value Store

etcd stores key-value pairs where both key and value are byte arrays. Keys are ordered lexicographically, enabling range queries (etcdctl get /registry/pods --prefix returns all keys starting with /registry/pods).

MVCC: Multi-Version Concurrency Control

etcd stores every version of every key. When a key is written, etcd creates a new version rather than overwriting the old one. Each write increments the global revision counter.

Timeline of operations:
  rev=1: PUT /foo "bar"      → /foo @ rev=1 = "bar"
  rev=2: PUT /baz "qux"      → /baz @ rev=2 = "qux"
  rev=3: PUT /foo "updated"  → /foo @ rev=3 = "updated"
                                /foo @ rev=1 = "bar" (still accessible)
  rev=4: DELETE /baz         → /baz @ rev=4 (tombstone)

GET /foo                     → "updated" (current revision)
GET /foo --rev=1             → "bar" (historical revision)
GET /foo --rev=2             → "bar" (rev=2 modification was to /baz)

The MVCC model enables several critical features: - Point-in-time snapshots: reads at a specific revision are consistent - Watch with history: "give me all changes since revision N" enables reliable event delivery even after a brief disconnection - Optimistic concurrency control: atomic compare-and-set operations (PUT /key "new" --prev-value="expected") enable controllers to safely update objects without holding locks

Kubernetes Keys in etcd

Kubernetes stores all objects under /registry/<resource>/<namespace>/<name>:

/registry/pods/default/my-pod
/registry/pods/kube-system/coredns-abc123
/registry/services/specs/default/my-service
/registry/deployments/apps/default/my-deployment
/registry/secrets/default/my-secret
/registry/configmaps/kube-system/kubeadm-config
/registry/leases/kube-system/kube-controller-manager  ← leader election
/registry/events/default/my-pod.abc123

Values are serialized using protocol buffers (not JSON — JSON was used in v2, protobuf was adopted in v3 for efficiency). The API server handles serialization/deserialization; etcd only stores bytes.

Watch Mechanism

The watch API is what makes Kubernetes reactive. Instead of polling the API server (and therefore etcd), Kubernetes controllers open long-lived watch connections. When any watched object changes, the watch delivers an event immediately.

Controller start sequence:
  1. Controller opens watch: WATCH /registry/pods --rev=current
  2. Controller performs initial LIST (full sync from current state)
  3. Controller processes all existing pods (reconcile initial state)
  4. New events arrive on watch stream as pods are created/updated/deleted
  5. Controller processes each event (reconcile changed state)

If the watch stream is interrupted (network partition, etcd restart):
  6. Controller records the last received revision
  7. Controller reopens watch from last revision: WATCH /registry/pods --rev=N
  8. etcd delivers all events since revision N (MVCC history)
  9. Controller processes missed events without full re-LIST

If revision N has been compacted (see below):
  10. etcd returns ErrCompacted
  11. Controller falls back to full LIST + re-establish watch

This design provides efficient, reliable change notification. The MVCC history guarantees no events are missed even during brief disconnections, as long as the watch is re-established before the relevant revisions are compacted.

The API server implements a watch cache (informer cache) that buffers etcd events in memory. Controllers watch the API server's cache, not etcd directly. This means etcd serves only O(control plane components) watch connections, not O(all controllers in all namespaces), significantly reducing etcd load.

etcd Storage Engine: bbolt

etcd uses bbolt (a fork of boltdb), an embedded key-value store based on a B+ tree, as its on-disk storage engine. bbolt stores all data in a single memory-mapped file: /var/lib/etcd/member/snap/db.

Key bbolt properties: - ACID transactions: writes are grouped into transactions; all writes in a transaction are atomic - Copy-on-write B+ tree: writes create new tree nodes rather than modifying existing ones, enabling concurrent reads during writes - Memory-mapped I/O: the OS page cache serves as the in-memory buffer; bbolt does not maintain its own cache - Write-ahead log (WAL): Raft writes the operation to the WAL (/var/lib/etcd/member/wal/) before applying it to bbolt, ensuring durability even if the process crashes mid-write

The memory-mapped file grows as data is written. bbolt does not shrink the file when data is deleted — free pages are tracked internally and reused. This means the etcd data file size is monotonically non-decreasing unless a defrag operation is performed (etcdctl defrag).

Performance Requirements

etcd is extremely sensitive to disk write latency. Raft requires that log entries be durably written to disk before acknowledging writes. If disk writes take too long, followers fall behind, leader elections may trigger unnecessarily, and write throughput collapses.

etcd write latency requirements:
  p99 write latency < 10ms      ← absolute requirement
  p99 write latency < 1ms       ← optimal, achieved with NVMe SSD

Storage IOPS requirements (typical 3000-node cluster):
  Write IOPS: ~100-500/second
  Burst: higher during rolling deployments, node storms

etcd Performance on Different Storage:
  NVMe SSD (local)     p99: <1ms      ✓ Recommended
  SATA SSD (local)     p99: 1-5ms     ✓ Acceptable
  EBS gp3 (AWS)        p99: 3-10ms    ⚠ Borderline (use io2 for prod)
  EFS/NFS              p99: 10-50ms   ✗ Unsuitable
  HDD                  p99: 5-20ms    ✗ Unsuitable

etcd_disk_wal_fsync_duration_seconds and etcd_disk_backend_commit_duration_seconds are the two most important etcd Prometheus metrics for disk performance monitoring. Alert at p99 > 10ms.

CPU requirements are modest: 2 cores is typically sufficient. Memory requirements depend on cluster size: 8GB for large clusters (etcd loads frequently-accessed B+ tree pages into the OS page cache via mmap).

Compaction and Defragmentation

MVCC Compaction

MVCC keeps all revisions of all keys. Without compaction, the bbolt data file grows without bound. Compaction discards old revisions, freeing bbolt pages.

Before compaction (revision 1000):
  /foo @ rev=1  "original"
  /foo @ rev=5  "updated"
  /foo @ rev=200 "latest"
  /bar @ rev=3  "value1"
  ...all 999 revisions of every key...

etcdctl compact --physical=true 950
  Removes all revisions < 950 for all keys
  Frees bbolt pages

After compaction:
  /foo @ rev=950+ only (latest before 950 preserved)
  /bar @ rev=950+ only
  ...only recent revisions remain...

Automatic compaction is configured on the etcd flags: --auto-compaction-mode=periodic --auto-compaction-retention=1h. This retains 1 hour of history, compacting every hour.

The appropriate retention depends on the longest possible watch reconnection time. If a controller can be disconnected for at most 15 minutes, a 30-minute retention provides a safety margin. Retaining more history gives controllers more time to reconnect before requiring a full re-LIST.

Defragmentation

Compaction frees bbolt pages internally but does not shrink the data file on disk. Defragmentation rebuilds the bbolt file, writing only live data to a new file.

etcdctl defrag should be run on follower nodes first (one at a time), then the leader (after promoting another node as leader). During defrag, the node is unavailable for up to several minutes for large databases.

Schedule defrag when etcd_mvcc_db_total_size_in_bytes / etcd_mvcc_db_total_size_in_use_in_bytes exceeds ~2 (more than half the file is fragmentation).

Backup and Restore

Backup

# Take a snapshot of etcd data
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/etcd/ca.crt \
  --cert=/etc/etcd/server.crt \
  --key=/etc/etcd/server.key

# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db

Take snapshots every 15-60 minutes and store them off-cluster (S3, GCS). etcd snapshots are consistent point-in-time views of the entire key-value store — a single snapshot file contains everything needed to restore the cluster.

Restore

Restoring etcd is an emergency procedure. It requires: 1. Stopping all etcd instances and the kube-apiserver 2. Restoring the snapshot to each etcd member's data directory 3. Restarting etcd with the --force-new-cluster flag (on first node) 4. Adding other nodes as new members 5. Restarting kube-apiserver

After restore, any changes made after the snapshot (new pods created, config changes) are lost. The cluster must be manually inspected for divergence.

Sizing and Limits

etcd data size guidelines:
  < 2GB     Healthy, no performance concerns
  2-8GB     Monitor closely, increase compaction frequency
  > 8GB     Severe performance degradation, defrag immediately
  Hard limit: 8GB default (configurable with --quota-backend-bytes)

Data volume drivers:
  - Pod count (each pod = multiple keys: spec, status, bindings)
  - Event objects (extremely high churn — set event TTL: --event-ttl=1h)
  - CRD instances (custom resources count toward total)
  - Secret/ConfigMap count
  - Deployment rollout history (limit with revisionHistoryLimit)

Cluster size impacts etcd size: a 1000-node cluster with 10,000 pods has approximately 500MB-1GB of etcd data. Events are the highest-volume objects and should have a short TTL.

Debugging Notes

High write latency: check disk I/O (iostat -xd 1). If await is high for the etcd disk, the storage is the bottleneck. Also check for BUSY status in etcdctl endpoint status (leader hasn't committed entries recently).
Compaction required: if etcd_server_quota_backend_bytes is approaching etcd_mvcc_db_total_size_in_bytes, etcd will start returning "etcdserver: mvcc: database space exceeded" errors. Compact and defrag immediately.
Watch closed: context deadline exceeded: this appears in API server logs when etcd is overloaded and cannot deliver watch events in time. Indicates etcd performance degradation.
Leader election instability: repeated leader elections appear in etcd logs as elected leader messages. Caused by disk latency (followers timeout waiting for heartbeat responses from leader struggling with fsync), or network latency between etcd nodes.
Inspecting etcd directly: etcdctl get /registry/ --prefix --keys-only | head -100 shows all Kubernetes keys. Never modify etcd directly — all changes must go through the API server for validation.

Security Implications

etcd stores all Kubernetes Secrets in base64 encoding by default — this is not encryption. Enable encryption at rest: configure an EncryptionConfiguration with AES-GCM or using KMS provider (envelope encryption with cloud KMS).
etcd endpoints must not be accessible from outside the control plane network. The default etcd port (2379 for client, 2380 for peer) should have a restrictive firewall rule.
Mutual TLS (mTLS) is required for etcd communication. All clients (API server), peers (other etcd nodes), and management tools (etcdctl) must authenticate with certificates.
etcd backup files contain all cluster secrets in plaintext (if at-rest encryption is not enabled in the backup). Backup files must be encrypted and access-controlled.

Performance Implications

The watch cache in kube-apiserver is the primary scalability mechanism. Enable it (default) and size it appropriately (--watch-cache-sizes). Without the watch cache, every controller's watch would be an etcd watch — etcd can support ~100-200 concurrent watches efficiently, not thousands.
Kubernetes API priority and fairness (APF) limits API server request rates per priority level, preventing a rogue controller from saturating the API server and thereby etcd. Configure FlowSchema and PriorityLevelConfiguration objects for custom rate limits.
etcd linearizable reads (default) go through the Raft leader and guarantee consistency. Serializable reads skip the leader but may return stale data. For most Kubernetes use cases, serializable reads are acceptable and reduce leader load.

Failure Modes

etcd leader network partition: if the leader is partitioned from the majority, it cannot commit new entries and steps down. The majority elects a new leader within an election timeout (150-300ms + processing time). During this period, writes fail.
etcd member data corruption: if a member's data directory is corrupted (disk failure, hardware error), the member must be removed from the cluster and re-added as a new member (it will replicate current state from the leader).
etcd disk full: if the disk hosting etcd fills up, etcd crashes. Monitor disk usage closely. Reserve disk space using LVM or partition sizing.
Clock skew: Raft election timeouts and TLS certificate validity both depend on accurate time. Ensure NTP is running on all etcd nodes. Kubernetes recommends chronyd or ntpd with <1 second clock skew.

Modern Usage

In managed Kubernetes (EKS, GKE, AKS), the cloud provider manages etcd — you never directly interact with it. The provider handles backups, sizing, and hardware selection. This is a significant operational simplification for most users.

For self-managed clusters (kubeadm, k3s, RKE2), etcd operations are a critical skill. k3s uses SQLite (single-node) or kine (a shim that translates etcd API calls to PostgreSQL or MySQL) as alternatives to etcd, sacrificing some consistency guarantees for simpler operations in smaller deployments.

Future Directions

etcd v3.6: improved server-side filtering for watch events (reduce network bandwidth for watches with field selectors)
Kubernetes Kubernetes etcd migration: the etcd gRPC API remains the standard, but the Kubernetes community is actively exploring alternatives for edge deployments where etcd's resource requirements are prohibitive
etcd over QUIC: experimental work on running etcd's peer protocol over QUIC for improved performance in high-latency environments (multi-region control planes)

Exercises

Set up a 3-node etcd cluster using kubeadm or manually. Simulate a leader failure by stopping the leader process. Measure time-to-new-leader using etcdctl endpoint status --watch and compare to the theoretical election timeout range.
Write a program using the etcd Go client library that opens a watch on /registry/pods/default and prints every event. Create and delete pods in a Kubernetes cluster and verify all events arrive.
Fill an etcd cluster past its quota (--quota-backend-bytes=16MB for testing). Observe the error returned by the API server. Perform compaction and defrag to recover.
Benchmark etcd write latency on NVMe SSD vs EBS gp3 vs EBS io2 using the official etcd benchmarking tool (etcd-benchmark). Compare wal_fsync_duration_seconds histograms.
Take an etcd snapshot, delete a Deployment from the running cluster, restore the snapshot to a new etcd cluster, and verify the deleted Deployment reappears. Document the exact steps and validate the restored cluster is healthy.

References

etcd Documentation: https://etcd.io/docs/
Diego Ongaro and John Ousterhout: "In Search of an Understandable Consensus Algorithm (Extended Version)" (USENIX ATC 2014)
etcd v3 Design: https://etcd.io/docs/v3.5/learning/design-overview/
Kubernetes etcd operations: https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/
bbolt: https://github.com/etcd-io/bbolt
Brandon Philips: "etcd: Distributed key-value store for shared configurations and service discovery" (CoreOS blog)
Xiang Li: "etcd v3 MVCC" (GitHub etcd/etcd design doc)
kine (SQLite/PostgreSQL etcd shim): https://github.com/k3s-io/kine