02 — CAP Theorem

Technical Overview

The CAP theorem is the most cited and most misunderstood result in distributed systems. It states that a distributed data store can provide at most two of three guarantees simultaneously: Consistency, Availability, and Partition Tolerance. Because network partitions are unavoidable in any real distributed system, the practical choice is always between consistency and availability during a partition.

Understanding what CAP actually says — and crucially, what it does not say — is prerequisite to making sound architectural decisions for distributed systems.

Prerequisites

Basic distributed systems concepts (see 01-distributed-systems-fundamentals.md)
Understanding of what a network partition is
Familiarity with ACID database properties
Basic understanding of replication

Core Content

The CAP Theorem Statement

Eric Brewer presented the CAP conjecture at PODC 2000 (Principles of Distributed Computing). Gilbert and Lynch provided a formal proof in 2002.

Formal definitions:

Consistency (C): Every read receives the most recent write or an error. In formal terms: a distributed system is consistent if every read operation that begins after a write completes will return the value of that write (or a later write). This is equivalent to linearizability — the system appears as a single, atomic copy of the data.

Note: CAP's "C" is NOT the same as ACID's "C" (which means constraint enforcement). CAP-C is about read-write ordering, not database invariants.

Availability (A): Every request received by a non-failing node must result in a response. Not "a response eventually" — the system cannot return an error or wait indefinitely. The response might not reflect the most recent write, but a response must be returned.

Partition Tolerance (P): The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.

Why You Always Have P

The critical insight that makes CAP practically useful: you cannot trade away partition tolerance. If you're running more than one machine, you will have partitions. Therefore the real choice is:

Given that partitions WILL happen:
  Do you prefer CONSISTENCY or AVAILABILITY?

  ┌─────────────────────────────────────────┐
  │  When partition occurs:                 │
  │                                         │
  │  CP: return error / block              │
  │      (sacrifice availability)          │
  │                                         │
  │  AP: return possibly stale data        │
  │      (sacrifice consistency)           │
  └─────────────────────────────────────────┘

The "CA" category (no partition tolerance) only makes sense for a single-machine system or systems where nodes are connected by a network so reliable it's effectively a single machine. A traditional single-node RDBMS is CA — but once you replicate it, you have a partition tolerance decision to make.

Partition Scenario Diagram

Normal Operation (all three properties achievable):
  ┌──────────┐     write x=5      ┌──────────┐
  │ Client   │─────────────────→  │ Node A   │
  └──────────┘                    │ x=5      │
                                  └────┬─────┘
                                       │ replicate
                                  ┌────▼─────┐
                                  │ Node B   │
                                  │ x=5      │
                                  └──────────┘

  Client reads from B → gets x=5 ✓

Network Partition Occurs:
  ┌──────────┐     write x=7      ┌──────────┐
  │ Client   │─────────────────→  │ Node A   │
  └──────────┘                    │ x=7      │
        │                         └──────────┘
        │                              ✗ (partition — B cannot receive replication)
        │                         ┌──────────┐
        │      read x=?           │ Node B   │
        └────────────────────────→│ x=5      │
                                  └──────────┘

  CP System: Node B returns ERROR (refuses to serve stale data)
  AP System: Node B returns x=5  (stale but available)

The Three Theoretical Categories

CA Systems (Consistency + Availability, no Partition Tolerance): - Single-node relational databases: PostgreSQL, MySQL, Oracle (single node) - These are not truly distributed; they fail completely during a network partition - Once you add replication, you exit this category

CP Systems (Consistency + Partition Tolerance, sacrifice Availability): - During a partition, the system refuses requests it cannot answer consistently - Returns errors or blocks until partition heals - ZooKeeper: Uses ZAB protocol (Zookeeper Atomic Broadcast). Requires quorum to serve reads and writes. During a partition where the quorum side is unavailable to a client, that client gets no service. A ZooKeeper follower that loses contact with the leader will refuse to serve reads after a configurable session timeout. - etcd: Similar to ZooKeeper. A minority partition will refuse writes and (in strict mode) reads. - HBase: Built on HDFS; chooses consistency over availability. A region server partition causes that region to be unavailable. - Google Chubby: CP by design — Chubby is a lock service; giving out two locks for the same resource is worse than being temporarily unavailable.

AP Systems (Availability + Partition Tolerance, sacrifice Consistency): - During a partition, all nodes continue serving requests with potentially stale data - Conflicts are resolved after partition heals (eventual consistency) - Apache Cassandra: With quorum settings lower than ALL, nodes continue accepting writes during partition. Different nodes may have divergent state temporarily. - Amazon DynamoDB: By default, reads may return stale data (eventually consistent reads). Strongly consistent reads are available at higher cost and reduced availability. - Couchbase, CouchDB: AP with conflict resolution via Multi-Version Concurrency Control - Amazon S3: AP — S3 eventually became strongly consistent (November 2020) for standard operations, but historically returned stale data after recent overwrites. - DNS: Classic AP system — DNS caches may return stale records during partitions

Real System Behavior During Partition

ZooKeeper (CP) — partition scenario:

ZooKeeper 5-node ensemble: [Z1, Z2, Z3, Z4, Z5]

Partition splits into:
  Majority side: [Z1, Z2, Z3] — can reach quorum
  Minority side: [Z4, Z5]   — cannot reach quorum

Z1, Z2, Z3: continue serving reads and writes ✓
Z4, Z5: refuse all writes, stop serving reads after leader timeout ✗

Client connected to Z4: receives "connection loss" — must reconnect to majority side

Cassandra (AP) — partition scenario:

Cassandra 5-node ring, RF=3, Quorum=2

Partition splits into:
  Side A: [N1, N2, N3]
  Side B: [N4, N5]

With Quorum writes (W=2):
  Writes to N1's token range → succeeds in Side A ✓
  Writes routed through N5 → N5 can write locally + N4, 
    but cannot replicate to N1-N3 → succeeds within B ✓

Both sides accept writes to overlapping key ranges → DIVERGENCE
After partition heals: anti-entropy + read repair reconcile state
  using last-write-wins (LWW) by default → newer timestamp wins

PACELC: The Extension

The CAP theorem describes behavior only during a partition. Daniel Abadi's 2012 paper "Consistency Tradeoffs in Modern Distributed Database System Design" extended it with the PACELC model:

If Partition:    C or A  (as in CAP)
Else (normal):   L or C  (Latency vs Consistency)

Even without a partition, a distributed system faces a fundamental tradeoff: if you want strong consistency on every read/write, you need coordination (adds latency); if you want low latency, you need to relax consistency.

PACELC categorizes systems as:

System	Partition	Normal	Category
DynamoDB	A	L	PA/EL
Cassandra	A	L	PA/EL
MongoDB	C	L	PC/EL
BigTable	C	C	PC/EC
Spanner	C	C	PC/EC
PNUTS/Yahoo	A	L	PA/EL
MySQL Cluster	A	L	PA/EL

Spanner, despite being globally distributed, achieves PC/EC by using TrueTime and Paxos — consistency is maintained at the cost of higher latency (10–100ms commit latency across regions).

CAP Misconceptions and Limitations

Misconception 1: CAP is a binary choice.

Real systems are not "consistent" or "available" — they exist on a spectrum. Cassandra's tunable consistency allows you to dial from eventual (W=1, R=1) to strong (W=ALL, R=ALL) and everywhere in between. The "choice" is made per-operation.

Misconception 2: CAP covers all distributed system properties.

CAP says nothing about: - Durability (can data be lost on node failure?) - Latency (how long does an operation take?) - Throughput - Ordering guarantees (does read B see all writes before write A?) - Byzantine fault tolerance

Misconception 3: Eventual consistency means "eventually consistent."

"Eventual consistency" without qualification is nearly meaningless. It says that if no new updates are made, all replicas will eventually converge. But it gives no bound on how long "eventually" is, no guarantee about which value converges to (without additional rules like LWW), and no guarantee about intermediate states.

Misconception 4: CP systems are always "more correct."

A CP system that becomes unavailable causes service outages. For many applications (product catalog, user preferences, social media feeds), serving slightly stale data is far better than returning errors.

Limitation: CAP is about a very specific failure model.

The CAP theorem applies to a system with at-most-once semantics on network messages and models only crash-stop failures. It does not model slow nodes, partial writes, clock skew, or Byzantine failures. Real systems face all of these simultaneously.

Limitation: CAP ignores latency during normal operation.

The PACELC model addresses this, but it's important to note that CAP's "availability" means "any response" — it does not require the response to be fast.

Historical Context

2000: Eric Brewer, then CTO of Inktomi (later Professor at UC Berkeley), presents the CAP conjecture as a keynote at PODC (Principles of Distributed Computing). His framing: you can get at most two of {consistency, availability, partition tolerance}.

2002: Seth Gilbert and Nancy Lynch at MIT publish "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services" in SIGACT News, providing a rigorous proof. The proof uses a simple two-node scenario: one node receives a write, the partition fires before replication, a second client reads from the other node. A consistent system must block or error; an available system must return the old value.

2012: Eric Brewer himself publishes "CAP Twelve Years Later: How the 'Rules' Have Changed" in IEEE Computer, clarifying many misconceptions and introducing the concept of "CAP-latency tradeoffs" — forerunner of PACELC.

2012: Daniel Abadi publishes "Consistency Tradeoffs in Modern Distributed Database System Design," introducing PACELC as a more nuanced model.

Production Examples

ZooKeeper: Apache ZooKeeper is explicitly CP. Its documentation states that ZooKeeper guarantees sequential consistency — every client sees the same order of updates. In exchange, a ZooKeeper session connected to a minority partition becomes unusable. Kubernetes historically used ZooKeeper for configuration; it now uses etcd (also CP).

Cassandra: By default AP. Operators choose consistency level per query. Netflix runs Cassandra with local_quorum for most operations, accepting that cross-datacenter reads may be slightly stale. Their SLA tolerates this in exchange for sub-millisecond read latency at scale.

DynamoDB: AP with eventual consistency as the default. DynamoDB now offers strongly-consistent reads (at twice the read capacity cost). The original Dynamo paper (2007) described using vector clocks for conflict resolution; production DynamoDB simplified this to last-write-wins with a timestamp.

Google Spanner: Defies simple CAP categorization. It is CP (consistent, partition tolerant) but achieves external consistency (stronger than linearizability) globally by using TrueTime. During a partition, Spanner will stall rather than serve inconsistent data. Its availability SLA is 99.999% in part because TrueTime makes the "stall time" minimal.

MongoDB: Pre-4.0, MongoDB was effectively AP by default (eventual consistency with replica sets). MongoDB 4.0+ added multi-document transactions. With majority write concern and majority read concern, MongoDB is CP. With default settings, it remains AP.

Debugging Notes

When debugging consistency issues in production systems, the CAP theorem helps frame the investigation:

Stale reads: If you're seeing stale data from an AP system, check whether a partition recently healed. Most AP systems have anti-entropy processes that converge within seconds to minutes of partition resolution.
Unavailability during partition: If a CP system is returning errors, find which nodes are in the minority partition. The fix is usually to reconnect clients to the majority partition.
Split-brain: Both halves of a partition accepted writes. This is the worst case — you need conflict resolution. Check your system's convergence mechanism (LWW, vector clocks, CRDTs).
Quorum misconfiguration: A common mistake in Cassandra is setting RF=3, W=1, R=1, believing that because you have 3 replicas you have strong consistency. You don't — W+R=2 < RF=3, so reads can miss recent writes.

The correct formula for strong consistency in a quorum system: W + R > RF, where W is write quorum, R is read quorum, RF is replication factor.

Security Implications

Consistency vs. security: An AP system that serves stale data can serve stale ACLs. If a user's permissions are revoked and the system serves a cached (pre-revocation) ACL from a replica that hasn't received the update, the revocation is effectively bypassed until convergence. For security-critical data, use CP systems or synchronous replication.
Split-brain and authorization: Two primaries during a network partition can both grant access to the same resource. Design security systems to favor CP over AP.
Data integrity during partition healing: When an AP system heals a partition, the conflict resolution (e.g., LWW) may silently discard writes. An attacker who can control timestamps may be able to cause legitimate writes to be discarded in favor of their own.

Performance Implications

CP systems: Each write that requires quorum confirmation adds a full network round-trip before acknowledgment. For ZooKeeper or etcd in a datacenter, this is 1–5ms. Cross-datacenter, it's 50–100ms+. This is why coordination services like ZooKeeper are not used for hot data paths.
AP systems: Cassandra with W=1 can acknowledge writes in under 1ms on the same machine. This is the fundamental reason Netflix, Discord, and others choose Cassandra for high-throughput write workloads despite its weaker consistency.
Quorum tuning: In Cassandra, LOCAL_QUORUM (quorum within the local datacenter only) provides a middle ground: consistent reads/writes within a datacenter, with eventual consistency cross-datacenter. This is the most common production configuration.

Failure Modes and Real Incidents

2012, GitHub "split-brain" MySQL outage: During a network event, GitHub's MySQL primary became unreachable. An automated failover promoted a replica. When the network recovered, the original primary had divergent writes. Result: data loss, manual intervention, 2.5 hours of degraded service. This was a CP vs AP failure: the system chose availability (kept serving) over consistency (should have blocked on partition).

2017, Amazon S3 us-east-1 Outage: An S3 API availability issue caused cascading failures across services that depended on S3. Services that had chosen AP-style dependency handling degraded gracefully; services that assumed S3 availability (CP-style assumption) failed hard. The CAP choice in your dependencies matters as much as your own system's choice.

Cassandra "tombstone" storms: Cassandra's AP model means deletes are implemented as "tombstones" (markers) rather than actual deletes, since you can't be sure all replicas received the delete during a partition. If a system creates millions of tombstones, compaction becomes extremely slow, causing latency spikes hours or days later. This is a CAP tradeoff made concrete in performance behavior.

Modern Usage

Modern distributed databases have largely moved past the "pick 2-of-3" framing toward tunable consistency:

TiDB, CockroachDB, YugabyteDB: NewSQL systems that achieve strong consistency (CP) while providing SQL semantics at global scale. They use Raft for consensus per shard, accepting higher latency for correctness.
Azure Cosmos DB: Exposes 5 explicitly-named consistency levels (Strong, Bounded Staleness, Session, Consistent Prefix, Eventual) with documented SLAs for each level.
MongoDB Atlas Global Clusters: Allows per-region consistency configuration, trading some consistency for lower read latency in remote regions.

The industry trend is toward systems that make CAP tradeoffs explicit and tunable, rather than hiding the choice or making a single fixed choice.

Future Directions

Geo-distributed strong consistency: Spanner demonstrated that PC/EC is achievable globally with the right infrastructure. As TrueTime-equivalent hardware (GPS-disciplined clocks) becomes commoditized, more systems will offer global strong consistency.
Invariant-centric consistency: Rather than choosing C or A globally, future systems may allow specifying invariants (e.g., "account balance must never go negative") and choosing the minimum consistency level that preserves each invariant.
CRDT-first design: As CRDTs mature, more AP systems will be built with data types that converge correctly without coordination, making the "AP means wrong answers" criticism less valid.

Exercises

CAP Classification: For each of the following, determine whether it is CP or AP during a network partition, and justify your answer: (a) Redis Cluster with 3 masters, no replicas; (b) PostgreSQL primary-standby with synchronous replication; (c) Apache Kafka with acks=all and min.insync.replicas=2 in a 3-broker cluster. What happens to writes when one broker is partitioned?
Quorum Math: You have a Cassandra cluster with 6 nodes and RF=3. Calculate the minimum W and R values that guarantee strong consistency (no stale reads). Now simulate what happens when 2 nodes go down — can you still achieve strong consistency? What is the minimum RF that would allow strong consistency with 2 node failures?
Partition Simulation: Using Docker Compose with three containers running a replicated service (Redis Sentinel, ZooKeeper, or etcd), use iptables to partition one node from the others. Observe: (a) which partition side serves reads, (b) what errors appear, (c) how long recovery takes after the partition heals.
PACELC Analysis: For a system you work with or know well, determine its PACELC classification. Measure actual latency for a consistent read vs an eventually-consistent read. Quantify the latency difference — is the tradeoff worth it for your use case?
Security Audit: Review an access control system (real or designed). Identify all places where stale ACL data could be served due to eventual consistency. For each, determine: (a) what is the maximum staleness window, (b) what is the worst-case security impact, (c) would switching to CP at that point be worth the availability cost?

References

Brewer, E. A. (2000). "Towards Robust Distributed Systems." Keynote at PODC 2000.
Gilbert, S., & Lynch, N. (2002). "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services." ACM SIGACT News, 33(2), 51–59.
Brewer, E. (2012). "CAP Twelve Years Later: How the 'Rules' Have Changed." IEEE Computer, 45(2), 23–29.
Abadi, D. (2012). "Consistency Tradeoffs in Modern Distributed Database System Design." IEEE Data Engineering Bulletin, 35(1), 37–42. (Introduces PACELC.)
DeCandia, G., et al. (2007). "Dynamo: Amazon's Highly Available Key-value Store." SOSP 2007.
Corbett, J. C., et al. (2012). "Spanner: Google's Globally Distributed Database." OSDI 2012.
Lakshman, A., & Malik, P. (2010). "Cassandra: A Decentralized Structured Storage System." ACM SIGOPS Operating Systems Review, 44(2), 35–40.