03 — Consistency Models

Technical Overview

A consistency model is a contract between a distributed storage system and the applications that use it. It defines which values a read operation may return, given the history of preceding writes. Consistency models form a hierarchy from strongest (most restrictive, most expensive) to weakest (most permissive, cheapest). Choosing the wrong consistency model is a primary source of distributed systems bugs.

Understanding the spectrum — from linearizability at the top to eventual consistency at the bottom — is essential for building correct distributed applications.

Prerequisites

CAP theorem (02-cap-theorem.md)
Basic distributed systems concepts (01-distributed-systems-fundamentals.md)
Familiarity with database transactions (ACID)
Understanding of concurrent programming (happens-before, atomicity)

Core Content

The Consistency Model Spectrum

STRONGEST (most expensive, most correct)
    │
    ▼
┌─────────────────────────────────────────────────────┐
│  Linearizability (atomic consistency)               │
│  — reads always reflect most recent write globally  │
├─────────────────────────────────────────────────────┤
│  Sequential Consistency                             │
│  — operations appear in some global serial order   │
│    consistent with each process's local order       │
├─────────────────────────────────────────────────────┤
│  Causal Consistency                                 │
│  — causally related ops appear in causal order     │
│    concurrent ops can appear in any order           │
├─────────────────────────────────────────────────────┤
│  Monotonic Read Consistency                         │
│  — a process never reads older data after newer    │
├─────────────────────────────────────────────────────┤
│  Read-Your-Writes (Session Consistency)             │
│  — a process always reads its own writes            │
├─────────────────────────────────────────────────────┤
│  Monotonic Write Consistency                        │
│  — writes from a process are applied in order      │
├─────────────────────────────────────────────────────┤
│  Eventual Consistency                               │
│  — replicas converge if no new updates             │
│    (no ordering guarantees on intermediate reads)   │
└─────────────────────────────────────────────────────┘
    │
    ▼
WEAKEST (cheapest, most available)

Linearizability

Definition (Herlihy & Wing, 1990): A system is linearizable if every operation appears to take effect atomically at some point between its invocation and its response — a linearization point. The linearization point must be consistent with the real-time ordering of operations.

Intuitively: the system behaves as a single, atomic register (or object). Once a write is acknowledged, all subsequent reads (by anyone, anywhere) will return that value or a later one.

Example — Linearizable System:

Timeline:
  Process A:  ──write(x=5)──────────────────────read(x)→5──
  Process B:  ────────────────read(x)→5──────────────────
                              ↑
                    This is only valid if B's read
                    starts AFTER A's write completes.
                    If B's read overlaps with A's write,
                    it may return 5 or the previous value.

Non-linearizable (violates):
  Process A:  ──write(x=5)────────(ack received)────────────
  Process B:  ──────────────read(x)→3──read(x)→5───────────
                             ↑ After write was acked!

Linearizability is the strongest consistency model. It is not the same as serializability: - Serializability (databases): transactions appear to execute in some serial order — but that order need not match real time. - Linearizability (distributed registers): operations appear to execute at a single point in real time. - Strict serializability = Serializability + Linearizability — the combination used by systems like Spanner.

Cost: Achieving linearizability requires coordination across replicas before acknowledging writes. This is a synchronous consensus operation (Paxos, Raft, 2PC). Every write waits for quorum acknowledgment; every read must verify it's reading from the current leader or verify quorum.

Systems: Google Spanner (external consistency, which is stronger than linearizability), etcd, ZooKeeper, FoundationDB.

Sequential Consistency

Definition (Lamport, 1979): A system is sequentially consistent if the result of any execution is the same as if all operations were executed in some sequential order, and the operations of each process appear in that sequence in the order specified by its program.

Key difference from linearizability: sequential consistency does not require the global order to respect real-time. Operations from different processes can be interleaved in any way, as long as within each process the order is preserved.

Sequential Consistency — valid execution:

Process A writes: x=1, then x=2
Process B writes: y=1
Process C reads:  y=1, x=1, x=2  ← valid (sees A's writes in order)
Process D reads:  x=2, x=1, y=1  ← INVALID (sees A's writes out of order)
Process D reads:  x=2, y=1       ← valid (skips old x=1 value but maintains order)

Sequential consistency is weaker than linearizability because it allows the global order to "lag" real time — you could see an older state than what was true at the moment of your read, as long as you never see a process's writes in the wrong order.

Systems: Early Java Memory Model (pre-5.0), some GPU memory models, Volta memory consistency model.

Causal Consistency

Definition: If operation A causally precedes operation B, then A must appear before B in all replicas. Operations with no causal relationship (concurrent operations) may appear in any order.

Causal consistency is weaker than sequential consistency — it allows replicas to diverge for causally unrelated operations while maintaining causal order.

Example — Social Media:

Process A: posts "Going to the park" (post_id=1)
Process B: reads post_id=1, then replies "Have fun!" (reply_id=2)
Process C: must see post_id=1 before reply_id=2
           (reply is causally dependent on the post)

But:
Process X: posts "Hot dog weather!" (post_id=3, concurrent with post_id=1)
Process C: may see post_id=3 before or after post_id=1 
           (they are causally unrelated)

Implementation: Typically implemented using vector clocks or dependency tracking. Each operation carries its causal history; a replica only applies an operation after applying all of its causal dependencies.

Systems: MongoDB (causal consistency sessions), COPS (Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage, Lloyd et al. 2011), Bolt-on Causal Consistency for Cassandra.

Monotonic Read Consistency

Definition: If a process reads value v of object x, any subsequent read of x by that process will return v or a value written after v.

Informally: you never read older data after reading newer data. You won't "time travel backward."

Monotonic Read — violation example:

  (Without monotonic reads — possible in AP systems)
  Process A:  read(x)→5 ──── read(x)→3  ← VIOLATION

  This can happen if:
  - First read hits Replica 1 (has x=5)
  - Network partition, then second read hits Replica 2 (has x=3, stale)

  (With monotonic reads — guaranteed)
  Process A:  read(x)→5 ──── read(x)→5 (or newer)

Implementation: Pin a client to a single replica (sticky sessions), or use monotonically increasing read timestamps.

Systems: Cassandra with LOCAL_SERIAL, DynamoDB strongly-consistent reads (per-request, not across requests), most session-based systems.

Read-Your-Writes (Session Consistency)

Definition: After a process writes value v for object x, any subsequent read of x by the same process will return v or a value written after v.

This is the minimum consistency level most applications expect intuitively. "I just posted a comment, why can't I see it?" is a violation of read-your-writes.

Read-Your-Writes — violation:

  User action:     write("comment=hello")  →  read("comments")
                                                 ↑ doesn't see "hello"

  This can happen if:
  - Write goes to Replica A (primary)
  - Read goes to Replica B (replica that hasn't received the write yet)

  Fix options:
  1. Route reads after a write to the same node (sticky routing)
  2. Client tracks its write timestamps; read with min_timestamp constraint
  3. Use read-your-writes tokens (DynamoDB does this with session tokens)

Systems: MongoDB with session-level consistency, DynamoDB with session tokens, most databases in single-node configuration.

Eventual Consistency

Definition (Vogels, 2009, "Eventually Consistent"): If no new updates are made to an object, eventually all reads of that object will return the last updated value.

This is a liveness property, not a safety property — it says nothing about intermediate states. Without additional rules, "eventual consistency" says nothing about: - How long "eventually" takes - What value intermediate reads return - Whether concurrent writes converge to the same value across replicas

The term became a marketing term in the late 2000s, often used to mean "we made it fast and we'll deal with correctness later."

Eventual Consistency — what it guarantees:

  Replica A: x=5 (latest write)
  Replica B: x=3 (stale, hasn't received update yet)
  Replica C: x=5 (up to date)

  A client reading from B today:    gets x=3 (stale)
  Same client reading from B later: gets x=5 (eventually converged)

  Guarantee: B WILL eventually show x=5.
  No guarantee: when, or what B returns in between.

Strengthening eventual consistency: In practice, "eventual" systems add rules to make eventual consistency useful:

Last-Write-Wins (LWW): Each write has a timestamp; the highest timestamp wins. Requires clock synchronization (or at least clock ordering). Cassandra uses this by default.
Multi-Value Register: Keep all conflicting versions, return them to the client for manual resolution (original Dynamo approach with vector clocks).
CRDTs: Use data types that converge to a deterministic result regardless of merge order (see 11-crdts.md).

Systems: Amazon DynamoDB (default), Apache Cassandra (default), Amazon S3 (pre-November 2020), DNS, web caches.

BASE vs ACID

The ACID properties of traditional databases (Atomicity, Consistency, Isolation, Durability) assume transactions complete atomically and leave the database in a consistent state. BASE was coined by Eric Brewer as the philosophical counterpoint for distributed, eventually-consistent systems:

ACID:
  Atomicity   — transaction either fully completes or fully rolls back
  Consistency — transaction leaves database in a valid state
  Isolation   — concurrent transactions don't interfere
  Durability  — committed transactions persist through failures

BASE:
  Basically Available   — system is available (may serve stale data)
  Soft state            — state may change over time without new input
                          (convergence in progress)
  Eventual consistency  — system will converge to consistent state

BASE is not a formal model — it's a design philosophy that accepts weaker guarantees in exchange for higher availability and lower latency. Most NoSQL systems are BASE-oriented; most RDBMS systems are ACID-oriented.

Consistency Models in Production Systems

System	Default Consistency	Strongest Available
DynamoDB	Eventual	Strongly Consistent Reads
Apache Cassandra	Eventual (W=1,R=1)	Linearizable (LWT/Paxos)
MongoDB	Eventual (replica reads)	Linearizable (w:majority)
Google Spanner	External consistency	External consistency
CockroachDB	Serializable	Serializable
PostgreSQL	Serializable (single node)	Serializable
Redis	Sequential (single node)	N/A (single node)
Redis Cluster	Eventual (async repl)	No strong consistency
FoundationDB	Strict serializable	Strict serializable
ZooKeeper	Sequential consistency	Sequential consistency

DynamoDB in detail: DynamoDB's "eventually consistent reads" use any replica; "strongly consistent reads" use the leader. Strongly consistent reads cost twice the read capacity units and have higher latency (~1ms vs ~0.5ms), but are necessary for any operation that depends on seeing the latest write.

Cassandra's tunable consistency: Cassandra's LWT (Lightweight Transactions) use a Paxos-based compare-and-set to achieve linearizability for specific operations. They are 4–5x slower than regular writes. Used when you absolutely cannot have lost updates (e.g., unique username registration).

Google Spanner external consistency: Spanner provides external consistency, which is strictly stronger than linearizability. It guarantees that if transaction T1 commits before T2 begins (in real time), T2 will observe T1's writes. This uses TrueTime to bound uncertainty and commit-wait to ensure ordering.

Historical Context

1979: Leslie Lamport defines sequential consistency in "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs" in IEEE Transactions on Computers.

1990: Maurice Herlihy and Jeannette Wing define linearizability in "Linearizability: A Correctness Condition for Concurrent Objects" in ACM Transactions on Programming Languages and Systems. This paper wins the 2003 Dijkstra Prize.

1994: Ahamad et al. formally define causal consistency and monotonic read consistency in "Causal Memory: Definitions, Implementation, and Programming."

1997: Eric Brewer and colleagues at Inktomi build systems at scale that first expose the practical tradeoffs between consistency and availability.

2007: Werner Vogels (CTO of Amazon) publishes "Eventually Consistent" in ACM Queue, popularizing the term and the BASE philosophy. The Dynamo paper (DeCandia et al., SOSP 2007) provides the engineering details.

2010: The term "eventual consistency" enters mainstream engineering discourse as NoSQL databases (Cassandra, CouchDB, MongoDB) gain adoption.

2012: Google publishes the Spanner paper, demonstrating that global strong consistency is achievable with the right infrastructure investment.

Production Examples

Discord and Cassandra: Discord uses Cassandra for message storage with eventual consistency. They experienced the "stale read" problem when users would send a message and immediately refresh — the message wouldn't appear because the read hit a replica that hadn't received the write. Their solution: after a write, read from the same coordinator node for a short window (read-your-writes via sticky routing).

Amazon and DynamoDB session consistency: DynamoDB's SessionToken concept allows the client SDK to track write timestamps and ensure subsequent reads return data at least as new as the latest write. This provides read-your-writes without requiring synchronous replication.

LinkedIn and Espresso: LinkedIn's Espresso database (described in their 2012 blog posts) provides MySQL-backed storage with multi-master replication. They explicitly provide per-record consistency annotations, allowing high-value records (job postings) to use stronger consistency than low-value records (ad clicks).

Debugging Notes

Diagnosing consistency violations in production:

Stale reads: Add write timestamps to data. If you read a value with a timestamp significantly behind the current time, the replica is lagging. Monitor replication lag as a key metric.
Non-monotonic reads: Instrument your client to log the timestamps of values read. If you ever see a read return a value with an older timestamp than a previous read, you have a monotonic read violation.
Lost updates (lack of read-your-writes): After every write in a critical flow, re-read the value and assert it matches what you wrote. This adds latency but catches violations immediately.
Jepsen analysis: Kyle Kingsbury's Jepsen framework (jepsen.io) has found linearizability violations in Cassandra, MongoDB, Etcd, Riak, and dozens of other systems. It uses Knossos (a linearizability checker) to analyze operation histories.

Key metrics to monitor: - Replication lag (seconds behind primary) - Read repair frequency (indicates stale reads are happening) - CAS (compare-and-set) failure rate (high rate indicates contention or staleness)

Security Implications

ACL staleness: If permissions are stored in an eventually-consistent store, a permission revocation may take seconds to minutes to propagate. Design critical security checks to use linearizable reads.
Token validation: JWT tokens validated against a blacklist stored in an eventually-consistent cache can be used even after blacklisting until convergence occurs. Use a CP store for token blacklists, or design tokens to expire quickly.
Audit log integrity: If audit logs are written to an eventually-consistent store, logs from different replicas may not appear in causal order, making forensic analysis difficult. Use a causally-consistent or linearizable store for audit logs.

Performance Implications

The performance cost of each consistency level (approximate, in a single datacenter):

Consistency Level    | Latency (p50) | Throughput | Notes
---------------------|---------------|------------|------
Linearizable         | 5-15ms        | Low        | Requires consensus round-trip
Sequential           | 2-8ms         | Medium     | Leader reads + verification
Causal               | 1-3ms         | High       | Metadata tracking overhead
Read-Your-Writes     | 0.5-2ms       | High       | Sticky routing or tokens
Eventual             | 0.1-1ms       | Very High  | No coordination

Cross-datacenter linearizability (Spanner): 10–100ms per operation.

The choice of consistency level is one of the primary dials for tuning throughput/latency vs. correctness. In practice, most systems use eventual consistency for reads that don't require the latest data (product listings, social feeds) and stronger consistency for writes and reads that affect user-visible state (account balances, shopping carts).

Failure Modes and Real Incidents

2013, Riak and eventual consistency: A Riak cluster experienced a partition that lasted 90 seconds. During this time, both partitions accepted writes. After healing, LWW resolved conflicts by discarding the older write. The "lost" writes were shopping cart updates — customers saw their cart emptied. Eventual consistency with LWW is unsuitable for shopping cart data; CRDTs (which Riak later adopted) handle this correctly.

2016, MongoDB's default read concern: Prior to MongoDB 3.6, the default read concern allowed reading from secondary replicas that might be significantly behind the primary. Applications that relied on reading their own writes were broken by default in replica set configurations. MongoDB's defaults have since been improved, but many applications had to be retrofitted with explicit read/write concern specifications.

2018, Cosmos DB and bounded staleness: A team using Azure Cosmos DB with "Bounded Staleness" consistency (maximum K versions or T seconds behind) misconfigured T=3600 (one hour). During a datacenter issue, they were reading data that was an hour stale. The system was behaving correctly per its configuration; the team had not understood the implications of their consistency choice.

Modern Usage

The trend in modern systems is toward explicit, per-operation consistency levels rather than system-wide choices:

Azure Cosmos DB: Five named consistency levels with documented guarantees, selected at the database level but overridable per-request.
Google Cloud Spanner: Strong consistency as the only option (the price is paid in latency, not correctness).
Apache Cassandra 4.x: Lightweight transactions (LWT) for linearizable operations, regular operations for eventual. The programmer explicitly chooses per-operation.
DynamoDB Transactions: As of 2018, DynamoDB supports serializable transactions across multiple items, providing strong consistency for critical operations while maintaining eventual consistency elsewhere.

The "NoSQL = no consistency" era is ending. Modern practitioners choose specific consistency levels for specific data access patterns rather than applying one model globally.

Future Directions

Invariant-based consistency: Research systems like Indigo (Balegas et al., 2015) allow developers to specify application invariants; the system automatically selects the minimum consistency level that preserves them.
Geo-distributed strong consistency at low cost: As CXL (Compute Express Link) and faster interconnects reduce datacenter latency, the cost of strong consistency decreases. Global strong consistency may become economically viable for more use cases.
Consistency in serverless: AWS Lambda and similar platforms introduce new consistency challenges — function instances are stateless and may read from different replicas. Session-based consistency models are difficult to implement without persistent connections.

Exercises

Consistency Level Experiment: Using a Redis cluster (or Cassandra) with 3 replicas, write a value from one client. Immediately read from three different replicas. Measure how long it takes all replicas to return the latest value. Plot the "convergence curve" over 100 iterations.
Read-Your-Writes Implementation: Implement a middleware layer on top of an eventually-consistent key-value store (DynamoDB or Redis) that guarantees read-your-writes. The middleware should: (a) track the most recent write timestamp per key per session, (b) on read, if the returned value is older than the write timestamp, retry or redirect to a consistent source. Measure the overhead vs. direct reads.
Linearizability Checker: Write a simple linearizability checker in Python. Given a history of operations (start_time, end_time, operation, value_written, value_read), determine if there exists a linearization point assignment that is consistent. Test it against known-good and known-bad histories.
Consistency Bug Hunt: Take an open-source application that uses a distributed database (e.g., a Rails app using Redis for sessions, or a Node.js app using MongoDB). Find at least two places in the code where a consistency violation could cause incorrect application behavior. Describe the scenario and the fix.
ACID vs BASE Trade-off Analysis: For an e-commerce shopping cart, design two implementations: one using strict ACID transactions (PostgreSQL), one using BASE eventual consistency (DynamoDB). For each, describe: (a) how items are added/removed, (b) how "total count" is tracked, (c) what happens during a node failure, (d) performance characteristics under 10,000 concurrent users.

References

Herlihy, M., & Wing, J. M. (1990). "Linearizability: A Correctness Condition for Concurrent Objects." ACM Transactions on Programming Languages and Systems, 12(3), 463–492.
Lamport, L. (1979). "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs." IEEE Transactions on Computers, C-28(9), 690–691.
Vogels, W. (2009). "Eventually Consistent." Communications of the ACM, 52(1), 40–44.
Ahamad, M., et al. (1995). "Causal Memory: Definitions, Implementation, and Programming." Distributed Computing, 9(1), 37–49.
Lloyd, W., et al. (2011). "Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS." SOSP 2011.
Corbett, J. C., et al. (2012). "Spanner: Google's Globally Distributed Database." OSDI 2012.
Alvaro, P., et al. (2011). "Consistency Analysis in Bloom: A CALM and Collected Approach." CIDR 2011.
Bailis, P., et al. (2013). "Highly Available Transactions: Virtues and Limitations." VLDB 2014.