11 — CRDTs (Conflict-Free Replicated Data Types)
Technical Overview
A Conflict-free Replicated Data Type (CRDT) is a data structure that can be replicated across multiple nodes with independent updates, and all replicas will eventually converge to the same state without requiring coordination. CRDTs achieve strong eventual consistency — a stronger property than eventual consistency — by defining merge operations that are commutative, associative, and idempotent. The key insight: if the merge function is mathematically well-defined and conflict-free, no coordination protocol (consensus, locking) is needed for convergence.
CRDTs are the principled solution to the "eventual consistency is hard to reason about" problem: instead of ad-hoc conflict resolution, CRDTs provide proven convergence guarantees.
Prerequisites
- Consistency models, especially eventual consistency (
03-consistency-models.md) - Distributed systems fundamentals and partial failure (
01-distributed-systems-fundamentals.md) - CAP theorem (
02-cap-theorem.md) - Basic set theory and abstract algebra (helpful but not required)
Core Content
Why CRDTs Exist
Traditional distributed systems handle conflicts by: 1. Last-Write-Wins (LWW): discard older write. Risk: data loss. 2. Multi-value register: return all conflicts to the application. Risk: application complexity. 3. Consensus/transactions: coordinate before writing. Risk: latency, availability.
CRDTs take a different approach: design the data type so that conflicts cannot occur. All concurrent operations can always be merged, and the result is deterministic.
CRDT Design Philosophy:
Q: How can we allow concurrent updates without conflicts?
A: Only allow operations that compose deterministically.
Counter: you can increment concurrently, but never set.
Inc + Inc = 2 increments (order doesn't matter)
Set(5) + Set(7) = conflict (which wins?)
Set: you can add, but (in simple CRDTs) never remove.
Add(a) + Add(b) = {a, b} (commutative)
Add(a) + Remove(a) = conflict (was a removed before or after add?)
Mathematical Foundation: Join-Semilattice
The mathematical foundation of state-based CRDTs is the join-semilattice (also called a bounded lattice or monotone semilattice).
A join-semilattice is a partially ordered set (S, ≤) where every pair of elements has a least upper bound (join), written a ⊔ b:
Properties of join-semilattice:
Commutativity: a ⊔ b = b ⊔ a
Associativity: (a ⊔ b) ⊔ c = a ⊔ (b ⊔ c)
Idempotence: a ⊔ a = a
For CRDTs: the "state" of a replica is an element of the lattice.
The "merge" function is the join (⊔).
Example: Natural numbers under MAX
State: 0, 1, 2, 3, ...
Join: MAX(a, b)
MAX(3, 5) = 5 (commutativity: MAX(5, 3) = 5 ✓)
MAX(MAX(3,5), 7) = MAX(3, MAX(5,7)) = 7 (associativity ✓)
MAX(5, 5) = 5 (idempotence ✓)
A "last-write-wins register" using MAX on timestamps is a simple CRDT!
For a state-based CRDT to converge, its state must only move "up" in the lattice (monotone growth). This is why CRDTs typically only allow additions, not deletions (in their simplest forms).
State-Based vs Operation-Based CRDTs
State-Based CRDTs (CvRDT — Convergent Replicated Data Types): - Replicas share their full state periodically. - Merge function takes two states and returns the join (least upper bound). - Any replica can merge with any other; order doesn't matter.
State-based CRDT:
Replica A state: S_A
Replica B state: S_B
To merge: S_merged = join(S_A, S_B)
After merging, both replicas have S_merged.
Idempotent: join(S_A, S_A) = S_A — safe to re-send state.
Operation-Based CRDTs (CmRDT — Commutative Replicated Data Types): - Replicas share individual operations. - Operations must be commutative (order doesn't matter) or the delivery mechanism must ensure causal ordering. - More network-efficient for large states; operations are small.
Operation-based CRDT:
Replica A sends: [increment(), increment()]
Replica B sends: [decrement()]
For commutativity: increment, decrement, increment == decrement, increment, increment
Both result in net +1.
Requires: operations delivered exactly once (or idempotent + at-least-once).
CRDT Examples
G-Counter (Grow-Only Counter)
A counter that only increments. Deletion is not allowed.
State: vector of N integers, one per replica (initialized to 0)
[n_1, n_2, ..., n_N] where N = number of replicas
Increment at replica i:
state[i] = state[i] + 1
Query (total value):
sum(state)
Merge(A, B):
[max(A[1], B[1]), max(A[2], B[2]), ..., max(A[N], B[N])]
Example:
Replica 1 state: [3, 2, 0] (replica 1 incremented 3 times, etc.)
Replica 2 state: [3, 5, 1]
Merge: [max(3,3), max(2,5), max(0,1)] = [3, 5, 1]
Total value: 3 + 5 + 1 = 9
Why it works: Each replica only increments its own counter.
No replica decrements any counter. States only grow (monotone).
PN-Counter (Positive-Negative Counter)
Supports both increment and decrement using two G-Counters.
State: (P-vector, N-vector)
P = G-Counter for increments
N = G-Counter for decrements
Increment: P[i]++
Decrement: N[i]++
Value: sum(P) - sum(N)
Merge: (merge_P(A.P, B.P), merge_N(A.N, B.N))
Example:
Replica 1: P=[3,2], N=[1,0] → value = 5 - 1 = 4
Replica 2: P=[2,4], N=[0,2] → value = 6 - 2 = 4
Merge: P=[3,4], N=[1,2] → value = 7 - 3 = 4
Note: the merged value may differ from what each replica computes locally,
but after merging, all replicas compute the same value.
G-Set (Grow-Only Set)
A set that only allows additions.
State: a set S
Add(e): S = S ∪ {e}
Query: S
Merge(A, B): A ∪ B
Commutativity: A ∪ B = B ∪ A ✓
Associativity: (A ∪ B) ∪ C = A ∪ (B ∪ C) ✓
Idempotence: S ∪ S = S ✓
2P-Set (Two-Phase Set)
Supports add and remove, but once removed, an element cannot be re-added.
State: (A, R) — added set, removed set
Invariant: R ⊆ A (you can only remove what was added)
Add(e): A = A ∪ {e}
Remove(e): R = R ∪ {e} (only if e ∈ A)
Query: A \ R
Merge((A1,R1), (A2,R2)) = (A1 ∪ A2, R1 ∪ R2)
Limitation: element can never be re-added after removal.
OR-Set (Observed-Remove Set)
Supports add and remove with re-add capability. Each addition creates a unique tag. Removal only removes specific tagged instances.
State: set of (element, unique_tag) pairs
Add(e): create unique tag t, add (e, t) to set
Remove(e): remove all (e, *) pairs currently in the set
Query: {e | (e, t) ∈ state}
Example:
Replica A: Add("alice") → {("alice", tag1)}
Replica B: concurrent Add("alice") → {("alice", tag2)}
Replica B: Remove("alice") → removes {("alice", tag2)} locally
Before merge: A has {("alice", tag1)}, B has {}
After merge: {("alice", tag1)}
"alice" is present because A's independent addition (tag1) was not removed.
This is the "observed-remove" semantics: you remove what you observed,
not what might have been added concurrently.
The OR-Set is semantically cleaner than 2P-Set because it allows re-addition and handles the add-concurrent-with-remove case correctly. Used in Riak (as "Sets" in Riak Data Types) and Redis CRDT.
LWW-Element-Set (Last-Write-Wins Set)
Uses timestamps for conflict resolution. The element with the most recent timestamp wins.
State: map from element to (add_timestamp, remove_timestamp)
Add(e): set add_ts[e] = current_time
Remove(e): set rem_ts[e] = current_time
Query: {e | add_ts[e] > rem_ts[e]}
Merge: for each element, keep the higher timestamps
Problem: requires synchronized clocks. If two operations happen
at the exact same millisecond, tie-breaking is arbitrary.
Less principled than OR-Set but simpler to implement.
MV-Register (Multi-Value Register)
Rather than resolving conflicts, keep all concurrent versions.
State: set of (value, vector_clock) pairs
Write(v, vc): replace all entries causally dominated by vc, add (v, vc+1)
Read: return all concurrent values
Example:
Initial: {}
A writes "hello": {("hello", VC_A=[1,0])}
B writes "world" concurrently: {("world", VC_B=[0,1])}
After merge: {("hello",[1,0]), ("world",[0,1])} — two siblings
Application must resolve: pick one or merge.
If B then writes "hello world" after seeing both:
VC_B=[1,1]. New state: {("hello world",[1,1])} — dominates both.
This is what Riak used for plain key-value objects (before Riak DTs). The application received multiple values on concurrent writes and had to resolve them.
RGA (Replicated Growable Array) — Collaborative Text Editing
RGA (Roh et al., 2011) is a CRDT for sequences (lists), supporting insert and delete. Used for collaborative text editing.
RGA Concept:
Each character insertion is tagged with a unique identifier (replica, timestamp).
Ordering uses the identifier for deterministic conflict resolution.
Concurrent insertions at the same position:
User A types "a" at position 5 with tag (A, t=3)
User B types "b" at position 5 with tag (B, t=3)
RGA: Sort by tag (e.g., lexicographic on (timestamp, replica_id))
Both insertions are preserved; position is determined by tag ordering.
All replicas compute the same order.
Delete: mark as "tombstone" (logically deleted but physically kept for ordering).
RGA is the algorithm behind many real-time collaborative editors, including early versions of Google Docs' operational transform (OT) component and the Automerge library.
CRDTs in Production Systems
Redis CRDT (Redis Enterprise): Redis Enterprise's Active-Active geo-replication uses CRDTs. Data types supported as CRDTs: Strings (LWW), Counters (PN-Counter), Sets (OR-Set), Sorted Sets, Hashes, Lists. Writes at any datacenter are conflict-free; merges happen during replication.
Riak Datatypes (Basho, 2014): Riak 2.0 introduced CRDT-based data types (Counters, Sets, Maps, Flags, Registers). These replace the application-level conflict resolution that was required with Riak's plain key-value store. riak_dt library implements G-Counter, PN-Counter, OR-Set, and ORSWOT (OR-Set Without Tombstones).
Apache Cassandra (LWW as "almost CRDT"): Cassandra's LWW counter and set types are not true CRDTs (they use timestamps, not lattice algebra), but Cassandra's counters (using gossip-propagated increments) are close to a G-Counter. True Cassandra counters use a server-side increment mechanism that is CRDT-like.
Automerge (JavaScript/Rust library): Automerge is a JSON CRDT library. An Automerge document can be edited independently by multiple clients; changes are merged with no conflicts. Uses a variant of RGA for ordered sequences and a variant of OR-Map for objects. Used in local-first software tools (Ink & Switch research group).
Yjs (JavaScript): Another popular CRDT library for collaborative editing. Used by TipTap, BlockSuite, and other collaborative editing tools. Implements YATA (Yet Another Transformation Approach) — a sequence CRDT similar to RGA but more memory-efficient.
SoundCloud's Roshi: A CRDT-based distributed set implementation for SoundCloud's feed ("recently played") system. Uses LWW-Element-Set semantics on top of Redis.
Historical Context
1998–2000s: Early work on optimistic replication (Saito & Shapiro, 2005 survey) discusses the need for conflict-free operations but doesn't use the CRDT term.
2007: Shapiro, Preguiça, Baquero, and Zawirski at INRIA begin formalizing CRDTs. Marc Shapiro presents early work on "Designing Commutative Operations."
2011: The landmark paper "Conflict-free Replicated Data Types" by Shapiro, Preguiça, Baquero, and Zawirski is published at SSS 2011. This establishes the formal definition, the CvRDT/CmRDT distinction, and proves convergence. A companion technical report provides more detail.
2012: Riak (Basho) implements CRDTs in production. The Riak team (Russell Brown, et al.) publishes blog posts and papers on practical CRDT implementation.
2014: Riak 2.0 ships with full CRDT support. Redis Labs (now Redis) begins work on CRDT-based Active-Active replication.
2016–present: Automerge, Yjs, and other CRDT libraries bring CRDTs to frontend/client development. "Local-first software" movement (Kleppmann, Beresford, Brandal, et al., 2019) advocates for CRDTs as the foundation for collaborative, offline-capable applications.
Debugging Notes
CRDT-specific debugging challenges:
-
Tombstone accumulation: OR-Set and RGA keep tombstones (logically deleted elements) forever. In long-running systems, tombstones accumulate and degrade memory/performance. Solutions: tombstone garbage collection (requires coordination — a quorum must acknowledge the element is gone), ORSWOT (OR-Set Without Tombstones) uses a different encoding.
-
Causality tracking overhead: CmRDTs require causal delivery (or idempotent operations). This typically means attaching vector clocks to every operation. For large systems, vector clocks are O(N) per operation.
-
Merge divergence: A CRDT that appears to converge may have a bug in its merge function (violating commutativity, associativity, or idempotence). Test: generate two independent operation sequences, merge in all orders, verify all results are identical.
-
Observing CRDT state: Because CRDTs merge asynchronously, you may observe intermediate states that seem inconsistent (e.g., a counter reads 7 on one replica and 9 on another). This is expected; after replication, both will show the same value. Monitor
pending_replication_opsto understand convergence lag.
Security Implications
- CRDT manipulation: CRDTs allow any replica to modify state, and the modification is automatically merged. A compromised replica can inject arbitrary increments into a PN-Counter or add arbitrary elements to an OR-Set. CRDTs provide no authentication mechanism; use at the application layer.
- Tombstone exploitation: Delayed tombstone delivery can cause "ghost" elements. An adversary who can delay replication might cause a deleted element to "reappear" briefly during merge operations.
- Counter overflow: A PN-Counter that wraps around (e.g., a 32-bit counter incremented past INT_MAX) produces nonsensical values. Use 64-bit counters and monitor for anomalous counter values.
- OR-Set tag uniqueness: If the unique tags in an OR-Set are not truly unique (e.g., using a non-cryptographic random number with a small space), two different additions might use the same tag. This would cause incorrect merge behavior.
Performance Implications
- State-based CRDTs: Require sending full state on each sync. For a set with 1M elements, this is expensive. Solutions: delta-CRDTs (send only the "delta" since last sync) — Almeida et al., 2016.
- Vector clock overhead: CmRDTs with causal delivery need vector clocks. For N replicas, each operation carries O(N) metadata.
- Tombstone memory: Long-lived OR-Set or RGA with many deletes accumulates tombstones. A 1M-element set that has had 10M deletes over its lifetime has 10M tombstones in memory.
- Read performance: CRDT reads are typically local (no coordination needed). Write performance is also local. Only replication is asynchronous.
- Delta-CRDTs: Rather than send full state, send only the "delta" (the portion of state that changed since last sync). Delta-CRDTs have the same convergence guarantees with much lower bandwidth.
Failure Modes and Real Incidents
2015, Riak Sibling Explosion: Before Riak 2.0 (which introduced proper CRDT types), Riak's plain key-value store could generate "siblings" (multiple conflicting versions) if application-level conflict resolution was missing. A bug in a web service that stored user profiles in Riak caused each write to generate a new sibling rather than resolving the conflict. After 6 months, some keys had thousands of siblings, making reads extremely slow (Riak returned all siblings and the application had to merge them). Fix: switch to Riak CRDT data types or implement proper conflict resolution.
2017, Redis Active-Active CRDT Bug: A Redis Enterprise customer running Active-Active (CRDT-based) discovered that their counter was incrementing correctly but occasionally showing incorrect values during replication. Investigation revealed a bug in Redis's CRDT merge for counters when operations arrived out of causal order. Fix: ensure causal ordering of CRDT operations via vector clocks.
2019, Automerge Performance Regression: A collaborative text editor using Automerge experienced severe performance degradation as documents grew larger. Root cause: Automerge's internal representation stored all historical operations (including tombstones) to reconstruct state. A 50KB document with many edits could have an internal state of 10MB+. Fix: implement "garbage collection" of causally obsolete history.
Modern Usage
CRDTs are increasingly used in: - Collaborative editing: Automerge and Yjs are the standard building blocks for real-time collaborative applications. - Distributed databases: Redis Enterprise, Riak, and research systems like Lasp use CRDTs as their data model. - Edge computing: Applications at the network edge (CDN nodes, IoT devices) use CRDTs to maintain local state without coordinating with a central server. - Local-first software: Applications that work offline and sync when connected (Notion-style apps, offline maps, git-like data stores) use CRDTs for sync-without-conflict.
Future Directions
- Purity and expressiveness: Research into more expressive CRDTs — can we design a CRDT for a database with foreign key constraints? (Balegas et al., 2015 — "Putting Consistency Back into Eventual Consistency" shows this is possible for some invariants.)
- Hardware CRDTs: CRDTs could be implemented at the NIC level, merging state at line rate without involving the CPU.
- Mixed-consistency CRDTs: Systems like Antidote DB (SyncFree EU project) allow mixing CRDT operations with transactions. CRDT updates are always available; transactional reads/writes use a higher consistency level.
- Delta-CRDT standardization: Delta-CRDTs reduce bandwidth to O(delta_size) from O(state_size). Standardizing delta representations would enable efficient cross-library CRDT sync.
Exercises
-
G-Counter Implementation: Implement a G-Counter in Python for a 3-node cluster. Simulate: (a) each node incrementing its own counter 5 times, (b) merging all states, (c) verifying the total equals 15. Then simulate a network partition where N1 and N2 cannot communicate for 10 "ticks", each incrementing independently, then merging. Verify the final count is correct.
-
OR-Set Correctness: Implement an OR-Set in Python. Test the concurrent add-remove scenario: (a) Replica A adds "alice", (b) Replica B concurrently removes "alice" (its own copy), (c) merge. Is "alice" in the set? Now test: (a) Replica A adds "alice", replication happens, (b) Replica B removes "alice", (c) Replica A concurrently adds "alice" again, (d) merge. Is "alice" in the set? Explain both results.
-
PN-Counter Under Partitions: Build a PN-Counter for 5 nodes. Simulate a partition: nodes 1-3 in one group, nodes 4-5 in another. Each group runs 100 increments and 50 decrements independently. Merge the states. Verify the total equals: (sum of all increments) - (sum of all decrements) across all groups. Simulate node 3 failing permanently: what is the final value?
-
RGA Text Editing: Implement a simplified RGA list CRDT. Simulate two users editing a shared text document: User A types "Hello " at position 0; User B types "World" at position 0, concurrently. Merge both changes. What is the resulting text? Is it consistent across replicas A and B? Why?
-
Tombstone GC Design: Design a tombstone garbage collection mechanism for an OR-Set. Requirements: (a) an element can only be garbage-collected if all replicas have acknowledged the deletion, (b) the GC must not coordinate with replicas that are currently partitioned, (c) GC must be safe to apply even if messages arrive out of order. Describe the protocol and identify any remaining edge cases.
References
- Shapiro, M., Preguiça, N., Baquero, C., & Zawirski, M. (2011). "Conflict-free Replicated Data Types." SSS 2011. Also: INRIA Technical Report RR-7687.
- Shapiro, M., et al. (2011). "A Comprehensive Study of Convergent and Commutative Replicated Data Types." INRIA Technical Report RR-7506.
- Roh, H. G., Jeon, M., Kim, J. S., & Lee, J. (2011). "Replicated Abstract Data Types: Building Blocks for Collaborative Applications." Journal of Parallel and Distributed Computing, 71(3), 354–368. (RGA)
- Almeida, P. S., Shoker, A., & Baquero, C. (2016). "Delta State Replicated Data Types." Journal of Parallel and Distributed Computing, 111, 162–173.
- Kleppmann, M., et al. (2019). "Local-First Software: You Own Your Data, in spite of the Cloud." Onward! 2019. (Automerge background)
- Balegas, V., et al. (2015). "Putting Consistency Back into Eventual Consistency." EuroSys 2015.
- Saito, Y., & Shapiro, M. (2005). "Optimistic Replication." ACM Computing Surveys, 37(1), 42–81.