08 — Replication

Technical Overview

Replication is the practice of maintaining copies of the same data on multiple machines. Its goals are threefold: high availability (the system continues if a node fails), durability (data survives individual node failures), and read scalability (read traffic can be spread across replicas). Replication is not free — it introduces complexity around consistency, conflict resolution, and replication lag. Understanding the tradeoffs between different replication strategies is essential to designing systems that meet their availability, consistency, and performance requirements.

Prerequisites

CAP theorem and consistency models (02-cap-theorem.md, 03-consistency-models.md)
Leader election concepts (07-leader-election.md)
Logical clocks for conflict detection (04-logical-clocks.md)
Basic database concepts (write-ahead log, transactions)

Core Content

Replication Goals

┌─────────────────────────────────────────────────────────┐
│  Replication Goals                                      │
│                                                         │
│  1. High Availability                                   │
│     System keeps running despite individual failures    │
│                                                         │
│  2. Durability                                          │
│     Data not lost when nodes fail                       │
│     (different from availability — node can be down     │
│     but data still recoverable on restart)              │
│                                                         │
│  3. Read Scalability                                    │
│     Multiple replicas serve read traffic                │
│     (writes still bottleneck on primary)                │
│                                                         │
│  4. Geographic Distribution (advanced)                  │
│     Replicas in multiple datacenters/regions            │
│     Reduces latency for geographically distributed users│
└─────────────────────────────────────────────────────────┘

Synchronous vs Asynchronous Replication

Synchronous Replication:

  Client → Primary → Replica1 → Primary → Client (ack)
                   → Replica2 ↗

  Primary waits for ALL replicas to acknowledge before
  confirming to client.

  Pros: Strong durability guarantee (data on N copies before ack)
  Cons: Write latency = max(replica RTTs); any slow/failed replica blocks writes

Asynchronous Replication:

  Client → Primary → Client (immediate ack)
           ↓
         [replication queue]
           ↓
         Replica1, Replica2 (eventually)

  Pros: Low write latency; primary doesn't block on replica health
  Cons: If primary fails before replication, data is lost;
        replicas may be stale (replication lag)

Semi-synchronous (MySQL default):

  At least one replica must acknowledge before ack to client.
  Other replicas are asynchronous.

  Pros: Balance between durability and latency
  Cons: Adds latency of fastest replica; still risk of data loss
        if the one synchronous replica is also the one that fails

Primary-Backup (Single-Leader) Replication

Single-leader replication is the most common model:

Single-Leader Replication:

  ┌──────────────────────────────────────────────────────┐
  │  All writes → Primary                                │
  │  Reads → Primary or Replicas                         │
  │                                                      │
  │  Client ──write──→ [Primary]                         │
  │                         │                            │
  │                    replication                       │
  │                    log/stream                        │
  │                    │         │                       │
  │                    ↓         ↓                       │
  │               [Replica1] [Replica2]                  │
  │                                                      │
  │  Client ──read───→ [Primary] or [Replica1/2]         │
  └──────────────────────────────────────────────────────┘

Replication mechanism: Changes are propagated via a replication log — a sequence of changes in the order they were applied on the primary. Different databases use different log formats:

Statement-based replication (MySQL, pre-5.1): Primary sends SQL statements to replicas. Problem: non-deterministic functions (NOW(), RAND(), auto-increment) produce different results on replicas.
Row-based replication (MySQL row-based binlog, PostgreSQL logical replication): Primary sends before/after values for each changed row. More bandwidth but deterministic.
WAL-based replication (PostgreSQL streaming replication): Primary streams Write-Ahead Log (WAL) segments. Replicas replay the WAL. Low overhead; tightly coupled to storage format.

PostgreSQL Streaming Replication:

PostgreSQL Primary-Standby:

  Primary: WAL Writer → WAL files → WAL Sender
                                          │
                                    TCP connection
                                          │
  Standby: WAL Receiver → WAL files → Startup Process
                                          │
                                    Replay WAL entries

  Synchronous standby:
    primary_conninfo in postgresql.conf
    synchronous_commit = on (wait for WAL flush on standby)

  Asynchronous standby:
    synchronous_commit = off or remote_write

MySQL Binlog Replication:

MySQL Primary-Replica:

  Primary: InnoDB → Binary Log (binlog) → Binlog Dump Thread
                                                │
                                          TCP (replication protocol)
                                                │
  Replica:  I/O Thread → Relay Log → SQL Thread → InnoDB

  GTID-based replication (MySQL 5.6+):
    Each transaction assigned a Global Transaction ID (GTID).
    Replicas track which GTIDs they've applied.
    Enables automatic replica positioning after failover.

Multi-Leader Replication

In multi-leader (active-active) replication, multiple nodes can accept writes. Changes are replicated to all other leaders.

Multi-Leader Replication:

  ┌──────────┐         ┌──────────┐
  │ Leader A │←──────→│ Leader B │
  │ (DC1)    │         │ (DC2)    │
  └──────────┘         └──────────┘
       ↑                     ↑
  writes                  writes

  Advantage: Low-latency writes from any datacenter
  Problem:   Write conflicts when both leaders accept different
             writes to the same key concurrently

Conflict Resolution:

Last-Write-Wins (LWW): The write with the highest timestamp wins. Simple, but loses data if timestamps are equal or if clocks are skewed. Used by Cassandra, DynamoDB (original behavior).

Merge strategies: Application-defined merge logic. Example: a CRDT counter adds all increments from all replicas. This is conflict-free by design.

Custom conflict handlers: Application code examines both versions and decides. Used by CouchDB, which surfaces conflicts to the application as multiple "versions" of a document.

Operational Transformation (OT) and CRDTs (see 11-crdts.md): mathematically guaranteed convergence.

Multi-leader in production: MySQL Group Replication (Galera Cluster), CockroachDB (multi-region with Raft), Google Docs (OT for collaborative editing), and Lotus Notes (early multi-master email).

Leaderless Replication (Dynamo-Style)

In leaderless replication (popularized by Amazon Dynamo, 2007), any replica can accept writes. Consistency is achieved through quorums.

Leaderless Replication (Quorum):

  Configuration: N=3 (replication factor), W=2 (write quorum), R=2 (read quorum)

  Write "x=5":
    Client → Replica 1: x=5 ✓
    Client → Replica 2: x=5 ✓  ← W=2 acked, write succeeds
    Client → Replica 3: x=5 ✗ (network issue)

  State: Replica 3 has stale x=old_value

  Read "x":
    Client → Replica 1: x=5 (ts=100)
    Client → Replica 2: x=5 (ts=100)
    Client → Replica 3: x=old_value (ts=50)  
    ← R=2 responses with ts=100 → return x=5
    ← Read repair: send x=5,ts=100 to Replica 3

  Consistency guarantee: W + R > N ensures at least one responding
  reader has the latest write (overlap between W-write set and R-read set)

Read repair: When a read detects a stale replica (lower version), the coordinator updates the stale replica with the fresh value. This is an asynchronous repair mechanism.

Anti-entropy (background repair): A background process compares data between replicas and synchronizes any differences. Used in Cassandra, Riak, and DynamoDB. Does not provide read-time consistency guarantees.

Sloppy Quorum and Hinted Handoff: If not enough replicas are available for the standard quorum, a sloppy quorum accepts writes to any W nodes that are available (even "wrong" nodes). Those nodes store a "hint" that the data belongs to unavailable nodes and forward it when they recover.

Sloppy Quorum Example (Cassandra):

  Normal replicas for key K: [R1, R2, R3]
  R1 and R2 are down.

  Client writes K=5, W=2:
  Without sloppy quorum: write FAILS (only R3 available)
  With sloppy quorum: write to R3 + R4 (or R5)
    R4 stores: data={K:5}, hint={real_owner: R1}
    When R1 recovers, R4 "hands off" the write to R1.

  Sloppy quorum improves write availability at the cost of potentially
  making the strict quorum condition W+R>N insufficient.

Chain Replication

Introduced by Renesse and Schneider (2004), chain replication is an alternative to primary-backup that provides strong consistency with better throughput.

Chain Replication:

  ┌────────┐  write  ┌────────┐  write  ┌────────┐
  │ Head   │────────→│ Middle │────────→│  Tail  │
  └────────┘         └────────┘         └────────┘
       ↑                                     │
  writes from                          acks to client
  clients                              (after full chain commit)

  Read? Always from TAIL (tail has all committed writes).
  Write? Enter at HEAD, propagate through chain.
  Ack? Tail acknowledges after applying the write.

  Advantages:
  - Strong consistency: reads from tail always see committed writes
  - Split writes/reads: head handles write throughput, tail handles reads
  - Failure recovery: well-defined (head failure → next becomes head)

  Used by: CRAQ (Chain Replication with Apportioned Queries, 2009),
           Azure Storage (block blobs), 
           HBase (write-ahead log shipping)

Replication Lag Problems

The replica lag problem: Between the time a write is applied to the primary and the time it reaches a replica, that replica has stale data. All reads from that replica during this window return old data.

Replication Lag Scenarios:

1. Reading-your-own-writes (violated):
   User posts comment → write to primary
   User refreshes page → read from replica (lagging 1s)
   → Comment doesn't appear
   → Fix: route reads to primary after writes, or use causal consistency

2. Monotonic reads (violated):
   User reads post with 5 comments (from replica with small lag)
   User refreshes → reads from different replica (large lag)
   → Sees only 3 comments
   → Perceived as comment deletion
   → Fix: route same user to same replica (sticky routing)

3. Consistent prefix reads (violated):
   Table orders: [order1, order2, order3]
   Due to replication lag, replica shows: [order1, order3] (missing order2)
   → Fix: replicas must not skip log entries; apply in order

Kafka Replication

Kafka's replication model is unique:

Kafka Partition Replication:

  Partition P0: Leader=Broker1, Replicas=[Broker2, Broker3]
  ISR (In-Sync Replicas) = {Broker1, Broker2, Broker3}

  Producer sends message with acks=all:
    Message → Broker1 (leader), written to local log
    Broker2 and Broker3 fetch from leader (pull-based replication)
    Once both are in ISR and have fetched → ack to producer

  If Broker3 falls behind (or fails):
    ISR shrinks to {Broker1, Broker2}
    acks=all now only requires Broker1+Broker2

  min.insync.replicas=2: even with acks=all, require at least 2 in ISR
    If only leader is available → write FAILS (reject with error)
    This is the durability setting

  acks=1: producer ack after leader write only (fastest, least durable)
  acks=0: fire and forget (fastest, potentially lossy)
  acks=-1/all: wait for all ISR replicas (slowest, most durable)

Historical Context

1970s–1980s: Primary-backup replication is used in fault-tolerant systems. The IBM IMS database (1968) used journal-based replication. TANDEM NonStop (1970s) pioneered fault-tolerant replication for financial transactions.

1990: Oki and Liskov publish "Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems" — one of the earliest formal treatments of primary-backup replication, equivalent in power to Paxos.

1996: MySQL introduces statement-based binlog replication, making asynchronous replication mainstream.

2004: van Renesse and Schneider publish "Chain Replication for Supporting High Throughput and Availability" at OSDI.

2007: Amazon's Dynamo paper introduces leaderless replication with sloppy quorums and hinted handoff to the mainstream engineering audience.

2010: PostgreSQL 9.0 introduces built-in streaming replication, replacing complex third-party replication solutions.

Production Examples

PostgreSQL (Streaming Replication): GitHub runs PostgreSQL primary-replica with synchronous replication to one standby per datacenter. Failover is managed by Patroni (a Python-based HA solution using etcd or Consul for leader election). Their setup: primary + 2 synchronous standbys + async replicas for read scaling.

MySQL at Facebook: Facebook ran MySQL with a semi-synchronous replication variant. Writes acknowledged after at least one replica persisted the binlog. They developed MyRocks (RocksDB storage engine for MySQL) to improve write throughput on the replication path.

Cassandra (leaderless): Cassandra uses leaderless replication with configurable consistency. Spotify runs Cassandra for user playlist data with LOCAL_QUORUM consistency. Netflix uses Cassandra for billing data with QUORUM across two datacenters.

DynamoDB (leaderless, sort-of): DynamoDB's internal implementation (not the original Dynamo paper) uses Raft-based replication per partition. Externally, it appears leaderless because any endpoint can route to the correct partition leader.

Kafka (ISR-based): LinkedIn (Kafka's origin) runs Kafka with acks=all and min.insync.replicas=2 for critical event streams (member activity, ad clicks). This ensures no data loss even if one broker fails during a write.

Debugging Notes

Diagnosing replication lag:

# PostgreSQL - check replica lag
SELECT client_addr, state, 
       pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS bytes_behind
FROM pg_stat_replication;

# MySQL - check replica lag
SHOW REPLICA STATUS\G
# Look for: Seconds_Behind_Source

# Cassandra - check for pending repairs
nodetool status
nodetool netstats  # Shows streaming in/out progress

Common replication issues:

Replica far behind: Primary disk I/O bottleneck → writes faster than replicas can apply. Fix: throttle replication, add more replicas, or improve disk.
Replication deadlock (MySQL): A long-running transaction on the replica holds locks that a new replication event needs. Fix: set slave_parallel_workers > 1 for parallel replication.
Hinted handoff buildup (Cassandra): If a node is down for a long time, hints accumulate on other nodes. When the node returns, a "hinted handoff storm" sends all accumulated hints, overloading the recovering node.
Write-after-read inconsistency: Client reads x=5 from replica, then writes x=6 based on that read, expecting result=6. But if the replica was stale and x was actually 7, the write produces x=6 (going backward). Use read-modify-write transactions or CAS operations.

Security Implications

Replication credentials: MySQL replication uses a dedicated MySQL user with REPLICATION SLAVE privilege. This user's credentials must be protected; compromise enables reading all data and potentially injecting events into the binlog.
Encrypted replication streams: By default, MySQL and PostgreSQL replication streams are unencrypted. Require TLS for replication connections. For PostgreSQL, set ssl=on in pg_hba.conf for replication connections.
Data sovereignty: In multi-datacenter replication, data crosses geographic and legal boundaries. Ensure encryption in transit; audit what data is replicated to which region.
Replicas as attack targets: A lagging replica may be less monitored than the primary. An attacker who compromises a replica may have more time to exfiltrate data before detection.

Performance Implications

Synchronous replication cost:

Write latency = max(primary_write_latency, replica_write_latency + 2×network_RTT)

Single datacenter: +1-5ms (fsync on replica + network)
Cross-datacenter (us-east to us-west): +60-80ms (cross-country RTT)

Replication throughput: - PostgreSQL WAL streaming: can replicate at near-line-rate (multi-GB/s on good hardware) - MySQL binlog: limited by SQL thread applying speed. slave_parallel_workers=8 allows 8 threads applying to different schemas/tables. - Cassandra: gossip-based replication. ISR flush rate varies with write QPS.

Read scaling: Adding read replicas linearly scales read throughput. A primary + 3 async replicas can handle 4× the read throughput of a single node. Writes still bottleneck on the primary.

Failure Modes and Real Incidents

2012, GitHub MySQL Failover: A network partition caused GitHub's primary MySQL to be unreachable from one datacenter. Orchestrator promoted a 30-second-lagging replica. 30 seconds of writes were effectively lost. The root cause: asynchronous replication with no synchronous durability guarantee. Fix: switch to semi-synchronous replication.

2016, MongoDB Secondary "Going Wild": A MongoDB replica set secondary was far behind its primary (multi-hour lag) due to a slow disk. The replica's oplog ran out of space (oplog is a fixed-size capped collection). The secondary had to do a full initial sync, taking 48 hours. During those 48 hours, the cluster ran with N-1 replicas, violating its redundancy SLA.

2020, Cassandra Compaction Storm: A Cassandra cluster that had been running with hinted handoff enabled for a node that was down for 24 hours. When the node returned, it received 24 hours' worth of hints simultaneously. The resulting write storm consumed all disk I/O, causing the node to fail again. Fix: limit max_hint_window_in_ms and rate-limit hinted handoff delivery.

Modern Usage

Replication is so fundamental that every database includes it: - CockroachDB/TiDB: Use Raft per-range for synchronous replication. Every write is replicated before ack. No secondary lag. - PlanetScale (Vitess): MySQL-compatible distributed database with automatic shard management and MySQL Group Replication per shard. - Turso (libSQL): SQLite with Raft replication. Brings strong replication to the embedded database space. - Amazon Aurora: MySQL/PostgreSQL-compatible with a novel storage layer. The log is replicated to 6 copies across 3 AZs synchronously. The compute layer is separate from storage — replicas share the same storage layer, so there's no "replica lag" in the traditional sense.

Future Directions

Storage-disaggregated replication (Aurora, Socrates, Neon): Separate compute from storage. The storage layer handles replication; compute nodes don't need to replicate to each other. Enables near-zero replica lag for read replicas.
Conflict-free replication (CRDTs): As CRDTs mature, multi-leader replication without conflict resolution becomes feasible for more data types.
Geo-replicated strong consistency: Spanner and CockroachDB have demonstrated this is possible. As the latency cost decreases (better interconnects, edge locations), more applications will adopt it.

Exercises

Replication Lag Measurement: Set up a PostgreSQL primary-standby with async replication. Write 100,000 rows to the primary as fast as possible. On the standby, poll pg_stat_replication every 100ms and log bytes_behind. Plot the replication lag over time. At what write rate does lag exceed 1 second?
Quorum Math: For a Cassandra cluster with N=5 replicas: (a) Calculate all W, R pairs where W+R>N (strong consistency). (b) For each pair, calculate how many node failures each can survive while still allowing writes and reads. (c) For N=5, RF=3 with W=2, R=2: is this strongly consistent? What happens if 2 nodes fail?
Conflict Resolution: Implement a simple multi-leader key-value store with 3 "leaders." Simulate a network partition where leaders A and B accept concurrent writes to the same key. Implement three conflict resolution strategies: (a) LWW with physical timestamp, (b) LWW with Lamport clock, (c) multi-value register (keep both, return both to client). Compare results for 1000 random concurrent writes.
Chain Replication: Implement a 3-node chain replication system. Write a test that: (a) sends 100 writes to the head, (b) simulates the middle node failing during write #50, (c) verifies that the tail only has the writes that completed the full chain before the failure, (d) promotes the tail to the new middle after replacing the failed node.
Kafka Replication Deep Dive: Set up a 3-broker Kafka cluster with a topic having RF=3. Using kafka-consumer-groups and kafka-log-dirs tools: (a) measure the ISR shrink time when you kill a broker, (b) measure the time for the killed broker to rejoin ISR after restart, (c) compare end-to-end latency for acks=1 vs acks=all under normal operation and during broker failure.

References

Oki, B. M., & Liskov, B. H. (1988). "Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems." PODC 1988.
DeCandia, G., et al. (2007). "Dynamo: Amazon's Highly Available Key-value Store." SOSP 2007.
van Renesse, R., & Schneider, F. B. (2004). "Chain Replication for Supporting High Throughput and Availability." OSDI 2004.
Kreps, J., Narkhede, N., & Rao, J. (2011). "Kafka: A Distributed Messaging System for Log Processing." NetDB Workshop at VLDB 2011.
Ports, D. R. K., Clements, A. T., Zhang, I., Madden, S., & Liskov, B. (2010). "Transactional Consistency and Automatic Management in an Application Data Cache." OSDI 2010.
Bernstein, P. A., & Goodman, N. (1987). "Multiversion Concurrency Control — Theory and Algorithms." ACM Transactions on Database Systems, 8(4), 465–483.
Özsu, M. T., & Valduriez, P. (2011). Principles of Distributed Database Systems (3rd ed.). Springer.