Skip to content

10 — Distributed Database Architectures

Technical Overview

Distributed databases extend the DBMS beyond a single node to achieve scale-out throughput, geographic distribution, high availability, and fault tolerance. The architectural decisions made at design time — how data is partitioned, how transactions are coordinated, how consistency is maintained — determine the entire operational and capability profile of the system. There is no universally optimal distributed database architecture; each reflects a set of tradeoffs in the CAP theorem space, latency budgets, and operational complexity.

Google Spanner, CockroachDB, Amazon Aurora, TiDB/TiKV, Vitess, and Citus represent six distinct architectural families, each making different tradeoffs. Understanding these architectures at the component level is essential for choosing the right database for a given workload and for reasoning about the consistency, latency, and failure behavior of distributed SQL.

Prerequisites

  • Understanding of WAL, MVCC, and B+ tree storage
  • Familiarity with consensus protocols (Raft, Paxos at a conceptual level)
  • Basic networking concepts (latency, bandwidth, TCP)
  • Knowledge of the CAP theorem and ACID properties

Core Content

Shared-Nothing vs Shared-Disk vs Shared-Memory

The foundational taxonomy of distributed database architectures:

Shared-Memory:
  +--------+   +--------+   +--------+
  | CPU 0  |   | CPU 1  |   | CPU 2  |
  +---+----+   +---+----+   +---+----+
      |             |             |
      +------+------+------+------+
             |    SHARED RAM    |
             +------------------+

  - All CPUs share the same address space
  - Limited to a single server (NUMA boundaries)
  - Examples: PostgreSQL shared_buffers, SMP databases
  - Scaling: vertical only (add CPUs/RAM to one machine)

Shared-Disk:
  +--------+   +--------+   +--------+
  | Node 0 |   | Node 1 |   | Node 2 |  <- compute nodes
  +---+----+   +---+----+   +---+----+
      |             |             |
      +------+------+------+------+
             |  SAN / EBS / NFS  |  <- shared storage
             +-------------------+

  - Each compute node has its own buffer pool (different caches)
  - Disk is shared; buffer pool coherence is the challenge
  - Examples: Oracle RAC, Amazon Aurora
  - Scaling: add compute nodes (but write locking is complex)

Shared-Nothing:
  +----------+   +----------+   +----------+
  | Node 0   |   | Node 1   |   | Node 2   |
  | CPU+RAM  |   | CPU+RAM  |   | CPU+RAM  |
  | LOCAL    |   | LOCAL    |   | LOCAL    |
  | STORAGE  |   | STORAGE  |   | STORAGE  |
  +----------+   +----------+   +----------+
       Shard 0        Shard 1        Shard 2

  - Each node owns a disjoint partition of data
  - No shared resources; coordination via network
  - Examples: CockroachDB, TiKV, Cassandra, Vitess
  - Scaling: add nodes, rebalance shards

Shared-nothing dominates modern distributed SQL because it offers linear write scalability. Shared-disk remains relevant for read-scale-out (Amazon Aurora: one writer, multiple readers sharing the same storage).

Distributed Query Execution: Push vs Pull

Pull Model (Volcano/Iterator): Each operator calls getNext() on its child, pulling one tuple at a time up the operator tree. Used by traditional DBMS. Simple to implement but poor parallelism and high per-tuple overhead.

Push Model: Data is pushed from scan operators through the operator tree. Used by columnar/vectorized engines (Hyper, Umbra, Velox). Better pipeline execution and SIMD utilization.

Distributed execution (shuffle partitioning): For distributed joins, data must be co-located. The optimizer inserts exchange operators (shuffle, broadcast, gather) into the plan:

Distributed Hash Join:
  Query: SELECT * FROM orders o JOIN customers c ON o.cust_id = c.id

  Before execution:
    orders is distributed by hash(order_id) across nodes 0,1,2
    customers is distributed by hash(cust_id) across nodes 0,1,2
    o.cust_id != hash distribution of orders => must SHUFFLE

  Plan:
    Gather (node 0 collects results)
        |
    Hash Join (on each node)
       /          \
  Shuffle          Shuffle
  (orders by       (customers by
   cust_id)         id)
      |                 |
  Scan(orders)     Scan(customers)

  Network: each row is sent to the node responsible for its join key

Broadcast join: if one table is small, broadcast it to all nodes and do a local join, avoiding shuffle. Presto/Trino and CockroachDB use broadcast join for tables below a threshold size.

Google Spanner Architecture

Spanner (Corbett et al., 2012) is Google's globally-distributed relational database. It provides external consistency (a property stronger than serializability: transactions appear to execute at a globally consistent wall-clock time) without sacrificing ACID semantics.

Key architectural components:

TrueTime API: Spanner uses GPS receivers and atomic clocks in each datacenter to provide a bounded uncertainty interval on the current time: TT.now() returns [earliest, latest] instead of a point in time. The clock error is bounded by ε (typically 1-7ms).

TrueTime:
  TT.now() = [t-ε, t+ε]  (a confidence interval)
  TT.before(t) = true if tt.latest < t
  TT.after(t)  = true if tt.earliest > t

External Consistency Protocol:
  On commit of transaction T:
  1. Acquire timestamp s = TT.now().latest  (start of commit interval)
  2. Wait until TT.after(s) is TRUE         (commit wait: typically 2*ε ≈ 14ms)
  3. Apply transaction at timestamp s
  Result: any future transaction starting after this commit will have
          TT.now().earliest > s, so it will see the commit.

Paxos groups: Spanner partitions data into "tablets" (B+ tree shards). Each tablet is replicated across 5 Paxos replicas (typically 3 in one region, 2 in another). Writes go through Paxos consensus within the tablet. Cross-tablet transactions use 2-phase commit.

Spanner data model: Hierarchical tables with interleaved storage — child rows are co-located with parent rows in the same tablet. INTERLEAVE IN PARENT customers stores customer rows physically next to their orders.

CockroachDB Architecture

CockroachDB (Spencer Kimball, Peter Mattis, Ben Darnell; 2015) is an open-source distributed SQL database inspired by Spanner but without GPS/atomic clocks.

Raft consensus per range: Data is divided into 512MB ranges (similar to Spanner tablets). Each range is replicated to 3 Raft nodes. Writes go through Raft consensus for that range. Reads can be served from the Raft leader without additional consensus (leader lease).

HLC (Hybrid Logical Clock): Without atomic clocks, CockroachDB uses HLC (Kulkarni et al., 2014) — a hybrid of physical (wall clock) and logical (Lamport clock) timestamps. HLC preserves causality while using physical time for approximate ordering.

Write intents: CockroachDB's MVCC implementation uses "write intents" — a temporary record in RocksDB indicating an in-progress transaction has written a value. Intents are resolved (committed or aborted) when the transaction completes. Other transactions encountering an intent must determine if it's committed by contacting the transaction coordinator.

PostgreSQL wire protocol: CockroachDB speaks the PostgreSQL wire protocol, making it compatible with existing PostgreSQL clients and drivers.

Amazon Aurora Architecture

Aurora (Verbitski et al., 2017) is Amazon's shared-disk cloud database. It separates compute (the Aurora database process, running MySQL/PostgreSQL-compatible SQL) from storage (a distributed 6-way replicated storage service).

Aurora Architecture:

  +-----------+   +-----------+   +-----------+
  |  Writer   |   | Reader 0  |   | Reader 1  |  <- Compute nodes (SQL engine)
  |  (MySQL/  |   | (read-    |   | (read-    |
  |   PG)     |   |  only)    |   |  only)    |
  +-----+-----+   +-----+-----+   +-----+-----+
        |               |               |
        |    Only WAL records sent ->   |
        v               v               v
  +----------------------------------------------+
  |        Aurora Distributed Storage            |
  |  6 copies across 3 AZs (quorum = 4/6)       |
  |  Storage nodes apply WAL, serve page reads   |
  +----------------------------------------------+

Key insight: Aurora sends only WAL records to storage, not full data pages. Storage nodes apply the WAL to reconstruct pages on demand. This reduces write bandwidth by ~7x compared to a traditional setup. The writer's "InnoDB redo log" is effectively distributed to 6 storage nodes.

Aurora's storage provides: atomic durability (4/6 quorum write), fast crash recovery (no need to replay WAL on compute restart — storage is already current), and read scale-out (multiple reader instances share the same storage, each with an MVCC read-only view).

TiDB/TiKV Architecture

TiDB is an open-source HTAP (Hybrid Transactional/Analytical Processing) distributed database from PingCAP.

TiDB Architecture:

  +-------+ +-------+ +-------+
  | TiDB  | | TiDB  | | TiDB  |  <- SQL Layer (stateless, MySQL-compatible)
  | Node  | | Node  | | Node  |
  +-------+ +-------+ +-------+
      |           |          |
  +---+------+----+------+---+
  |  PD (Placement Driver)   |  <- Metadata + Raft leader orchestration
  |  TrueTime-like TSO       |     Global timestamp oracle
  +--------------------------+
      |
  +--------+ +--------+ +--------+
  | TiKV   | | TiKV   | | TiKV   |  <- Storage: RocksDB + Raft
  | Region | | Region | | Region |
  | (Raft) | | (Raft) | | (Raft) |
  +--------+ +--------+ +--------+
      |
  +--------+
  |TiFlash |  <- Columnar replica (TiKV data replicated to columnar format)
  |        |     Used for OLAP queries
  +--------+

TiKV uses RocksDB as its underlying storage engine, with Raft groups per 96MB "Region" (equivalent to Spanner tablets). The Placement Driver (PD) handles region scheduling (load balancing, split, merge), cluster metadata, and the global Timestamp Oracle (TSO) — a centralized service that assigns monotonically increasing transaction timestamps.

2-Phase Commit in TiKV: TiDB uses a Percolator-style 2PC (Peng & Dabek, 2010) with: 1. Prewrite phase: Write all "lock" records to TiKV (one primary + secondary locks) 2. Commit phase: Commit the primary lock (atomic); asynchronously clean up secondary locks

Vitess: MySQL Sharding

Vitess (Jain et al., YouTube, 2010; open-sourced) is a middleware layer that shards MySQL horizontally. Originally built to handle YouTube's MySQL scale, it is now a CNCF graduated project used by Slack, HubSpot, GitHub, and PlanetScale.

Vitess Architecture:

  Application
      |
  +----------+
  | VTGate   |  <- SQL router (understands shard topology, routes queries)
  +----+-----+
       |  (routes to correct shard based on sharding key)
       |
  +----+----+    +----------+    +----------+
  | VTTablet|    | VTTablet |    | VTTablet |  <- Vttablet per MySQL
  | (shard 0|    | (shard 1)|    | (shard 2)|
  +----+----+    +----------+    +----------+
       |              |               |
   MySQL 0         MySQL 1         MySQL 2

  Topo: ZooKeeper/etcd stores shard topology (which shard owns which key range)

Cross-shard queries are handled by VTGate via scatter-gather: broadcast the query to all shards, collect results, merge and sort at VTGate. Cross-shard transactions use 2PC with MySQL XA transactions.

Citus: PostgreSQL Extension Sharding

Citus (Ozgun Erdogan et al., acquired by Microsoft for Azure) is a PostgreSQL extension that transforms a single PostgreSQL server into a distributed database.

Citus Architecture:

  Coordinator Node (standard PostgreSQL with Citus extension)
  |
  | Distributed query planner inserts routing logic
  |
  +------+------+------+
  |      |      |      |
Worker  Worker  Worker  Worker  <- Regular PostgreSQL nodes
  0       1       2       3
(shards)(shards)(shards)(shards)

Citus distributes tables by a distribution column: SELECT create_distributed_table('orders', 'customer_id') partitions orders into 32 shards by hash of customer_id. Co-located tables (sharded on the same key) enable local joins on each worker.

Reference tables (create_reference_table('countries')) are replicated to all workers for efficient cross-shard joins.

Citus is particularly strong for multi-tenant SaaS architectures where the tenant ID is the shard key — all data for a tenant is co-located on one worker, enabling full ACID semantics per tenant.

Distributed JOIN Challenges

Cross-shard joins are the primary performance and complexity challenge in distributed databases:

  1. Broadcast join: Replicate the smaller table to all nodes. O(|small_table| * num_nodes) network.
  2. Shuffle join: Repartition both tables by join key. O(|both_tables|) network.
  3. Co-located join: Data is pre-sharded on the join key — zero network. Ideal but requires careful schema design.
  4. Push-down join: If the join can be pushed to a single shard (e.g., all data for customer_id=42), execute locally. Zero network.

Global Secondary Indexes

A global secondary index (GSI) allows lookups by non-shard-key columns across all shards:

Orders table sharded by customer_id:
  Shard 0: customer_id in [0, 1000000)
  Shard 1: customer_id in [1000000, 2000000)
  Shard 2: customer_id in [2000000, 3000000)

Query: SELECT * FROM orders WHERE order_id = 12345
  -> order_id is NOT the shard key -> must scatter to all shards (expensive)
  -> OR: maintain a global secondary index: (order_id -> shard + row location)

GSI:
  Index shard: order_id -> (shard_id, customer_id)
  Lookup: 1) Query index shard for order_id=12345 -> shard=1, cust_id=1500000
           2) Query shard 1 for cust_id=1500000, order_id=12345

GSI updates are expensive: they require a distributed write (update both the data shard and the index shard). DynamoDB, Spanner, and CockroachDB support GSIs with varying consistency guarantees.

Historical Context

The earliest distributed databases (like IBM's SDD-1, 1979) used shared-nothing architecture but required manual sharding. The 2000s saw the rise of NoSQL (Cassandra, HBase, DynamoDB) which sacrificed ACID for scale-out. Google's Spanner (2012) demonstrated that globally consistent, full-ACID SQL was achievable at planetary scale, reigniting interest in distributed relational databases.

CockroachDB and YugabyteDB were both founded in 2015 as open-source attempts to bring Spanner-like guarantees to the broader engineering community. Aurora (2014) took a different approach: scale out reads while keeping single-writer ACID semantics, optimizing for the common case.

Production Examples

Google Spanner: Powers F1 (Google AdWords backend), Google Cloud Spanner. The original paper used 2000+ servers in production, supporting millions of operations per second.

CockroachDB at Meta (Workplace): Used for distributed coordination tasks requiring global consistency. The Raft-per-range design means a single data center's failure does not require failover — Raft automatically promotes a new leader.

Amazon Aurora: Default database for many AWS-native applications. Aurora Serverless v2 autoscales compute independently of storage. Thousands of reader instances can share the same storage for read-heavy applications.

TiDB at ByteDance, Meituan: TiDB is extensively used in China for large-scale e-commerce workloads. TiFlash's columnar replica enables real-time analytics on operational data without ETL.

Vitess at PlanetScale: PlanetScale is a managed Vitess service. GitHub migrated from sharded MySQL to Vitess for their production workloads.

Debugging Notes

  • Spanner: INFORMATION_SCHEMA.SPANNER_STATISTICS provides query statistics. EXPLAIN shows distributed query plans. google.spanner.v1.Spanner.ExecuteSql spans in Cloud Trace show per-replica latency.
  • CockroachDB: EXPLAIN (DISTSQL) shows the distributed query plan graphically. SHOW JOBS shows background Raft rebalancing. cockroach node status shows per-node health. crdb_internal.ranges shows range distribution.
  • Aurora: aurora_replica_read_consistency parameter controls replica lag tolerance. SHOW REPLICA STATUS (MySQL-compatible) on readers shows replication lag.
  • TiDB: EXPLAIN ANALYZE shows both TiDB-side and TiKV-side execution. ADMIN SHOW SLOW QUERY shows slow query log. pd-ctl region --jq shows region distribution.
  • Vitess: vtctlclient ListAllTablets shows all tablet endpoints. vtctlclient GetShardReplication shows shard membership.

Security Implications

  • Distributed system attack surface: Each node is a network endpoint. mTLS between nodes is essential. Spanner, CockroachDB, and TiDB all support inter-node TLS.
  • Timestamp oracle as single point of failure/attack: PD (TiDB's TSO) and similar centralized timestamp services are high-value targets — compromising them allows arbitrary transaction ordering.
  • Cross-shard XA transactions: Two-phase commit leaves a participant in a "prepared" state if the coordinator crashes. In this state, the participant's data is locked indefinitely. An attacker exploiting coordinator failures can cause cross-shard deadlocks.
  • Data residency (GDPR): Distributed databases that span geographic regions must provide mechanisms to constrain which regions store which data. Spanner's "placement API" and CockroachDB's multi-region tables (REGIONAL BY ROW) provide this.

Performance Implications

  • Cross-shard write latency: Any write touching multiple shards requires 2PC: two network round trips + consensus. At typical datacenter RTT (~500µs), 2PC adds ~1ms. At cross-region RTT (~50ms), 2PC adds ~100ms. Design schemas to minimize cross-shard transactions.
  • Raft follower reads: Reading from a Raft follower (instead of leader) reduces read latency for geographically distributed reads. CockroachDB's "follower reads" (AS OF SYSTEM TIME follower_read_timestamp()) allow slightly stale reads from the nearest replica.
  • Aurora read scale: Aurora reader instances serve reads with at most aurora_replica_read_consistency lag. For most OLAP workloads, eventual consistency within 10-100ms is acceptable and allows massive read scale-out.
  • TiFlash columnar queries: TiDB automatically routes queries involving large aggregations and OLAP patterns to TiFlash rather than TiKV, providing 10-100x speedup for analytical queries.

Failure Modes

  1. Raft leader election storm: If network partitions cause repeated leader elections in many Raft groups simultaneously, write availability degrades. The Raft leader must be in the majority partition; minority partitions are read-only.
  2. Hotspot shards: If the shard key is poorly chosen (e.g., timestamp, sequential ID), all writes go to one shard (the "hot" recent range), causing that shard to become a bottleneck while others are idle. Use hash sharding or UUID keys.
  3. Two-phase commit coordinator failure: If the 2PC coordinator crashes after sending PREPARE but before sending COMMIT, participants are stuck in "prepared" state until a recovery protocol resolves the transaction. All distributed databases have a recovery mechanism, but it introduces latency.
  4. TrueTime clock uncertainty: If GPS or atomic clocks in a Spanner datacenter become unavailable, clock uncertainty increases. Spanner widens the commit wait interval proportionally, increasing write latency. Fallback to NTP (larger uncertainty) degrades performance.

Modern Usage

The distributed database landscape is converging on a few dominant patterns: - Serverless (Aurora Serverless, Neon, CockroachDB Serverless): Automatic compute scaling based on load. Storage is always on; compute scales to zero between queries. - HTAP: Systems that serve both OLTP (row-store) and OLAP (columnar) workloads from the same data. TiDB, SingleStore, SQL Server, Oracle (in-memory) are examples. - Global tables: CockroachDB's multi-region tables allow declaring that certain tables should be readable with low latency from multiple regions, with the DBMS automatically managing data placement and routing.

Future Directions

  • Disaggregated storage as universal substrate: Aurora's model (shared log as the storage layer) is becoming the default for cloud databases. Neon, AlloyDB, and many others have adopted it.
  • Geo-distributed HTAP: Systems that can run analytical queries close to data in any region, with OLTP writes automatically replicated. Spanner's GSQL and TiDB's TiFlash are early examples.
  • Declarative data placement: SQL extensions for specifying data placement constraints (REGIONAL BY ROW, GLOBAL) allow application developers to express geo-compliance requirements in the schema.

Exercises

  1. Set up a 3-node CockroachDB cluster locally (using Docker). Create a table, insert data, kill one node, verify the cluster remains available. Observe how Raft leader election proceeds.
  2. Benchmark cross-shard vs co-located joins in Citus: create two tables, one sharded by customer_id and one reference table. Compare query latency for a join on the sharded key vs a non-sharded join.
  3. Simulate network partition in TiDB using tc netem (Linux traffic control) to add latency between nodes. Observe how cross-region write latency changes.
  4. Read the Aurora paper (Verbitski et al., 2017) and identify: (a) what is sent to storage instead of full pages, (b) how crash recovery works without replaying the redo log on compute, (c) how read replicas share storage.
  5. Design a schema for a global e-commerce application using CockroachDB multi-region tables: identify which tables should be REGIONAL BY ROW, which should be GLOBAL, and which should remain local.

References

  • Corbett, J. C., et al. (2012). Spanner: Google's Globally Distributed Database. OSDI 2012.
  • Verbitski, A., et al. (2017). Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases. SIGMOD 2017.
  • Peng, D., & Dabek, F. (2010). Large-Scale Incremental Processing Using Distributed Transactions and Notifications. OSDI 2010. (Percolator — foundation of TiKV 2PC)
  • Kulkarni, S. S., et al. (2014). Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases. OPODIS 2014. (HLC)
  • CockroachDB design documents: https://github.com/cockroachdb/cockroach/tree/master/docs/design
  • TiDB Architecture documentation: https://docs.pingcap.com/tidb/stable/tidb-architecture
  • Vitess documentation: https://vitess.io/docs/