Skip to content

Section 18: Database Internals

Purpose and Scope

Database engines are among the most algorithmically rich components in systems software. This section dissects the machinery beneath the SQL or API surface: storage engine design (B-tree vs LSM-tree), buffer pool management, write-ahead logging (WAL), multi-version concurrency control (MVCC), query planning and cost-based optimization, index structures, and transaction isolation. It extends to the specialized engines that have proliferated in response to workload diversity: distributed SQL (NewSQL), columnar analytics (OLAP), time-series databases, and vector databases for ML embeddings.

The section treats correctness and performance as inseparable: a storage engine that is fast but violates ACID guarantees is worse than useless in production, while one that is correct but ignores physical I/O patterns will be outperformed by orders of magnitude.


Prerequisites

  • Section 11 (Memory Management): buffer pool == a managed memory pool with eviction policy
  • Section 12 (Storage Systems): I/O schedulers, fsync semantics, NVMe latency
  • Section 13 (Filesystems): journaling, fsync ordering, O_DIRECT
  • Section 10 (Synchronization): latching, MVCC relies on atomic operations
  • Section 17 (Distributed Systems): 2PC, Raft for distributed databases

Learning Objectives

Upon completing this section you will be able to:

  1. Explain B+-tree node structure, split/merge operations, and why fan-out determines tree height.
  2. Describe an LSM-tree: memtable, immutable memtable, SSTables, compaction strategies (leveled, tiered, FIFO).
  3. Explain write amplification and read amplification trade-offs between B-tree and LSM-tree.
  4. Describe WAL (write-ahead logging): log record format, ARIES recovery protocol (REDO/UNDO).
  5. Explain MVCC: how multiple versions enable readers to not block writers; version chain structure.
  6. Articulate the four standard isolation levels (RC, RR, SI, Serializable) and the anomalies each permits.
  7. Trace a query through the optimizer: parse → logical plan → physical plan → cost estimation → execution.
  8. Explain columnar storage: column groups, run-length encoding, dictionary encoding, vectorized execution.
  9. Describe the write path and read path in a distributed database like CockroachDB (Raft + MVCC).
  10. Explain what an HNSW index is and why approximate nearest neighbor search is required for vector databases.

Architecture Overview

  SQL / API Query
  ┌──────────────────────────────────────────────────────────────────┐
  │   Parser → Binder → Logical Planner → Optimizer → Executor      │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │
  ┌──────────────────────────────▼───────────────────────────────────┐
  │              Transaction Manager                                  │
  │   Lock Manager │  MVCC Version Store │  Isolation Level Enforcer │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │
  ┌──────────────────────────────▼───────────────────────────────────┐
  │              Storage Engine                                       │
  │  ┌─────────────────────────┐  ┌─────────────────────────────┐   │
  │  │   B+-Tree Engine        │  │   LSM-Tree Engine           │   │
  │  │   (InnoDB, PostgreSQL)  │  │   (RocksDB, LevelDB, Cassandra)│ │
  │  │   page(16KB) + WAL      │  │   memtable → SSTable levels │   │
  │  └─────────────────────────┘  └─────────────────────────────┘   │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │
  ┌──────────────────────────────▼───────────────────────────────────┐
  │              Buffer Pool                                          │
  │   LRU/Clock eviction ─ dirty page tracking ─ prefetch           │
  │   Page latch (shared/exclusive) ─ pin count                      │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │  fsync / O_DIRECT
  ┌──────────────────────────────▼───────────────────────────────────┐
  │              WAL (Write-Ahead Log)                                │
  │   Log record: LSN, prev-LSN, transaction-ID, undo/redo info      │
  │   ARIES: Analysis → REDO → UNDO phases on crash recovery         │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │
  ┌──────────────────────────────▼───────────────────────────────────┐
  │              Storage Layer (OS + Block Device)                   │
  └──────────────────────────────────────────────────────────────────┘

  MVCC Version Chain:
  Tuple → [xmin=10, xmax=15, data=v1] → [xmin=15, xmax=∞, data=v2]
  Reader at snapshot 12 sees v1; reader at 16 sees v2; writer does not block reader

Key Concepts

  • B+-Tree: Self-balancing tree where all data resides in leaf nodes, linked in sorted order; internal nodes are keys only; fan-out of 100–1000 means 3-4 levels for billion-row tables.
  • LSM-Tree (Log-Structured Merge-Tree): All writes go to an in-memory memtable; flushed as immutable SSTables; background compaction merges levels; sequential write amplification traded for read amplification.
  • Write Amplification: Total bytes written to storage per byte written by the application; high in LSM-tree during compaction; critical for SSD endurance.
  • Buffer Pool: A managed cache of database pages (typically 8–16 KB) in memory; the single most important performance component; hit rate must exceed 99% for good OLTP performance.
  • WAL (Write-Ahead Logging): All changes recorded to a sequential log before being applied to data pages; enables atomic crash recovery and replication.
  • ARIES (Algorithm for Recovery and Isolation Exploiting Semantics): Standard WAL-based recovery algorithm; 3 phases: Analysis (reconstruct dirty page table), REDO (replay from last checkpoint), UNDO (roll back uncommitted transactions).
  • MVCC (Multi-Version Concurrency Control): Each write creates a new version of a tuple; readers access a consistent snapshot without locking writers; old versions garbage-collected by a vacuum/purge process.
  • Isolation Levels: Read Uncommitted (dirty reads), Read Committed (no dirty reads), Repeatable Read (no non-repeatable reads), Serializable (no phantoms); PostgreSQL adds Snapshot Isolation (SSI).
  • Page Latch: Physical synchronization on a buffer pool page (shared for read, exclusive for write); distinct from logical transaction locks.
  • Query Optimizer: Cost-based optimizer selects access paths and join orders; cost model estimates I/O and CPU based on table statistics (cardinality, histograms).
  • Vectorized Execution: Processes a batch of tuples per operator invocation; exploits SIMD and avoids per-tuple interpreter overhead; used by DuckDB, Velox, Databricks Photon.
  • Columnar Storage: Stores each column contiguously; enables compression (dictionary, RLE, delta), column pruning, and SIMD scan; Parquet, ORC, Feather are columnar file formats.
  • NewSQL: Distributed relational databases providing ACID transactions at scale; typically Raft replication + MVCC + distributed query execution (CockroachDB, TiDB, YugabyteDB, Google Spanner).
  • Time-Series Database: Optimized for high-ingest, time-ordered append workloads; compression exploits monotonic timestamps (delta-of-delta); InfluxDB, TimescaleDB, VictoriaMetrics.
  • Vector Database: Stores high-dimensional embeddings; approximate nearest neighbor (ANN) index (HNSW, IVF) enables similarity search; Pinecone, Weaviate, pgvector.
  • HNSW (Hierarchical Navigable Small World): Graph-based ANN index; layered graph provides O(log N) search at high recall; memory-intensive but fast.

Major Historical Milestones

Year Milestone
1970 Codd publishes relational model (IBM Research)
1976 IBM System R: first SQL prototype; shadow paging for recovery
1979 Oracle 2 — first commercial relational database
1981 IBM DB2 development begins
1986 SQL standardized (SQL-86)
1987 PostgreSQL (Postgres) development begins at Berkeley
1992 MySQL development begins (released 1995)
1992 Ramakrishnan & Gehrke: textbook on database systems internals
1992 ARIES recovery algorithm published (Mohan et al., IBM)
1996 LSM-Tree paper (O'Neil et al.)
2000 InnoDB becomes MySQL default storage engine
2005 Bigtable (Google) — precursor to wide-column NoSQL era
2006 Facebook develops Cassandra (open-sourced 2008)
2007 Amazon Dynamo paper — key-value store at scale
2010 LevelDB released (Jeff Dean, Sanjay Ghemawat)
2012 RocksDB forked from LevelDB at Facebook; LSM-tree optimized for SSD
2012 Google Spanner: globally distributed strongly consistent database
2015 CockroachDB founded; Raft + MVCC distributed SQL
2016 ClickHouse open-sourced by Yandex; OLAP columnar engine
2019 DuckDB released; in-process analytical database, vectorized
2021 Apache Arrow/Velox vectorized execution engines
2023 Vector databases (Pinecone, Weaviate) achieve production scale for LLM embeddings

Modern Relevance and Production Use Cases

OLTP systems (banking, e-commerce) run on InnoDB or PostgreSQL; buffer pool tuning, WAL durability settings (fsync, synchronous_commit), and MVCC vacuum configuration are primary DBA concerns.

RocksDB underlies more production systems than any other storage engine: Meta's social graph (MyRocks), Kafka's log storage (optionally), TiKV (TiDB's storage), CockroachDB's Pebble, MongoDB's WiredTiger shares conceptual roots. Understanding compaction behavior explains write stalls in high-ingest systems.

Analytical workloads have shifted from Hadoop/Hive (batch) to ClickHouse/DuckDB/Snowflake (interactive); vectorized execution with columnar formats achieves 10–100x speedup over row-oriented engines for aggregation queries.

LLM-backed applications use vector databases for RAG (Retrieval-Augmented Generation); understanding HNSW trade-offs (recall vs memory vs query speed) is now a production concern for any ML platform team.

NewSQL databases enable global ACID transactions; CockroachDB running geo-distributed Raft groups demonstrates that the P in CAP is manageable when latency is acceptable.


File Map

File Description
01-storage-engines-overview.md B-tree vs LSM-tree design space, use case mapping
02-btree-internals.md Node structure, split/merge, B+-tree leaf chain, concurrency
03-lsm-tree-internals.md Memtable, SSTable format, bloom filters, compaction
04-compaction-strategies.md Leveled, size-tiered, FIFO, universal compaction
05-mvcc.md Version chain, snapshot acquisition, vacuum/purge, visibility rules
06-wal-and-recovery.md Log record format, LSN, checkpointing, ARIES phases
07-buffer-pool-management.md LRU/Clock, dirty page eviction, prefetch, page pinning
08-query-planning.md Parser, binder, logical plan, join reordering, pushdown
09-query-optimization.md Cost model, statistics, cardinality estimation, plan cache
10-index-structures.md B-tree index, hash index, GiST, GIN, bloom filter index
11-transaction-isolation.md Isolation levels, anomalies (dirty/phantom/non-repeatable read)
12-locking-vs-mvcc.md 2PL vs MVCC, predicate locks, SSI (serializable snapshot isolation)
13-distributed-databases.md Raft replication, distributed transactions, geo-partitioning
14-oltp-vs-olap.md Row vs column storage, mixed HTAP workloads
15-columnar-storage.md Parquet/ORC, dictionary/RLE/delta encoding, zone maps
16-vectorized-execution.md Volcano vs vectorized model, SIMD, DuckDB/Velox architecture
17-vector-databases.md Embedding storage, HNSW, IVF-PQ, pgvector, ANN benchmarks
18-time-series-databases.md TSM engine, delta-of-delta, downsampling, InfluxDB/TimescaleDB
19-newsql-architecture.md CockroachDB/TiDB internals, Raft groups, distributed MVCC

Cross-References

  • Section 10 (Synchronization): page latches, lock manager, MVCC atomic operations
  • Section 11 (Memory Management): buffer pool as managed memory; huge pages for buffer pool
  • Section 12 (Storage Systems): fsync semantics, O_DIRECT, NVMe latency for WAL writes
  • Section 13 (Filesystems): WAL on ext4/XFS, O_DIRECT bypass, filesystem journaling interaction
  • Section 17 (Distributed Systems): Raft replication, 2PC for distributed transactions, quorum
  • Section 19 (Virtualization): database VMs, storage I/O path, live migration impact on transactions