Section 18: Database Internals

Purpose and Scope

Database engines are among the most algorithmically rich components in systems software. This section dissects the machinery beneath the SQL or API surface: storage engine design (B-tree vs LSM-tree), buffer pool management, write-ahead logging (WAL), multi-version concurrency control (MVCC), query planning and cost-based optimization, index structures, and transaction isolation. It extends to the specialized engines that have proliferated in response to workload diversity: distributed SQL (NewSQL), columnar analytics (OLAP), time-series databases, and vector databases for ML embeddings.

The section treats correctness and performance as inseparable: a storage engine that is fast but violates ACID guarantees is worse than useless in production, while one that is correct but ignores physical I/O patterns will be outperformed by orders of magnitude.

Prerequisites

Section 11 (Memory Management): buffer pool == a managed memory pool with eviction policy
Section 12 (Storage Systems): I/O schedulers, fsync semantics, NVMe latency
Section 13 (Filesystems): journaling, fsync ordering, O_DIRECT
Section 10 (Synchronization): latching, MVCC relies on atomic operations
Section 17 (Distributed Systems): 2PC, Raft for distributed databases

Learning Objectives

Upon completing this section you will be able to:

Explain B+-tree node structure, split/merge operations, and why fan-out determines tree height.
Describe an LSM-tree: memtable, immutable memtable, SSTables, compaction strategies (leveled, tiered, FIFO).
Explain write amplification and read amplification trade-offs between B-tree and LSM-tree.
Describe WAL (write-ahead logging): log record format, ARIES recovery protocol (REDO/UNDO).
Explain MVCC: how multiple versions enable readers to not block writers; version chain structure.
Articulate the four standard isolation levels (RC, RR, SI, Serializable) and the anomalies each permits.
Trace a query through the optimizer: parse → logical plan → physical plan → cost estimation → execution.
Explain columnar storage: column groups, run-length encoding, dictionary encoding, vectorized execution.
Describe the write path and read path in a distributed database like CockroachDB (Raft + MVCC).
Explain what an HNSW index is and why approximate nearest neighbor search is required for vector databases.

Architecture Overview

  SQL / API Query
  ┌──────────────────────────────────────────────────────────────────┐
  │   Parser → Binder → Logical Planner → Optimizer → Executor      │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │
  ┌──────────────────────────────▼───────────────────────────────────┐
  │              Transaction Manager                                  │
  │   Lock Manager │  MVCC Version Store │  Isolation Level Enforcer │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │
  ┌──────────────────────────────▼───────────────────────────────────┐
  │              Storage Engine                                       │
  │  ┌─────────────────────────┐  ┌─────────────────────────────┐   │
  │  │   B+-Tree Engine        │  │   LSM-Tree Engine           │   │
  │  │   (InnoDB, PostgreSQL)  │  │   (RocksDB, LevelDB, Cassandra)│ │
  │  │   page(16KB) + WAL      │  │   memtable → SSTable levels │   │
  │  └─────────────────────────┘  └─────────────────────────────┘   │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │
  ┌──────────────────────────────▼───────────────────────────────────┐
  │              Buffer Pool                                          │
  │   LRU/Clock eviction ─ dirty page tracking ─ prefetch           │
  │   Page latch (shared/exclusive) ─ pin count                      │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │  fsync / O_DIRECT
  ┌──────────────────────────────▼───────────────────────────────────┐
  │              WAL (Write-Ahead Log)                                │
  │   Log record: LSN, prev-LSN, transaction-ID, undo/redo info      │
  │   ARIES: Analysis → REDO → UNDO phases on crash recovery         │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │
  ┌──────────────────────────────▼───────────────────────────────────┐
  │              Storage Layer (OS + Block Device)                   │
  └──────────────────────────────────────────────────────────────────┘

  MVCC Version Chain:
  Tuple → [xmin=10, xmax=15, data=v1] → [xmin=15, xmax=∞, data=v2]
  Reader at snapshot 12 sees v1; reader at 16 sees v2; writer does not block reader

Key Concepts

B+-Tree: Self-balancing tree where all data resides in leaf nodes, linked in sorted order; internal nodes are keys only; fan-out of 100–1000 means 3-4 levels for billion-row tables.
LSM-Tree (Log-Structured Merge-Tree): All writes go to an in-memory memtable; flushed as immutable SSTables; background compaction merges levels; sequential write amplification traded for read amplification.
Write Amplification: Total bytes written to storage per byte written by the application; high in LSM-tree during compaction; critical for SSD endurance.
Buffer Pool: A managed cache of database pages (typically 8–16 KB) in memory; the single most important performance component; hit rate must exceed 99% for good OLTP performance.
WAL (Write-Ahead Logging): All changes recorded to a sequential log before being applied to data pages; enables atomic crash recovery and replication.
ARIES (Algorithm for Recovery and Isolation Exploiting Semantics): Standard WAL-based recovery algorithm; 3 phases: Analysis (reconstruct dirty page table), REDO (replay from last checkpoint), UNDO (roll back uncommitted transactions).
MVCC (Multi-Version Concurrency Control): Each write creates a new version of a tuple; readers access a consistent snapshot without locking writers; old versions garbage-collected by a vacuum/purge process.
Isolation Levels: Read Uncommitted (dirty reads), Read Committed (no dirty reads), Repeatable Read (no non-repeatable reads), Serializable (no phantoms); PostgreSQL adds Snapshot Isolation (SSI).
Page Latch: Physical synchronization on a buffer pool page (shared for read, exclusive for write); distinct from logical transaction locks.
Query Optimizer: Cost-based optimizer selects access paths and join orders; cost model estimates I/O and CPU based on table statistics (cardinality, histograms).
Vectorized Execution: Processes a batch of tuples per operator invocation; exploits SIMD and avoids per-tuple interpreter overhead; used by DuckDB, Velox, Databricks Photon.
Columnar Storage: Stores each column contiguously; enables compression (dictionary, RLE, delta), column pruning, and SIMD scan; Parquet, ORC, Feather are columnar file formats.
NewSQL: Distributed relational databases providing ACID transactions at scale; typically Raft replication + MVCC + distributed query execution (CockroachDB, TiDB, YugabyteDB, Google Spanner).
Time-Series Database: Optimized for high-ingest, time-ordered append workloads; compression exploits monotonic timestamps (delta-of-delta); InfluxDB, TimescaleDB, VictoriaMetrics.
Vector Database: Stores high-dimensional embeddings; approximate nearest neighbor (ANN) index (HNSW, IVF) enables similarity search; Pinecone, Weaviate, pgvector.
HNSW (Hierarchical Navigable Small World): Graph-based ANN index; layered graph provides O(log N) search at high recall; memory-intensive but fast.

Major Historical Milestones

Year	Milestone
1970	Codd publishes relational model (IBM Research)
1976	IBM System R: first SQL prototype; shadow paging for recovery
1979	Oracle 2 — first commercial relational database
1981	IBM DB2 development begins
1986	SQL standardized (SQL-86)
1987	PostgreSQL (Postgres) development begins at Berkeley
1992	MySQL development begins (released 1995)
1992	Ramakrishnan & Gehrke: textbook on database systems internals
1992	ARIES recovery algorithm published (Mohan et al., IBM)
1996	LSM-Tree paper (O'Neil et al.)
2000	InnoDB becomes MySQL default storage engine
2005	Bigtable (Google) — precursor to wide-column NoSQL era
2006	Facebook develops Cassandra (open-sourced 2008)
2007	Amazon Dynamo paper — key-value store at scale
2010	LevelDB released (Jeff Dean, Sanjay Ghemawat)
2012	RocksDB forked from LevelDB at Facebook; LSM-tree optimized for SSD
2012	Google Spanner: globally distributed strongly consistent database
2015	CockroachDB founded; Raft + MVCC distributed SQL
2016	ClickHouse open-sourced by Yandex; OLAP columnar engine
2019	DuckDB released; in-process analytical database, vectorized
2021	Apache Arrow/Velox vectorized execution engines
2023	Vector databases (Pinecone, Weaviate) achieve production scale for LLM embeddings

Modern Relevance and Production Use Cases

OLTP systems (banking, e-commerce) run on InnoDB or PostgreSQL; buffer pool tuning, WAL durability settings (fsync, synchronous_commit), and MVCC vacuum configuration are primary DBA concerns.

RocksDB underlies more production systems than any other storage engine: Meta's social graph (MyRocks), Kafka's log storage (optionally), TiKV (TiDB's storage), CockroachDB's Pebble, MongoDB's WiredTiger shares conceptual roots. Understanding compaction behavior explains write stalls in high-ingest systems.

Analytical workloads have shifted from Hadoop/Hive (batch) to ClickHouse/DuckDB/Snowflake (interactive); vectorized execution with columnar formats achieves 10–100x speedup over row-oriented engines for aggregation queries.

LLM-backed applications use vector databases for RAG (Retrieval-Augmented Generation); understanding HNSW trade-offs (recall vs memory vs query speed) is now a production concern for any ML platform team.

NewSQL databases enable global ACID transactions; CockroachDB running geo-distributed Raft groups demonstrates that the P in CAP is manageable when latency is acceptable.

File Map

File	Description
`01-storage-engines-overview.md`	B-tree vs LSM-tree design space, use case mapping
`02-btree-internals.md`	Node structure, split/merge, B+-tree leaf chain, concurrency
`03-lsm-tree-internals.md`	Memtable, SSTable format, bloom filters, compaction
`04-compaction-strategies.md`	Leveled, size-tiered, FIFO, universal compaction
`05-mvcc.md`	Version chain, snapshot acquisition, vacuum/purge, visibility rules
`06-wal-and-recovery.md`	Log record format, LSN, checkpointing, ARIES phases
`07-buffer-pool-management.md`	LRU/Clock, dirty page eviction, prefetch, page pinning
`08-query-planning.md`	Parser, binder, logical plan, join reordering, pushdown
`09-query-optimization.md`	Cost model, statistics, cardinality estimation, plan cache
`10-index-structures.md`	B-tree index, hash index, GiST, GIN, bloom filter index
`11-transaction-isolation.md`	Isolation levels, anomalies (dirty/phantom/non-repeatable read)
`12-locking-vs-mvcc.md`	2PL vs MVCC, predicate locks, SSI (serializable snapshot isolation)
`13-distributed-databases.md`	Raft replication, distributed transactions, geo-partitioning
`14-oltp-vs-olap.md`	Row vs column storage, mixed HTAP workloads
`15-columnar-storage.md`	Parquet/ORC, dictionary/RLE/delta encoding, zone maps
`16-vectorized-execution.md`	Volcano vs vectorized model, SIMD, DuckDB/Velox architecture
`17-vector-databases.md`	Embedding storage, HNSW, IVF-PQ, pgvector, ANN benchmarks
`18-time-series-databases.md`	TSM engine, delta-of-delta, downsampling, InfluxDB/TimescaleDB
`19-newsql-architecture.md`	CockroachDB/TiDB internals, Raft groups, distributed MVCC

Cross-References

Section 10 (Synchronization): page latches, lock manager, MVCC atomic operations
Section 11 (Memory Management): buffer pool as managed memory; huge pages for buffer pool
Section 12 (Storage Systems): fsync semantics, O_DIRECT, NVMe latency for WAL writes
Section 13 (Filesystems): WAL on ext4/XFS, O_DIRECT bypass, filesystem journaling interaction
Section 17 (Distributed Systems): Raft replication, 2PC for distributed transactions, quorum
Section 19 (Virtualization): database VMs, storage I/O path, live migration impact on transactions