Section 18: Database Internals
Purpose and Scope
Database engines are among the most algorithmically rich components in systems software. This section dissects the machinery beneath the SQL or API surface: storage engine design (B-tree vs LSM-tree), buffer pool management, write-ahead logging (WAL), multi-version concurrency control (MVCC), query planning and cost-based optimization, index structures, and transaction isolation. It extends to the specialized engines that have proliferated in response to workload diversity: distributed SQL (NewSQL), columnar analytics (OLAP), time-series databases, and vector databases for ML embeddings.
The section treats correctness and performance as inseparable: a storage engine that is fast but violates ACID guarantees is worse than useless in production, while one that is correct but ignores physical I/O patterns will be outperformed by orders of magnitude.
Prerequisites
- Section 11 (Memory Management): buffer pool == a managed memory pool with eviction policy
- Section 12 (Storage Systems): I/O schedulers, fsync semantics, NVMe latency
- Section 13 (Filesystems): journaling, fsync ordering, O_DIRECT
- Section 10 (Synchronization): latching, MVCC relies on atomic operations
- Section 17 (Distributed Systems): 2PC, Raft for distributed databases
Learning Objectives
Upon completing this section you will be able to:
- Explain B+-tree node structure, split/merge operations, and why fan-out determines tree height.
- Describe an LSM-tree: memtable, immutable memtable, SSTables, compaction strategies (leveled, tiered, FIFO).
- Explain write amplification and read amplification trade-offs between B-tree and LSM-tree.
- Describe WAL (write-ahead logging): log record format, ARIES recovery protocol (REDO/UNDO).
- Explain MVCC: how multiple versions enable readers to not block writers; version chain structure.
- Articulate the four standard isolation levels (RC, RR, SI, Serializable) and the anomalies each permits.
- Trace a query through the optimizer: parse → logical plan → physical plan → cost estimation → execution.
- Explain columnar storage: column groups, run-length encoding, dictionary encoding, vectorized execution.
- Describe the write path and read path in a distributed database like CockroachDB (Raft + MVCC).
- Explain what an HNSW index is and why approximate nearest neighbor search is required for vector databases.
Architecture Overview
SQL / API Query
┌──────────────────────────────────────────────────────────────────┐
│ Parser → Binder → Logical Planner → Optimizer → Executor │
└──────────────────────────────┬───────────────────────────────────┘
│
┌──────────────────────────────▼───────────────────────────────────┐
│ Transaction Manager │
│ Lock Manager │ MVCC Version Store │ Isolation Level Enforcer │
└──────────────────────────────┬───────────────────────────────────┘
│
┌──────────────────────────────▼───────────────────────────────────┐
│ Storage Engine │
│ ┌─────────────────────────┐ ┌─────────────────────────────┐ │
│ │ B+-Tree Engine │ │ LSM-Tree Engine │ │
│ │ (InnoDB, PostgreSQL) │ │ (RocksDB, LevelDB, Cassandra)│ │
│ │ page(16KB) + WAL │ │ memtable → SSTable levels │ │
│ └─────────────────────────┘ └─────────────────────────────┘ │
└──────────────────────────────┬───────────────────────────────────┘
│
┌──────────────────────────────▼───────────────────────────────────┐
│ Buffer Pool │
│ LRU/Clock eviction ─ dirty page tracking ─ prefetch │
│ Page latch (shared/exclusive) ─ pin count │
└──────────────────────────────┬───────────────────────────────────┘
│ fsync / O_DIRECT
┌──────────────────────────────▼───────────────────────────────────┐
│ WAL (Write-Ahead Log) │
│ Log record: LSN, prev-LSN, transaction-ID, undo/redo info │
│ ARIES: Analysis → REDO → UNDO phases on crash recovery │
└──────────────────────────────┬───────────────────────────────────┘
│
┌──────────────────────────────▼───────────────────────────────────┐
│ Storage Layer (OS + Block Device) │
└──────────────────────────────────────────────────────────────────┘
MVCC Version Chain:
Tuple → [xmin=10, xmax=15, data=v1] → [xmin=15, xmax=∞, data=v2]
Reader at snapshot 12 sees v1; reader at 16 sees v2; writer does not block reader
Key Concepts
- B+-Tree: Self-balancing tree where all data resides in leaf nodes, linked in sorted order; internal nodes are keys only; fan-out of 100–1000 means 3-4 levels for billion-row tables.
- LSM-Tree (Log-Structured Merge-Tree): All writes go to an in-memory memtable; flushed as immutable SSTables; background compaction merges levels; sequential write amplification traded for read amplification.
- Write Amplification: Total bytes written to storage per byte written by the application; high in LSM-tree during compaction; critical for SSD endurance.
- Buffer Pool: A managed cache of database pages (typically 8–16 KB) in memory; the single most important performance component; hit rate must exceed 99% for good OLTP performance.
- WAL (Write-Ahead Logging): All changes recorded to a sequential log before being applied to data pages; enables atomic crash recovery and replication.
- ARIES (Algorithm for Recovery and Isolation Exploiting Semantics): Standard WAL-based recovery algorithm; 3 phases: Analysis (reconstruct dirty page table), REDO (replay from last checkpoint), UNDO (roll back uncommitted transactions).
- MVCC (Multi-Version Concurrency Control): Each write creates a new version of a tuple; readers access a consistent snapshot without locking writers; old versions garbage-collected by a vacuum/purge process.
- Isolation Levels: Read Uncommitted (dirty reads), Read Committed (no dirty reads), Repeatable Read (no non-repeatable reads), Serializable (no phantoms); PostgreSQL adds Snapshot Isolation (SSI).
- Page Latch: Physical synchronization on a buffer pool page (shared for read, exclusive for write); distinct from logical transaction locks.
- Query Optimizer: Cost-based optimizer selects access paths and join orders; cost model estimates I/O and CPU based on table statistics (cardinality, histograms).
- Vectorized Execution: Processes a batch of tuples per operator invocation; exploits SIMD and avoids per-tuple interpreter overhead; used by DuckDB, Velox, Databricks Photon.
- Columnar Storage: Stores each column contiguously; enables compression (dictionary, RLE, delta), column pruning, and SIMD scan; Parquet, ORC, Feather are columnar file formats.
- NewSQL: Distributed relational databases providing ACID transactions at scale; typically Raft replication + MVCC + distributed query execution (CockroachDB, TiDB, YugabyteDB, Google Spanner).
- Time-Series Database: Optimized for high-ingest, time-ordered append workloads; compression exploits monotonic timestamps (delta-of-delta); InfluxDB, TimescaleDB, VictoriaMetrics.
- Vector Database: Stores high-dimensional embeddings; approximate nearest neighbor (ANN) index (HNSW, IVF) enables similarity search; Pinecone, Weaviate, pgvector.
- HNSW (Hierarchical Navigable Small World): Graph-based ANN index; layered graph provides O(log N) search at high recall; memory-intensive but fast.
Major Historical Milestones
| Year | Milestone |
|---|---|
| 1970 | Codd publishes relational model (IBM Research) |
| 1976 | IBM System R: first SQL prototype; shadow paging for recovery |
| 1979 | Oracle 2 — first commercial relational database |
| 1981 | IBM DB2 development begins |
| 1986 | SQL standardized (SQL-86) |
| 1987 | PostgreSQL (Postgres) development begins at Berkeley |
| 1992 | MySQL development begins (released 1995) |
| 1992 | Ramakrishnan & Gehrke: textbook on database systems internals |
| 1992 | ARIES recovery algorithm published (Mohan et al., IBM) |
| 1996 | LSM-Tree paper (O'Neil et al.) |
| 2000 | InnoDB becomes MySQL default storage engine |
| 2005 | Bigtable (Google) — precursor to wide-column NoSQL era |
| 2006 | Facebook develops Cassandra (open-sourced 2008) |
| 2007 | Amazon Dynamo paper — key-value store at scale |
| 2010 | LevelDB released (Jeff Dean, Sanjay Ghemawat) |
| 2012 | RocksDB forked from LevelDB at Facebook; LSM-tree optimized for SSD |
| 2012 | Google Spanner: globally distributed strongly consistent database |
| 2015 | CockroachDB founded; Raft + MVCC distributed SQL |
| 2016 | ClickHouse open-sourced by Yandex; OLAP columnar engine |
| 2019 | DuckDB released; in-process analytical database, vectorized |
| 2021 | Apache Arrow/Velox vectorized execution engines |
| 2023 | Vector databases (Pinecone, Weaviate) achieve production scale for LLM embeddings |
Modern Relevance and Production Use Cases
OLTP systems (banking, e-commerce) run on InnoDB or PostgreSQL; buffer pool tuning, WAL durability settings (fsync, synchronous_commit), and MVCC vacuum configuration are primary DBA concerns.
RocksDB underlies more production systems than any other storage engine: Meta's social graph (MyRocks), Kafka's log storage (optionally), TiKV (TiDB's storage), CockroachDB's Pebble, MongoDB's WiredTiger shares conceptual roots. Understanding compaction behavior explains write stalls in high-ingest systems.
Analytical workloads have shifted from Hadoop/Hive (batch) to ClickHouse/DuckDB/Snowflake (interactive); vectorized execution with columnar formats achieves 10–100x speedup over row-oriented engines for aggregation queries.
LLM-backed applications use vector databases for RAG (Retrieval-Augmented Generation); understanding HNSW trade-offs (recall vs memory vs query speed) is now a production concern for any ML platform team.
NewSQL databases enable global ACID transactions; CockroachDB running geo-distributed Raft groups demonstrates that the P in CAP is manageable when latency is acceptable.
File Map
| File | Description |
|---|---|
01-storage-engines-overview.md |
B-tree vs LSM-tree design space, use case mapping |
02-btree-internals.md |
Node structure, split/merge, B+-tree leaf chain, concurrency |
03-lsm-tree-internals.md |
Memtable, SSTable format, bloom filters, compaction |
04-compaction-strategies.md |
Leveled, size-tiered, FIFO, universal compaction |
05-mvcc.md |
Version chain, snapshot acquisition, vacuum/purge, visibility rules |
06-wal-and-recovery.md |
Log record format, LSN, checkpointing, ARIES phases |
07-buffer-pool-management.md |
LRU/Clock, dirty page eviction, prefetch, page pinning |
08-query-planning.md |
Parser, binder, logical plan, join reordering, pushdown |
09-query-optimization.md |
Cost model, statistics, cardinality estimation, plan cache |
10-index-structures.md |
B-tree index, hash index, GiST, GIN, bloom filter index |
11-transaction-isolation.md |
Isolation levels, anomalies (dirty/phantom/non-repeatable read) |
12-locking-vs-mvcc.md |
2PL vs MVCC, predicate locks, SSI (serializable snapshot isolation) |
13-distributed-databases.md |
Raft replication, distributed transactions, geo-partitioning |
14-oltp-vs-olap.md |
Row vs column storage, mixed HTAP workloads |
15-columnar-storage.md |
Parquet/ORC, dictionary/RLE/delta encoding, zone maps |
16-vectorized-execution.md |
Volcano vs vectorized model, SIMD, DuckDB/Velox architecture |
17-vector-databases.md |
Embedding storage, HNSW, IVF-PQ, pgvector, ANN benchmarks |
18-time-series-databases.md |
TSM engine, delta-of-delta, downsampling, InfluxDB/TimescaleDB |
19-newsql-architecture.md |
CockroachDB/TiDB internals, Raft groups, distributed MVCC |
Cross-References
- Section 10 (Synchronization): page latches, lock manager, MVCC atomic operations
- Section 11 (Memory Management): buffer pool as managed memory; huge pages for buffer pool
- Section 12 (Storage Systems): fsync semantics, O_DIRECT, NVMe latency for WAL writes
- Section 13 (Filesystems): WAL on ext4/XFS, O_DIRECT bypass, filesystem journaling interaction
- Section 17 (Distributed Systems): Raft replication, 2PC for distributed transactions, quorum
- Section 19 (Virtualization): database VMs, storage I/O path, live migration impact on transactions