01 — Storage Engines: Architecture and Design Space

Technical Overview

A storage engine is the subsystem of a database management system responsible for physically storing, retrieving, and managing data on durable media. It sits below the query engine and exposes an API for CRUD operations, transaction support, crash recovery, and space management. The separation of the storage engine from the query engine is one of the most consequential architectural decisions in DBMS design — it enables the same SQL frontend to operate over radically different physical representations.

The canonical interface a storage engine must expose includes: get(key), put(key, value), delete(key), scan(start, end), begin_txn(), commit(), abort(), plus callbacks for crash recovery. In MySQL's pluggable architecture this interface is formalized as the handlerton struct (handler + singleton), which defines function pointers for every storage operation. In SQLite this boundary is less explicit but still present in the Btree and Pager layers.

Prerequisites

Familiarity with file I/O syscalls (read, write, fsync, mmap)
Basic knowledge of B-trees and hash tables
Understanding of ACID transaction properties
Operating system concepts: page cache, direct I/O, virtual memory

Core Content

The Storage Engine vs. Query Engine Split

 SQL Query
     |
     v
+------------------+
| Parser           |
| Analyzer         |
| Rewriter         |
| Planner          |  <- Query Engine (SQL layer)
| Executor         |
+--------+---------+
         |
         | Storage Engine API
         |  get/put/scan/txn
         v
+------------------+
| Storage Engine   |
|  - Buffer Pool   |
|  - Access Methods|
|  - Concurrency   |  <- Storage Engine Layer
|  - Recovery      |
|  - Space Mgmt    |
+--------+---------+
         |
         v
   Disk / NVM / RAM

MySQL is the most prominent example of this split. The query layer (parsing, optimization, replication protocol) is independent of the storage layer. You can ALTER TABLE t ENGINE=InnoDB; or ALTER TABLE t ENGINE=RocksDB; and the SQL semantics remain identical. The tradeoffs shift entirely in the storage tier.

PostgreSQL, by contrast, uses a monolithic architecture where the storage engine is tightly coupled to the executor. The table access method API introduced in PostgreSQL 12 (CREATE ACCESS METHOD) is a step toward pluggability, but it remains far less decoupled than MySQL's model. Zheap (an alternative heap storage with in-place undo) was prototyped using this API.

Pluggable Storage Engines in MySQL

MySQL's handlerton (defined in sql/handler.h) requires implementations of roughly 80 virtual methods, though many have default no-ops. Key methods:

struct handlerton {
  slot_t slot;
  uint savepoint_offset;
  int (*close_connection)(handlerton *hton, THD *thd);
  int (*savepoint_set)(handlerton *hton, THD *thd, void *sv);
  int (*savepoint_rollback)(handlerton *hton, THD *thd, void *sv);
  int (*commit)(handlerton *hton, THD *thd, bool all);
  int (*rollback)(handlerton *hton, THD *thd, bool all);
  int (*prepare)(handlerton *hton, THD *thd, bool all);
  int (*recover)(handlerton *hton, XID *xid_list, uint len);
  ...
};

InnoDB: The default and most production-hardened MySQL engine. B+ tree primary key clustered index. MVCC via undo logs. Doublewrite buffer for torn-write protection. Adaptive hash index (AHI) as an automatic cache of recent B-tree lookups. Buffer pool with LRU-K eviction. ARIES-based recovery with redo/undo logging. Full support for foreign keys, row-level locking, and all isolation levels.

MyISAM: The original MySQL storage engine. No transaction support, no crash recovery, no row-level locking (table locks only). Separate .MYD (data) and .MYI (index) files per table. The lack of MVCC means reads acquire shared locks. Extremely fast for read-only workloads (no lock overhead, no txn state). Used by MySQL internal catalog tables until 8.0 when they migrated to InnoDB.

RocksDB (MyRocks): LSM-tree based storage, developed by Meta (Facebook) and integrated into MySQL as MyRocks. Dramatically better write amplification than InnoDB for write-heavy workloads. Facebook reported 2x storage compression improvement over InnoDB for their UDB (user database) workload. Weak on random reads due to LSM's multi-level read path.

TokuDB (now largely deprecated): Fractal tree index-based storage. Superior write amplification to B-trees, better compression than both InnoDB and MyISAM. Acquired by Percona but abandoned after MariaDB licensing disputes.

Read-Optimized vs Write-Optimized Storage

The fundamental tension in storage engine design:

Write-Optimized                        Read-Optimized
(LSM Trees)                            (B+ Trees)

Writes land in memory buffer           Writes go directly to
(MemTable), then flushed               the target B-tree page
sequentially                           (random I/O)

Read requires checking                 Read follows single
multiple levels (expensive)            tree path (fast)

Background compaction amortizes        No background work needed
write cost, causes read                for reads; merges happen
amplification spikes                   in-place during writes

Space amplification: multiple          Minimal space overhead
versions exist simultaneously          (one copy of data)

The RUM Conjecture (Idreos et al., 2016) formalizes this: you cannot simultaneously minimize Read overhead, Update overhead, and Memory/space overhead. Any storage structure optimizes two at the expense of the third.

In-Memory vs Disk-Based Storage

Disk-based engines (InnoDB, RocksDB) assume the working set exceeds RAM. They manage a buffer pool that caches disk pages. The page size (InnoDB default: 16KB, configurable 4KB–64KB) governs I/O granularity and tree fan-out.

In-memory engines (VoltDB, H-Store, Redis Modules, MySQL NDB Cluster with in-memory tables) assume the entire dataset fits in RAM. They can use pointer-based data structures (skip lists, Masstree, ART — Adaptive Radix Tree) instead of page-oriented trees. Lock-free structures become feasible. NVM (Optane DC PMem) blurs this line further — pmdk libraries expose persistent memory with DRAM-like latency.

SQLite is notable: it uses a Btree over a pager abstraction, where the pager manages a page cache in memory. The default page size is 4096 bytes. SQLite optionally uses mmap for reads (PRAGMA mmap_size) bypassing its own pager cache for sequential reads.

Column Stores vs Row Stores

Row Store (OLTP):

Row 1: [id=1][name="Alice"][age=30][salary=90000]
Row 2: [id=2][name="Bob"]  [age=25][salary=75000]
Row 3: [id=3][name="Carol"][age=35][salary=95000]

Column Store (OLAP):

id:     [1][2][3]
name:   ["Alice"]["Bob"]["Carol"]
age:    [30][25][35]
salary: [90000][75000][95000]

Row stores optimize for whole-row access (INSERT, UPDATE, point queries). Column stores optimize for aggregate queries over few columns in large tables — the I/O is proportional to columns accessed, not rows. Column stores also achieve far better compression: homogeneous data in a column compresses well with delta encoding, run-length encoding, or dictionary encoding.

PostgreSQL is a row store. ClickHouse is a column store. DuckDB is an in-process column store. Hybrid HTAP systems (TiDB, SQL Server columnstore indexes, Oracle In-Memory) maintain both layouts.

WAL-Based vs Shadow Paging

Two approaches to ensuring atomicity and durability:

WAL (Write-Ahead Logging): Log changes before applying them to data pages. On crash, replay the log. Used by PostgreSQL, InnoDB, SQLite. Requires sequential log writes (fast) + random data page writes (potentially deferred). The key invariant: log record for page X must be on durable storage before page X is evicted from buffer pool to disk.

Shadow Paging: Maintain two versions of pages — the current (shadow) and the new (modified). On commit, atomically flip the root pointer to point to the new version. No log needed. Used by CouchDB, LMDB, SQLite WAL mode (inverted — the "WAL" in SQLite WAL mode is actually a shadow page file for the reader snapshot). Shadow paging has poor locality (random writes for every modified page) and does not support fine-grained recovery. LMDB uses this approach and achieves extremely fast reads at the cost of writes being serialized (single-writer model).

Storage Engine Selection Criteria

Criterion	InnoDB	MyRocks	MyISAM	In-Memory
Write amplification	High	Low	Medium	None
Read performance	Excellent	Good (bloom)	Excellent	Excellent
ACID transactions	Full	Full	None	Varies
Compression	Page-level	Block-level	None	N/A
Range scans	Excellent	Good	Good	Good
Space efficiency	Moderate	High	Low	RAM-limited
Crash recovery	Fast	Fast	No	RAM volatile

Historical Context

The storage engine concept in its pluggable form was pioneered by MySQL AB around 1999-2001. Brian Aker and Monty Widenius designed the handler API to allow commercial storage engine vendors (Innobase Oy for InnoDB, Sleepycat for BerkeleyDB) to ship engines independently. Oracle acquired both Innobase (2005) and Sun/MySQL (2010), collapsing the commercial ecosystem.

SQLite (D. Richard Hipp, 2000) took a different approach: a self-contained, serverless storage engine targeting embedded use. Its pager/btree separation remains one of the cleanest storage engine designs in open-source.

MongoDB switched its storage engine from its custom MMAP-based engine to WiredTiger (acquired in 2014) in MongoDB 3.0 (2015). WiredTiger uses a B-tree with MVCC and supports both row and column storage. The switch was motivated by WiredTiger's document-level locking vs the MMAP engine's collection-level locking.

MariaDB's Aria engine was created as a crash-safe replacement for MyISAM, used internally for system tables. It supports both MVCC and non-transactional modes.

Production Examples

MongoDB WiredTiger (src/third_party/wiredtiger): Uses a B-tree with internal MVCC. Data files use .wt extension. The WT_SESSION and WT_CURSOR API maps cleanly onto the storage engine interface abstraction. WiredTiger supports both row-store and column-store engines, and uses a block-based allocator with a B-tree for free space management.

MariaDB Aria: Crash-safe replacement for MyISAM. Uses .MAI (index) and .MAD (data) files. Supports both transactional and non-transactional operation. Used by MariaDB for internal temporary tables and system tables.

SQLite: The btree.c and pager.c modules implement the storage layer. Pager handles the page cache, journaling, and WAL. Btree implements B+ trees on top of the pager. SQLite WAL mode uses a separate WAL file (database.db-wal) for writes, enabling concurrent reads during writes.

Debugging Notes

InnoDB: SHOW ENGINE INNODB STATUS\G dumps buffer pool stats, transaction list, deadlock info, I/O statistics, and semaphore waits. Extremely valuable for diagnosis.
RocksDB: db->GetProperty("rocksdb.stats", &stats) and the SHOW ENGINE ROCKSDB STATUS MyRocks extension expose compaction stats, LSM level sizes, bloom filter hit rates.
SQLite: PRAGMA integrity_check; walks the entire B-tree verifying structural consistency. PRAGMA page_count; and PRAGMA freelist_count; help diagnose space issues.
WiredTiger: wt dump -f dump.txt table:collection-xyz exports a table in a readable format. The wtperf tool benchmarks engine performance.
Verify fsync is not disabled in production: PRAGMA synchronous=OFF in SQLite, innodb_flush_method=O_DSYNC vs O_DIRECT in MySQL.

Security Implications

Engine-level encryption: InnoDB tablespace encryption (MySQL 8.0+) encrypts .ibd files at rest. Key management integrates with keyring plugins. Without engine-level encryption, disk-level access bypasses SQL access controls entirely.
MyISAM's lack of transactions means partial writes on crash can leave tables in inconsistent states, potentially exposing partial data to subsequent queries.
MMAP-based engines (older MongoDB, LMDB) are vulnerable to kernel-level attacks that corrupt shared memory regions. WiredTiger's switch away from MMAP improved MongoDB's security posture.
Pluggable engines increase attack surface: a maliciously crafted storage engine .so loaded via MySQL plugin interface has full process privileges.

Performance Implications

Buffer pool sizing is the single most impactful configuration: innodb_buffer_pool_size should be 70-80% of available RAM for a dedicated MySQL server.
InnoDB doublewrite buffer halves effective write throughput on HDDs (each page written twice). On SSDs with atomic 4K writes (most modern NVMe), innodb_doublewrite=OFF is safe.
RocksDB compaction can consume 100% of I/O bandwidth during heavy write periods ("write stalls"). Tune level0_slowdown_writes_trigger and level0_stop_writes_trigger to manage this.
MyISAM key cache (key_buffer_size) only caches index pages, not data pages. At high concurrency, table-level locking causes severe contention. Do not use MyISAM for write-concurrent workloads.

Failure Modes

Torn writes: InnoDB's doublewrite buffer exists to handle this. A 16KB page is written as two 8KB writes; if the machine crashes mid-write, the doublewrite buffer copy is used for recovery.
Compaction storms (RocksDB/LSM): If writes outpace compaction, L0 SSTable count grows, reads slow dramatically, and eventually writes are stalled entirely.
Buffer pool eviction under pressure: Dirty pages must be flushed before eviction; a sudden spike in writes can cause checkpoint stall — InnoDB's adaptive flushing (innodb_adaptive_flushing) mitigates this.
Corruption without checksums: Enable innodb_checksum_algorithm=crc32. SQLite's PRAGMA integrity_check detects but cannot repair corruption.
MMAP OOM killer interaction: Engines using mmap can be targeted by the Linux OOM killer in unexpected ways since mmap'd pages appear as anonymous memory.

Modern Usage

The storage engine abstraction has evolved significantly. PostgreSQL's Table Access Method API (since v12) allows custom heap implementations. Neon (serverless PostgreSQL) replaces PostgreSQL's storage with a remote page server, intercepting at the smgr (storage manager) interface. OrioleDB implements a new heap with undo-based MVCC for PostgreSQL, avoiding the need for VACUUM.

NVM-aware storage engines are an active research area. Intel's PMDK and PMEM-aware RocksDB variants exploit persistent memory's byte-addressability to eliminate the page-oriented access model for hot data.

Future Directions

Disaggregated storage: Aurora, Neon, and AlloyDB all demonstrate that the storage layer can be moved to a shared, replicated, cloud-native tier. The storage engine becomes a network I/O client.
NVM storage engines: Byte-addressable persistent memory enables in-place atomic updates at cache-line granularity, eliminating WAL overhead for small updates.
Learned storage structures: Projects like Bourbon (learned LSM), ALEX (adaptive learned index), and PGM-Index replace B-trees with ML models that learn the data distribution, achieving better space/time tradeoffs on static or slowly changing datasets.
HTAP unification: Systems like TiDB and SingleStore attempt to unify OLTP row storage and OLAP column storage in a single engine, eliminating ETL pipelines.

Exercises

Implement a minimal storage engine for MySQL (using the handlerton interface) that stores all data in a sorted in-memory std::map. Support get, put, delete, and scan.
Benchmark InnoDB vs MyRocks on a write-heavy workload (e.g., 10M sequential INSERTs) and measure write amplification using iostat.
Trace a SQLite write through the source code: from sqlite3_exec("INSERT ...") through sqlite3BtreeInsert down to the pager_write call.
Configure RocksDB compaction strategies (size-tiered vs leveled) and measure the effect on read latency under a mixed read/write benchmark using db_bench.
Examine SHOW ENGINE INNODB STATUS output on a loaded production server. Identify buffer pool hit rate, pending I/O operations, and any semaphore waits.

References

Gray, J., & Reuter, A. (1992). Transaction Processing: Concepts and Techniques. Morgan Kaufmann.
Idreos, S., Dayan, N., Qin, W., Akmanalp, M., & Fadika, S. (2016). Design Continuums and the Path Toward Self-Designing Key-Value Stores. CIDR 2016.
O'Neil, P., Cheng, E., Gawlick, D., & O'Neil, E. (1996). The Log-Structured Merge-Tree (LSM-Tree). Acta Informatica, 33(4), 351–385.
MySQL Internals Manual: https://dev.mysql.com/doc/internals/en/
WiredTiger Architecture Guide: https://source.wiredtiger.com/develop/arch-index.html
SQLite Architecture: https://www.sqlite.org/arch.html
Lomet, D. (2009). The Evolution of Effective B-tree: Page Organization and Techniques. ACM SIGMOD Record.