01 — Storage Engines: Architecture and Design Space
Technical Overview
A storage engine is the subsystem of a database management system responsible for physically storing, retrieving, and managing data on durable media. It sits below the query engine and exposes an API for CRUD operations, transaction support, crash recovery, and space management. The separation of the storage engine from the query engine is one of the most consequential architectural decisions in DBMS design — it enables the same SQL frontend to operate over radically different physical representations.
The canonical interface a storage engine must expose includes: get(key), put(key, value), delete(key), scan(start, end), begin_txn(), commit(), abort(), plus callbacks for crash recovery. In MySQL's pluggable architecture this interface is formalized as the handlerton struct (handler + singleton), which defines function pointers for every storage operation. In SQLite this boundary is less explicit but still present in the Btree and Pager layers.
Prerequisites
- Familiarity with file I/O syscalls (
read,write,fsync,mmap) - Basic knowledge of B-trees and hash tables
- Understanding of ACID transaction properties
- Operating system concepts: page cache, direct I/O, virtual memory
Core Content
The Storage Engine vs. Query Engine Split
SQL Query
|
v
+------------------+
| Parser |
| Analyzer |
| Rewriter |
| Planner | <- Query Engine (SQL layer)
| Executor |
+--------+---------+
|
| Storage Engine API
| get/put/scan/txn
v
+------------------+
| Storage Engine |
| - Buffer Pool |
| - Access Methods|
| - Concurrency | <- Storage Engine Layer
| - Recovery |
| - Space Mgmt |
+--------+---------+
|
v
Disk / NVM / RAM
MySQL is the most prominent example of this split. The query layer (parsing, optimization, replication protocol) is independent of the storage layer. You can ALTER TABLE t ENGINE=InnoDB; or ALTER TABLE t ENGINE=RocksDB; and the SQL semantics remain identical. The tradeoffs shift entirely in the storage tier.
PostgreSQL, by contrast, uses a monolithic architecture where the storage engine is tightly coupled to the executor. The table access method API introduced in PostgreSQL 12 (CREATE ACCESS METHOD) is a step toward pluggability, but it remains far less decoupled than MySQL's model. Zheap (an alternative heap storage with in-place undo) was prototyped using this API.
Pluggable Storage Engines in MySQL
MySQL's handlerton (defined in sql/handler.h) requires implementations of roughly 80 virtual methods, though many have default no-ops. Key methods:
struct handlerton {
slot_t slot;
uint savepoint_offset;
int (*close_connection)(handlerton *hton, THD *thd);
int (*savepoint_set)(handlerton *hton, THD *thd, void *sv);
int (*savepoint_rollback)(handlerton *hton, THD *thd, void *sv);
int (*commit)(handlerton *hton, THD *thd, bool all);
int (*rollback)(handlerton *hton, THD *thd, bool all);
int (*prepare)(handlerton *hton, THD *thd, bool all);
int (*recover)(handlerton *hton, XID *xid_list, uint len);
...
};
InnoDB: The default and most production-hardened MySQL engine. B+ tree primary key clustered index. MVCC via undo logs. Doublewrite buffer for torn-write protection. Adaptive hash index (AHI) as an automatic cache of recent B-tree lookups. Buffer pool with LRU-K eviction. ARIES-based recovery with redo/undo logging. Full support for foreign keys, row-level locking, and all isolation levels.
MyISAM: The original MySQL storage engine. No transaction support, no crash recovery, no row-level locking (table locks only). Separate .MYD (data) and .MYI (index) files per table. The lack of MVCC means reads acquire shared locks. Extremely fast for read-only workloads (no lock overhead, no txn state). Used by MySQL internal catalog tables until 8.0 when they migrated to InnoDB.
RocksDB (MyRocks): LSM-tree based storage, developed by Meta (Facebook) and integrated into MySQL as MyRocks. Dramatically better write amplification than InnoDB for write-heavy workloads. Facebook reported 2x storage compression improvement over InnoDB for their UDB (user database) workload. Weak on random reads due to LSM's multi-level read path.
TokuDB (now largely deprecated): Fractal tree index-based storage. Superior write amplification to B-trees, better compression than both InnoDB and MyISAM. Acquired by Percona but abandoned after MariaDB licensing disputes.
Read-Optimized vs Write-Optimized Storage
The fundamental tension in storage engine design:
Write-Optimized Read-Optimized
(LSM Trees) (B+ Trees)
Writes land in memory buffer Writes go directly to
(MemTable), then flushed the target B-tree page
sequentially (random I/O)
Read requires checking Read follows single
multiple levels (expensive) tree path (fast)
Background compaction amortizes No background work needed
write cost, causes read for reads; merges happen
amplification spikes in-place during writes
Space amplification: multiple Minimal space overhead
versions exist simultaneously (one copy of data)
The RUM Conjecture (Idreos et al., 2016) formalizes this: you cannot simultaneously minimize Read overhead, Update overhead, and Memory/space overhead. Any storage structure optimizes two at the expense of the third.
In-Memory vs Disk-Based Storage
Disk-based engines (InnoDB, RocksDB) assume the working set exceeds RAM. They manage a buffer pool that caches disk pages. The page size (InnoDB default: 16KB, configurable 4KB–64KB) governs I/O granularity and tree fan-out.
In-memory engines (VoltDB, H-Store, Redis Modules, MySQL NDB Cluster with in-memory tables) assume the entire dataset fits in RAM. They can use pointer-based data structures (skip lists, Masstree, ART — Adaptive Radix Tree) instead of page-oriented trees. Lock-free structures become feasible. NVM (Optane DC PMem) blurs this line further — pmdk libraries expose persistent memory with DRAM-like latency.
SQLite is notable: it uses a Btree over a pager abstraction, where the pager manages a page cache in memory. The default page size is 4096 bytes. SQLite optionally uses mmap for reads (PRAGMA mmap_size) bypassing its own pager cache for sequential reads.
Column Stores vs Row Stores
Row Store (OLTP):
Row 1: [id=1][name="Alice"][age=30][salary=90000]
Row 2: [id=2][name="Bob"] [age=25][salary=75000]
Row 3: [id=3][name="Carol"][age=35][salary=95000]
Column Store (OLAP):
id: [1][2][3]
name: ["Alice"]["Bob"]["Carol"]
age: [30][25][35]
salary: [90000][75000][95000]
Row stores optimize for whole-row access (INSERT, UPDATE, point queries). Column stores optimize for aggregate queries over few columns in large tables — the I/O is proportional to columns accessed, not rows. Column stores also achieve far better compression: homogeneous data in a column compresses well with delta encoding, run-length encoding, or dictionary encoding.
PostgreSQL is a row store. ClickHouse is a column store. DuckDB is an in-process column store. Hybrid HTAP systems (TiDB, SQL Server columnstore indexes, Oracle In-Memory) maintain both layouts.
WAL-Based vs Shadow Paging
Two approaches to ensuring atomicity and durability:
WAL (Write-Ahead Logging): Log changes before applying them to data pages. On crash, replay the log. Used by PostgreSQL, InnoDB, SQLite. Requires sequential log writes (fast) + random data page writes (potentially deferred). The key invariant: log record for page X must be on durable storage before page X is evicted from buffer pool to disk.
Shadow Paging: Maintain two versions of pages — the current (shadow) and the new (modified). On commit, atomically flip the root pointer to point to the new version. No log needed. Used by CouchDB, LMDB, SQLite WAL mode (inverted — the "WAL" in SQLite WAL mode is actually a shadow page file for the reader snapshot). Shadow paging has poor locality (random writes for every modified page) and does not support fine-grained recovery. LMDB uses this approach and achieves extremely fast reads at the cost of writes being serialized (single-writer model).
Storage Engine Selection Criteria
| Criterion | InnoDB | MyRocks | MyISAM | In-Memory |
|---|---|---|---|---|
| Write amplification | High | Low | Medium | None |
| Read performance | Excellent | Good (bloom) | Excellent | Excellent |
| ACID transactions | Full | Full | None | Varies |
| Compression | Page-level | Block-level | None | N/A |
| Range scans | Excellent | Good | Good | Good |
| Space efficiency | Moderate | High | Low | RAM-limited |
| Crash recovery | Fast | Fast | No | RAM volatile |
Historical Context
The storage engine concept in its pluggable form was pioneered by MySQL AB around 1999-2001. Brian Aker and Monty Widenius designed the handler API to allow commercial storage engine vendors (Innobase Oy for InnoDB, Sleepycat for BerkeleyDB) to ship engines independently. Oracle acquired both Innobase (2005) and Sun/MySQL (2010), collapsing the commercial ecosystem.
SQLite (D. Richard Hipp, 2000) took a different approach: a self-contained, serverless storage engine targeting embedded use. Its pager/btree separation remains one of the cleanest storage engine designs in open-source.
MongoDB switched its storage engine from its custom MMAP-based engine to WiredTiger (acquired in 2014) in MongoDB 3.0 (2015). WiredTiger uses a B-tree with MVCC and supports both row and column storage. The switch was motivated by WiredTiger's document-level locking vs the MMAP engine's collection-level locking.
MariaDB's Aria engine was created as a crash-safe replacement for MyISAM, used internally for system tables. It supports both MVCC and non-transactional modes.
Production Examples
MongoDB WiredTiger (src/third_party/wiredtiger): Uses a B-tree with internal MVCC. Data files use .wt extension. The WT_SESSION and WT_CURSOR API maps cleanly onto the storage engine interface abstraction. WiredTiger supports both row-store and column-store engines, and uses a block-based allocator with a B-tree for free space management.
MariaDB Aria: Crash-safe replacement for MyISAM. Uses .MAI (index) and .MAD (data) files. Supports both transactional and non-transactional operation. Used by MariaDB for internal temporary tables and system tables.
SQLite: The btree.c and pager.c modules implement the storage layer. Pager handles the page cache, journaling, and WAL. Btree implements B+ trees on top of the pager. SQLite WAL mode uses a separate WAL file (database.db-wal) for writes, enabling concurrent reads during writes.
Debugging Notes
- InnoDB:
SHOW ENGINE INNODB STATUS\Gdumps buffer pool stats, transaction list, deadlock info, I/O statistics, and semaphore waits. Extremely valuable for diagnosis. - RocksDB:
db->GetProperty("rocksdb.stats", &stats)and theSHOW ENGINE ROCKSDB STATUSMyRocks extension expose compaction stats, LSM level sizes, bloom filter hit rates. - SQLite:
PRAGMA integrity_check;walks the entire B-tree verifying structural consistency.PRAGMA page_count;andPRAGMA freelist_count;help diagnose space issues. - WiredTiger:
wt dump -f dump.txt table:collection-xyzexports a table in a readable format. Thewtperftool benchmarks engine performance. - Verify
fsyncis not disabled in production:PRAGMA synchronous=OFFin SQLite,innodb_flush_method=O_DSYNCvsO_DIRECTin MySQL.
Security Implications
- Engine-level encryption: InnoDB tablespace encryption (MySQL 8.0+) encrypts
.ibdfiles at rest. Key management integrates with keyring plugins. Without engine-level encryption, disk-level access bypasses SQL access controls entirely. - MyISAM's lack of transactions means partial writes on crash can leave tables in inconsistent states, potentially exposing partial data to subsequent queries.
- MMAP-based engines (older MongoDB, LMDB) are vulnerable to kernel-level attacks that corrupt shared memory regions. WiredTiger's switch away from MMAP improved MongoDB's security posture.
- Pluggable engines increase attack surface: a maliciously crafted storage engine
.soloaded via MySQL plugin interface has full process privileges.
Performance Implications
- Buffer pool sizing is the single most impactful configuration:
innodb_buffer_pool_sizeshould be 70-80% of available RAM for a dedicated MySQL server. - InnoDB doublewrite buffer halves effective write throughput on HDDs (each page written twice). On SSDs with atomic 4K writes (most modern NVMe),
innodb_doublewrite=OFFis safe. - RocksDB compaction can consume 100% of I/O bandwidth during heavy write periods ("write stalls"). Tune
level0_slowdown_writes_triggerandlevel0_stop_writes_triggerto manage this. - MyISAM key cache (
key_buffer_size) only caches index pages, not data pages. At high concurrency, table-level locking causes severe contention. Do not use MyISAM for write-concurrent workloads.
Failure Modes
- Torn writes: InnoDB's doublewrite buffer exists to handle this. A 16KB page is written as two 8KB writes; if the machine crashes mid-write, the doublewrite buffer copy is used for recovery.
- Compaction storms (RocksDB/LSM): If writes outpace compaction, L0 SSTable count grows, reads slow dramatically, and eventually writes are stalled entirely.
- Buffer pool eviction under pressure: Dirty pages must be flushed before eviction; a sudden spike in writes can cause checkpoint stall — InnoDB's adaptive flushing (
innodb_adaptive_flushing) mitigates this. - Corruption without checksums: Enable
innodb_checksum_algorithm=crc32. SQLite'sPRAGMA integrity_checkdetects but cannot repair corruption. - MMAP OOM killer interaction: Engines using
mmapcan be targeted by the Linux OOM killer in unexpected ways since mmap'd pages appear as anonymous memory.
Modern Usage
The storage engine abstraction has evolved significantly. PostgreSQL's Table Access Method API (since v12) allows custom heap implementations. Neon (serverless PostgreSQL) replaces PostgreSQL's storage with a remote page server, intercepting at the smgr (storage manager) interface. OrioleDB implements a new heap with undo-based MVCC for PostgreSQL, avoiding the need for VACUUM.
NVM-aware storage engines are an active research area. Intel's PMDK and PMEM-aware RocksDB variants exploit persistent memory's byte-addressability to eliminate the page-oriented access model for hot data.
Future Directions
- Disaggregated storage: Aurora, Neon, and AlloyDB all demonstrate that the storage layer can be moved to a shared, replicated, cloud-native tier. The storage engine becomes a network I/O client.
- NVM storage engines: Byte-addressable persistent memory enables in-place atomic updates at cache-line granularity, eliminating WAL overhead for small updates.
- Learned storage structures: Projects like Bourbon (learned LSM), ALEX (adaptive learned index), and PGM-Index replace B-trees with ML models that learn the data distribution, achieving better space/time tradeoffs on static or slowly changing datasets.
- HTAP unification: Systems like TiDB and SingleStore attempt to unify OLTP row storage and OLAP column storage in a single engine, eliminating ETL pipelines.
Exercises
- Implement a minimal storage engine for MySQL (using the
handlertoninterface) that stores all data in a sorted in-memorystd::map. Supportget,put,delete, andscan. - Benchmark InnoDB vs MyRocks on a write-heavy workload (e.g., 10M sequential INSERTs) and measure write amplification using iostat.
- Trace a SQLite write through the source code: from
sqlite3_exec("INSERT ...")throughsqlite3BtreeInsertdown to thepager_writecall. - Configure RocksDB compaction strategies (size-tiered vs leveled) and measure the effect on read latency under a mixed read/write benchmark using
db_bench. - Examine
SHOW ENGINE INNODB STATUSoutput on a loaded production server. Identify buffer pool hit rate, pending I/O operations, and any semaphore waits.
References
- Gray, J., & Reuter, A. (1992). Transaction Processing: Concepts and Techniques. Morgan Kaufmann.
- Idreos, S., Dayan, N., Qin, W., Akmanalp, M., & Fadika, S. (2016). Design Continuums and the Path Toward Self-Designing Key-Value Stores. CIDR 2016.
- O'Neil, P., Cheng, E., Gawlick, D., & O'Neil, E. (1996). The Log-Structured Merge-Tree (LSM-Tree). Acta Informatica, 33(4), 351–385.
- MySQL Internals Manual: https://dev.mysql.com/doc/internals/en/
- WiredTiger Architecture Guide: https://source.wiredtiger.com/develop/arch-index.html
- SQLite Architecture: https://www.sqlite.org/arch.html
- Lomet, D. (2009). The Evolution of Effective B-tree: Page Organization and Techniques. ACM SIGMOD Record.