03 — Log-Structured Merge Trees (LSM Trees)

Technical Overview

The Log-Structured Merge Tree (LSM Tree) is a data structure optimized for write-heavy workloads by converting random writes into sequential I/O. Introduced by O'Neil, Cheng, Gawlick, and O'Neil in 1996, LSM trees buffer writes in memory and periodically flush them to disk as immutable sorted files. Background compaction merges these files, maintaining the sorted property while reclaiming space from deleted and obsolete versions. The result is dramatically lower write amplification compared to B+ trees — at the cost of higher read amplification and space amplification.

LevelDB (Google, 2011), RocksDB (Meta, 2012), Apache Cassandra, HBase, and InfluxDB all use LSM trees as their core storage structure. Understanding LSM internals is essential for tuning write-heavy databases, diagnosing compaction issues, and understanding the tradeoff space at the heart of modern key-value storage.

Prerequisites

Understanding of B+ trees and why random I/O is expensive
Familiarity with sorting algorithms (specifically merge sort)
Basic understanding of bloom filters
Knowledge of write amplification, read amplification, space amplification concepts

Core Content

The Core Insight: Sequential Writes Are Fast

On HDDs, sequential I/O is ~100x faster than random I/O. On NVMe SSDs the ratio shrinks to ~5-10x, but sequential writes still have lower write amplification at the device level. LSM trees exploit this by:

Accepting all writes into a fast in-memory buffer (MemTable).
Flushing the MemTable to disk as a sorted, immutable file (SSTable — Sorted String Table) when it reaches a threshold.
Compacting SSTables in the background to maintain sorted order and remove obsolete data.

LSM Tree Architecture

Write Path:
  Client Write
       |
       v
  +----------+
  | WAL      |  <-- Sequential append for crash recovery
  | (append) |
  +----------+
       |
       v
  +----------+
  | MemTable |  <-- In-memory sorted structure (skip list or red-black tree)
  | (active) |      ~64MB default in RocksDB
  +----------+
       | (when full)
       v
  +-------------------+
  | Immutable MemTable|  <-- Being flushed, still readable
  +-------------------+
       | (flush)
       v
  +---------+  +---------+  +---------+
  | SST L0  |  | SST L0  |  | SST L0  |  <-- L0: Unsorted between SSTables!
  +---------+  +---------+  +---------+
       | (compaction)
       v
  +----------------------------------------------+
  |           L1: Sorted, non-overlapping        |  ~10MB (RocksDB default)
  +----------------------------------------------+
       | (compaction)
       v
  +----------------------------------------------+
  |           L2: Sorted, non-overlapping        |  ~100MB
  +----------------------------------------------+
       | (compaction)
       v
  +----------------------------------------------+
  |           L3: Sorted, non-overlapping        |  ~1GB
  +----------------------------------------------+

The MemTable

The MemTable is an in-memory sorted data structure. RocksDB defaults to a skip list (InlineSkipList in memtable/inlineskiplist.h), which provides O(log n) search, insert, and delete with excellent concurrent performance via optimistic CAS operations. Alternative MemTable implementations in RocksDB include hash skip list (for point lookups only), hash linked list, and vector (for bulk loading).

Skip List MemTable (conceptual):
Level 3: [---------------------> key=50 -------> NULL]
Level 2: [-> key=10 -----------> key=50 -> key=80 -> NULL]
Level 1: [-> key=10 -> key=30 -> key=50 -> key=70 -> key=80 -> NULL]
Level 0: [-> key=5 -> key=10 -> key=20 -> key=30 -> key=40 -> key=50 -> ...]

When the MemTable reaches its size limit (write_buffer_size, default 64MB in RocksDB), it becomes immutable and a new MemTable is created. The immutable MemTable is flushed to disk as an L0 SSTable by the flush thread.

The SSTable (Sorted String Table)

An SSTable is an immutable sorted file. RocksDB's SSTable format (table/block_based/):

SSTable File Layout:
+------------------------+
| Data Blocks            |  Sorted key-value pairs, ~4KB each
| Data Blocks            |
| ...                    |
+------------------------+
| Filter Block           |  Bloom filter (or Ribbon filter) for each block
+------------------------+
| Index Block            |  One entry per data block: (max_key, block_offset)
+------------------------+
| Meta Index Block       |  Locations of filter block, other metadata
+------------------------+
| Footer                 |  Magic number, offsets to index and meta blocks
+------------------------+

The index block allows binary search to find the right data block for a given key in O(log(num_blocks)) without reading the entire SSTable. The filter block (bloom filter) allows skipping the SSTable entirely for keys that are definitely not present.

The Read Path

Read(key):
  1. Check active MemTable
  2. Check immutable MemTable(s)
  3. Check L0 SSTables (all of them — L0 may have overlapping ranges!)
     For each L0 SSTable (newest first):
       a. Check bloom filter — if "definitely not", skip
       b. Binary search index block to find data block
       c. Binary search data block for key
  4. Check L1 SSTables (sorted, non-overlapping — binary search to find the one)
     a. Check bloom filter
     b. Binary search index block, then data block
  5. Repeat for L2, L3, ... until key found or all levels exhausted
  6. If key found as tombstone, return "not found"

In the worst case (key doesn't exist), every level must be checked. Bloom filters make this O(1) amortized by filtering 99%+ of SSTables. RocksDB's Ribbon filter (Dillinger & Walzer, 2021) achieves better false positive rates than standard Bloom filters at the same memory cost.

Bloom Filters for SSTable Skip

A Bloom filter is a probabilistic data structure that answers "is key X in this set?" with no false negatives and controllable false positives. Each SSTable has a Bloom filter covering all keys in the SSTable.

Bloom Filter for SSTable:
- Bit array of size m (e.g., 10 bits per key)
- k hash functions
- Set bits: for each key, compute k hashes, set those bits
- Query: for a query key, compute k hashes, check those bits
  - If any bit is 0: key DEFINITELY not in SSTable (skip it)
  - If all bits are 1: key PROBABLY in SSTable (check the SSTable)

RocksDB uses 10 bits per key in its default Bloom filter, giving a ~1% false positive rate. At bloom_bits_per_key=10, reads that miss benefit from ~99% SSTable skip rate.

Compaction Strategies

Compaction merges SSTables to reduce read amplification and reclaim space from deleted/updated keys. Three main strategies:

Leveled Compaction (RocksDB default, LevelDB default): - L0 SSTables are unsorted relative to each other (key ranges overlap) - L1 and below: key ranges within a level are non-overlapping - Compaction picks one SSTable from L(n) and all overlapping SSTables from L(n+1), merges them, and writes back to L(n+1) - Level size ratio: L(n+1) is 10x L(n) (configurable via max_bytes_for_level_multiplier) - Write amplification: O(level_ratio * num_levels) ≈ 30x for 3 levels with ratio 10 - Read amplification: O(num_levels) ≈ 3-7 I/Os - Space amplification: ~1.1x (10% overhead during compaction)

Size-Tiered Compaction (Cassandra default, HBase default): - SSTables are grouped into "tiers" by size - When enough SSTables of similar size accumulate, they are merged into a larger SSTable - No strict level structure; SSTables in a tier can have overlapping key ranges - Write amplification: Lower than leveled (data rewritten fewer times) - Read amplification: Higher than leveled (multiple overlapping SSTables per tier) - Space amplification: Higher than leveled (temporarily stores both old and new SSTables during compaction)

Size-Tiered Compaction:
Tier 0 (small): [SST-a][SST-b][SST-c][SST-d]  -> merge into SST-ABCD
Tier 1 (med):   [SST-1][SST-2][SST-ABCD]      -> merge into SST-12ABCD
Tier 2 (large): [SST-X][SST-12ABCD]           -> merge into one big SST

FIFO Compaction (RocksDB, for time-series): - SSTables are not merged; old SSTables are simply deleted when total size exceeds a threshold - Zero write amplification (no compaction), very high read amplification - Only valid for time-windowed data where old data is discarded

Universal Compaction (RocksDB): - A variant of size-tiered that tries to maintain a sorted run invariant - Minimizes write amplification for scenarios where data is written in time order

Write Amplification, Read Amplification, Space Amplification (The RUM Conjecture)

The fundamental tradeoff in LSM design, formalized by Idreos et al. (2016):

Leveled Compaction:
  Write Amp  ≈ 25-50x  (data rewritten many times through levels)
  Read Amp   ≈ 3-7 I/Os per point read
  Space Amp  ≈ 1.1x

Size-Tiered Compaction:
  Write Amp  ≈ 5-10x   (data rewritten fewer times)
  Read Amp   ≈ 10-50 I/Os per point read (without bloom filters)
  Space Amp  ≈ 2-3x

B+ Tree (for comparison):
  Write Amp  ≈ 5-150x  (random I/O, every update rewrites a page)
  Read Amp   ≈ 3-5 I/Os (O(log n) with high fan-out)
  Space Amp  ≈ 1.5-2x  (page fragmentation)

You can tune RocksDB toward lower write amplification (universal compaction, larger write_buffer_size) or lower read amplification (leveled compaction, more bloom filter bits) but not both simultaneously without increasing space amplification.

Tombstones and Deletion

Deleting a key in an LSM tree writes a tombstone — a special marker record with the key and a deletion flag. The key is not immediately removed because older versions of the key may exist in lower levels. The tombstone supersedes all older values during reads.

Read(key="foo"):
  L0: {key="foo", value=tombstone}  <- Found tombstone, return "not found"
  L2: {key="foo", value="bar"}      <- Older version, ignored

Tombstones are only removed during compaction when the tombstone has reached the lowest level (guaranteeing no older versions remain). This causes the tombstone accumulation problem: if deletes are frequent and compaction is slow, tombstones pile up and degrade read performance even for keys that were deleted long ago.

Compaction filter (RocksDB): A user-defined function called during compaction that can drop keys or tombstones. Used by MyRocks to implement TTL (time-to-live) expiration.

LSM Implementations

LevelDB (Google, 2011, Sanjay Ghemawat & Jeff Dean): The original production LSM implementation. Single-threaded compaction, no compression by default. ~20K lines of C++. The foundational code that RocksDB forked from.

RocksDB (Meta/Facebook, 2012): LevelDB fork with production enhancements: multi-threaded compaction, column families (separate LSM per logical namespace), write-ahead log group commit, Bloom/Ribbon filters, compression (Snappy, Zstd, LZ4), direct I/O, rate limiting for I/O. Used in MyRocks (MySQL), MongoRocks, TiKV (TiDB), Dgraph, YugabyteDB.

Apache Cassandra: Uses a hybrid approach — SSTables are LSM-structured, but Cassandra's CompactionManager supports both size-tiered (default) and leveled compaction. Cassandra's "SSTable" is actually a family of files (Data.db, Index.db, Filter.db, Statistics.db, etc.).

HBase: Uses HDFS as the underlying storage. MemStore (in-memory) → HFile on HDFS. HBase regions correspond to LSM "shards." Compaction is done by the region server.

Historical Context

O'Neil et al.'s 1996 paper was motivated by OLTP workloads with extreme write rates (transaction logging, audit trails) where B-tree random I/O was the bottleneck. The paper proposed a multi-component structure with two main components: a small in-memory C0 and a larger on-disk C1, with rolling merges between them.

Google's Bigtable (Chang et al., 2006) applied LSM ideas at scale, using GFS-backed SSTables with a "minor compaction" (MemTable to one SSTable) and "major compaction" (all SSTables into one) strategy. This directly influenced HBase.

LevelDB's contribution was the "leveled" compaction strategy (replacing Bigtable's two-component approach) and the clean API design. RocksDB's contribution was making the structure production-grade: multi-threaded, tunable, observable, and supporting column families.

Production Examples

RocksDB in TiKV: TiDB's storage node (TiKV) uses RocksDB with column families for MVCC data (default CF) and MVCC write records (write CF). Each key is stored as {user_key}{ts} where ts is a descending timestamp, placing the newest version first in sorted order.

Cassandra: The nodetool compactionstats command shows ongoing compaction. nodetool cfstats shows per-table SSTable counts. Cassandra's compaction_throughput_mb_per_sec limits I/O impact of compaction.

MyRocks at Facebook: Meta's UDB (user database) runs MyRocks at 300+ TB scale. The switch from InnoDB to MyRocks reduced storage by 2x and write I/O by 10x for their specific workload (mostly inserts and point reads). The savings come from RocksDB's compression (prefix compression + Zstd on L2+) and lower write amplification.

Debugging Notes

RocksDB stats: db->GetProperty("rocksdb.stats", &value) returns a human-readable stats table. Look for Stall and Stop indicators — these mean the write path is being throttled because compaction cannot keep up.
L0 file count: High L0 SSTable count (>= level0_slowdown_writes_trigger, default 20) indicates compaction falling behind. rocksdb.num-files-at-level0 property tracks this.
Write stalls: RocksDB writes [WARN] Stalling writes... to the info log when throttling. A CompactionEventListener can receive programmatic notifications.
Bloom filter effectiveness: rocksdb.bloom-filter-useful and rocksdb.bloom-filter-full-positive counters measure how often bloom filters skip SSTable reads.
Cassandra tombstone warnings: Cassandra logs tombstone_warn_threshold (default 1000) warnings when a read encounters many tombstones. This is a sign of problematic delete patterns.

Security Implications

LSM tombstones are not immediate deletion: After a DELETE operation, the data remains in SSTables until compaction removes the tombstone and all underlying versions. This is a compliance risk for regulations requiring immediate data deletion (GDPR "right to erasure").
Compaction filters as exfiltration vector: A malicious compaction filter plugin can read all keys during compaction and exfiltrate data. RocksDB does not sandbox compaction filters.
WAL exposure: The WAL is plain-text (or lightly structured binary) on disk. If WAL encryption is not enabled, an attacker with filesystem access can read recent uncommitted writes.
SSTable encryption: RocksDB supports an EncryptionProvider interface for at-rest encryption. RocksDB-Cloud (Rockset, CockroachDB, others) implements envelope encryption where each SSTable has its own data key, encrypted with a master key.

Performance Implications

Write buffer sizing: Larger write_buffer_size reduces flush frequency (fewer L0 files) but increases memory usage and recovery time after crashes. max_write_buffer_number controls how many immutable MemTables can queue before writes stall.
Compaction concurrency: max_background_compactions and max_background_flushes control parallelism. On storage with high I/O bandwidth (NVMe), increase these.
Compression: Enable compression for L2 and deeper (compression_per_level). Zstd provides better compression than Snappy at similar CPU cost. Compression reduces I/O, improving effective throughput.
Direct I/O: use_direct_reads=true and use_direct_io_for_flush_and_compaction=true bypass the OS page cache for SSTable reads/writes. Beneficial when RocksDB manages its own block cache (avoids double-caching).
Block cache sizing: NewLRUCache(8 * 1024 * 1024 * 1024) for an 8GB block cache. This caches uncompressed data block, index blocks, and filter blocks. cache_index_and_filter_blocks=true puts index and filters in the block cache (default: they are in the table's memory).

Failure Modes

Write stall/stop: When L0 file count reaches level0_slowdown_writes_trigger (default 20), writes are throttled. When it reaches level0_stop_writes_trigger (default 36), writes are stopped entirely. If compaction cannot keep up with writes, the database becomes effectively read-only.
Compaction debt: A burst of writes that creates many L0 files cannot be quickly recovered — compaction is bandwidth-limited. Recovery from a full write stall may take minutes to hours.
Tombstone storm: Millions of delete operations without compaction create massive tombstone accumulation. Range reads become O(tombstone_count) instead of O(result_count). Cassandra's SizeTieredCompactionStrategy is particularly vulnerable.
WAL corruption: If the WAL is corrupted (partial write on crash), RocksDB's WALRecoveryMode::kPointInTimeRecovery recovers to the last consistent point, discarding partially-written records.
Bloom filter false positives: A corrupted or misconfigured bloom filter returning too many false positives causes unnecessary SSTable reads. Monitor bloom-filter-full-positive / bloom-filter-useful ratio.

Modern Usage

RocksDB is now the de facto standard for embedded key-value storage in new database systems. TiKV (TiDB), CockroachDB (uses Pebble, a Go LSM based on RocksDB concepts), YugabyteDB, Dgraph, InfluxDB IOx, and many others use RocksDB or RocksDB-inspired LSM trees.

Pebble (CockroachDB): A Go reimplementation of RocksDB's concepts with a focus on CockroachDB's workload. Pebble introduces "range key" support (for CRDB's SQL range operations) and a simpler compaction algorithm.

WiscKey (University of Wisconsin, 2016): Separates keys from values — keys go in the LSM tree, values in a separate "value log." Reduces write amplification when values are large, at the cost of more complex garbage collection.

Future Directions

Tiered storage-aware LSM: Automatically placing hot SSTables on fast NVMe and cold SSTables on S3, with compaction moving data between tiers. Titan (TiKV's WiscKey-inspired vLog), RocksDB remote compaction, and CockroachDB's S3-backed storage explore this direction.
Hardware-accelerated compaction: FPGAs and DPUs (Data Processing Units) can offload compaction I/O and merge operations from the host CPU, as demonstrated in projects from Samsung, Alibaba, and USENIX papers from 2022-2024.
ML-guided compaction scheduling: Using workload prediction to proactively compact before read-latency spikes, rather than reactively. Dayan & Idreos (2018) formalized "Dostoevsky" — an optimal design point calculator for LSM compaction policies.

Exercises

Implement a minimal LSM tree in Python or Go with: MemTable (sorted dict), flush to SSTable (sorted file), and a two-level compaction. Test with 100K key-value pairs.
Use RocksDB's db_bench tool to measure write amplification under leveled vs universal compaction on a 10GB dataset. Compare total bytes written to storage vs total bytes inserted.
Reproduce the "tombstone storm": insert 1M keys into a Cassandra table, delete 500K of them, then run a range scan. Measure latency before and after running nodetool compact.
Trace a RocksDB Put() operation through the source code: from DB::Put() → WriteImpl() → WAL write → MemTable insert. Identify where the write is considered durable.
Experiment with bloom_bits_per_key in RocksDB: set it to 0 (no bloom filter), 5, 10, and 20. Measure the impact on point-read latency for missing keys.

References

O'Neil, P., Cheng, E., Gawlick, D., & O'Neil, E. (1996). The Log-Structured Merge-Tree. Acta Informatica, 33(4), 351–385.
Chang, F., et al. (2006). Bigtable: A Distributed Storage System for Structured Data. OSDI 2006.
Dayan, N., Athanassoulis, M., & Idreos, S. (2017). Monkey: Optimal Navigable Key-Value Store. SIGMOD 2017.
Dayan, N., & Idreos, S. (2018). Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores. SIGMOD 2018.
Lu, L., Pillai, T. S., Arpaci-Dusseau, A. C., & Arpaci-Dusseau, R. H. (2016). WiscKey: Separating Keys from Values in SSD-Conscious Storage. FAST 2016.
RocksDB Tuning Guide: https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide
Idreos, S., Dayan, N., et al. (2016). The Periodic Table of Data Structures. IEEE Data Engineering Bulletin, 41(3).