02 — B-Trees and B+ Trees
Technical Overview
The B-tree (Bayer & McCreight, 1972) is the dominant index structure in relational and key-value databases. Unlike binary search trees, B-trees are optimized for block-oriented storage: each node maps to one or more disk pages, and the tree's fan-out is engineered to minimize the number of I/O operations required for a lookup. The B+ tree variant — where all data records are stored in leaf nodes and internal nodes contain only keys for routing — is so universally adopted in databases that "B-tree" in database literature almost always refers to the B+ tree.
PostgreSQL's nbtree module, InnoDB's clustered and secondary indexes, SQLite's Btree module, and virtually every other disk-based DBMS use B+ trees as the primary access method. Understanding their internal mechanics — from node splits and merges to the interaction with the buffer pool — is prerequisite knowledge for any database internals engineer.
Prerequisites
- Understanding of binary search trees and tree traversal
- Block I/O concepts: disk pages, random vs sequential access costs
- Basic C struct layout and memory management
- Familiarity with buffer pool concepts (cache of disk pages in RAM)
Core Content
B+ Tree Structure
A B+ tree of order m satisfies:
- Every internal node has at most m children.
- Every internal node (except the root) has at least ⌈m/2⌉ children.
- The root has at least 2 children (unless it is a leaf).
- All leaves are at the same depth.
- All data records are stored at the leaf level; leaves are linked in a doubly-linked list.
Order-4 B+ Tree (max 3 keys per internal node, max 3 records per leaf):
Internal Nodes (keys only):
[ 20 | 40 ]
/ | \
[10|15] [25|30] [45|50]
/ | \ / | \ / | \
L0 L1 L2 L3 L4 L5 L6 L7 L8
Leaf Level (data + sibling pointers):
+-------+ +-------+ +-------+ +-------+ +-------+
|1,2,5 |<-->|10,12 |<-->|20,22 |<-->|30,35 |<-->|40,45 |
+-------+ +-------+ +-------+ +-------+ +-------+
L0 L1 L2 L3 L4
The sibling pointers enable efficient range scans: after finding the start key via tree traversal, follow the leaf chain until the range is exhausted, without returning to internal nodes.
Fan-Out and Height Analysis
For a B+ tree with n data records and order m, the height is O(log_m(n)). With a page size of 16KB and an 8-byte key + 6-byte pointer (typical InnoDB internal node entry), fan-out is:
16384 bytes / 14 bytes per entry ≈ 1170 entries per internal node
For a 1-billion-row table:
height = ceil(log_1170(1,000,000,000)) ≈ ceil(3.05) = 4 levels
Four I/Os to find any record in a billion-row table. In practice, the root and upper levels are pinned in the buffer pool, reducing this to 1-2 I/Os for the lower levels.
Search Algorithm
function search(node, key):
if node is leaf:
return binary_search(node.keys, key)
else:
i = largest index where node.keys[i] <= key
return search(node.children[i], key)
The binary search within a node is O(log m), total complexity O(log_m(n) * log m) = O(log n). For practical fan-outs (hundreds to thousands), the constant factor is small.
Insertion with Splits
Insertion always targets a leaf node. If the leaf is full (m-1 keys), it splits:
Before split (order 4, max 3 keys):
Leaf: [10 | 20 | 30] -- FULL, inserting 25
Step 1: Split leaf into two:
Left: [10 | 20]
Right: [25 | 30]
Step 2: Push up the split key (25) to parent:
Parent: [... | 25 | ...]
/ \
[10|20] [25|30]
If the parent is also full, the split propagates upward. In the worst case, a single insertion splits O(h) nodes (one per level), where h is the tree height. This is rare in practice — a probabilistic analysis shows that most insertions require no splits at all.
InnoDB uses a "pessimistic insert" path when splits are needed (btr_cur_pessimistic_insert in btr0cur.cc) and an "optimistic insert" path when there is free space (btr_cur_optimistic_insert). The optimistic path acquires only an SX latch on the leaf page; pessimistic must latch the entire path from root to leaf.
Deletion with Merges and Redistribution
Deletion may underflow a leaf (fewer than ⌈m/2⌉ keys). Two options:
Redistribution: If a sibling has more than the minimum, borrow a key:
Before (underflow after delete):
Parent: [... | 30 | ...]
/ \
[20] (underflow) [30 | 40 | 50] (has spare)
After redistribution:
Parent: [... | 40 | ...]
/ \
[20 | 30] [40 | 50]
Merge: If neither sibling has a spare key, merge two nodes and pull down the separator from the parent:
Before (both siblings at minimum):
Parent: [... | 30 | ...]
/ \
[20] [30] <- both at minimum
After merge:
Parent: [...] (30 removed from parent)
|
[20 | 30]
Merges can cascade upward, potentially shrinking the tree height. PostgreSQL nbtree (nbtsearch.c, nbtinsert.c) implements a "half-dead" mechanism where deletion is lazy: the page is marked half-dead, and actual merge/recycling occurs during subsequent tree operations.
B-Tree vs B+ Tree Tradeoffs
| Property | B-Tree | B+ Tree |
|---|---|---|
| Data location | Internal and leaf nodes | Leaf nodes only |
| Range scan efficiency | Poor (must traverse tree) | Excellent (linked leaf list) |
| Space per node | Higher (data + keys + ptrs) | Lower internal nodes |
| Fan-out | Lower (data reduces fan-out) | Higher (pure keys in internals) |
| Exact key lookup | Potentially faster (1 I/O) | Always O(h) I/Os |
The performance advantage of B+ trees for range scans (which dominate database workloads) and the higher fan-out from key-only internal nodes make B+ trees universally preferred in databases. Pure B-trees appear in some file system directory structures (HFS+, ext4 htree) where range scans are rare.
B+ Tree Page Format (InnoDB)
InnoDB B-tree pages are 16KB by default. The PAGE_HEADER at the start of each page contains:
// From storage/innobase/include/page0page.h
#define PAGE_N_DIR_SLOTS 0 /* number of slots in page directory */
#define PAGE_HEAP_TOP 2 /* pointer to record heap top */
#define PAGE_N_HEAP 4 /* number of records in heap */
#define PAGE_FREE 6 /* pointer to start of page free list */
#define PAGE_GARBAGE 8 /* number of bytes in deleted records */
#define PAGE_LAST_INSERT 10 /* pointer to last inserted record */
#define PAGE_DIRECTION 12 /* last insert direction */
#define PAGE_N_DIRECTION 14 /* number of consecutive inserts in same dir */
#define PAGE_N_RECS 16 /* number of user records */
#define PAGE_MAX_TRX_ID 18 /* highest id of a trx which may have modified */
#define PAGE_LEVEL 26 /* level of the node in an index tree; */
/* the leaf level is 0 */
#define PAGE_INDEX_ID 28 /* index id where the page belongs */
Records on a page are organized in a singly-linked list ordered by key. The page directory (at the tail of the page) is a sparse array of offsets into the record list, enabling binary search within the page without scanning all records.
Clustered vs Secondary Indexes in InnoDB
InnoDB stores data in the B+ tree index itself (clustered index). The leaf nodes of the primary key B+ tree contain the full row data.
Primary Key (Clustered) Index Leaf Node:
+------------------+------------------+
| PK=1 | col1=... | col2=... | ... |
+------------------+------------------+
| PK=2 | col1=... | col2=... | ... |
+------------------+------------------+
Secondary Index Leaf Node:
+----------------------+
| secondary_key=Alice | PK=42 |
+----------------------+
| secondary_key=Bob | PK=17 |
+----------------------+
A secondary index lookup requires two B-tree traversals: first the secondary index to get the PK, then the clustered index to get the row data (unless a covering index is used).
Covering Indexes
A covering index contains all columns needed to satisfy a query, avoiding the second B-tree lookup ("index-only scan"). PostgreSQL calls this a "index-only scan" and checks pg_visibility to determine if heap access can be skipped. InnoDB calls the secondary index's PK columns "included columns."
-- This can use a covering index on (last_name, first_name, salary)
-- because all projected columns are in the index
SELECT first_name, salary FROM employees WHERE last_name = 'Smith';
Buffer Pool Interaction
B-tree operations interact with the buffer pool through page fix/unfix (pin/unpin) operations. In InnoDB:
buf_page_get()fixes the page in the buffer pool, incrementing its pin count.- A latch (read or write) is acquired on the page.
- After the B-tree operation, the latch is released and the page is unfixed.
- The LRU algorithm can only evict unfixed (pin count = 0) pages.
The root page is frequently accessed and quickly bubbles to the "young" end of the LRU list, staying hot in the buffer pool. This is why the effective number of I/Os for a lookup is often 1-2, not the theoretical 3-4.
PostgreSQL uses ReadBuffer() / ReleaseBuffer() and the LockBuffer() mechanism for the equivalent functionality, implemented in src/backend/storage/buffer/bufmgr.c.
PostgreSQL nbtree Internals
PostgreSQL's nbtree (src/backend/access/nbtree/) uses a "concurrent B-tree" algorithm based on Lehman & Yao (1981) with modifications by Lanin & Shasha (1986) and Rao & Ross (2000). Key properties:
- High keys: Each non-rightmost page has a "high key" record — the maximum key value that could appear on the page. This enables concurrent traversal without holding parent latches.
- Page split without parent latch: When a page splits, the right sibling is linked first, then the parent is updated. Concurrent readers follow sibling pointers if they land on a page that has been split.
- Vacuum and page recycling: Deleted pages are placed on a free list (
FSM— Free Space Map). PostgreSQL tracks the oldest XID that could have accessed a deleted page before recycling it.
Historical Context
Rudolf Bayer and Edward McCreight invented the B-tree at Boeing Research Labs in 1972 while working on large database indexing. The paper "Organization and Maintenance of Large Ordered Indexes" appeared in Acta Informatica. The name "B-tree" is never explained in the paper — speculated meanings include "Bayer tree," "Boeing tree," "balanced tree," or "broad tree."
The B+ tree variant was not formally distinguished from the B-tree in Bayer & McCreight's original paper. The distinction was clarified and the "B+ tree" name popularized by Douglas Comer's survey paper "The Ubiquitous B-Tree" (ACM Computing Surveys, 1979), which remains an excellent introduction.
Jim Gray and Andreas Reuter's Transaction Processing (1992) formalized many of the concurrent B-tree access methods used in production systems.
Production Examples
- PostgreSQL nbtree:
src/backend/access/nbtree/nbtree.c. Every index type in PostgreSQL except GiST, GIN, BRIN, Hash, and SPGIST uses this implementation._bt_search()is the descent function;_bt_insertonpg()handles splits. - InnoDB:
storage/innobase/btr/btr0btr.ccandbtr0cur.cc.btr_root_get()fetches the root page;btr_page_split_and_insert()handles splits. - SQLite:
src/btree.c.sqlite3BtreeMovetoUnpacked()is the search function;insertCell()andbalance()handle insertion and rebalancing. - LevelDB/RocksDB: Use B+ trees only for SSTable internal block indexes; the primary data structure is an LSM tree. But
table/block.ccimplements a restart-point-based binary search that resembles a flat B-tree page.
Debugging Notes
- PostgreSQL index bloat: Dead tuples leave behind index entries that are not immediately reclaimed.
pgstattupleextension:SELECT * FROM pgstattuple('my_index');shows leaf fragmentation. - InnoDB B-tree statistics:
SELECT * FROM information_schema.INNODB_INDEXES;showsPAGE_NOof root pages.innodb_print_all_deadlockslogs all deadlock information. - Visualizing a B+ tree page: In PostgreSQL,
pageinspectextension:SELECT * FROM bt_page_stats('my_index', 1);andSELECT * FROM bt_page_items('my_index', 1);dump page contents. - Index size:
SELECT pg_size_pretty(pg_relation_size('my_index'));in PostgreSQL;SHOW TABLE STATUSin MySQL for approximate index size. - Corruption detection: PostgreSQL's
amcheckextension (SELECT bt_index_check('my_index', true)) performs a consistency check on the entire B-tree structure.
Security Implications
- Index inference attacks: Timing side channels in B-tree search can leak information about key distribution. Uniform query latency is not guaranteed — a lookup for a key near the middle of a page may be slightly faster than one near the end.
- B-tree page pinning: An attacker with access to
pg_classandpageinspectcan read raw page contents, potentially exposing data from dead tuples not yet vacuumed. - Index-only scan visibility: PostgreSQL's index-only scan relies on the visibility map; if an attacker can corrupt the visibility map (possible via a sufficiently privileged role), they could force heap access to avoid row security policies in edge cases.
Performance Implications
- Fill factor: PostgreSQL's
FILLFACTOR(default 90 for B-tree) leaves 10% of each page empty for future inserts, reducing page splits for append-near-random workloads. For monotonically increasing keys (timestamps, auto-increment IDs),FILLFACTOR=100is fine. - Hot root page: The root page is accessed on every tree traversal. In a highly concurrent OLTP system, latch contention on the root page can be a bottleneck. InnoDB and PostgreSQL both use SX (shared-exclusive) latches that allow concurrent readers with exclusive writes.
- Sequential vs random inserts: Inserting in primary key order (InnoDB auto_increment, PostgreSQL SERIAL) fills pages sequentially with minimal splits. Random UUID inserts cause O(n log n) splits and 50% average page fill, roughly doubling storage requirements.
- Index-only scans: Avoiding the heap fetch (second B-tree traversal for secondary indexes in InnoDB) can provide 2-10x speedup for read-heavy workloads. PostgreSQL's
INCLUDEclause adds non-key columns to the index leaf nodes for this purpose.
Failure Modes
- Index corruption from torn writes: InnoDB's doublewrite buffer protects data pages; index pages are also written through the doublewrite buffer. PostgreSQL uses WAL to redo any partially-written index page.
- Index bloat from dead tuples: Without regular
VACUUM(PostgreSQL), index pages fill with dead pointers, increasing tree height and scan costs. Monitor withpg_stat_user_indexes.idx_scandropping relative toseq_scan. - Split storms: A workload inserting random UUIDs into a large table can trigger continuous page splits, consuming significant I/O and CPU. Mitigate with UUID v7 (time-sorted) or ULID.
- Latch deadlocks: InnoDB's SMO (Structure Modification Operation) latch protocol acquires a global index latch during splits. Under very high write concurrency, this creates a bottleneck that cannot be parallelized.
Modern Usage
B+ trees remain dominant but face increasing competition from cache-oblivious structures (van Emde Boas trees, cache-oblivious B-trees by Bender et al.) and learned indexes. However, B+ trees' predictable performance characteristics, decades of engineering, and well-understood concurrency protocols keep them as the default choice.
Bw-Trees (Levandoski, Lomet & Sengupta, 2013), used in Microsoft SQL Server Hekaton, are a latch-free B-tree variant using a page mapping table (indirection layer) to enable lock-free updates via CAS. This allows extremely high concurrency on modern multi-core CPUs.
PostgreSQL's nbtree was significantly improved in version 14 with deduplication (duplicate key entries in leaf nodes are compressed) and in version 13 with bottom-up index deletion, reducing bloat without requiring VACUUM.
Future Directions
- Learned B-tree indexes: ALEX (Ding et al., 2020) and FITing-Tree (Galakatos et al., 2019) use learned models as approximations of the B-tree search function, achieving better cache efficiency.
- NVM B-trees: Persistent memory (Optane) allows in-place updates of B-tree nodes without WAL, since PM is byte-addressable and persistent. FAST+FAIR (Hwang et al., 2018) is a NVM-optimized B+ tree.
- RDMA B-trees: Disaggregated memory architectures allow B-tree traversal using one-sided RDMA operations (bypassing the remote CPU), as demonstrated in Sherman (Wei et al., 2022) and other RDMA database projects.
Exercises
- Implement a B+ tree in C or Go with order 4, supporting
insert,search, andrange_scan. Verify correctness by inserting 1000 random keys and iterating the leaf chain. - Use PostgreSQL
pageinspectto examine a B+ tree index page. Identify the high key, the live records, and the page-level statistics. Compare a leaf page to an internal page. - Measure the effect of insert order on index size: create two identical tables in PostgreSQL, insert 1M rows with sequential IDs into one and random UUIDs into the other. Compare
pg_relation_sizeof the indexes. - In InnoDB source code (
btr0btr.cc), find thebtr_page_split_and_insertfunction. Trace the SMO (structure modification operation) latch acquisition sequence. - Design a schema for a time-series workload (sensor readings with timestamp PK). Discuss the B+ tree fill factor, page split behavior, and whether a clustered or non-clustered design is preferable.
References
- Bayer, R., & McCreight, E. (1972). Organization and Maintenance of Large Ordered Indexes. Acta Informatica, 1(3), 173–189.
- Comer, D. (1979). The Ubiquitous B-Tree. ACM Computing Surveys, 11(2), 121–137.
- Lehman, P.L., & Yao, S.B. (1981). Efficient Locking for Concurrent Operations on B-Trees. ACM TODS, 6(4), 650–670.
- Graefe, G. (2011). Modern B-Tree Techniques. Foundations and Trends in Databases, 3(4), 203–402.
- Levandoski, J., Lomet, D., & Sengupta, S. (2013). The Bw-Tree: A B-tree for New Hardware Platforms. ICDE 2013.
- PostgreSQL nbtree README:
src/backend/access/nbtree/README - InnoDB source:
storage/innobase/btr/btr0btr.cc