Skip to content

02 — B-Trees and B+ Trees

Technical Overview

The B-tree (Bayer & McCreight, 1972) is the dominant index structure in relational and key-value databases. Unlike binary search trees, B-trees are optimized for block-oriented storage: each node maps to one or more disk pages, and the tree's fan-out is engineered to minimize the number of I/O operations required for a lookup. The B+ tree variant — where all data records are stored in leaf nodes and internal nodes contain only keys for routing — is so universally adopted in databases that "B-tree" in database literature almost always refers to the B+ tree.

PostgreSQL's nbtree module, InnoDB's clustered and secondary indexes, SQLite's Btree module, and virtually every other disk-based DBMS use B+ trees as the primary access method. Understanding their internal mechanics — from node splits and merges to the interaction with the buffer pool — is prerequisite knowledge for any database internals engineer.

Prerequisites

  • Understanding of binary search trees and tree traversal
  • Block I/O concepts: disk pages, random vs sequential access costs
  • Basic C struct layout and memory management
  • Familiarity with buffer pool concepts (cache of disk pages in RAM)

Core Content

B+ Tree Structure

A B+ tree of order m satisfies: - Every internal node has at most m children. - Every internal node (except the root) has at least ⌈m/2⌉ children. - The root has at least 2 children (unless it is a leaf). - All leaves are at the same depth. - All data records are stored at the leaf level; leaves are linked in a doubly-linked list.

Order-4 B+ Tree (max 3 keys per internal node, max 3 records per leaf):

Internal Nodes (keys only):
                 [ 20 | 40 ]
                /     |     \
        [10|15]    [25|30]    [45|50]
        /  |  \    /  |  \    /  |  \
       L0  L1  L2  L3  L4  L5  L6  L7  L8

Leaf Level (data + sibling pointers):
+-------+    +-------+    +-------+    +-------+    +-------+
|1,2,5  |<-->|10,12  |<-->|20,22  |<-->|30,35  |<-->|40,45  |
+-------+    +-------+    +-------+    +-------+    +-------+
   L0            L1            L2           L3            L4

The sibling pointers enable efficient range scans: after finding the start key via tree traversal, follow the leaf chain until the range is exhausted, without returning to internal nodes.

Fan-Out and Height Analysis

For a B+ tree with n data records and order m, the height is O(log_m(n)). With a page size of 16KB and an 8-byte key + 6-byte pointer (typical InnoDB internal node entry), fan-out is:

16384 bytes / 14 bytes per entry ≈ 1170 entries per internal node

For a 1-billion-row table:

height = ceil(log_1170(1,000,000,000)) ≈ ceil(3.05) = 4 levels

Four I/Os to find any record in a billion-row table. In practice, the root and upper levels are pinned in the buffer pool, reducing this to 1-2 I/Os for the lower levels.

Search Algorithm

function search(node, key):
    if node is leaf:
        return binary_search(node.keys, key)
    else:
        i = largest index where node.keys[i] <= key
        return search(node.children[i], key)

The binary search within a node is O(log m), total complexity O(log_m(n) * log m) = O(log n). For practical fan-outs (hundreds to thousands), the constant factor is small.

Insertion with Splits

Insertion always targets a leaf node. If the leaf is full (m-1 keys), it splits:

Before split (order 4, max 3 keys):
Leaf: [10 | 20 | 30]   -- FULL, inserting 25

Step 1: Split leaf into two:
Left:  [10 | 20]
Right: [25 | 30]

Step 2: Push up the split key (25) to parent:
Parent: [... | 25 | ...]
        /           \
  [10|20]           [25|30]

If the parent is also full, the split propagates upward. In the worst case, a single insertion splits O(h) nodes (one per level), where h is the tree height. This is rare in practice — a probabilistic analysis shows that most insertions require no splits at all.

InnoDB uses a "pessimistic insert" path when splits are needed (btr_cur_pessimistic_insert in btr0cur.cc) and an "optimistic insert" path when there is free space (btr_cur_optimistic_insert). The optimistic path acquires only an SX latch on the leaf page; pessimistic must latch the entire path from root to leaf.

Deletion with Merges and Redistribution

Deletion may underflow a leaf (fewer than ⌈m/2⌉ keys). Two options:

Redistribution: If a sibling has more than the minimum, borrow a key:

Before (underflow after delete):
Parent: [... | 30 | ...]
        /              \
 [20]    (underflow)    [30 | 40 | 50]   (has spare)

After redistribution:
Parent: [... | 40 | ...]
        /              \
 [20 | 30]              [40 | 50]

Merge: If neither sibling has a spare key, merge two nodes and pull down the separator from the parent:

Before (both siblings at minimum):
Parent: [... | 30 | ...]
         /           \
  [20]              [30]    <- both at minimum

After merge:
Parent: [...]   (30 removed from parent)
         |
  [20 | 30]

Merges can cascade upward, potentially shrinking the tree height. PostgreSQL nbtree (nbtsearch.c, nbtinsert.c) implements a "half-dead" mechanism where deletion is lazy: the page is marked half-dead, and actual merge/recycling occurs during subsequent tree operations.

B-Tree vs B+ Tree Tradeoffs

Property B-Tree B+ Tree
Data location Internal and leaf nodes Leaf nodes only
Range scan efficiency Poor (must traverse tree) Excellent (linked leaf list)
Space per node Higher (data + keys + ptrs) Lower internal nodes
Fan-out Lower (data reduces fan-out) Higher (pure keys in internals)
Exact key lookup Potentially faster (1 I/O) Always O(h) I/Os

The performance advantage of B+ trees for range scans (which dominate database workloads) and the higher fan-out from key-only internal nodes make B+ trees universally preferred in databases. Pure B-trees appear in some file system directory structures (HFS+, ext4 htree) where range scans are rare.

B+ Tree Page Format (InnoDB)

InnoDB B-tree pages are 16KB by default. The PAGE_HEADER at the start of each page contains:

// From storage/innobase/include/page0page.h
#define PAGE_N_DIR_SLOTS  0   /* number of slots in page directory */
#define PAGE_HEAP_TOP     2   /* pointer to record heap top */
#define PAGE_N_HEAP       4   /* number of records in heap */
#define PAGE_FREE         6   /* pointer to start of page free list */
#define PAGE_GARBAGE      8   /* number of bytes in deleted records */
#define PAGE_LAST_INSERT 10   /* pointer to last inserted record */
#define PAGE_DIRECTION   12   /* last insert direction */
#define PAGE_N_DIRECTION 14   /* number of consecutive inserts in same dir */
#define PAGE_N_RECS      16   /* number of user records */
#define PAGE_MAX_TRX_ID  18   /* highest id of a trx which may have modified */
#define PAGE_LEVEL       26   /* level of the node in an index tree; */
                              /* the leaf level is 0 */
#define PAGE_INDEX_ID    28   /* index id where the page belongs */

Records on a page are organized in a singly-linked list ordered by key. The page directory (at the tail of the page) is a sparse array of offsets into the record list, enabling binary search within the page without scanning all records.

Clustered vs Secondary Indexes in InnoDB

InnoDB stores data in the B+ tree index itself (clustered index). The leaf nodes of the primary key B+ tree contain the full row data.

Primary Key (Clustered) Index Leaf Node:
+------------------+------------------+
| PK=1 | col1=... | col2=... | ...   |
+------------------+------------------+
| PK=2 | col1=... | col2=... | ...   |
+------------------+------------------+

Secondary Index Leaf Node:
+----------------------+
| secondary_key=Alice  |  PK=42  |
+----------------------+
| secondary_key=Bob    |  PK=17  |
+----------------------+

A secondary index lookup requires two B-tree traversals: first the secondary index to get the PK, then the clustered index to get the row data (unless a covering index is used).

Covering Indexes

A covering index contains all columns needed to satisfy a query, avoiding the second B-tree lookup ("index-only scan"). PostgreSQL calls this a "index-only scan" and checks pg_visibility to determine if heap access can be skipped. InnoDB calls the secondary index's PK columns "included columns."

-- This can use a covering index on (last_name, first_name, salary)
-- because all projected columns are in the index
SELECT first_name, salary FROM employees WHERE last_name = 'Smith';

Buffer Pool Interaction

B-tree operations interact with the buffer pool through page fix/unfix (pin/unpin) operations. In InnoDB:

  1. buf_page_get() fixes the page in the buffer pool, incrementing its pin count.
  2. A latch (read or write) is acquired on the page.
  3. After the B-tree operation, the latch is released and the page is unfixed.
  4. The LRU algorithm can only evict unfixed (pin count = 0) pages.

The root page is frequently accessed and quickly bubbles to the "young" end of the LRU list, staying hot in the buffer pool. This is why the effective number of I/Os for a lookup is often 1-2, not the theoretical 3-4.

PostgreSQL uses ReadBuffer() / ReleaseBuffer() and the LockBuffer() mechanism for the equivalent functionality, implemented in src/backend/storage/buffer/bufmgr.c.

PostgreSQL nbtree Internals

PostgreSQL's nbtree (src/backend/access/nbtree/) uses a "concurrent B-tree" algorithm based on Lehman & Yao (1981) with modifications by Lanin & Shasha (1986) and Rao & Ross (2000). Key properties:

  • High keys: Each non-rightmost page has a "high key" record — the maximum key value that could appear on the page. This enables concurrent traversal without holding parent latches.
  • Page split without parent latch: When a page splits, the right sibling is linked first, then the parent is updated. Concurrent readers follow sibling pointers if they land on a page that has been split.
  • Vacuum and page recycling: Deleted pages are placed on a free list (FSM — Free Space Map). PostgreSQL tracks the oldest XID that could have accessed a deleted page before recycling it.

Historical Context

Rudolf Bayer and Edward McCreight invented the B-tree at Boeing Research Labs in 1972 while working on large database indexing. The paper "Organization and Maintenance of Large Ordered Indexes" appeared in Acta Informatica. The name "B-tree" is never explained in the paper — speculated meanings include "Bayer tree," "Boeing tree," "balanced tree," or "broad tree."

The B+ tree variant was not formally distinguished from the B-tree in Bayer & McCreight's original paper. The distinction was clarified and the "B+ tree" name popularized by Douglas Comer's survey paper "The Ubiquitous B-Tree" (ACM Computing Surveys, 1979), which remains an excellent introduction.

Jim Gray and Andreas Reuter's Transaction Processing (1992) formalized many of the concurrent B-tree access methods used in production systems.

Production Examples

  • PostgreSQL nbtree: src/backend/access/nbtree/nbtree.c. Every index type in PostgreSQL except GiST, GIN, BRIN, Hash, and SPGIST uses this implementation. _bt_search() is the descent function; _bt_insertonpg() handles splits.
  • InnoDB: storage/innobase/btr/btr0btr.cc and btr0cur.cc. btr_root_get() fetches the root page; btr_page_split_and_insert() handles splits.
  • SQLite: src/btree.c. sqlite3BtreeMovetoUnpacked() is the search function; insertCell() and balance() handle insertion and rebalancing.
  • LevelDB/RocksDB: Use B+ trees only for SSTable internal block indexes; the primary data structure is an LSM tree. But table/block.cc implements a restart-point-based binary search that resembles a flat B-tree page.

Debugging Notes

  • PostgreSQL index bloat: Dead tuples leave behind index entries that are not immediately reclaimed. pgstattuple extension: SELECT * FROM pgstattuple('my_index'); shows leaf fragmentation.
  • InnoDB B-tree statistics: SELECT * FROM information_schema.INNODB_INDEXES; shows PAGE_NO of root pages. innodb_print_all_deadlocks logs all deadlock information.
  • Visualizing a B+ tree page: In PostgreSQL, pageinspect extension: SELECT * FROM bt_page_stats('my_index', 1); and SELECT * FROM bt_page_items('my_index', 1); dump page contents.
  • Index size: SELECT pg_size_pretty(pg_relation_size('my_index')); in PostgreSQL; SHOW TABLE STATUS in MySQL for approximate index size.
  • Corruption detection: PostgreSQL's amcheck extension (SELECT bt_index_check('my_index', true)) performs a consistency check on the entire B-tree structure.

Security Implications

  • Index inference attacks: Timing side channels in B-tree search can leak information about key distribution. Uniform query latency is not guaranteed — a lookup for a key near the middle of a page may be slightly faster than one near the end.
  • B-tree page pinning: An attacker with access to pg_class and pageinspect can read raw page contents, potentially exposing data from dead tuples not yet vacuumed.
  • Index-only scan visibility: PostgreSQL's index-only scan relies on the visibility map; if an attacker can corrupt the visibility map (possible via a sufficiently privileged role), they could force heap access to avoid row security policies in edge cases.

Performance Implications

  • Fill factor: PostgreSQL's FILLFACTOR (default 90 for B-tree) leaves 10% of each page empty for future inserts, reducing page splits for append-near-random workloads. For monotonically increasing keys (timestamps, auto-increment IDs), FILLFACTOR=100 is fine.
  • Hot root page: The root page is accessed on every tree traversal. In a highly concurrent OLTP system, latch contention on the root page can be a bottleneck. InnoDB and PostgreSQL both use SX (shared-exclusive) latches that allow concurrent readers with exclusive writes.
  • Sequential vs random inserts: Inserting in primary key order (InnoDB auto_increment, PostgreSQL SERIAL) fills pages sequentially with minimal splits. Random UUID inserts cause O(n log n) splits and 50% average page fill, roughly doubling storage requirements.
  • Index-only scans: Avoiding the heap fetch (second B-tree traversal for secondary indexes in InnoDB) can provide 2-10x speedup for read-heavy workloads. PostgreSQL's INCLUDE clause adds non-key columns to the index leaf nodes for this purpose.

Failure Modes

  1. Index corruption from torn writes: InnoDB's doublewrite buffer protects data pages; index pages are also written through the doublewrite buffer. PostgreSQL uses WAL to redo any partially-written index page.
  2. Index bloat from dead tuples: Without regular VACUUM (PostgreSQL), index pages fill with dead pointers, increasing tree height and scan costs. Monitor with pg_stat_user_indexes.idx_scan dropping relative to seq_scan.
  3. Split storms: A workload inserting random UUIDs into a large table can trigger continuous page splits, consuming significant I/O and CPU. Mitigate with UUID v7 (time-sorted) or ULID.
  4. Latch deadlocks: InnoDB's SMO (Structure Modification Operation) latch protocol acquires a global index latch during splits. Under very high write concurrency, this creates a bottleneck that cannot be parallelized.

Modern Usage

B+ trees remain dominant but face increasing competition from cache-oblivious structures (van Emde Boas trees, cache-oblivious B-trees by Bender et al.) and learned indexes. However, B+ trees' predictable performance characteristics, decades of engineering, and well-understood concurrency protocols keep them as the default choice.

Bw-Trees (Levandoski, Lomet & Sengupta, 2013), used in Microsoft SQL Server Hekaton, are a latch-free B-tree variant using a page mapping table (indirection layer) to enable lock-free updates via CAS. This allows extremely high concurrency on modern multi-core CPUs.

PostgreSQL's nbtree was significantly improved in version 14 with deduplication (duplicate key entries in leaf nodes are compressed) and in version 13 with bottom-up index deletion, reducing bloat without requiring VACUUM.

Future Directions

  • Learned B-tree indexes: ALEX (Ding et al., 2020) and FITing-Tree (Galakatos et al., 2019) use learned models as approximations of the B-tree search function, achieving better cache efficiency.
  • NVM B-trees: Persistent memory (Optane) allows in-place updates of B-tree nodes without WAL, since PM is byte-addressable and persistent. FAST+FAIR (Hwang et al., 2018) is a NVM-optimized B+ tree.
  • RDMA B-trees: Disaggregated memory architectures allow B-tree traversal using one-sided RDMA operations (bypassing the remote CPU), as demonstrated in Sherman (Wei et al., 2022) and other RDMA database projects.

Exercises

  1. Implement a B+ tree in C or Go with order 4, supporting insert, search, and range_scan. Verify correctness by inserting 1000 random keys and iterating the leaf chain.
  2. Use PostgreSQL pageinspect to examine a B+ tree index page. Identify the high key, the live records, and the page-level statistics. Compare a leaf page to an internal page.
  3. Measure the effect of insert order on index size: create two identical tables in PostgreSQL, insert 1M rows with sequential IDs into one and random UUIDs into the other. Compare pg_relation_size of the indexes.
  4. In InnoDB source code (btr0btr.cc), find the btr_page_split_and_insert function. Trace the SMO (structure modification operation) latch acquisition sequence.
  5. Design a schema for a time-series workload (sensor readings with timestamp PK). Discuss the B+ tree fill factor, page split behavior, and whether a clustered or non-clustered design is preferable.

References

  • Bayer, R., & McCreight, E. (1972). Organization and Maintenance of Large Ordered Indexes. Acta Informatica, 1(3), 173–189.
  • Comer, D. (1979). The Ubiquitous B-Tree. ACM Computing Surveys, 11(2), 121–137.
  • Lehman, P.L., & Yao, S.B. (1981). Efficient Locking for Concurrent Operations on B-Trees. ACM TODS, 6(4), 650–670.
  • Graefe, G. (2011). Modern B-Tree Techniques. Foundations and Trends in Databases, 3(4), 203–402.
  • Levandoski, J., Lomet, D., & Sengupta, S. (2013). The Bw-Tree: A B-tree for New Hardware Platforms. ICDE 2013.
  • PostgreSQL nbtree README: src/backend/access/nbtree/README
  • InnoDB source: storage/innobase/btr/btr0btr.cc