06 — Buffer Pool Management
Technical Overview
The buffer pool is the in-memory cache of disk pages maintained by a database management system. Its purpose is to reduce the number of disk I/O operations by keeping frequently accessed pages in memory. Since disk I/O latency is orders of magnitude higher than memory access latency (even on NVMe SSDs: ~100µs vs ~100ns), the buffer pool is the primary determinant of OLTP database performance for most production workloads.
The buffer pool implementation is one of the most engineering-intensive components in a DBMS. It must handle: concurrent access by hundreds of threads, dirty page tracking and flush ordering (WAL before page), intelligent eviction under memory pressure, and bypass of the OS page cache (which would double-cache pages). InnoDB, PostgreSQL, and SQL Server all implement their own buffer pool from scratch rather than relying on the OS page cache.
Prerequisites
- Understanding of virtual memory, physical memory, and the OS page cache
- Familiarity with WAL and the force-log-at-commit invariant
- Knowledge of B+ tree page structures
- Basic concurrency primitives (mutexes, latches, condition variables)
Core Content
Buffer Pool Structure
A buffer pool consists of:
- Frame array: A large contiguous block of memory divided into fixed-size frames (each frame holds one page). InnoDB default: 16KB frames. PostgreSQL default: 8KB frames.
- Page table: A hash map from (tablespace_id, page_id) → frame_index. Allows O(1) lookup of whether a page is already in the buffer pool.
- Free list: List of frames that currently hold no page (available for allocation).
- LRU list: Ordered list of frames for eviction decisions.
- Flush list: List of dirty frames ordered by oldest_modification_LSN.
Buffer Pool Layout:
Frame Array (physical memory):
+--------+--------+--------+--------+--------+
| Frame 0| Frame 1| Frame 2| Frame 3| Frame 4|
| Page 42| FREE | Page 17| Page 99| FREE |
+--------+--------+--------+--------+--------+
| | |
v v v
Page Table (hash map):
(ts=1, pg=42) -> frame 0 (dirty=true, pageLSN=1050, pinCount=0)
(ts=1, pg=17) -> frame 2 (dirty=false, pageLSN=900, pinCount=1)
(ts=1, pg=99) -> frame 3 (dirty=true, pageLSN=1080, pinCount=0)
Free List: [frame 1, frame 4]
LRU List (MRU to LRU): [frame 2, frame 3, frame 0]
Flush List (oldest modification first): [frame 0 (LSN=1050), frame 3 (LSN=1080)]
Pin/Unpin Mechanism
Before accessing a buffer pool frame, a thread must pin it (increment its pin count). While pinned, the frame cannot be evicted. After access, the thread unpins the frame.
// PostgreSQL pin/unpin: src/backend/storage/buffer/bufmgr.c
Buffer buf = ReadBuffer(rel, blockNum); // pins the buffer
LockBuffer(buf, BUFFER_LOCK_SHARE); // acquires content latch
// ... read page contents ...
LockBuffer(buf, BUFFER_LOCK_UNLOCK); // release content latch
ReleaseBuffer(buf); // unpins the buffer
// InnoDB pin/unpin: storage/innobase/buf/buf0buf.cc
buf_page_t* bpage = buf_page_get(space_id, page_size, page_no, RW_S_LATCH, &mtr);
// ... access page ...
mtr_commit(&mtr); // releases all latches and unpins
In InnoDB, the mtr_t (mini-transaction) abstraction tracks pinned pages and held latches, releasing them atomically on mtr_commit().
Page Fault Path
When a page is requested that is not in the buffer pool:
1. Hash lookup in page table: MISS
2. Acquire a free frame:
a. If free list is non-empty: take a frame from the free list
b. Else: evict a victim frame (using eviction policy)
- If victim is dirty: flush it to disk (must wait for WAL flush first!)
3. Read the requested page from disk into the frame
4. Update page table: (tablespace, page) -> new frame
5. Pin the frame (increment pin count)
6. Return frame to caller
Steps 3-5 require a synchronization mechanism to prevent two threads from loading the same page simultaneously. PostgreSQL uses a per-buffer "IO in progress" flag protected by a lightweight lock. InnoDB uses a per-bucket latch on the buffer pool's page hash table.
Eviction Policies
LRU (Least Recently Used)
Maintains a doubly-linked list ordered by last access time. On access, move page to MRU (Most Recently Used) end. On eviction, choose the page at the LRU end.
Problem: sequential scan pollution. A large table scan brings in thousands of pages, evicting hot OLTP pages from the buffer pool. These scanned pages are accessed once and then useless, but they destroyed the cache.
LRU-K (IBM Research)
Instead of tracking the last access time, track the K-th most recent access time. A page that was accessed 1000 times but hasn't been accessed in 5 minutes has a better K-th access time than a page accessed once 1 second ago.
InnoDB uses LRU-2 (tracks the 2nd most recent access). The buffer pool is conceptually divided into a "young" sublist (recently accessed twice) and an "old" sublist (accessed once or infrequently). New pages enter the "old" sublist. After innodb_old_blocks_time (default 1000ms), if a page in the "old" sublist is accessed again, it is promoted to "young."
InnoDB LRU Structure:
MRU end LRU end
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
| | | | | | | | | | | | | | | |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|<--- "Young" sublist (3/8) --->|<- "Old" (5/8)->|
^
midpoint (innodb_old_blocks_pct = 37%)
New pages inserted at midpoint (start of old sublist).
Sequential scan pages stay in old sublist, never promoted to young.
This is equivalent to LRU-2: a page must be accessed twice (once initial load, once access while still in the pool) before it enters the "young" sublist.
Clock Algorithm
The Clock (or "second chance") algorithm is an approximation of LRU that avoids the overhead of maintaining a fully-sorted linked list. Each frame has a "reference bit." On access, set the reference bit. On eviction, scan frames in a circle: - If reference bit is set: clear it and advance (give the page a "second chance") - If reference bit is clear: evict this frame
PostgreSQL uses the Clock-Sweep algorithm (StrategyGetBuffer() in src/backend/storage/buffer/freelist.c). Each buffer has a usage_count (0-5). On access, increment usage_count (up to 5). On eviction scan, decrement usage_count; evict when it reaches 0.
ARC (Adaptive Replacement Cache)
ARC (Megiddo & Modha, IBM Research, 2003) adapts between recency (LRU) and frequency (LFU) based on observed workload. Maintains four lists: T1 (recently used once), T2 (recently used more than once), B1 (ghost entries for recently evicted T1), B2 (ghost entries for recently evicted T2). A parameter p controls the balance between T1 and T2, and is adjusted based on hit/miss patterns in B1 and B2.
PostgreSQL 9+ uses ARC-inspired logic for its shared buffer management. ARC is patented by IBM (expired in 2023 in the US), which historically limited its adoption in open-source projects.
2Q (Two-Queue)
2Q maintains two queues: A1 (recently accessed once) and Am (accessed multiple times). New pages enter A1. Pages accessed again while in A1 are promoted to Am. Pages evicted from Am are demoted back to A1 (second chance). Pages evicted from A1 are discarded.
2Q is used in PostgreSQL's buffer management for relation data (with the Clock-Sweep as the underlying mechanism for the "Am" queue equivalent).
Dirty Page Tracking and Flush Ordering
When a page is modified, it is marked dirty and added to the flush list (InnoDB) or tracked in the dirty page list (PostgreSQL). The flush list is ordered by oldest_modification_LSN — the LSN of the first modification to this page since the last time it was clean.
The critical invariant: before a dirty page can be written to disk, all WAL records with LSN <= pageLSN of that page must be flushed to the WAL. This is the "WAL must precede data" invariant from ARIES.
Before flushing dirty page at frame F:
Assert: WAL.flushedLSN >= page_table[F].pageLSN
If not: flush WAL up to pageLSN first, then write the page
Background Flushing: Page Cleaner Threads
InnoDB's page cleaner thread (page_cleaner.cc) periodically flushes dirty pages to prevent checkpoints from requiring large I/O bursts. It runs as a background thread and uses adaptive flushing to balance I/O across time.
Adaptive flushing (innodb_adaptive_flushing=ON) uses a "fuzzy checkpoint" model: it tries to flush dirty pages at a rate that keeps the redo log from filling up. The flush rate is modulated by how full the redo log is (the "redo log space pressure").
PostgreSQL has a similar bgwriter process that flushes dirty pages before they are urgently needed for eviction. bgwriter_delay (default 200ms) controls the sleep interval; bgwriter_lru_maxpages limits the number of pages written per cycle.
Buffer Pool Sizing
The buffer pool should be sized to hold the "working set" — the set of pages accessed frequently enough that evicting them would cause cache misses. Rules of thumb:
- InnoDB:
innodb_buffer_pool_size= 70-80% of available RAM for a dedicated MySQL server. - PostgreSQL:
shared_buffers= 25% of RAM (intentionally conservative, relying on OS page cache for some caching). Effective cache size (effective_cache_size) should be set to 50-75% of RAM to inform the query planner. - SQL Server:
max server memory= total RAM - OS overhead (typically leave 10-20% for OS).
Multiple Buffer Pools (InnoDB)
InnoDB supports multiple buffer pool instances (innodb_buffer_pool_instances, default = 8 when pool > 1GB). Each instance has its own LRU list, free list, flush list, and mutex, reducing contention on the buffer pool mutex for high-concurrency workloads.
8 Buffer Pool Instances:
Pool 0: LRU, free list, flush list (mutex 0)
Pool 1: LRU, free list, flush list (mutex 1)
...
Pool 7: LRU, free list, flush list (mutex 7)
Page routing: pool_instance = (space_id * fold + page_no) % num_instances
Buffer Pool Warmup
After a restart, the buffer pool is cold — all requests miss the cache. Buffer pool dump/restore (innodb_buffer_pool_dump_at_shutdown=ON, innodb_buffer_pool_load_at_startup=ON) saves the list of pages in the buffer pool at shutdown and re-warms it at startup by reading those pages back in.
PostgreSQL does not have a built-in warmup mechanism. Third-party extensions like pg_prewarm (SELECT pg_prewarm('my_table')) read all pages of a relation into the shared buffer cache.
Prefetching
Read-ahead (InnoDB): When a sequential scan pattern is detected (pages accessed in order), InnoDB pre-fetches the next several pages asynchronously. innodb_read_ahead_threshold controls when linear read-ahead triggers.
Prefetch (PostgreSQL): PostgreSQL's sequential scan (seq_scan) uses asynchronous prefetch via PrefetchBuffer() which calls posix_fadvise(POSIX_FADV_WILLNEED) to hint the OS page cache. PostgreSQL 15 introduced explicit prefetch for index scans via the io_combine_limit and asynchronous I/O infrastructure.
OS Page Cache vs Database Buffer Pool
Why do databases implement their own buffer pool instead of relying on the OS page cache?
- Double caching: Without
O_DIRECT, disk pages are cached both in the database buffer pool and in the OS page cache, wasting 2x RAM. - Eviction control: The OS page cache uses a generic LRU algorithm that doesn't understand database access patterns (LRU-K, 2Q, ARC are database-specific optimizations).
- Flush ordering: The WAL-before-data invariant requires the database to control the order in which pages are written to disk. The OS page cache's
writebackmechanism does not guarantee this order. - Direct I/O: Using
O_DIRECT(Linux) orF_NOCACHE(macOS) bypasses the OS page cache entirely, allowing the database to manage all caching decisions.
InnoDB uses O_DIRECT by default on Linux (innodb_flush_method=O_DIRECT). PostgreSQL historically relied on the OS page cache (using read()/write() without O_DIRECT) but PostgreSQL 16 introduced async I/O groundwork, and future versions may add O_DIRECT support.
Historical Context
Early DBMS (IBM System R, 1970s) implemented their own buffer managers precisely because the OS virtual memory subsystem was not suitable for database workloads. The "80-20 rule" (80% of requests hit 20% of data) motivated LRU as the natural eviction policy.
The Clock algorithm was proposed for OS page replacement in the 1960s and adopted by early DBMS. LRU-K (O'Neil, O'Neil & Weikum, 1993) was specifically motivated by the sequential scan pollution problem in OLTP/analytical mixed workloads. ARC (Megiddo & Modha, 2003) was developed at IBM Research Almaden and represents the state of the art in single-server buffer pool management.
Production Examples
InnoDB Buffer Pool: buf_pool_t struct in storage/innobase/include/buf0buf.h. The LRU list is implemented as buf_pool->LRU (a UT_LIST). Dirty pages are tracked in buf_pool->flush_list. The page hash table is buf_pool->page_hash with BUF_PAGE_HASH_LOCK_COUNT = 256 read-write locks.
PostgreSQL Shared Buffers: src/backend/storage/buffer/bufmgr.c. BufferDescriptors is the array of BufferDesc structs. StrategyShmemSize() calculates the shared memory size. StrategyGetBuffer() implements Clock-Sweep. BufferAlloc() is the main buffer acquisition function.
SQL Server Buffer Pool: SQL Server uses a "steal" policy and a "no-force" policy. The CHECKPOINT command forces all dirty pages to disk. SQL Server's buffer pool is managed by the Buffer Manager component and integrates with Windows' AWE (Address Windowing Extensions) for >4GB RAM on 32-bit Windows.
Debugging Notes
- InnoDB buffer pool stats:
SHOW STATUS LIKE 'Innodb_buffer_pool%';shows hit rate (Innodb_buffer_pool_reads/Innodb_buffer_pool_read_requests). A hit rate below 99% indicates buffer pool pressure. - InnoDB dirty pages:
Innodb_buffer_pool_pages_dirty— if consistently high, page cleaner is falling behind. - PostgreSQL hit rate:
SELECT heap_blks_hit / (heap_blks_hit + heap_blks_read) AS hit_rate FROM pg_statio_user_tables WHERE relname='mytable'; - PostgreSQL bgwriter:
SELECT * FROM pg_stat_bgwriter;—buffers_clean(bgwriter proactive flushing) vsbuffers_backend(backends forced to flush before getting a free buffer, indicating bgwriter is too slow). - Free buffer shortage: In PostgreSQL,
buffers_allocinpg_stat_bgwriterincreasing rapidly andbuffers_backendbeing non-zero indicates the buffer pool is under pressure.
Security Implications
- Buffer pool inspection attacks: An attacker who can run arbitrary SQL on a shared PostgreSQL server can infer what data other users recently accessed by observing buffer pool hit rates for specific tables (
pg_statio_user_tables). This is a side-channel attack. - Data in swap: If the system runs low on RAM and begins swapping, buffer pool frames (containing potentially sensitive data) may be paged to disk swap, bypassing database-level encryption. Configure
vm.swappiness=0and use encrypted swap. - O_DIRECT and kernel bypasses: Using
O_DIRECT(InnoDB) means the kernel's page cache is not populated with database page data. This reduces the attack surface for kernel-level memory inspection of cached pages. - Multiple buffer pool instances: A subtle security consideration — if buffer pool instances are partitioned by tenant in a multi-tenant system, one tenant cannot evict another's hot pages from the cache (different pool instances). This is not typically implemented but is theoretically possible.
Performance Implications
- Buffer pool hit rate target: 99%+ for OLTP workloads. A 1% miss rate on a 10K QPS workload means 100 disk reads/second — acceptable on NVMe, catastrophic on HDD.
- Clock-Sweep usage_count: PostgreSQL's usage_count (max 5) means a hot page must be scanned past 5 times before eviction. This provides better protection for hot pages than pure LRU.
- Buffer pool mutex contention: With
innodb_buffer_pool_instances=1, the single buffer pool mutex is a scalability bottleneck. Increase instances to 8 or 16 for high-concurrency workloads. - Page cleaner I/O: InnoDB's
innodb_io_capacityandinnodb_io_capacity_maxparameters control how aggressively the page cleaner flushes dirty pages. On fast NVMe storage, increase these toinnodb_io_capacity=4000andinnodb_io_capacity_max=8000. - THP (Transparent Huge Pages): Huge pages reduce TLB misses for the buffer pool's large contiguous allocation. On Linux, enable
transparent_hugepages=alwaysfor buffer pools > 1GB, or explicitly allocate HugeTLB pages (innodb_use_native_aio=ON+large_pages=ON).
Failure Modes
- OOM kill of database process: The buffer pool is the largest consumer of memory in a database process. If the kernel OOM-kills the process, the buffer pool is lost. PostgreSQL will recover via WAL replay; InnoDB via redo log replay. Mitigation: set memory limits appropriately, use cgroups, configure
vm.overcommit_memory=2. - Buffer pool corruption: A hardware bit flip or kernel bug can corrupt buffer pool pages. InnoDB's page checksums (
innodb_checksum_algorithm=crc32) detect this before writing; PostgreSQL'sfull_page_writesanddata_checksumsprovide similar protection. - Buffer pool thrashing: When working set > buffer pool size, every page access is a miss. Performance degrades to disk I/O speed. Mitigation: increase buffer pool, add RAM, shard the database, or partition hot data to a separate instance.
- Clock-Sweep starvation: Under heavy concurrent load, the Clock-Sweep pointer may race around the buffer pool without finding free frames (all frames pinned by concurrent threads). PostgreSQL mitigates this with
BufferPincounting and an eviction waiter mechanism.
Modern Usage
The buffer pool model is being re-examined as storage latency decreases. At NVMe latencies (~100µs), the benefit of a multi-GB buffer pool over OS page cache is reduced (but still significant). Storage Class Memory (SCM / Optane) at ~1µs latency makes the buffer pool/disk boundary nearly invisible for small databases.
Neon (serverless PostgreSQL): The compute node has a local in-memory LRU cache of recently fetched pages, but the "true" storage is the remote Pageserver. The buffer pool is still needed as a working memory, but eviction pushes to a network call rather than a disk write.
SAP HANA: An in-memory DBMS where the "buffer pool" is the entire working set, kept in DRAM. HANA uses a separate "column store" layout in memory and lazily loads column segments. The eviction policy is size-aware column segment eviction, not page-level LRU.
Future Directions
- Machine learning-guided eviction: Netflix's EVA (2020), CMU's BPM-ML (2021), and others train online ML models to predict page access patterns and improve eviction decisions beyond what LRU-K can achieve.
- NUMA-aware buffer pools: On multi-socket servers, the buffer pool should preferentially use local NUMA memory. InnoDB supports NUMA awareness via
innodb_numa_interleave. PostgreSQL lacks native NUMA awareness butnumactl --interleave=allhelps. - Persistent memory buffer pools: Storing the buffer pool in Optane (byte-addressable persistent memory) eliminates buffer pool warmup time after restarts, while retaining the management benefits of an explicit buffer pool.
Exercises
- Implement a buffer pool in Python: fixed-size frame array, hash-based page table, Clock-Sweep eviction, pin/unpin mechanism. Verify that pinned frames are never evicted.
- Demonstrate sequential scan pollution in PostgreSQL: run
SELECT COUNT(*) FROM large_table(forces full scan) and observe buffer pool hit rate drop usingpg_statio_user_tables. Then check if small lookup table hit rates recovered. - Measure InnoDB buffer pool performance: using Sysbench
oltp_read_write, run withinnodb_buffer_pool_sizeset to 10%, 50%, and 90% of your dataset size. Plot QPS vs hit rate. - Trace a
ReadBuffer()call in PostgreSQL source code: frombufmgr.c:ReadBuffer()throughBufferAlloc()→StrategyGetBuffer()→StartBufferIO()→ actual page read from disk. Identify where the I/O wait occurs. - Configure InnoDB buffer pool dump/restore: set
innodb_buffer_pool_dump_at_shutdown=ON, run a workload, shut down MySQL, restart, and measure how long buffer pool warmup takes vs a cold start.
References
- O'Neil, E. J., O'Neil, P. E., & Weikum, G. (1993). The LRU-K Page Replacement Algorithm for Database Disk Buffering. SIGMOD 1993.
- Megiddo, N., & Modha, D. S. (2003). ARC: A Self-Tuning, Low Overhead Replacement Cache. FAST 2003.
- Johnson, T., & Shasha, D. (1994). 2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm. VLDB 1994.
- Graefe, G. (2011). Modern B-Tree Techniques. Foundations and Trends in Databases, 3(4). (Chapter on buffer pool interaction)
- Effelsberg, W., & Haerder, T. (1984). Principles of Database Buffer Management. ACM TODS, 9(4), 560–595.
- InnoDB Buffer Pool documentation: https://dev.mysql.com/doc/refman/8.0/en/innodb-buffer-pool.html
- PostgreSQL Buffer Manager README:
src/backend/storage/buffer/README