04 — Multi-Version Concurrency Control (MVCC)
Technical Overview
Multi-Version Concurrency Control (MVCC) is a concurrency scheme in which a database maintains multiple physical versions of a data item simultaneously, allowing readers to access a consistent snapshot of the database without blocking writers and vice versa. This is the critical property that makes modern OLTP databases tractable at scale: a long-running analytical query does not block concurrent updates, and concurrent updates do not block each other's reads.
MVCC is the foundation of PostgreSQL, MySQL InnoDB, Oracle, SQL Server (snapshot isolation mode), and CockroachDB. The variations among these implementations differ primarily in where versions are stored (heap vs undo log), how visibility is determined (transaction IDs vs timestamps), and how obsolete versions are reclaimed (VACUUM vs purge threads).
Prerequisites
- Understanding of transaction isolation levels (READ COMMITTED, REPEATABLE READ, SERIALIZABLE)
- Familiarity with transaction IDs and the concepts of commit and abort
- Basic understanding of B+ trees and row storage
- Knowledge of the anomalies MVCC is designed to prevent (dirty reads, non-repeatable reads)
Core Content
Why MVCC: The Reader-Writer Problem
Before MVCC, databases used lock-based concurrency. A reader needed a shared lock; a writer needed an exclusive lock. Readers and writers blocked each other. Under high concurrency, this created severe lock contention.
Lock-based (without MVCC):
T1: BEGIN; SELECT * FROM orders WHERE id=1; -- acquires shared lock
T2: BEGIN; UPDATE orders SET total=100 WHERE id=1; -- BLOCKED on T1's shared lock
T1: ... long computation ...
T1: COMMIT; -- releases shared lock
T2: (now can proceed)
MVCC:
T1: BEGIN; SELECT * FROM orders WHERE id=1; -- reads version at T1's snapshot time
T2: BEGIN; UPDATE orders SET total=100 WHERE id=1; -- creates NEW version, not blocked
Both proceed concurrently.
T1 sees old version; T2 creates new version.
MVCC's core guarantee: readers never block writers, writers never block readers.
Version Chain Structure
When a row is updated under MVCC, the old version is not immediately overwritten. Instead, the new version is created and linked to the old version via a version chain.
Version Chain for row (id=42):
[CURRENT]
Heap/Data: [id=42, val="C", xmin=300, xmax=INF]
|
(version chain pointer)
v
Undo Log: [id=42, val="B", xmin=200, xmax=300]
|
v
Undo Log: [id=42, val="A", xmin=100, xmax=200]
xmin = transaction that created this version
xmax = transaction that deleted/superseded this version (INF = "still current")
The implementation differs between databases:
PostgreSQL (heap-based MVCC): All versions live in the heap (the main table storage). Old versions are called "dead tuples." The heap can accumulate many dead tuples over time, requiring VACUUM to reclaim space. This is the "write amplification" cost of PostgreSQL's MVCC — every UPDATE writes a new heap tuple.
InnoDB (undo log-based MVCC): Only the current version lives in the clustered index (B+ tree). Old versions are stored in the undo log (a separate data structure in the system tablespace or undo tablespace). The undo log is a linked list of "undo records" per transaction. This means InnoDB's data pages are not bloated with dead versions, but the undo log must be traversed for old reads.
PostgreSQL MVCC: xmin/xmax Visibility
Every heap tuple in PostgreSQL has four system attributes:
- xmin: XID of the transaction that inserted this tuple version
- xmax: XID of the transaction that deleted (or updated, which is a delete+insert) this tuple
- ctid: Physical location (page, offset) of this tuple
- cmin/cmax: Command IDs for intra-transaction visibility
-- View system attributes
SELECT xmin, xmax, ctid, id, val FROM my_table;
Visibility rule: A tuple version is visible to transaction T if:
1. xmin is committed AND (xmax is 0 OR xmax is not yet committed OR xmax > T.snapshot_xmax)
2. More precisely, PostgreSQL checks xmin against the transaction's snapshot: {xmin_horizon, xmax_horizon, active_xids[]}
/* From src/backend/utils/time/tqual.c (older PostgreSQL) / src/backend/access/heap/heapam_visibility.c */
static bool
HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot, Buffer buffer)
{
HeapTupleHeader tuple = htup->t_data;
if (!HeapTupleHeaderXminCommitted(tuple)) {
if (HeapTupleHeaderXminInvalid(tuple))
return false;
/* Check if xmin is in snapshot's active set */
if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple)))
/* visible if cmin < snapshot->curcid */
else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
return false;
/* ... commit status checks ... */
}
/* Check xmax similarly */
}
PostgreSQL Snapshot: What Does "Consistent Read" Mean?
A PostgreSQL snapshot captures:
- xmin: oldest still-active XID when snapshot was taken (all XIDs below this are either committed or rolled back)
- xmax: next XID to be assigned (all XIDs above this are in the future — not visible)
- xip[]: list of active (in-progress) XIDs at snapshot time
Example:
Active transactions at snapshot time: {102, 105}
xmin = 101 (lowest active)
xmax = 108 (next to assign)
xip = [102, 105]
Visibility for tuple with xmin=104:
- 104 < xmax (108): might be visible
- 104 not in xip ([102, 105]): not active at snapshot time
- 104 > xmin (101): could be active... but it's not in xip
- Therefore: 104 committed before snapshot, VISIBLE.
VACUUM: Dead Tuple Cleanup in PostgreSQL
Because PostgreSQL stores all tuple versions in the heap, dead tuples (versions with xmax set and the deleting transaction committed) accumulate over time. VACUUM reclaims this space.
Before VACUUM:
Heap page:
+------------------------+
| Tuple 1: (live, xmin=100, xmax=0) |
| Tuple 2: (dead, xmin=50, xmax=80) |
| Tuple 3: (dead, xmin=60, xmax=90) |
| Tuple 4: (live, xmin=110, xmax=0) |
+------------------------+
After VACUUM:
+------------------------+
| Tuple 1: (live, xmin=100, xmax=0) |
| FREE SPACE |
| FREE SPACE |
| Tuple 4: (live, xmin=110, xmax=0) |
+------------------------+
VACUUM also cleans up dead index entries and updates the free space map (FSM). VACUUM FULL rewrites the entire table, reclaiming space back to the OS (at the cost of an AccessExclusiveLock).
autovacuum: PostgreSQL's background autovacuum daemon (autovacuum_vacuum_scale_factor=0.2 triggers VACUUM when 20% of rows are dead). Critical for health; if autovacuum falls behind, table and index bloat accumulates, and worse, XID wraparound can occur.
XID Wraparound: PostgreSQL transaction IDs are 32-bit, wrapping at ~2 billion. If a table has not been VACUUMed for ~2 billion transactions, its xmin values appear to be "in the future," making all rows invisible. PostgreSQL monitors wraparound risk with pg_database.datfrozenxid and will aggressively autovacuum (or refuse new transactions) when wraparound is imminent.
InnoDB MVCC: Undo Log and Read View
InnoDB stores only the current version in the clustered index B+ tree. Old versions are stored in the undo log as difference records.
Clustered Index (current version only):
+---------------------------+
| PK=42 | val="C" | trx_id=300 | roll_ptr=0xABCDEF |
+---------------------------+
|
(roll_ptr)
v
Undo Log Segment:
+----------------------------------+
| Undo record: trx_id=200, val="B" |
|
(prev pointer)
v
+----------------------------------+
| Undo record: trx_id=100, val="A" |
roll_ptr (rollback pointer) in each row header points to the undo record that can undo the current write, which in turn points to the previous undo record. This forms the version chain.
InnoDB Read View: At the start of a consistent read, InnoDB creates a ReadView struct:
// storage/innobase/include/read0types.h
struct ReadView {
trx_id_t m_low_limit_id; // Transactions >= this are invisible
trx_id_t m_up_limit_id; // Transactions < this are visible
trx_ids_t m_ids; // Snapshot of active transaction IDs
trx_id_t m_creator_trx_id; // Creator of this ReadView
};
Visibility check: a row version is visible to a ReadView if:
- trx_id < m_up_limit_id: committed before snapshot — visible
- trx_id >= m_low_limit_id: started after snapshot — not visible
- trx_id in m_ids: active at snapshot time — not visible
- Otherwise: visible
CockroachDB MVCC: Timestamp-Based
CockroachDB stores all versions directly in RocksDB, keyed by {user_key}{timestamp} (descending timestamp so the latest version comes first in sorted order). MVCC here is timestamp-based rather than XID-based.
RocksDB keys for MVCC (conceptual):
{foo}{ts=1700000005} -> "version C" <- latest
{foo}{ts=1700000003} -> "version B"
{foo}{ts=1700000001} -> "version A" <- oldest
Read at timestamp T:
Scan from {foo}{ts=T} forward to find the latest version with ts <= T.
CockroachDB uses Hybrid Logical Clocks (HLC) — a combination of physical (wall clock) and logical (Lamport clock) timestamps — to maintain causality across nodes without GPS hardware. This simplifies MVCC semantics: "read as of time T" is directly expressible, enabling true time-travel queries.
MVCC Overhead
Version bloat (PostgreSQL): Each UPDATE creates a new heap tuple. A table with heavy UPDATEs can grow 10x its logical size before autovacuum catches up. Use pgstattuple to measure dead tuple percentage.
Undo log pressure (InnoDB): Long-running transactions hold undo log space for all versions they might need to read. A single long-running query can prevent undo log purge, causing the undo tablespace to grow unboundedly. Monitor SHOW ENGINE INNODB STATUS for History list length — if this exceeds tens of thousands, a long-running transaction is blocking purge.
Read view overhead: Creating a read view requires acquiring the trx_sys mutex (InnoDB) or latch (PostgreSQL procarray) to snapshot active transactions. At very high concurrency (thousands of short transactions per second), this mutex becomes a hot spot.
Write Skew Under Snapshot Isolation
Snapshot Isolation (SI) prevents dirty reads, non-repeatable reads, and phantom reads. However, it allows write skew:
Example: Two doctors on-call rule (at least one doctor must be on-call)
Initial state: Alice on-call=true, Bob on-call=true
T1 (Alice going off-call):
SELECT count(*) FROM doctors WHERE on_call=true; -- returns 2
UPDATE doctors SET on_call=false WHERE name='Alice';
COMMIT;
T2 (Bob going off-call, concurrent with T1):
SELECT count(*) FROM doctors WHERE on_call=true; -- also returns 2 (snapshot!)
UPDATE doctors SET on_call=false WHERE name='Bob';
COMMIT;
Final state: Both doctors off-call. Constraint violated!
Both transactions read a consistent snapshot that showed 2 doctors, but they updated disjoint rows. Neither transaction saw the other's write. This is the write skew anomaly.
Write skew is allowed under PostgreSQL's REPEATABLE READ and InnoDB's REPEATABLE READ (the ANSI SQL default for InnoDB). Prevention requires either SELECT FOR UPDATE (converting to pessimistic locking) or SERIALIZABLE isolation with Serializable Snapshot Isolation (SSI).
Historical Context
Reed (1978) introduced the concept of multi-version timestamps for concurrency control in his MIT dissertation, though the term "MVCC" was not yet in use. The first practical implementations appeared in InterBase (Jim Starkey, late 1970s/early 1980s — later became Firebird) and PostgreSQL (Michael Stonebraker's Postgres project at Berkeley, 1986-1994).
Oracle implemented MVCC via undo log segments (similar to InnoDB's approach) in the early 1980s. Oracle's implementation influenced MySQL/InnoDB designer Heikki Tuuri when he designed InnoDB in the mid-1990s.
PostgreSQL's heap-based MVCC is often criticized for its need for VACUUM, which has no equivalent in Oracle or InnoDB. The Zheap project (2018-present) and OrioleDB (2021-present) explore in-place MVCC for PostgreSQL that would eliminate heap bloat.
Production Examples
PostgreSQL: src/backend/access/heap/heapam_visibility.c contains HeapTupleSatisfiesMVCC(). src/backend/storage/ipc/procarray.c manages the active transaction snapshot array. The pg_stat_activity view shows long-running transactions that may be blocking VACUUM.
InnoDB: storage/innobase/row/row0sel.cc contains row_sel_get_clust_rec_for_mysql() which traverses the version chain for consistent reads. storage/innobase/trx/trx0purge.cc implements the purge thread that deletes undo records no longer needed.
CockroachDB: pkg/storage/mvcc.go contains the core MVCC read/write logic. MVCCScan and MVCCGet are the fundamental read operations. MVCCPut writes a new version. The garbage collection of old versions is triggered by gc.ttl zone configuration.
Debugging Notes
- PostgreSQL dead tuples:
SELECT relname, n_dead_tup, n_live_tup, last_vacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC; - PostgreSQL XID age:
SELECT datname, age(datfrozenxid) FROM pg_database;— values approaching 2 billion need immediate attention. - InnoDB undo log pressure:
SHOW ENGINE INNODB STATUS\G— look forHistory list length. Values > 10000 indicate purge is blocked. - InnoDB read view staleness:
SELECT * FROM information_schema.INNODB_TRX ORDER BY trx_started;— find the oldest transaction holding a read view. - CockroachDB MVCC GC lag:
SHOW ZONE CONFIGURATION FOR TABLE t;— checkgc.ttlseconds.SELECT * FROM crdb_internal.kv_node_status;for per-node MVCC stats.
Security Implications
- Data exposure via version chains: An attacker with direct file system access (but not SQL access) can read old versions from the undo log (InnoDB) or dead heap tuples (PostgreSQL). This is a risk if database files are not encrypted at rest.
- MVCC and row-level security: PostgreSQL's row security policies (RLS) are applied to all visible versions of rows. Since MVCC can expose old versions, ensure RLS policies account for historical data sensitivity.
- Long-running transactions as DoS: An attacker (or errant query) holding a long-running transaction can prevent undo log purge (InnoDB) or VACUUM (PostgreSQL), causing storage exhaustion. Set
idle_in_transaction_session_timeoutin PostgreSQL andinnodb_lock_wait_timeoutin MySQL to mitigate. - Write skew exploitation: Applications relying on check-then-act patterns (check constraint in application, then insert) are vulnerable to write skew under snapshot isolation. Always use database-level constraints, not just application-level checks.
Performance Implications
- Read-heavy workloads: MVCC provides excellent performance for mixed read/write workloads. Pure read-only workloads may prefer lock-free snapshot reads (PostgreSQL's
SET TRANSACTION ISOLATION LEVEL REPEATABLE READavoids row-level locks entirely). - autovacuum tuning: Aggressive autovacuum (
autovacuum_vacuum_cost_delay=0,autovacuum_vacuum_scale_factor=0.01) improves health at the cost of more I/O. For write-heavy tables, useALTER TABLE t SET (autovacuum_vacuum_scale_factor=0.01)for per-table tuning. - InnoDB undo log size:
innodb_max_undo_log_sizelimits undo log growth. After truncation, the freed space is reused. Ensure undo logs are on fast storage. - Index vacuuming: PostgreSQL index entries for dead tuples are not cleaned by lightweight vacuum scans. A full index vacuum pass is required, which can be expensive on large indexes. Monitor
pg_stat_user_indexes.idx_tup_fetchvsseq_scanratios.
Failure Modes
- XID wraparound (PostgreSQL): If a table's
relfrozenxidfalls 2 billion XIDs behind the current XID, all its rows become invisible. PostgreSQL aggressively autovacuums when withinautovacuum_freeze_max_age(default 200M) of wraparound. An emergencyVACUUM FREEZEmay be needed. - Undo log runaway (InnoDB): A long-running transaction prevents undo purge. The undo tablespace grows indefinitely. InnoDB has no configurable limit on undo log growth (unlike PostgreSQL's
temp_file_limit). Monitor and kill long-running transactions. - Heap bloat (PostgreSQL): Heavy UPDATE workloads on a table with delayed autovacuum can cause the heap to grow 10x. A
VACUUM FULL(full table lock) may be needed, or (preferred) schedule regularVACUUMduring off-peak hours. - Snapshot too old (PostgreSQL): If a transaction holds a snapshot for too long while the undo data it needs gets VACUUMed away, PostgreSQL will raise "ERROR: snapshot too old." Mitigated by
old_snapshot_threshold.
Modern Usage
MVCC is universal in production RDBMS. The key modern developments are:
- OrioleDB: A PostgreSQL storage engine with in-place MVCC using undo log, eliminating heap bloat and VACUUM. Uses a checkpoint-based recovery model instead of full ARIES WAL.
- Distributed MVCC: CockroachDB, YugabyteDB, and Google Spanner implement distributed MVCC where versions are replicated across nodes. The "timestamp oracle" or TrueTime API ensures global consistency.
- HTAP MVCC: TiDB uses the same MVCC data in TiKV for both OLTP and OLAP queries, with TiFlash as a columnar replica that maintains its own MVCC state for long-running analytical reads.
Future Directions
- NVM-aware MVCC: With persistent memory, undo log traversal can use byte-addressable PM reads instead of disk I/O. Projects like nvm-MVCC (VLDB 2020) explore in-place update with PM-resident undo.
- Garbage-collection-free MVCC: Academic proposals (Leanstore's version chain design, Umbra's MVCC) aim to eliminate the need for a separate GC pass by using reference counting or epoch-based reclamation.
- MVCC for AI workloads: As databases store model weights and training data, MVCC semantics for "read the model version as of training time T" become important for reproducibility and audit trails.
Exercises
- In PostgreSQL, demonstrate MVCC in action: open two psql sessions, begin a transaction in session 1 that reads a row, update the row in session 2, and commit. In session 1, select the row — it should still show the old value. Use
SELECT xmin, xmax, * FROM tto observe version metadata. - Create a table in PostgreSQL with 1M rows. Run 100K UPDATEs. Measure table size before and after, and after VACUUM. Use
pgstattupleto measure dead tuple counts. - Reproduce the write skew anomaly: using PostgreSQL REPEATABLE READ isolation, implement the doctor on-call example and verify both doctors can simultaneously go off-call. Then switch to SERIALIZABLE and observe the difference.
- In InnoDB, hold a long-running transaction and observe the
History list lengthinSHOW ENGINE INNODB STATUSgrow. Kill the transaction and observe the purge thread catch up. - Implement a simple MVCC in-memory key-value store in Python: keys map to version chains (lists of
(write_ts, value)tuples). Implementread(key, ts)andwrite(key, value, ts)with snapshot isolation semantics.
References
- Reed, D. P. (1978). Naming and Synchronization in a Decentralized Computer System. MIT PhD Dissertation.
- Stonebraker, M., et al. (1987). The Design of the Postgres Storage System. VLDB 1987.
- Bernstein, P. A., & Goodman, N. (1983). Multiversion Concurrency Control — Theory and Algorithms. ACM TODS, 8(4), 465–483.
- Cahill, M. J., Rohm, U., & Fekete, A. D. (2008). Serializable Isolation for Snapshot Databases. SIGMOD 2008.
- Neumann, T., Mühlbauer, T., & Kemper, A. (2015). Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems. SIGMOD 2015.
- PostgreSQL MVCC documentation: https://www.postgresql.org/docs/current/mvcc.html
- InnoDB MVCC: https://dev.mysql.com/doc/refman/8.0/en/innodb-multi-versioning.html