05 — Write-Ahead Logging and Database Recovery
Technical Overview
Write-Ahead Logging (WAL) is the fundamental mechanism by which databases guarantee atomicity and durability without synchronously writing data pages to disk on every transaction commit. The core invariant is: log records describing a change must reach durable storage before the corresponding data page change can be considered committed. On crash, the log provides enough information to reconstruct the exact database state at the moment of failure.
The ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) recovery algorithm, published by Mohan, Haderle, Lindsay, Pirahesh, and Schwarz in 1992, is the theoretical foundation underlying PostgreSQL's WAL recovery, InnoDB's redo/undo log, and most other production RDBMS recovery systems. Understanding WAL and ARIES is prerequisite for understanding streaming replication, point-in-time recovery, and crash safety guarantees.
Prerequisites
- Understanding of buffer pool management (dirty pages, eviction)
- Familiarity with transaction isolation levels and ACID properties
- Basic knowledge of file I/O and
fsyncsemantics - Understanding of B+ tree page structure
Core Content
The Core Invariant: Force-Log-at-Commit
The WAL invariant has two halves:
-
Write-Ahead (Steal policy): Before a dirty page can be written from the buffer pool to disk, all log records for changes to that page must first be written to the log (on disk). This ensures redo is possible if the dirty page is on disk but the transaction later aborts.
-
Force-at-Commit (Force policy): Before a transaction can be considered committed, all its log records must be forced to durable storage. This ensures the commit survives a crash.
Buffer Pool: WAL Log (disk):
+-----------+ +------------------+
| Page 42 | (dirty) | LSN 1001: UPDATE |
| LSN=1005 | | LSN 1002: UPDATE |
+-----------+ | LSN 1003: UPDATE |
| LSN 1004: COMMIT |<-- must be flushed
| LSN 1005: UPDATE | before page 42 written
+------------------+
flushedLSN = 1004
Rule: can evict page 42 only if flushedLSN >= pageLSN(42) == 1005
The "No-Force, Steal" policy (used by InnoDB and PostgreSQL) means: - No-Force: Dirty pages do not need to be flushed on commit (only the log does). This allows batching of many transactions' data writes. - Steal: Dirty pages can be evicted from the buffer pool even before the transaction commits. This avoids pinning all dirty pages in the buffer pool.
Log Sequence Numbers (LSN)
Every WAL record has a monotonically increasing Log Sequence Number (LSN). In PostgreSQL, LSNs are byte offsets into the WAL stream (a 64-bit integer). In InnoDB, LSNs are also byte counts.
Key LSN tracking points:
For each WAL record:
prevLSN: LSN of previous record for the same transaction (undo chain)
pageLSN: LSN of last log record that modified this page (stored in page header)
flushedLSN: Highest LSN flushed to durable storage
Recovery invariants:
If pageLSN > flushedLSN: page modification is not yet logged -- BUG (WAL violation)
If pageLSN <= flushedLSN: safe to evict page
PostgreSQL stores pd_lsn (page LSN) in the page header struct PageHeaderData. The WAL writer tracks LogwrtResult.Flush as the flushedLSN.
InnoDB stores FIL_PAGE_LSN in the 16KB page header and tracks the global log_sys->lsn and log_sys->flushed_to_disk_lsn.
WAL Record Format (PostgreSQL)
PostgreSQL WAL records have a standard header (XLogRecord) followed by resource manager-specific data:
/* src/include/access/xlogrecord.h */
typedef struct XLogRecord {
uint32 xl_tot_len; /* total len of entire record */
TransactionId xl_xid; /* xact id */
XLogRecPtr xl_prev; /* ptr to previous record in log */
uint8 xl_info; /* flag bits, resource manager specific */
RmgrId xl_rmid; /* resource manager for this record */
pg_crc32c xl_crc; /* CRC for this record */
/* followed by XLogRecordBlockHeader(s) and data */
} XLogRecord;
Resource managers (rmgr) handle specific subsystems: XLOG, Transaction, Storage, CLOG, Database, Tablespace, MultiXact, RelMap, Standby, Heap, Heap2, Btree, Hash, Gin, Gist, Sequence, SPGist, BRIN, CommitTs, ReplicationOrigin, Generic, LogicalMessage. Each has its own record types and redo functions.
Checkpoints
A checkpoint ensures that all dirty pages are flushed to disk up to a certain point, limiting how far back recovery must replay the log. After a checkpoint, log records before the checkpoint's LSN can be recycled.
Sharp Checkpoint (simple): Stop all writes, flush all dirty pages, write a checkpoint record. Safe but causes I/O spikes and unavailability. Used by some simple databases.
Fuzzy Checkpoint (PostgreSQL, InnoDB): Begin writing dirty pages to disk without stopping writes. Record the checkpoint_begin LSN, flush all pages that were dirty at checkpoint_begin, then write the checkpoint_end record. Transactions can continue during the checkpoint. Recovery must replay from checkpoint_begin, not checkpoint_end.
PostgreSQL checkpoint process (src/backend/postmaster/checkpointer.c):
1. RequestCheckpoint() signals the checkpointer process
2. checkpointer calls CreateCheckPoint():
a. Acquire CheckpointLock
b. Record checkpoint_begin LSN (RedoRecPtr)
c. Iterate over all dirty buffer pool pages, flush them via bgwriter
d. fsync the data directory
e. Write checkpoint WAL record
f. Update pg_control (stable file with checkpoint LSN + system state)
ARIES Recovery Algorithm
ARIES (Mohan et al., 1992) defines the three-phase recovery process. It is used (with variations) by PostgreSQL, InnoDB, and DB2.
ARIES Recovery Timeline:
[crash point]
|
v
Last Checkpoint Crash
| |
v v
+---+--+---+--+---+--+---+--+--+
| C | | T1| | T2| | T1| | |
+---+--+---+--+---+--+---+--+--+
^ ^
| |
RedoLSN flushedLSN
(from checkpoint)
Phase 1: ANALYSIS (forward from checkpoint)
- Reconstruct Active Transaction Table (ATT)
- Reconstruct Dirty Page Table (DPT)
- Find RedoLSN = min(recLSN for all pages in DPT)
Phase 2: REDO (forward from RedoLSN)
- Replay ALL log records from RedoLSN
- Even for committed transactions (they may have dirty pages not yet flushed)
- Skip redo if pageLSN >= record LSN (page already has this update)
Phase 3: UNDO (backward through log)
- For each transaction in ATT (that did not commit by crash point)
- Traverse backward via prevLSN chain
- Undo each operation using the CLR (Compensation Log Record)
Compensation Log Records (CLR): During undo, ARIES writes CLRs that describe the undo operation. CLRs have a special undoNextLSN field pointing to the next LSN to undo, allowing the undo pass to skip already-undone operations on repeated crash/recovery cycles.
Active Transaction Table (ATT) and Dirty Page Table (DPT)
During the analysis phase, ARIES reconstructs:
ATT (Active Transaction Table): Maps transaction ID → {lastLSN, status}. Status is COMMITTED, ABORTED, or IN-PROGRESS. Transactions in ATT that are IN-PROGRESS at crash time are "losers" — they must be undone in phase 3.
DPT (Dirty Page Table): Maps page ID → recLSN (Recovery LSN — the LSN of the first log record that dirtied this page after the last checkpoint). During redo, if a page's pageLSN >= the current record's LSN, the redo is skipped (the page already has this update).
ATT at crash:
TxID=100: {lastLSN=1050, status=IN-PROGRESS} <- must UNDO
TxID=101: {lastLSN=1045, status=COMMITTED} <- already committed, skip
TxID=102: {lastLSN=1060, status=IN-PROGRESS} <- must UNDO
DPT at crash:
Page 42: {recLSN=1020} <- first dirty after checkpoint
Page 55: {recLSN=1035}
RedoLSN = min(1020, 1035) = 1020
PostgreSQL WAL Internals
PostgreSQL WAL is organized as 16MB (configurable: wal_segment_size) segment files in $PGDATA/pg_wal/. Files are named with a 24-hex-digit timeline+LSN name (e.g., 000000010000000000000001).
WAL writer: Background process (walwriter) that periodically flushes WAL to disk. Configurable via wal_writer_delay (default 200ms). synchronous_commit=on forces a WAL flush at every COMMIT.
WAL archiver: Copies completed WAL segments to an archive location (archive_command). Used for Point-in-Time Recovery (PITR). The archive provides the base backup + WAL stream to replay to any point in time.
Streaming replication via WAL: PostgreSQL's primary-standby replication works by shipping WAL records to standby servers. The standby's startup process applies WAL records via the same recovery path used for crash recovery. wal_level=replica or wal_level=logical is required.
PostgreSQL WAL flow:
Backend WAL Buffer WAL Files Standby
------- ---------- --------- -------
XLogInsert -> WAL buffer -> (walwriter) -> pg_wal/*.seg -> (walreceiver)
-> (archiver) -> archive/
(restore_command)
InnoDB Redo Log
InnoDB uses a circular redo log buffer that wraps around. The redo log is stored in ib_logfile0, ib_logfile1 (MySQL 5.7 and earlier; MySQL 8.0 uses a single file with auto-extending). The log is written in 512-byte blocks (matching disk sector size for atomicity).
InnoDB Redo Log (circular):
+---+---+---+---+---+---+---+---+
| LSN 1000 .. 1100 | ... | LSN 5000..5100 |
+---+---+---+---+---+---+---+---+
^ ^
| |
oldest dirty current
page's LSN log position
(must not be overwritten)
If the redo log wraps around and would overwrite the oldest-needed LSN, InnoDB triggers a checkpoint to flush dirty pages (called "sharp" checkpoint in InnoDB, though it is actually fuzzy). This is why innodb_log_file_size is critical: too small causes frequent forced checkpoints.
Group commit: InnoDB's trx_commit_in_memory() does not call fsync per transaction. Instead, transactions queue for a group commit: one fsync call covers all recently committed transactions. This dramatically improves throughput under high concurrency. PostgreSQL has similar group commit via synchronous_commit=on with WAL writer batching.
Point-in-Time Recovery (PITR)
PITR uses a base backup + WAL archive to restore a database to any point in time:
PITR Process:
1. Restore base backup to data directory
2. Create recovery.conf (PostgreSQL < 12) or postgresql.conf (>= 12):
restore_command = 'cp /archive/%f %p'
recovery_target_time = '2026-03-15 14:30:00'
3. Start PostgreSQL
4. PostgreSQL enters recovery mode:
- Reads checkpoint from pg_control
- Applies WAL records from archive until recovery_target_time
- Enters read/write mode after recovery
Historical Context
The concept of write-ahead logging predates ARIES. Jim Gray's work at IBM in the 1970s formalized the notion of logging for transaction recovery. The System R project (1974-1979) implemented an early WAL.
ARIES (Mohan et al., 1992, IBM Almaden Research Center) synthesized decades of recovery research into a clean, implementable algorithm. The 1992 paper, "ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging," published in ACM TODS, remains a foundational paper and is required reading for any database implementer. Mohan's work on ARIES also introduced physiological logging (log physical pages, not logical operations) and the CLR concept.
PostgreSQL's WAL was introduced in PostgreSQL 7.1 (2001). Before that, PostgreSQL used shadow page recovery.
Production Examples
PostgreSQL: src/backend/access/transam/xlog.c contains the WAL writing infrastructure. XLogInsert() inserts a WAL record into the WAL buffer. XLogFlush() forces WAL to disk. StartupXLOG() in src/backend/access/transam/xlog.c implements all three ARIES phases.
InnoDB: storage/innobase/log/log0log.cc manages the redo log buffer and flushing. recv_recovery_from_checkpoint_start() in storage/innobase/log/log0recv.cc implements recovery. The mtr_t (mini-transaction) abstraction in InnoDB ensures WAL-before-page atomicity at the sub-transaction level.
SQLite: WAL mode in SQLite uses a write-ahead log file (database.db-wal). Readers read from the database file plus any applicable WAL entries. Writers append to the WAL. A "WAL checkpoint" copies WAL entries back to the main database file. This is actually shadow-paging semantics (readers see a snapshot), not the traditional WAL used for recovery.
Debugging Notes
- PostgreSQL WAL inspection:
pg_waldumptool reads WAL files:pg_waldump -p $PGDATA/pg_wal -s 0/1000000 -e 0/2000000prints all records in that range. - PostgreSQL recovery state:
SELECT pg_is_in_recovery();returns true if in recovery mode.SELECT pg_last_wal_replay_lsn();shows progress. - InnoDB redo log:
SHOW ENGINE INNODB STATUS\GshowsLog sequence number,Log flushed up to, andPages flushed up to— the gap between the first and last indicates how much WAL needs to be replayed on crash. - PostgreSQL checkpoint stats:
SELECT * FROM pg_stat_bgwriter;shows checkpoint frequency, buffers written per checkpoint, and whether checkpoints are happening too frequently (checkpoints_req > checkpoints_timed indicates log space pressure). - WAL lag in replication:
SELECT sent_lsn, write_lsn, flush_lsn, replay_lsn FROM pg_stat_replication;shows how far behind each standby is.
Security Implications
- WAL contains full data changes: The WAL is essentially a copy of all data changes in plaintext. Anyone with filesystem access can read the WAL and reconstruct any data modification. WAL archiving over the network (streaming replication) must use TLS.
- Physical replication leaks schema changes: Physical WAL streaming replication ships the entire WAL, including changes to
pg_authid(passwords), system catalogs, and all tables — the standby has a full copy of everything. - WAL and encryption at rest: Even with tablespace encryption (InnoDB) or data-at-rest encryption, the WAL may contain unencrypted versions of recently written data in older systems. Ensure WAL archiving destinations are also encrypted.
- Recovery mode vulnerabilities: A PostgreSQL server in recovery mode (
pg_is_in_recovery() = true) applies all WAL records it receives from the primary. A man-in-the-middle on the replication stream could inject arbitrary WAL, modifying data. Always usessl=onandssl_cert_filefor replication connections.
Performance Implications
- WAL segment size:
wal_segment_size(PostgreSQL, default 16MB) affects how frequently WAL segments are archived and the granularity of WAL position. Larger segments reduce archiving overhead but increase minimum checkpoint WAL size. - synchronous_commit: Setting
synchronous_commit=offallows the WAL flush to happen asynchronously (up towal_writer_delay, default 200ms, after commit). This risks losing the last ~200ms of commits on crash but provides ~2x throughput improvement for small transactions. - innodb_flush_log_at_trx_commit:
=1(default): fsync WAL on every commit — fully durable=2: write to OS buffer on every commit, fsync every second — lose up to 1s on OS crash=0: write to InnoDB buffer only, flush every second — lose up to 1s on MySQL crash- WAL compression: PostgreSQL 15+ supports
wal_compression=lz4|zstd. Reduces WAL volume (and replication bandwidth) by 50-70% for typical workloads, at slight CPU cost. - Full-page writes: On the first modification of a page after a checkpoint, PostgreSQL writes the entire page (not just the diff) to WAL. This protects against torn writes but increases WAL volume. Disable with
full_page_writes=offonly on storage with guaranteed atomic 8KB writes.
Failure Modes
- WAL corruption: A corrupted WAL record halts recovery. PostgreSQL will refuse to start and print the corrupt record's LSN. Recovery requires restoring from a backup before the corruption.
- Log wrap-around (InnoDB): If writes outpace checkpoints, InnoDB's circular log fills up and InnoDB stalls all writes until a checkpoint completes. Monitor
log_sequence_number - last_checkpoint_atinSHOW ENGINE INNODB STATUS. - Standby replication lag: If a standby falls too far behind the primary, the primary may recycle WAL segments the standby still needs. This causes the standby to drop its replication slot and require a full resync. Set
wal_keep_size(PostgreSQL) to retain WAL for slow standbys. - fsync errors: Linux kernels before 4.13 silently swallowed
fsyncerrors on some file systems. PostgreSQL 12+ re-reads pages afterfsyncto detect silent failures. InnoDB has similar verification. - Half-applied checkpoint: If the system crashes during a checkpoint, some pages may be on disk with newer data than the checkpoint record indicates. ARIES handles this correctly: the redo pass re-applies all records from
RedoLSN, which skips records already reflected in on-disk pages (viapageLSNcheck).
Modern Usage
WAL is the enabling technology for streaming replication, logical replication, change data capture (CDC), and PITR. Modern cloud databases extend WAL in interesting ways:
- Amazon Aurora: The database layer ships only WAL records to the storage layer (a distributed, replicated storage service). The storage nodes apply WAL records directly, eliminating the need to ship data pages between database instances.
- Neon: A serverless PostgreSQL service where the PostgreSQL compute node ships WAL to a remote "Safekeeper" cluster (for durability) and a "Pageserver" (for storage). The compute node has no local disk — it fetches pages on demand from the Pageserver.
- Logical replication: PostgreSQL's logical replication decodes WAL records into logical row changes (INSERT/UPDATE/DELETE) that can be replicated to heterogeneous systems (Kafka, other databases, etc.) via the
pg_logicalplugin andpgoutputprotocol.
Future Directions
- NVM-aware WAL: On Optane DC PMem, WAL records can be written with cache-line granularity without
fsync(usingclwb+sfenceinstructions). Projects like FOEDUS and SpanDB exploit this to reduce WAL latency to microseconds. - Distributed WAL: Multi-primary RDBMS (CockroachDB, YugabyteDB, Google Spanner) require a distributed agreement protocol (Raft or Paxos) for WAL ordering. The Raft log becomes the distributed WAL.
- Log-as-the-system: Aurora's insight that the WAL is the ground truth of the database has spawned a new class of "log-structured databases" where the storage tier is purely a WAL subscriber.
Exercises
- Run PostgreSQL with
fsync=off, crash it mid-workload usingkill -9 $(pg_ctl -D $PGDATA status | grep PID), and restart. Observe that data was lost. Then repeat withfsync=on. Verify data integrity. - Use
pg_waldumpto inspect a WAL segment. Identify at least one heap INSERT record, one heap UPDATE record, and one transaction commit record. Note their resource manager IDs. - Set up PostgreSQL streaming replication. Monitor
pg_stat_replicationto observe WAL shipping lag. Pause the standby and observesent_lsnvsreplay_lsndiverge. - Implement a simplified WAL in Python: a crash-safe key-value store using a sequential log file. Implement
put(key, value)that appends to log before updating an in-memory dict, andrecover()that replays the log on startup. - Measure the impact of
synchronous_commitsettings in PostgreSQL using pgbench: compare TPS forsynchronous_commit=onvsoffon a 10-second benchmark with-c 20clients.
References
- Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H., & Schwarz, P. (1992). ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging. ACM TODS, 17(1), 94–162.
- Gray, J., & Reuter, A. (1992). Transaction Processing: Concepts and Techniques. Morgan Kaufmann.
- Hellerstein, J. M., Stonebraker, M., & Hamilton, J. (2007). Architecture of a Database System. Foundations and Trends in Databases, 1(2).
- Antonopoulos, P., et al. (2019). Socrates: The New SQL Server in the Cloud. SIGMOD 2019. (Covers Aurora-style log shipping.)
- PostgreSQL WAL documentation: https://www.postgresql.org/docs/current/wal-internals.html
- InnoDB Redo Log documentation: https://dev.mysql.com/doc/refman/8.0/en/innodb-redo-log.html