03 - Journaling

Technical Overview

Journaling (also called write-ahead logging in the database context) solves the crash consistency problem in filesystems: how do you guarantee that a filesystem remains consistent after a sudden power loss or kernel panic that interrupts a multi-step write sequence?

Without journaling, a crash during an unlink() operation could leave an orphan data block (the inode was zeroed but the block was not returned to the free block bitmap) or an orphan inode (the directory entry was removed but the inode still has i_nlink > 0). Finding and fixing these inconsistencies requires a full filesystem check (fsck), which on a 10 TB filesystem takes hours.

Journaling solves this by recording intended changes to a journal (write-ahead log) before applying them to the filesystem. If a crash occurs, the journal is replayed on next mount — either rolling forward (completing committed transactions) or doing nothing (uncommitted transactions are simply discarded).

Prerequisites

Filesystem on-disk layout (inodes, block groups, bitmaps)
ext4 architecture basics
Block layer I/O ordering and write barriers
Transaction concepts from database systems

Core Content

The Crash Consistency Problem

Consider write() to a file. At the block layer, this requires three independent writes: 1. Update the data block (new content) 2. Update the inode (new i_size, new i_mtime, updated block pointer) 3. Update the block bitmap (mark the new block as allocated)

These three writes are not atomic. Any subset can complete before a crash:

Possible crash states after writing to a file (8 combinations):

(D=data, I=inode, B=bitmap)  Written? Consistent?  Problem
(D=no,   I=no,   B=no)       No       YES           Nothing happened, file unchanged
(D=yes,  I=no,   B=no)       No       YES           Data written but orphaned (space leak)
(D=no,   I=yes,  B=no)       No       NO            Inode points to uninitialized data block
(D=no,   I=no,   B=yes)      No       NO            Bitmap marks block used but no inode owns it
(D=yes,  I=yes,  B=no)       YES      NO            Data+inode updated, bitmap wrong (block leaked)
(D=yes,  I=no,   B=yes)      No       NO            Bitmap allocated, but inode not updated (lost write)
(D=no,   I=yes,  B=yes)      No       NO            Most dangerous: inode points to block with old data
(D=yes,  I=yes,  B=yes)      YES      YES           All three committed: fully consistent

Without journaling, a crash leaves the filesystem in one of 8 states, 6 of which are inconsistent.

fsck: The Pre-Journaling Solution

fsck (filesystem check) was the original solution. It scans the entire filesystem after a crash to detect and repair inconsistencies:

Check superblock (magic number, block size, inode count)
Check block group descriptors
Scan every inode: validate i_mode, i_size, i_blocks
Rebuild the block reference count by scanning all inode block pointers
Compare computed reference count to block bitmap → fix mismatches
Check directory structure (ensure all inodes referenced from at least one directory)
Collect orphan inodes (non-zero i_nlink with no directory entry) into lost+found

fsck complexity: O(disk size). For a 1 TB filesystem, fsck takes 10-30 minutes. For a 100 TB filesystem: hours. This made periodic fsck runs in data centers operationally unacceptable, motivating journaling as an alternative.

Journaling Concept: Write-Ahead Log

Journaling Model:

Before journaling:
  Filesystem writes → Data blocks + Inode + Bitmap (3 separate writes, non-atomic)
  Crash → inconsistent state → run fsck (hours)

With journaling:
  Transaction begins
  Journal writes → [TxBegin | Data | Inode | Bitmap | TxCommit] (sequential)
  Filesystem writes → Data blocks + Inode + Bitmap (can be reordered by kernel)
  Transaction complete

  Crash during journal write → journal incomplete → replay skips this transaction
  Crash after TxCommit, before FS writes → replay journal → FS writes reapplied
  Crash after FS writes → journal entry marked used → recovery nothing to do

Recovery is O(journal size) — typically 128 MB. Takes seconds, not hours.

The journal is a circular log written to a dedicated area of the filesystem (or a separate device). Transactions are committed atomically by writing a TxCommit block with a checksum. Recovery only replays complete (committed) transactions.

Journal Modes in ext4

ext4 supports three journaling modes controlled by the data= mount option:

data=writeback (fastest, least safe):

Journal: [TxBegin | Inode | Bitmap | TxCommit]  (metadata only)
Data:     written to disk at any time (before OR after metadata commit)

No data blocks in journal — only metadata (inodes, bitmaps, directory entries)
Data and metadata writes can be arbitrarily reordered
After crash+replay: metadata is consistent, but data blocks may contain garbage (old data that was being overwritten)
Risk: you can read a file that appears to have been written but contains pre-overwrite data
Performance: highest (journal is smaller, no data ordering)

data=ordered (default, balanced):

1. Write data blocks to filesystem (D)
2. Write-barrier (flush, ensuring D is persistent)
3. Journal: [TxBegin | Inode | Bitmap | TxCommit]  (metadata only)

Metadata only in journal, but data is guaranteed to be on-disk BEFORE metadata transaction commits
After crash+replay: data blocks are consistent with committed metadata
No old data exposed from partially-overwritten files
Performance: moderate — one extra barrier per transaction (~10% overhead vs writeback)

data=journal (safest, slowest):

Journal: [TxBegin | Data | Inode | Bitmap | TxCommit]  (everything)
No direct filesystem writes until after journal commit

Both data and metadata written to journal first, then to filesystem
Effectively writes every byte twice (journal + filesystem)
After crash+replay: complete data+metadata integrity
Performance: 2x I/O overhead for writes; suitable only for databases with critical data
Rarely used (databases have their own WAL; O_DIRECT bypasses this anyway)

# Check current journal mode
tune2fs -l /dev/sda1 | grep "Default mount options"
# "Default mount options: user_xattr acl"

# Set journal mode at mount
mount -o data=ordered /dev/sda1 /mnt
mount -o data=writeback /dev/sda1 /mnt  # dangerous for general use

# View journal parameters
tune2fs -l /dev/sda1 | grep -i journal
# Journal UUID: ...
# Journal size: 134217728 (128 MB default)
# Journal inode: 8 (special inode 8 holds journal data)

ext4 JBD2 (Journaling Block Device 2)

ext4 uses the jbd2 (Journaling Block Device 2) library, a generic journaling layer that can in theory be used by any filesystem (ext3 used jbd; ext4 uses jbd2 with 64-bit block addresses and checksums).

JBD2 Journal Layout:

Journal (circular, fits in inode 8's blocks):
+----------+-----+-----+-----+------+-----+-----+-----+------+--
| Journal  | TX  | TX  | TX  | TX   | TX  | TX  | TX  | TX   | ...
| Header   |Begin| desc| data| Commit|Begin| desc| data|Commit|
+----------+-----+-----+-----+------+-----+-----+-----+------+--
           |<-- Transaction 1 ------->|<-- Transaction 2 ------>|

Journal Block Types:
- JBD2_DESCRIPTOR_BLOCK: lists which filesystem blocks are in this transaction
- JBD2_COMMIT_BLOCK: marks transaction as committed (with CRC32c checksum in ext4_64bit)
- JBD2_REVOKE_BLOCK: list of blocks whose journal copies should be ignored
- JBD2_SUPERBLOCK_V2: journal superblock (head, tail, sequence numbers)

Transaction commit cycle: 1. jbd2_journal_start() — begin a handle (transaction participation) 2. jbd2_journal_get_write_access() — register a buffer for modification 3. Modify the buffer (in memory only) 4. jbd2_journal_dirty_metadata() — mark buffer for journal write 5. jbd2_journal_stop() — release handle; if transaction is complete, trigger commit 6. jbd2 commit thread (jbd2/<device>-8) writes journal and then sets TxCommit 7. Checkpoint: after blocks are written to their final filesystem locations, journal space is freed

# Monitor jbd2 commit thread
ps aux | grep jbd2
# jbd2/sda1-8 or jbd2/dm-0-8 — one per journaled device

# JBD2 statistics
cat /proc/fs/jbd2/sda1-8/info
# journal size: 134217728
# average transaction size: 12345 bytes
# commit interval: 5 seconds
# starting block seq: 12345

Journal Transaction Commit Diagram

ext4 write path with ordered mode journaling:

Application write() syscall
         |
         v
  Page cache (dirty pages)
         |
   [periodic or forced commit — every 5s or on fsync()]
         |
         v
  jbd2 commit thread:
    1. Freeze current transaction (new writes go to next transaction)
    2. Write descriptor blocks (list of modified metadata blocks)
    3. Write modified metadata blocks to journal
    4. Flush journal to disk (write barrier)
    5. Write TxCommit block with checksum → journal entry is durable
    6. Allow dirty data blocks to be written to filesystem (in background)
         |
         v
  kworker/flush writes dirty pages to their final locations
         |
         v
  Checkpoint: jbd2 verifies all blocks from this transaction
              are in their final locations → marks journal space free

Journal commit interval: Default 5 seconds (commit=5 mount option). ext4 flushes the journal every 5 seconds even without fsync(). Reducing to commit=1 improves durability at the cost of more frequent (and smaller) journal transactions → more overhead.

Journal Overhead

Typical journaling overhead measurements: - data=ordered: ~5-15% write overhead vs no journal (mostly from ordering constraint and journal write) - data=writeback: ~2-5% overhead (journal write only) - data=journal: ~60-100% overhead (double writes)

The overhead is primarily from: 1. Extra writes (journal + filesystem for metadata) 2. Write barriers (expensive on HDDs — flush cache ~5 ms; cheap on NVMe — ~20 µs) 3. Serialization (one transaction must commit before the next begins for sequential consistency)

Alternatives to Journaling: Copy-on-Write

CoW filesystems (Btrfs, ZFS) do not use a journal. Instead, they never overwrite data in place — every write goes to a new location, and the tree of pointers is updated atomically by updating the root pointer last.

CoW write vs journaling:

Journaling (ext4 data=ordered):
1. Write data to old location (replacing old data)
2. Update journal with new metadata
3. Flush journal
4. Update metadata in place

CoW (Btrfs):
1. Allocate new blocks for both data and metadata copy
2. Write new data to new location
3. Write new metadata (inode, tree nodes) to new locations
4. Update root pointer atomically (single 64-byte write)
→ Old data blocks released to free pool

After crash:
  Journaling: replay journal → at worst, redo metadata writes
  CoW: root pointer still points to last committed state → no recovery needed

CoW advantages over journaling: no journal replay needed, snapshots are free (just snapshot the root pointer), no write hole. Disadvantages: fragmentation over time (random write pattern), higher write amplification (write new metadata for every data write).

Historical Context

The fsck approach dates to early UNIX (1970s). As disk sizes grew, fsck times became impractical. Several research groups independently developed journaling solutions in the 1980s-1990s.

The first major journaling filesystem for Linux was ext3, introduced in 2001 (kernel 2.4.16). Ext3 added jbd (journaling block device) on top of ext2. The transition from ext2 to ext3 was trivial (just add a journal with tune2fs -j) — a critical factor in its adoption.

ReiserFS (2001) was also an early Linux journaling filesystem, notable for its tree-based directory structure (fast small-file performance). It gained significant market share but development stalled after Reiser's imprisonment in 2008, and the filesystem is now largely abandoned.

XFS (Silicon Graphics, 1993, merged in Linux 2.4) brought a production-quality journaling filesystem with scalability features. XFS's journal is called "log" and uses a different architecture from jbd2: the XFS log records all metadata changes in a ring buffer and uses a separate "log device" option for high-performance journal placement.

ext4 (2008, Theodore Ts'o and others) improved on ext3: larger file/filesystem limits, extents (replacing indirect blocks), delayed allocation, journal checksums, and online defragmentation.

Production Examples

MySQL on ext4 ordered mode: InnoDB uses fsync() after every committed transaction (when innodb_flush_log_at_trx_commit=1). With data=ordered, the ext4 journal commits when fsync is called, providing durability. The combination of InnoDB's own WAL + ext4's journal means durability is guaranteed within ~1 ms (NVMe) even after power loss.

PostgreSQL WAL + filesystem journal double-jeopardy: PostgreSQL writes its WAL (write-ahead log) using O_SYNC or fsync(). If the underlying filesystem is ext4 with data=journal, every WAL write is journaled twice — once in the filesystem journal and once in PostgreSQL's WAL. This is unnecessary overhead. PostgreSQL recommends using data=writeback with barrier=0 (if the drive has hardware write-back guarantees) OR simply use data=ordered (default). Many cloud databases use separate ext4 filesystems with data=writeback for WAL volumes on NVMe with PLP (power-loss protection).

ZFS at Twitter: Twitter migrated parts of their storage from RAID5 + ext4 to ZFS in the mid-2010s specifically because ZFS's CoW-based consistency eliminated the need for journal recovery time during rolling reboots. ext4 journal replay was adding 30-60 seconds to reboot time on large filesystems with active writes.

Debugging Notes

# Check journal health on ext4
tune2fs -l /dev/sda1 | grep -i journal
# Journal inode: 8
# Journal backup: inode blocks

# Check if journal has errors
dmesg | grep -i "ext4\|jbd2\|journal"
# Look for: "journal commit I/O error", "recovery failed"

# Replay journal manually (on unmounted filesystem)
fsck.ext4 -n /dev/sda1   # -n = no changes, just check
fsck.ext4 /dev/sda1       # replay journal and fix

# Show jbd2 transaction statistics
cat /proc/fs/jbd2/*/info

# Force journal commit (flush dirty metadata)
sync

# Examine journal content (advanced)
debugfs /dev/sda1
> dump_journal   # print journal contents
> logdump -c 20  # show last 20 committed transactions

# Monitor journal commits in real time
perf trace --event 'ext4:ext4_journal_start' -p $(pgrep mysqld)

Security Implications

Journal replay attack: If an attacker can control the journal content (by corrupting the journal area on disk), they can inject fake committed transactions into the filesystem. ext4 journal checksums (jbd2 with metadata_csum) prevent this — each commit block includes a CRC32c checksum of all journal blocks, preventing undetected tampering.

Sensitive data in journal: data=journal mode writes all file data to the journal before writing to its final location. If a file contains sensitive data that is later deleted, remnants may persist in the journal until the journal wraps around. For secure deletion, use data=writeback and explicit data block overwriting, or use encrypted filesystems (LUKS/dm-crypt).

Journal device as single point of failure: If the journal device fails (when using an external journal tune2fs -O journal_dev), the filesystem becomes unmountable even if all data is intact. Production systems using external journals should use redundant storage for the journal device.

Performance Implications

Write barriers and NVMe: A write barrier (REQ_PREFLUSH + REQ_FUA in the block layer) is required to ensure journal commits are durable before the commit block is written. On HDDs, a write barrier flushes the HDD's volatile write cache (~5 ms). On NVMe with power-loss protection (PLP), the drive guarantees durability without needing a full cache flush — the barrier is much cheaper (~20 µs). This is why NVMe drives dramatically improve database commit latency.

Journal contention: With many concurrent writers, all journal writes must serialize through jbd2. Under heavy write load, jbd2's commit can become a bottleneck. Signs: jbd2 thread at 100% CPU in top, high w_await on the journal device. Solutions: increase journal size (tune2fs -J size=1024), use data=writeback, or consider XFS (better journal scalability).

Commit interval tuning: The default 5-second commit interval means up to 5 seconds of metadata can be lost on power failure (data loss bounded by dirty page writeback settings). For lower risk:

mount -o remount,commit=1 /dev/sda1  # commit every 1 second

For lower overhead (batch more per commit):

mount -o remount,commit=30 /dev/sda1  # commit every 30 seconds (risky)

Failure Modes and Real Incidents

Journal corruption from hardware error: A single bit flip in the journal area (from bad RAM or a buggy drive) can corrupt a journal commit record. With checksums disabled (pre-kernel 3.0), the recovery code would apply a partially-valid journal entry, creating a corrupted filesystem. With metadata_csum enabled (ext4 default since ~3.18), the corrupted commit record is detected and discarded — the filesystem may lose recently committed metadata but remains consistent.

Ext3 data loss bug (2008): A bug in ext3 (jbd1) under specific conditions of data=ordered mode allowed the ordering guarantee to be violated — metadata could be written to the journal before the corresponding data blocks were written to disk. After a crash, the filesystem appeared consistent (metadata committed) but data blocks pointed to uninitialized/old content. Fixed in jbd2 for ext4. This incident reinforced the case for upgrading to ext4.

XFS log recovery hanging (Linux 4.17-4.19): A regression in XFS log recovery code caused mount to hang indefinitely when replaying a log on a filesystem with specific dirty state. Affected production servers that rebooted uncleanly. Fix: kernel 4.20. Mitigation: use xfs_repair manually before mounting.

ext4 barrier=0 disaster: An older recommendation was to disable write barriers (mount -o barrier=0) for performance on hardware RAID with BBU. When BBU batteries died (silently), the system continued operating without barriers but without persistence guarantees. After power failures, filesystems were corrupt because the journal was not actually durable. Modern recommendation: always keep barriers enabled, monitor BBU status separately.

Modern Usage

F2FS (Flash-Friendly Filesystem, Linux 3.8+): Uses a log-structured approach for flash media (SSDs/eMMC in Android). Journal is replaced by a node info table and checkpoint mechanism. Optimized for the write patterns of NAND flash.
NOVA (Non-volatile Object-based VFS Architecture): Research filesystem for persistent memory (Optane). Uses per-inode journals in persistent memory — no global journal bottleneck. Achieves atomicity with CPU CLWB + SFENCE instructions rather than block-layer barriers.
Ext4 fast commit: Linux 5.10+. Appends a compact "fast commit" record for fsync-intensive workloads (small metadata changes). Reduces fsync overhead by 40-60% for workloads with many small metadata updates (git operations, email servers).

Future Directions

DAX + hardware persistence instructions: For persistent memory (SCM), journaling as a block-level concept is replaced by CPU instructions (CLWB, SFENCE) that ensure cache lines are flushed to persistent media. No journaling overhead — persistence is at instruction granularity.
CoW filesystem maturity: As Btrfs and OpenZFS mature and gain wider adoption, the relevance of journaling-based recovery decreases. CoW effectively makes per-transaction journaling obsolete at the filesystem level.
Atomic write (NVMe): NVMe 2.0 defines an "Atomic Write" feature — the drive guarantees that a specific write operation (up to a certain size) is either completely written or not written on power failure. This could allow filesystems to eliminate certain journal writes for metadata updates that fit within the atomic write size.

Exercises

Demonstrate the ordered vs writeback difference. Mount ext4 with data=writeback and data=ordered. For each, write a large file, kill power (use a VM and yank the power), and run fsck. Observe whether the data is consistent.
Measure journal overhead. Benchmark fio --rw=write --sync=1 --bs=4k (simulating database fsync workload) on ext4 with data=ordered, data=writeback, and data=journal. Plot IOPS and P99 latency for each.
Use debugfs to find the journal inode (inode 8) on an ext4 filesystem. Dump the first few journal blocks and identify the descriptor block structure. Find a TxBegin and TxCommit block.
Simulate a journal replay. Create an ext4 filesystem on a loop device, write data, intentionally corrupt a metadata block (using dd to zero a block in the block group descriptors), and observe that fsck.ext4 uses the journal to repair the filesystem without a full scan.
Compare jbd2 and XFS log in terms of commit mechanics. Read fs/jbd2/commit.c and fs/xfs/xfs_log.c. How does each detect incomplete (uncommitted) transactions during recovery?

References

Tweedie, S. "Journaling the Linux ext2fs Filesystem." LinuxExpo 1998.
Ts'o, T. et al. "Ext4 disk layout." Linux kernel documentation: Documentation/filesystems/ext4/
Prabhakaran, V. et al. "IRON File Systems." SOSP 2005. (analysis of fsck and journaling bugs)
Yang, J. et al. "NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories." FAST 2016.
Lu, L. et al. "Failure-Atomic Updates of Application Data with an Optimistic Hybrid Logging Scheme." FAST 2013.
Linux kernel source: fs/jbd2/, fs/ext4/, fs/xfs/
ext4 fast commit: https://lwn.net/Articles/842964/
Bovet, D. and Cesati, M. Understanding the Linux Kernel, 3rd ed., Chapter 18 (Ext2/Ext3)