05 - ext4

Technical Overview

ext4 (fourth extended filesystem) is the default filesystem for most Linux distributions and remains one of the most widely deployed filesystems in the world. It is a mature, production-hardened, journaling filesystem built on decades of UNIX filesystem design. ext4's design philosophy emphasizes backward compatibility, reliability, and incremental improvement — it can be created from an ext2/ext3 filesystem with a simple tune2fs command.

While newer filesystems (Btrfs, ZFS) offer more advanced features, ext4 remains the right choice for most Linux deployments due to its excellent performance, low CPU overhead, extensive production testing, and fast fsck with metadata checksums.

Prerequisites

Inode and dentry concepts (see 02-inodes-and-dentries.md)
Journaling mechanics (see 03-journaling.md)
Block device concepts
RAID basics (for understanding RAID-optimized mkfs parameters)

Core Content

ext Family Evolution

ext (1992, Rémy Card):
  - First Linux filesystem
  - 2 GB max file size, 64 GB filesystem limit
  - No separate access/modify/create timestamps
  - Replaced almost immediately by ext2

ext2 (1993, Rémy Card):
  - Standard Linux FS for a decade
  - 16 TB max filesystem (with 4K blocks)
  - Multi-block group structure
  - No journaling → fsck required after unclean shutdown
  - Still used on flash media (no journal wear)

ext3 (2001, Stephen Tweedie):
  - ext2 + jbd (journaling block device)
  - Fully backward compatible with ext2
  - Three journal modes: writeback, ordered, data
  - Online filesystem growth
  - dir_index: HTree for large directories
  - Still limited to ~32,000 subdirectories per directory

ext4 (2008, Theodore Ts'o et al.):
  - ext3 + extents + larger limits + delayed allocation + more
  - 1 EB filesystem limit, 16 TB file limit
  - 64,000 subdirectories per directory
  - Journal checksums
  - Online defragmentation
  - Transparent compression (not in mainline, was proposed)

On-Disk Layout

ext4 filesystem layout (3 block groups shown):

+==============+==============+==============+
|  Block       |  Block       |  Block       |
|  Group 0     |  Group 1     |  Group 2     |
+==============+==============+==============+

Block Group 0 detail:
+------+------+------+------+------+------+------+------+------+
|Super |GDT   |GDT   |Block |Inode |Inode |Inode |Data  |Data  |
|block |backup|      |Bitmap|Bitmap|Table |Table |Block |Block |
|      |      |      |      |      |      |      | ...  | ...  |
+------+------+------+------+------+------+------+------+------+
  1 blk  >=1    flex  1 blk  1 blk  N blks N blks

Superblock: block 0 in group 0 (+ backup copies in groups 1,3,5,7,9,25,49...)
GDT (Group Descriptor Table): one 32-byte (or 64-byte with 64bit feature) entry per block group
Block Bitmap: 1 bit per block in group (1=used, 0=free), fills exactly one 4KB block
              → max 32,768 blocks per group with 4KB blocks
Inode Bitmap: 1 bit per inode in group
Inode Table: inodes_per_group × inode_size bytes
Data Blocks: all remaining blocks

Block group size: 8 × block_size × 8 = 8 × 4096 × 8 = 262,144 blocks = 1 GiB

Flexible Block Groups (flex_bg feature, ext4 default): Groups 0 through flex_bg_size-1 (default 16) form a "flex group". All bitmaps and inode tables of the flex group are concentrated in group 0, with remaining blocks in each group used entirely for data. This improves large-file allocation locality.

Superblock Fields

# View superblock details
dumpe2fs -h /dev/sda1

Key superblock fields (stored at byte offset 1024 in the filesystem):

Block count:              2621440
Free blocks:              1432567
Inode count:              655360
Free inodes:              532145
Block size:               4096
Inode size:               256          ← includes extra nanosecond timestamps
First block:              0
Blocks per group:         32768
Inodes per group:         8192
Journal inode:            8
Default mount options:    user_xattr acl
Features:                 has_journal,ext_attr,resize_inode,dir_index,
                          filetype,needs_recovery,extent,flex_bg,
                          sparse_super2,large_file,huge_file,
                          uninit_bg,dir_nlink,extra_isize,
                          metadata_csum
Journal backup:           inode blocks
Lifetime writes:          1234 GB

ext4 Key Features

Extents (ext_attr + extent features): The traditional indirect block scheme in ext2/ext3 used up to 15 block pointers per inode (12 direct + 1 single-indirect + 1 double + 1 triple). For large files, this created deep trees of block pointers. ext4 replaces block pointers with extents:

ext4 extent in i_block[]:

struct ext4_extent_header {
    __le16 eh_magic;    // 0xF30A
    __le16 eh_entries;  // number of valid entries
    __le16 eh_max;      // max entries in this node
    __le16 eh_depth;    // depth of tree (0 = leaf)
    __le32 eh_generation;
};

struct ext4_extent {    // leaf node: maps logical blocks to physical
    __le32 ee_block;    // first logical block
    __le16 ee_len;      // number of blocks
    __le16 ee_start_hi; // high 16 bits of physical block
    __le32 ee_start_lo; // low 32 bits of physical block
};

A 4-extent tree stored inline in the inode's i_block[15] area (60 bytes) covers 4 contiguous physical ranges. For a sequentially-written 1 TB file: one extent. For a heavily fragmented file: the extent tree grows deeper, backed by additional blocks.

Delayed Allocation (EXT4_F_DELALLOC): ext4 defers block allocation until data is flushed from the page cache. This allows the allocator to see the full extent of a write before choosing physical blocks: - Better contiguous allocation: Writing a file in 4K chunks — ext4 waits until writeback to allocate, allocating a contiguous extent for all dirty pages together. - Elimination of unnecessary allocation: If a file is written then deleted before flushing, no blocks are ever allocated.

Tradeoff: delayed allocation can cause ENOSPC to be reported to the application later than expected (when writeback runs, not when write() is called), potentially confusing applications that check for free space only at write time.

uninit_bg (uninitialized block groups): Block groups are lazily initialized. On mkfs.ext4, only group 0's metadata is written. Other groups are initialized on first use. This makes mkfs.ext4 on a 10 TB filesystem take seconds instead of minutes, and enables fast fsck (skip uninitialized groups).

dir_index (HTree for directories): Directories with >2 files use an HTree (a B-tree variant where leaves are hash-to-dentry mappings). This changes directory lookup from O(n) linear scan to O(log n) hash lookup. For directories with 100,000 entries, this is the difference between 500 ms and 2 ms per lookup.

# Check if dir_index is enabled (it's on by default for ext4)
dumpe2fs -h /dev/sda1 | grep dir_index

# Retroactively add htree to existing large directory
e2fsck -fD /dev/sda1  # -D: optimize all directories

fallocate() (persistent preallocation): fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, 1<<30) allocates 1 GB of contiguous blocks for fd without writing data. The inode's i_size remains unchanged (KEEP_SIZE flag), but physical blocks are reserved. Used by databases (MySQL, PostgreSQL) and multimedia editors to preallocate space and avoid fragmentation.

Inline data: Files ≤ 60 bytes can store their data directly in the inode's i_block[] area. A 20-byte file needs no separate data block — its content is inside the inode itself. Significant space saving for small files (configuration, metadata).

Block Group Bitmap Layout (ASCII)

Block Group Bitmap Analysis:

Group 0 descriptor (tune2fs -l → "Block groups"):
  Block bitmap at block 1025
  Inode bitmap at block 1026
  Inode table at block 1027 through 1282

Block bitmap (block 1025, 4096 bytes = 32768 bits):
  Bit 0: block 0 (superblock) → 1 (used)
  Bit 1: block 1 (GDT) → 1 (used)
  Bit 2: block 2 (block bitmap) → 1 (used)
  Bit 3: block 3 (inode bitmap) → 1 (used)
  Bits 4-259: inode table blocks → 1 (used)
  Bits 260-32767: data blocks → 0 (free) or 1 (used)

Inode bitmap (block 1026, 4096 bytes = 32768 bits):
  Bit 0: inode 1 (bad blocks inode) → 1
  Bit 1: inode 2 (root directory "/") → 1
  Bit 2-10: reserved system inodes → 1
  Bits 11+: user inodes → 0 (free) or 1 (allocated)

ext4 Tuning and Tools

mkfs.ext4 options:

# Standard filesystem for general use
mkfs.ext4 /dev/sda1

# Database-optimized (more inodes, larger block size, no lazy init)
mkfs.ext4 -b 4096 -i 4096 -m 1 -E lazy_itable_init=0,lazy_journal_init=0 /dev/sda1
# -i 4096: one inode per 4096 bytes (more inodes for small-file workloads)
# -m 1: 1% reserved blocks (default 5%, wasteful on large disks)
# -E lazy_itable_init=0: initialize inode tables immediately (slower mkfs, faster first mount)

# Stripe-aware for RAID10 (chunk 512KB, 2 data disks)
mkfs.ext4 -E stride=128,stripe-width=256 /dev/md0
# stride = chunk_size / block_size = 512KB / 4KB = 128
# stripe-width = stride × data_disks = 128 × 2 = 256

# Mail server (many small files)
mkfs.ext4 -i 1024 -m 0 /dev/sda1  # one inode per 1KB, no reserved blocks

tune2fs: Modify filesystem parameters on existing ext4:

# View current parameters
tune2fs -l /dev/sda1

# Change reserved block percentage
tune2fs -m 2 /dev/sda1

# Add journal (convert ext2 to ext3/ext4)
tune2fs -j /dev/sda2

# Enable metadata checksums
tune2fs -O metadata_csum /dev/sda1  # requires e2fsck after

# Set journal size
tune2fs -J size=512 /dev/sda1  # 512 MB journal (default 128 MB)

# Enable/disable features
tune2fs -O ^has_journal /dev/sda1  # disable journaling (^=remove feature)

dumpe2fs: Dump filesystem metadata:

# Full dump (all block groups)
dumpe2fs /dev/sda1 2>/dev/null | less

# Superblock only
dumpe2fs -h /dev/sda1

# Show block group details
dumpe2fs /dev/sda1 | grep -A 20 "Group 0:"

debugfs: Interactive filesystem debugger:

debugfs /dev/sda1  # opens filesystem read-only

# Useful debugfs commands:
# > stat /etc/passwd    — dump inode details
# > ls -l /             — list directory entries with inodes
# > cat /etc/passwd     — dump file content
# > logdump -c 10       — show last 10 journal transactions
# > imap               — dump inode bitmap for all groups
# > bmap <inode> <block> — map logical block to physical block

# Recover deleted file (if blocks not yet reallocated)
debugfs> lsdel          # list recently deleted inodes
debugfs> undel <inode_num> /tmp/recovered_file

e2fsck: Check and repair ext4:

# Check filesystem (must be unmounted or read-only)
e2fsck -n /dev/sda1      # read-only check, no modifications
e2fsck /dev/sda1         # interactive: prompts for each fix
e2fsck -y /dev/sda1      # non-interactive: auto-yes to all fixes
e2fsck -f /dev/sda1      # force check even if clean flag set

# After kernel crash (journal may need replay):
e2fsck -pv /dev/sda1     # -p: auto-fix, -v: verbose progress

ext4 Scalability Limits

Parameter	ext4 Limit	Notes
Max file size	16 TiB (4KB blocks)	Limited by extent tree depth
Max filesystem size	1 EiB (4KB blocks)	48-bit block addresses
Max filename length	255 bytes	UTF-8 encoded
Max path length	4096 bytes	VFS limit
Max directory entries	2 billion (HTree)	No practical limit
Max hard links	65,000 per file	`i_links_count` is 16-bit
Max subdirs	64,998 per directory	`dir_nlink` feature needed >32K
Inodes per filesystem	Set at mkfs time	Cannot change without recreating FS
Block size	1024, 2048, 4096 bytes	4096 matches page size (standard)

Mount Options Reference

# /etc/fstab entry for database filesystem:
/dev/nvme0n1p1  /var/lib/mysql  ext4  \
  defaults,data=ordered,noatime,nodiratime,discard,barrier=1  0  2

# Key options:
# noatime: don't update atime on reads (improves random read I/O by ~2-5%)
# nodiratime: don't update directory atime
# discard: enable automatic TRIM on UNMAP (for SSD; check if HDD — harmful)
# barrier=1: default, required for data integrity with non-PLP storage
# data=ordered: default journal mode
# commit=1: journal commit interval (seconds); 5=default, 1=more durable
# errors=remount-ro: remount read-only on error (vs continue or panic)

# For NVMe SSD with PLP (power-loss protection):
# Can safely use data=writeback, barrier=0 for max performance
# (PLP hardware guarantees durability without software barriers)

Historical Context

ext was written by Rémy Card in 1992 as the first non-Minix filesystem for Linux. ext2 followed in 1993 with expanded limits and the classic multi-block-group layout. Stephen Tweedie's work on ext3 (1999-2001) added journaling while maintaining binary compatibility with ext2.

ext4 began as a series of patches to ext3 that broke backward compatibility — originally called "ext4dev" and marked experimental. Theodore Ts'o, Andrew Morton, and others stabilized it for production use by 2008 (Linux 2.6.28). The major improvements were extents (eliminating the indirect block tree for large files), delayed allocation (improving contiguous allocation for modern workloads), and 48-bit block addresses (enabling >16 TB filesystems).

Google switched Android from YAFFS2 to ext4 in 2010 (Android 2.3), bringing ext4 to hundreds of millions of devices. As of 2024, ext4 remains the default filesystem on Ubuntu, Debian, Fedora (for non-Btrfs installs), CentOS/RHEL, and Android.

Production Examples

Google file system metrics: Google has extensive ext4 deployments for Linux server system disks. Their internal ext4 monitoring identifies i_nlink overflow (from deep directory trees in build caches), delayed allocation latency spikes (from writeback storms), and metadata checksum error rates that trigger drive replacement.

Dropbox inode exhaustion (2017): Before migrating to their own storage, Dropbox experienced multiple inode exhaustion events on ext4 filesystems backing their NFS servers. Each user's home directory had tens of thousands of files. Fix: mkfs.ext4 -i 1024 on new volumes. This was documented as a standard part of their capacity planning after the incidents.

Canonical Ubuntu LTS: Ubuntu defaults to ext4 for all installs. Their CI/CD infrastructure has tested ext4 behavior extensively, including the delayed allocation bug (below) — a primary driver for metadata_csum becoming mandatory.

Debugging Notes

# Monitor ext4 events in real time
perf trace --event 'ext4:*' -p $PID
# Or via ftrace:
echo 'ext4:ext4_journal_start ext4:ext4_sync_file_enter ext4:ext4_writepages' \
  > /sys/kernel/debug/tracing/set_event

# Check for metadata corruption
dmesg | grep -i 'ext4\|jbd2\|buffer_io_error'

# Find which inode corresponds to a specific block number
debugfs /dev/sda1 -R "icheck $BLOCK_NUM"
debugfs /dev/sda1 -R "ncheck $INODE_NUM"

# Check filesystem consistency stats
e2fsck -n /dev/sda1 2>&1 | tail -5
# "pass 1: checking inodes, blocks, and sizes"
# ...
# "/dev/sda1: 123456/655360 files, 1432567/2621440 blocks"

# Fragmentation analysis
e2fsck -fn /dev/sda1 (check fragmentation as part of check)
# Or: e4defrag -c /path  (check fragmentation score)
e4defrag -c /var/lib/mysql    # check fragmentation score (0=good)
e4defrag /var/lib/mysql       # defragment

# Monitor journal activity
iostat -x 1 | grep -E "$(lsblk /dev/sda --output NAME -n)"
# High w/s with small wreq-sz = journal writes

Security Implications

metadata_csum protection: ext4 with metadata_csum checksums every metadata block (superblock, group descriptors, bitmaps, inode table, directory blocks, journal). A corrupted metadata block is detected and reported rather than silently causing data loss. Enable on all new filesystems:

mkfs.ext4 -O metadata_csum /dev/sda1

Filesystem immutable flag: chattr +i /path/to/file sets the immutable attribute — even root cannot delete, rename, or modify the file until the flag is cleared with chattr -i. Used to protect critical files (system binaries, configuration) from rootkit modification.

chattr +i /etc/resolv.conf   # prevent modification by DHCP client or malware
lsattr /etc/resolv.conf      # verify: ----i-----------

Reserved blocks: ext4 reserves 5% of blocks for root by default. This prevents non-root users from completely filling the filesystem and causing root's processes to fail. Reduce to 1-2% on very large disks: tune2fs -m 1 /dev/sda1.

Performance Implications

Delayed allocation and write ordering: Delayed allocation can cause ext4 to allocate a large contiguous extent for a file written in many small write() calls. Benchmark: sequential file write with fio --rw=write --bs=4k on ext4 shows much better fragmentation than on ext2 (which allocates per-write).

noatime mount option: atime updates on every file read require an inode write — converting every read into a mixed read/write. For workloads with heavy sequential reads (log processing, backup), noatime eliminates this overhead. Use relatime (default since Linux 2.6.30) as a middle ground: atime is only updated if it predates mtime or ctime.

Large file fallocate performance: For databases creating a 100 GB file, fallocate(FALLOC_FL_KEEP_SIZE) is orders of magnitude faster than writing zeros via dd. ext4's fallocate allocates contiguous extents without zeroing them (uninitialized extent marked with EXT4_EXT_IS_WRITTEN flag cleared; read returns zeros until written).

Failure Modes and Real Incidents

ext4 data loss bug (CVE-2015-8374, "delayed allocation orphan inode" bug): A race condition in ext4's delayed allocation + journal path could cause file data to be lost after a crash if the file was written to and truncated in the same journal transaction. Affected Linux 3.10-4.0. Fixed in 4.1. Production impact: automated systems doing write-then-truncate patterns (log rotation, temporary files) on ext4 with delayed allocation could silently lose data after a crash. Widely reported in enterprise Linux support channels in 2014-2015.

ext4 journal replay hang on very large filesystems (reported 2018-2020): With journal size 128 MB and a very large (>32 TB) filesystem with high metadata write rate, journal replay on boot could take 5-30 minutes. This caused unexpected "disk hang" during VM startup in cloud environments. Fix: increase journal size (tune2fs -J size=1024) and enable fast_commit (Linux 5.10+).

Block bitmap corruption from misbehaving storage (common): Production ext4 filesystems sometimes show "block bitmap inconsistency" errors detected by e2fsck after a bad shutdown on drives with intermittent write failures. With metadata_csum, these are detected online (during mount) and the filesystem is remounted read-only. Without checksums, corruption can propagate silently.

Modern Usage

Default on Android: ext4 with f2fs as an alternative. Most Android devices since 2010 use ext4 for userdata, system, and vendor partitions.
Container base filesystems: Docker's default storage driver on Debian/Ubuntu uses overlayfs on top of ext4. The ext4 filesystem's mature dir_index and extent support makes it efficient for the many-small-file access patterns of container layers.
NVMe system disks in cloud VMs: GCP, Azure, and AWS Linux VMs typically boot from ext4 on NVMe disks. With metadata_csum and noatime, ext4 performs within 5% of raw NVMe on mixed workloads.

Future Directions

ext5 / Next-gen ext: Ted Ts'o has mentioned post-ext4 features: larger inline data, better large-directory performance, improved extent tree. But Btrfs and XFS have largely captured the "next generation ext4" market segment.
Fast commit (Linux 5.10+): EXT4_FEATURE_COMPAT_FAST_COMMIT enables a compact commit record for fsync-only-metadata changes. Reduces fsync overhead by 40% for many workloads.
ext4 case-folding (Linux 5.2+): casefold feature enables case-insensitive filenames per directory (for Android compat mode and Windows-interop NFS shares).
Verity (fs-verity): Cryptographic read-only content hashing for individual files. Merged in Linux 5.4. Used by Android to verify system image integrity and by Chromebook verified boot.

Exercises

Create two ext4 filesystems: one with default settings and one with -i 1024 (dense inode allocation). Create 100,000 small files (1 byte each) on each. Measure: time to create files, space used (df -i), and time to ls -l the directory.
Demonstrate delayed allocation by writing a 1 GB file in 4K chunks via fio. While writing, observe block allocation with filefrag -v /path/to/file both during and after. How many extents does the file have during writing vs. after completion?
Use debugfs to examine a live (mounted, read-only opening allowed) ext4 filesystem. Find inode 2 (root directory). List its extent tree. Dump the directory entries. Find the journal inode (8) and note its size and blocks.
Force an unclean shutdown (kill the VM/container). Observe e2fsck running at next boot. Identify in dmesg the journal replay messages. How long does it take? Compare to e2fsck -n on the same filesystem after a clean unmount.
Measure the metadata_csum overhead. Create two identical ext4 filesystems (one with, one without metadata_csum). Run fio --rw=randwrite --bs=4k on each. Measure IOPS difference. Is the checksum overhead significant for this workload?

References

Card, R. et al. "Design and Implementation of the Second Extended Filesystem." Dutch Linux Symposium 1994.
Ts'o, T. and Tweedie, S. "Planned Extensions to the Linux Ext2/Ext3 Filesystem." USENIX ATC 2002.
Linux kernel documentation: Documentation/filesystems/ext4/
ext4 on-disk format: fs/ext4/ext4.h in kernel source
man 5 ext4 — ext4 filesystem overview
man 8 mkfs.ext4, man 8 tune2fs, man 8 debugfs, man 8 dumpe2fs
Ts'o, T. "ext4 Data Loss Bug (delayed allocation)." LWN 2015. https://lwn.net/Articles/638252/
Mathur, A. et al. "The new ext4 filesystem: current status and future plans." Ottawa Linux Symposium 2007.