05 - ext4
Technical Overview
ext4 (fourth extended filesystem) is the default filesystem for most Linux distributions and remains one of the most widely deployed filesystems in the world. It is a mature, production-hardened, journaling filesystem built on decades of UNIX filesystem design. ext4's design philosophy emphasizes backward compatibility, reliability, and incremental improvement — it can be created from an ext2/ext3 filesystem with a simple tune2fs command.
While newer filesystems (Btrfs, ZFS) offer more advanced features, ext4 remains the right choice for most Linux deployments due to its excellent performance, low CPU overhead, extensive production testing, and fast fsck with metadata checksums.
Prerequisites
- Inode and dentry concepts (see 02-inodes-and-dentries.md)
- Journaling mechanics (see 03-journaling.md)
- Block device concepts
- RAID basics (for understanding RAID-optimized mkfs parameters)
Core Content
ext Family Evolution
ext (1992, Rémy Card):
- First Linux filesystem
- 2 GB max file size, 64 GB filesystem limit
- No separate access/modify/create timestamps
- Replaced almost immediately by ext2
ext2 (1993, Rémy Card):
- Standard Linux FS for a decade
- 16 TB max filesystem (with 4K blocks)
- Multi-block group structure
- No journaling → fsck required after unclean shutdown
- Still used on flash media (no journal wear)
ext3 (2001, Stephen Tweedie):
- ext2 + jbd (journaling block device)
- Fully backward compatible with ext2
- Three journal modes: writeback, ordered, data
- Online filesystem growth
- dir_index: HTree for large directories
- Still limited to ~32,000 subdirectories per directory
ext4 (2008, Theodore Ts'o et al.):
- ext3 + extents + larger limits + delayed allocation + more
- 1 EB filesystem limit, 16 TB file limit
- 64,000 subdirectories per directory
- Journal checksums
- Online defragmentation
- Transparent compression (not in mainline, was proposed)
On-Disk Layout
ext4 filesystem layout (3 block groups shown):
+==============+==============+==============+
| Block | Block | Block |
| Group 0 | Group 1 | Group 2 |
+==============+==============+==============+
Block Group 0 detail:
+------+------+------+------+------+------+------+------+------+
|Super |GDT |GDT |Block |Inode |Inode |Inode |Data |Data |
|block |backup| |Bitmap|Bitmap|Table |Table |Block |Block |
| | | | | | | | ... | ... |
+------+------+------+------+------+------+------+------+------+
1 blk >=1 flex 1 blk 1 blk N blks N blks
Superblock: block 0 in group 0 (+ backup copies in groups 1,3,5,7,9,25,49...)
GDT (Group Descriptor Table): one 32-byte (or 64-byte with 64bit feature) entry per block group
Block Bitmap: 1 bit per block in group (1=used, 0=free), fills exactly one 4KB block
→ max 32,768 blocks per group with 4KB blocks
Inode Bitmap: 1 bit per inode in group
Inode Table: inodes_per_group × inode_size bytes
Data Blocks: all remaining blocks
Block group size: 8 × block_size × 8 = 8 × 4096 × 8 = 262,144 blocks = 1 GiB
Flexible Block Groups (flex_bg feature, ext4 default): Groups 0 through flex_bg_size-1 (default 16) form a "flex group". All bitmaps and inode tables of the flex group are concentrated in group 0, with remaining blocks in each group used entirely for data. This improves large-file allocation locality.
Superblock Fields
# View superblock details
dumpe2fs -h /dev/sda1
Key superblock fields (stored at byte offset 1024 in the filesystem):
Block count: 2621440
Free blocks: 1432567
Inode count: 655360
Free inodes: 532145
Block size: 4096
Inode size: 256 ← includes extra nanosecond timestamps
First block: 0
Blocks per group: 32768
Inodes per group: 8192
Journal inode: 8
Default mount options: user_xattr acl
Features: has_journal,ext_attr,resize_inode,dir_index,
filetype,needs_recovery,extent,flex_bg,
sparse_super2,large_file,huge_file,
uninit_bg,dir_nlink,extra_isize,
metadata_csum
Journal backup: inode blocks
Lifetime writes: 1234 GB
ext4 Key Features
Extents (ext_attr + extent features):
The traditional indirect block scheme in ext2/ext3 used up to 15 block pointers per inode (12 direct + 1 single-indirect + 1 double + 1 triple). For large files, this created deep trees of block pointers. ext4 replaces block pointers with extents:
ext4 extent in i_block[]:
struct ext4_extent_header {
__le16 eh_magic; // 0xF30A
__le16 eh_entries; // number of valid entries
__le16 eh_max; // max entries in this node
__le16 eh_depth; // depth of tree (0 = leaf)
__le32 eh_generation;
};
struct ext4_extent { // leaf node: maps logical blocks to physical
__le32 ee_block; // first logical block
__le16 ee_len; // number of blocks
__le16 ee_start_hi; // high 16 bits of physical block
__le32 ee_start_lo; // low 32 bits of physical block
};
A 4-extent tree stored inline in the inode's i_block[15] area (60 bytes) covers 4 contiguous physical ranges. For a sequentially-written 1 TB file: one extent. For a heavily fragmented file: the extent tree grows deeper, backed by additional blocks.
Delayed Allocation (EXT4_F_DELALLOC):
ext4 defers block allocation until data is flushed from the page cache. This allows the allocator to see the full extent of a write before choosing physical blocks:
- Better contiguous allocation: Writing a file in 4K chunks — ext4 waits until writeback to allocate, allocating a contiguous extent for all dirty pages together.
- Elimination of unnecessary allocation: If a file is written then deleted before flushing, no blocks are ever allocated.
Tradeoff: delayed allocation can cause ENOSPC to be reported to the application later than expected (when writeback runs, not when write() is called), potentially confusing applications that check for free space only at write time.
uninit_bg (uninitialized block groups): Block groups are lazily initialized. On mkfs.ext4, only group 0's metadata is written. Other groups are initialized on first use. This makes mkfs.ext4 on a 10 TB filesystem take seconds instead of minutes, and enables fast fsck (skip uninitialized groups).
dir_index (HTree for directories): Directories with >2 files use an HTree (a B-tree variant where leaves are hash-to-dentry mappings). This changes directory lookup from O(n) linear scan to O(log n) hash lookup. For directories with 100,000 entries, this is the difference between 500 ms and 2 ms per lookup.
# Check if dir_index is enabled (it's on by default for ext4)
dumpe2fs -h /dev/sda1 | grep dir_index
# Retroactively add htree to existing large directory
e2fsck -fD /dev/sda1 # -D: optimize all directories
fallocate() (persistent preallocation): fallocate(fd, FALLOC_FL_KEEP_SIZE, 0, 1<<30) allocates 1 GB of contiguous blocks for fd without writing data. The inode's i_size remains unchanged (KEEP_SIZE flag), but physical blocks are reserved. Used by databases (MySQL, PostgreSQL) and multimedia editors to preallocate space and avoid fragmentation.
Inline data: Files ≤ 60 bytes can store their data directly in the inode's i_block[] area. A 20-byte file needs no separate data block — its content is inside the inode itself. Significant space saving for small files (configuration, metadata).
Block Group Bitmap Layout (ASCII)
Block Group Bitmap Analysis:
Group 0 descriptor (tune2fs -l → "Block groups"):
Block bitmap at block 1025
Inode bitmap at block 1026
Inode table at block 1027 through 1282
Block bitmap (block 1025, 4096 bytes = 32768 bits):
Bit 0: block 0 (superblock) → 1 (used)
Bit 1: block 1 (GDT) → 1 (used)
Bit 2: block 2 (block bitmap) → 1 (used)
Bit 3: block 3 (inode bitmap) → 1 (used)
Bits 4-259: inode table blocks → 1 (used)
Bits 260-32767: data blocks → 0 (free) or 1 (used)
Inode bitmap (block 1026, 4096 bytes = 32768 bits):
Bit 0: inode 1 (bad blocks inode) → 1
Bit 1: inode 2 (root directory "/") → 1
Bit 2-10: reserved system inodes → 1
Bits 11+: user inodes → 0 (free) or 1 (allocated)
ext4 Tuning and Tools
mkfs.ext4 options:
# Standard filesystem for general use
mkfs.ext4 /dev/sda1
# Database-optimized (more inodes, larger block size, no lazy init)
mkfs.ext4 -b 4096 -i 4096 -m 1 -E lazy_itable_init=0,lazy_journal_init=0 /dev/sda1
# -i 4096: one inode per 4096 bytes (more inodes for small-file workloads)
# -m 1: 1% reserved blocks (default 5%, wasteful on large disks)
# -E lazy_itable_init=0: initialize inode tables immediately (slower mkfs, faster first mount)
# Stripe-aware for RAID10 (chunk 512KB, 2 data disks)
mkfs.ext4 -E stride=128,stripe-width=256 /dev/md0
# stride = chunk_size / block_size = 512KB / 4KB = 128
# stripe-width = stride × data_disks = 128 × 2 = 256
# Mail server (many small files)
mkfs.ext4 -i 1024 -m 0 /dev/sda1 # one inode per 1KB, no reserved blocks
tune2fs: Modify filesystem parameters on existing ext4:
# View current parameters
tune2fs -l /dev/sda1
# Change reserved block percentage
tune2fs -m 2 /dev/sda1
# Add journal (convert ext2 to ext3/ext4)
tune2fs -j /dev/sda2
# Enable metadata checksums
tune2fs -O metadata_csum /dev/sda1 # requires e2fsck after
# Set journal size
tune2fs -J size=512 /dev/sda1 # 512 MB journal (default 128 MB)
# Enable/disable features
tune2fs -O ^has_journal /dev/sda1 # disable journaling (^=remove feature)
dumpe2fs: Dump filesystem metadata:
# Full dump (all block groups)
dumpe2fs /dev/sda1 2>/dev/null | less
# Superblock only
dumpe2fs -h /dev/sda1
# Show block group details
dumpe2fs /dev/sda1 | grep -A 20 "Group 0:"
debugfs: Interactive filesystem debugger:
debugfs /dev/sda1 # opens filesystem read-only
# Useful debugfs commands:
# > stat /etc/passwd — dump inode details
# > ls -l / — list directory entries with inodes
# > cat /etc/passwd — dump file content
# > logdump -c 10 — show last 10 journal transactions
# > imap — dump inode bitmap for all groups
# > bmap <inode> <block> — map logical block to physical block
# Recover deleted file (if blocks not yet reallocated)
debugfs> lsdel # list recently deleted inodes
debugfs> undel <inode_num> /tmp/recovered_file
e2fsck: Check and repair ext4:
# Check filesystem (must be unmounted or read-only)
e2fsck -n /dev/sda1 # read-only check, no modifications
e2fsck /dev/sda1 # interactive: prompts for each fix
e2fsck -y /dev/sda1 # non-interactive: auto-yes to all fixes
e2fsck -f /dev/sda1 # force check even if clean flag set
# After kernel crash (journal may need replay):
e2fsck -pv /dev/sda1 # -p: auto-fix, -v: verbose progress
ext4 Scalability Limits
| Parameter | ext4 Limit | Notes |
|---|---|---|
| Max file size | 16 TiB (4KB blocks) | Limited by extent tree depth |
| Max filesystem size | 1 EiB (4KB blocks) | 48-bit block addresses |
| Max filename length | 255 bytes | UTF-8 encoded |
| Max path length | 4096 bytes | VFS limit |
| Max directory entries | 2 billion (HTree) | No practical limit |
| Max hard links | 65,000 per file | i_links_count is 16-bit |
| Max subdirs | 64,998 per directory | dir_nlink feature needed >32K |
| Inodes per filesystem | Set at mkfs time | Cannot change without recreating FS |
| Block size | 1024, 2048, 4096 bytes | 4096 matches page size (standard) |
Mount Options Reference
# /etc/fstab entry for database filesystem:
/dev/nvme0n1p1 /var/lib/mysql ext4 \
defaults,data=ordered,noatime,nodiratime,discard,barrier=1 0 2
# Key options:
# noatime: don't update atime on reads (improves random read I/O by ~2-5%)
# nodiratime: don't update directory atime
# discard: enable automatic TRIM on UNMAP (for SSD; check if HDD — harmful)
# barrier=1: default, required for data integrity with non-PLP storage
# data=ordered: default journal mode
# commit=1: journal commit interval (seconds); 5=default, 1=more durable
# errors=remount-ro: remount read-only on error (vs continue or panic)
# For NVMe SSD with PLP (power-loss protection):
# Can safely use data=writeback, barrier=0 for max performance
# (PLP hardware guarantees durability without software barriers)
Historical Context
ext was written by Rémy Card in 1992 as the first non-Minix filesystem for Linux. ext2 followed in 1993 with expanded limits and the classic multi-block-group layout. Stephen Tweedie's work on ext3 (1999-2001) added journaling while maintaining binary compatibility with ext2.
ext4 began as a series of patches to ext3 that broke backward compatibility — originally called "ext4dev" and marked experimental. Theodore Ts'o, Andrew Morton, and others stabilized it for production use by 2008 (Linux 2.6.28). The major improvements were extents (eliminating the indirect block tree for large files), delayed allocation (improving contiguous allocation for modern workloads), and 48-bit block addresses (enabling >16 TB filesystems).
Google switched Android from YAFFS2 to ext4 in 2010 (Android 2.3), bringing ext4 to hundreds of millions of devices. As of 2024, ext4 remains the default filesystem on Ubuntu, Debian, Fedora (for non-Btrfs installs), CentOS/RHEL, and Android.
Production Examples
Google file system metrics: Google has extensive ext4 deployments for Linux server system disks. Their internal ext4 monitoring identifies i_nlink overflow (from deep directory trees in build caches), delayed allocation latency spikes (from writeback storms), and metadata checksum error rates that trigger drive replacement.
Dropbox inode exhaustion (2017): Before migrating to their own storage, Dropbox experienced multiple inode exhaustion events on ext4 filesystems backing their NFS servers. Each user's home directory had tens of thousands of files. Fix: mkfs.ext4 -i 1024 on new volumes. This was documented as a standard part of their capacity planning after the incidents.
Canonical Ubuntu LTS: Ubuntu defaults to ext4 for all installs. Their CI/CD infrastructure has tested ext4 behavior extensively, including the delayed allocation bug (below) — a primary driver for metadata_csum becoming mandatory.
Debugging Notes
# Monitor ext4 events in real time
perf trace --event 'ext4:*' -p $PID
# Or via ftrace:
echo 'ext4:ext4_journal_start ext4:ext4_sync_file_enter ext4:ext4_writepages' \
> /sys/kernel/debug/tracing/set_event
# Check for metadata corruption
dmesg | grep -i 'ext4\|jbd2\|buffer_io_error'
# Find which inode corresponds to a specific block number
debugfs /dev/sda1 -R "icheck $BLOCK_NUM"
debugfs /dev/sda1 -R "ncheck $INODE_NUM"
# Check filesystem consistency stats
e2fsck -n /dev/sda1 2>&1 | tail -5
# "pass 1: checking inodes, blocks, and sizes"
# ...
# "/dev/sda1: 123456/655360 files, 1432567/2621440 blocks"
# Fragmentation analysis
e2fsck -fn /dev/sda1 (check fragmentation as part of check)
# Or: e4defrag -c /path (check fragmentation score)
e4defrag -c /var/lib/mysql # check fragmentation score (0=good)
e4defrag /var/lib/mysql # defragment
# Monitor journal activity
iostat -x 1 | grep -E "$(lsblk /dev/sda --output NAME -n)"
# High w/s with small wreq-sz = journal writes
Security Implications
metadata_csum protection: ext4 with metadata_csum checksums every metadata block (superblock, group descriptors, bitmaps, inode table, directory blocks, journal). A corrupted metadata block is detected and reported rather than silently causing data loss. Enable on all new filesystems:
mkfs.ext4 -O metadata_csum /dev/sda1
Filesystem immutable flag: chattr +i /path/to/file sets the immutable attribute — even root cannot delete, rename, or modify the file until the flag is cleared with chattr -i. Used to protect critical files (system binaries, configuration) from rootkit modification.
chattr +i /etc/resolv.conf # prevent modification by DHCP client or malware
lsattr /etc/resolv.conf # verify: ----i-----------
Reserved blocks: ext4 reserves 5% of blocks for root by default. This prevents non-root users from completely filling the filesystem and causing root's processes to fail. Reduce to 1-2% on very large disks: tune2fs -m 1 /dev/sda1.
Performance Implications
Delayed allocation and write ordering: Delayed allocation can cause ext4 to allocate a large contiguous extent for a file written in many small write() calls. Benchmark: sequential file write with fio --rw=write --bs=4k on ext4 shows much better fragmentation than on ext2 (which allocates per-write).
noatime mount option: atime updates on every file read require an inode write — converting every read into a mixed read/write. For workloads with heavy sequential reads (log processing, backup), noatime eliminates this overhead. Use relatime (default since Linux 2.6.30) as a middle ground: atime is only updated if it predates mtime or ctime.
Large file fallocate performance: For databases creating a 100 GB file, fallocate(FALLOC_FL_KEEP_SIZE) is orders of magnitude faster than writing zeros via dd. ext4's fallocate allocates contiguous extents without zeroing them (uninitialized extent marked with EXT4_EXT_IS_WRITTEN flag cleared; read returns zeros until written).
Failure Modes and Real Incidents
ext4 data loss bug (CVE-2015-8374, "delayed allocation orphan inode" bug): A race condition in ext4's delayed allocation + journal path could cause file data to be lost after a crash if the file was written to and truncated in the same journal transaction. Affected Linux 3.10-4.0. Fixed in 4.1. Production impact: automated systems doing write-then-truncate patterns (log rotation, temporary files) on ext4 with delayed allocation could silently lose data after a crash. Widely reported in enterprise Linux support channels in 2014-2015.
ext4 journal replay hang on very large filesystems (reported 2018-2020): With journal size 128 MB and a very large (>32 TB) filesystem with high metadata write rate, journal replay on boot could take 5-30 minutes. This caused unexpected "disk hang" during VM startup in cloud environments. Fix: increase journal size (tune2fs -J size=1024) and enable fast_commit (Linux 5.10+).
Block bitmap corruption from misbehaving storage (common): Production ext4 filesystems sometimes show "block bitmap inconsistency" errors detected by e2fsck after a bad shutdown on drives with intermittent write failures. With metadata_csum, these are detected online (during mount) and the filesystem is remounted read-only. Without checksums, corruption can propagate silently.
Modern Usage
- Default on Android: ext4 with
f2fsas an alternative. Most Android devices since 2010 use ext4 for userdata, system, and vendor partitions. - Container base filesystems: Docker's default storage driver on Debian/Ubuntu uses overlayfs on top of ext4. The ext4 filesystem's mature
dir_indexand extent support makes it efficient for the many-small-file access patterns of container layers. - NVMe system disks in cloud VMs: GCP, Azure, and AWS Linux VMs typically boot from ext4 on NVMe disks. With
metadata_csumandnoatime, ext4 performs within 5% of raw NVMe on mixed workloads.
Future Directions
- ext5 / Next-gen ext: Ted Ts'o has mentioned post-ext4 features: larger inline data, better large-directory performance, improved extent tree. But Btrfs and XFS have largely captured the "next generation ext4" market segment.
- Fast commit (Linux 5.10+):
EXT4_FEATURE_COMPAT_FAST_COMMITenables a compact commit record for fsync-only-metadata changes. Reduces fsync overhead by 40% for many workloads. - ext4 case-folding (Linux 5.2+):
casefoldfeature enables case-insensitive filenames per directory (for Android compat mode and Windows-interop NFS shares). - Verity (fs-verity): Cryptographic read-only content hashing for individual files. Merged in Linux 5.4. Used by Android to verify system image integrity and by Chromebook verified boot.
Exercises
-
Create two ext4 filesystems: one with default settings and one with
-i 1024(dense inode allocation). Create 100,000 small files (1 byte each) on each. Measure: time to create files, space used (df -i), and time tols -lthe directory. -
Demonstrate delayed allocation by writing a 1 GB file in 4K chunks via
fio. While writing, observe block allocation withfilefrag -v /path/to/fileboth during and after. How many extents does the file have during writing vs. after completion? -
Use
debugfsto examine a live (mounted, read-only opening allowed) ext4 filesystem. Find inode 2 (root directory). List its extent tree. Dump the directory entries. Find the journal inode (8) and note its size and blocks. -
Force an unclean shutdown (kill the VM/container). Observe
e2fsckrunning at next boot. Identify indmesgthe journal replay messages. How long does it take? Compare toe2fsck -non the same filesystem after a clean unmount. -
Measure the
metadata_csumoverhead. Create two identical ext4 filesystems (one with, one withoutmetadata_csum). Runfio --rw=randwrite --bs=4kon each. Measure IOPS difference. Is the checksum overhead significant for this workload?
References
- Card, R. et al. "Design and Implementation of the Second Extended Filesystem." Dutch Linux Symposium 1994.
- Ts'o, T. and Tweedie, S. "Planned Extensions to the Linux Ext2/Ext3 Filesystem." USENIX ATC 2002.
- Linux kernel documentation:
Documentation/filesystems/ext4/ - ext4 on-disk format:
fs/ext4/ext4.hin kernel source man 5 ext4— ext4 filesystem overviewman 8 mkfs.ext4,man 8 tune2fs,man 8 debugfs,man 8 dumpe2fs- Ts'o, T. "ext4 Data Loss Bug (delayed allocation)." LWN 2015. https://lwn.net/Articles/638252/
- Mathur, A. et al. "The new ext4 filesystem: current status and future plans." Ottawa Linux Symposium 2007.