Section 13: Filesystems
Purpose and Scope
A filesystem is the layer that imposes structure, naming, and durability semantics onto a raw storage medium. This section examines the full spectrum: from the abstract VFS (Virtual Filesystem Switch) that gives Linux its filesystem-agnostic POSIX interface, through the internal data structures (inodes, dentries, superblocks), to the crash-consistency mechanisms (journaling, copy-on-write, log-structured layouts) that distinguish modern filesystems from their predecessors.
Concrete filesystems covered include the Linux mainstream (ext4, XFS, Btrfs), the enterprise imports (ZFS, NTFS, APFS), the specialized (F2FS for NAND flash, tmpfs for memory, procfs/sysfs for kernel exposition, FUSE for userspace), and the distributed tier (NFS, CephFS, GlusterFS). The section treats crash consistency, POSIX semantics, and performance trade-offs as first-class subjects.
Prerequisites
- Section 11 (Memory Management): page cache, mmap, demand paging
- Section 12 (Storage Systems): block devices, block layer, I/O schedulers
- Section 03 (OS Fundamentals): file descriptors, syscalls, VFS overview
- Familiarity with POSIX file API (open, read, write, fsync, rename)
Learning Objectives
Upon completing this section you will be able to:
- Explain VFS abstractions (superblock, inode, dentry, file) and how they enable filesystem-independent system calls.
- Describe ext4 journaling modes (journal, ordered, writeback) and their crash consistency guarantees.
- Explain how Btrfs and ZFS use copy-on-write trees to achieve atomic multi-block updates without journals.
- Describe the on-disk format of an ext4 inode and how extents differ from block maps.
- Explain why rename() is atomic on POSIX filesystems and the conditions under which fsync() is required.
- Compare log-structured (F2FS, LFS) vs update-in-place (ext4) approaches for NAND flash workloads.
- Describe how FUSE works and its performance implications (context switches per I/O operation).
- Explain distributed filesystem consistency models (NFS close-to-open, CephFS capabilities).
Architecture Overview
User Space
┌────────────────────────────────────────────────────────────────┐
│ open() / read() / write() / fsync() / rename() │
└─────────────────────────────┬──────────────────────────────────┘
│ syscall
┌─────────────────────────────▼──────────────────────────────────┐
│ VFS Layer │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ superblock│ │ inode │ │ dentry │ │ file (fd) │ │
│ │ operations│ │ operations│ │ cache │ │ operations │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────────┘ │
└──────────────────────────────────────────────────────────────┬─┘
┌──────────────┬──────────┴─────────┐
┌───────▼────┐ ┌──────▼──────┐ ┌─────────▼─────┐
│ ext4 │ │ XFS │ │ Btrfs │
│ journaling │ │ journaling │ │ CoW b-tree │
└───────┬────┘ └──────┬──────┘ └─────────┬─────┘
└──────────────┴──────────────────┬─┘
│
┌──────────────────────────────────────────────────────▼─────────┐
│ Page Cache │
│ address_space: radix tree of struct page (4 KB) │
│ writeback: pdflush → per-BDI flusher threads │
└──────────────────────────────────────────────────────┬──────────┘
│
┌──────────────────────────────────────────────────────▼──────────┐
│ Block Layer │
└─────────────────────────────────────────────────────────────────┘
Special Filesystems:
procfs (kernel state) ─ sysfs (device model) ─ tmpfs (RAM) ─ FUSE (user)
Key Concepts
- VFS (Virtual Filesystem Switch): An abstraction layer that provides a uniform interface for all filesystem types, enabling system calls like open() to work without knowing the underlying filesystem.
- Inode: The fundamental metadata object: file type, permissions, timestamps, size, and block/extent pointers. Does not store the filename.
- Dentry (Directory Entry): Maps a name to an inode; cached in the dentry cache (dcache) for fast path lookups.
- Superblock: Per-filesystem metadata: block size, block count, free block bitmap, inode table location, journal location.
- Extent: A contiguous run of blocks described by (start_block, length); more efficient than per-block maps for large files.
- Journaling: Write-ahead log that records pending metadata changes; ensures that after a crash, the filesystem can replay or discard incomplete operations (never leaving it in an inconsistent state).
- Copy-on-Write (CoW): Instead of modifying data in place, write new data to free blocks and atomically update the tree root; enables snapshots and atomic multi-block updates.
- Log-Structured Filesystem (LFS): All writes (data and metadata) are appended sequentially to a log; old versions accumulate and require cleaning. Ideal for NAND flash write patterns.
- fsync(): Forces all dirty pages and journal entries for a file to durable media before returning; required for crash-consistent application writes.
- rename() atomicity: On POSIX, rename() replacing an existing file is atomic at the directory entry level; widely used for atomic file replacement patterns.
- Page Cache: Kernel-managed cache of file data (4 KB pages in a per-inode radix tree); read()/write() operate on cache; writeback flushes dirty pages.
- Overlay Filesystem: Stacks a read-write upper layer on a read-only lower layer; used by Docker (OverlayFS), live CDs.
- FUSE (Filesystem in Userspace): Allows filesystem implementations in user space via /dev/fuse; each I/O incurs 2 context switches (kernel→user→kernel).
- Distributed Filesystem: Filesystem whose data and metadata span multiple servers; consistency model defines what readers observe after a write.
Major Historical Milestones
| Year | Milestone |
|---|---|
| 1974 | Unix Version 6 filesystem: 512-byte blocks, direct/indirect block maps |
| 1983 | Berkeley Fast Filesystem (FFS): cylinder groups, 4 KB blocks, reduced seeks |
| 1984 | Sun NFS v2: network filesystem, stateless server model |
| 1992 | Linux ext (first Linux filesystem); ext2 in 1993 |
| 1993 | Log-Structured Filesystem (Rosenblum & Ousterhout) |
| 1994 | Journaling enters mainstream (IBM JFS concept) |
| 1996 | SGI XFS open-sourced; B+ tree directory and extent allocation |
| 1998 | ReiserFS: first mainstream journaling filesystem for Linux |
| 2001 | ext3: journaling added to ext2 |
| 2001 | ZFS development begins at Sun Microsystems |
| 2003 | NTFS reverse-engineered; ntfs-3g FUSE driver follows |
| 2004 | ZFS first ships in Solaris 10 (2005) |
| 2006 | Btrfs development begins at Oracle (Chris Mason) |
| 2008 | Linux FUSE merged (mainline 2.6.14 in 2005, stabilized by 2008) |
| 2008 | ext4 merged into Linux 2.6.28 (extents, delayed allocation) |
| 2012 | Btrfs declared stable for basic usage in Linux 3.x |
| 2016 | Apple announces APFS (CoW, snapshots, clones) |
| 2017 | F2FS flash-friendly filesystem production-ready on Android |
| 2019 | ZFS on Linux (OpenZFS) 0.8: native encryption, sequential resilver |
| 2020 | bcachefs development; io_uring filesystem integration begins |
Modern Relevance and Production Use Cases
Container runtimes (Docker, containerd) use OverlayFS to stack image layers; understanding CoW page semantics explains why a container writing a large file in the upper layer causes copy-on-write amplification.
Kubernetes persistent volumes backed by XFS or ext4 are formatted with specific parameters (inode ratio, block size) that affect metadata scalability; a /var/lib/etcd volume on a slow filesystem causes etcd timeouts and cluster instability.
ZFS and Btrfs are used in NAS appliances (TrueNAS, Synology) and Linux production servers for their snapshot, send/receive, and scrub capabilities; understanding the CoW tree is essential for capacity planning.
Android ships F2FS as the default userdata filesystem since Android 9; its multi-head logging and segment-based layout reduce write amplification on eMMC and UFS storage.
Database installations (PostgreSQL, MySQL) require careful fsync() and O_DIRECT interaction with the filesystem; ext4 data=ordered mode is the minimum required for PostgreSQL WAL durability.
FUSE-based filesystems (s3fs, gcsfuse, sshfs) are widely deployed in cloud environments; their 2-context-switch-per-I/O cost makes them unsuitable for metadata-heavy workloads but acceptable for sequential large-object access.
File Map
| File | Description |
|---|---|
01-vfs-layer.md |
superblock/inode/dentry/file objects, VFS operations tables |
02-inodes-and-dentries.md |
inode on-disk format, dentry cache, dcache lookup |
03-page-cache.md |
address_space, radix tree, writeback, dirty page accounting |
04-journaling.md |
journal commit, ordered vs writeback vs data journaling |
05-ext4.md |
extent trees, block groups, htree directories, delayed allocation |
06-xfs.md |
AG structure, B+ tree allocator, reflink, realtime subvolume |
07-btrfs.md |
CoW B-tree forest, subvolumes, snapshots, RAID, scrub |
08-zfs.md |
ZPL/DMU/SPA stack, ZIL, ARC, dedup, vdev types |
09-cow-filesystems.md |
CoW semantics, snapshot internals, clone efficiency |
10-log-structured-fs.md |
LFS concepts, cleaner, F2FS multi-head logging, segment types |
11-distributed-fs.md |
NFS v4 delegation, CephFS MDS, GlusterFS translators |
12-overlay-fs.md |
OverlayFS mechanics, Docker layer merging, whiteout files |
13-fuse.md |
/dev/fuse, request routing, virtiofs, performance analysis |
14-ntfs-apfs.md |
NTFS MFT, alternate data streams, APFS container/volume model |
15-f2fs.md |
Flash-aware zones, node/data segments, adaptive logging |
16-tmpfs-procfs-sysfs.md |
Memory-backed VFS filesystems, kernel data exposition |
17-flash-aware-fs.md |
JFFS2, UBIFS, MTD layer, wear leveling at filesystem level |
18-crash-consistency.md |
Ordering guarantees, example bugs, testing with dm-flakey |
Cross-References
- Section 11 (Memory Management): page cache is physically part of memory management; mmap of files; DAX bypass
- Section 12 (Storage Systems): block layer below VFS; I/O scheduler interaction; NVMe namespaces as block devices
- Section 10 (Synchronization): inode locks, page cache spinlocks, journal locks
- Section 14 (Device Drivers): MTD layer for raw flash, block device driver interface
- Section 15 (Networking): NFS/SMB over network stack, RDMA for distributed filesystems
- Section 17 (Distributed Systems): distributed consistency models as applied to filesystem semantics
- Section 19 (Virtualization): virtio-fs, virtiofs shared directory, 9P protocol