Section 13: Filesystems

Purpose and Scope

A filesystem is the layer that imposes structure, naming, and durability semantics onto a raw storage medium. This section examines the full spectrum: from the abstract VFS (Virtual Filesystem Switch) that gives Linux its filesystem-agnostic POSIX interface, through the internal data structures (inodes, dentries, superblocks), to the crash-consistency mechanisms (journaling, copy-on-write, log-structured layouts) that distinguish modern filesystems from their predecessors.

Concrete filesystems covered include the Linux mainstream (ext4, XFS, Btrfs), the enterprise imports (ZFS, NTFS, APFS), the specialized (F2FS for NAND flash, tmpfs for memory, procfs/sysfs for kernel exposition, FUSE for userspace), and the distributed tier (NFS, CephFS, GlusterFS). The section treats crash consistency, POSIX semantics, and performance trade-offs as first-class subjects.

Prerequisites

Section 11 (Memory Management): page cache, mmap, demand paging
Section 12 (Storage Systems): block devices, block layer, I/O schedulers
Section 03 (OS Fundamentals): file descriptors, syscalls, VFS overview
Familiarity with POSIX file API (open, read, write, fsync, rename)

Learning Objectives

Upon completing this section you will be able to:

Explain VFS abstractions (superblock, inode, dentry, file) and how they enable filesystem-independent system calls.
Describe ext4 journaling modes (journal, ordered, writeback) and their crash consistency guarantees.
Explain how Btrfs and ZFS use copy-on-write trees to achieve atomic multi-block updates without journals.
Describe the on-disk format of an ext4 inode and how extents differ from block maps.
Explain why rename() is atomic on POSIX filesystems and the conditions under which fsync() is required.
Compare log-structured (F2FS, LFS) vs update-in-place (ext4) approaches for NAND flash workloads.
Describe how FUSE works and its performance implications (context switches per I/O operation).
Explain distributed filesystem consistency models (NFS close-to-open, CephFS capabilities).

Architecture Overview

  User Space
  ┌────────────────────────────────────────────────────────────────┐
  │  open() / read() / write() / fsync() / rename()               │
  └─────────────────────────────┬──────────────────────────────────┘
                                │  syscall
  ┌─────────────────────────────▼──────────────────────────────────┐
  │                         VFS Layer                               │
  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────┐  │
  │  │ superblock│  │  inode   │  │  dentry  │  │  file (fd)   │  │
  │  │ operations│  │ operations│  │  cache   │  │  operations  │  │
  │  └──────────┘  └──────────┘  └──────────┘  └──────────────┘  │
  └──────────────────────────────────────────────────────────────┬─┘
                      ┌──────────────┬──────────┴─────────┐
              ┌───────▼────┐  ┌──────▼──────┐  ┌─────────▼─────┐
              │   ext4     │  │    XFS      │  │    Btrfs      │
              │ journaling │  │ journaling  │  │  CoW b-tree   │
              └───────┬────┘  └──────┬──────┘  └─────────┬─────┘
                      └──────────────┴──────────────────┬─┘
                                                         │
  ┌──────────────────────────────────────────────────────▼─────────┐
  │                     Page Cache                                   │
  │  address_space: radix tree of struct page (4 KB)                │
  │  writeback: pdflush → per-BDI flusher threads                   │
  └──────────────────────────────────────────────────────┬──────────┘
                                                         │
  ┌──────────────────────────────────────────────────────▼──────────┐
  │                      Block Layer                                  │
  └─────────────────────────────────────────────────────────────────┘

  Special Filesystems:
  procfs (kernel state) ─ sysfs (device model) ─ tmpfs (RAM) ─ FUSE (user)

Key Concepts

VFS (Virtual Filesystem Switch): An abstraction layer that provides a uniform interface for all filesystem types, enabling system calls like open() to work without knowing the underlying filesystem.
Inode: The fundamental metadata object: file type, permissions, timestamps, size, and block/extent pointers. Does not store the filename.
Dentry (Directory Entry): Maps a name to an inode; cached in the dentry cache (dcache) for fast path lookups.
Superblock: Per-filesystem metadata: block size, block count, free block bitmap, inode table location, journal location.
Extent: A contiguous run of blocks described by (start_block, length); more efficient than per-block maps for large files.
Journaling: Write-ahead log that records pending metadata changes; ensures that after a crash, the filesystem can replay or discard incomplete operations (never leaving it in an inconsistent state).
Copy-on-Write (CoW): Instead of modifying data in place, write new data to free blocks and atomically update the tree root; enables snapshots and atomic multi-block updates.
Log-Structured Filesystem (LFS): All writes (data and metadata) are appended sequentially to a log; old versions accumulate and require cleaning. Ideal for NAND flash write patterns.
fsync(): Forces all dirty pages and journal entries for a file to durable media before returning; required for crash-consistent application writes.
rename() atomicity: On POSIX, rename() replacing an existing file is atomic at the directory entry level; widely used for atomic file replacement patterns.
Page Cache: Kernel-managed cache of file data (4 KB pages in a per-inode radix tree); read()/write() operate on cache; writeback flushes dirty pages.
Overlay Filesystem: Stacks a read-write upper layer on a read-only lower layer; used by Docker (OverlayFS), live CDs.
FUSE (Filesystem in Userspace): Allows filesystem implementations in user space via /dev/fuse; each I/O incurs 2 context switches (kernel→user→kernel).
Distributed Filesystem: Filesystem whose data and metadata span multiple servers; consistency model defines what readers observe after a write.

Major Historical Milestones

Year	Milestone
1974	Unix Version 6 filesystem: 512-byte blocks, direct/indirect block maps
1983	Berkeley Fast Filesystem (FFS): cylinder groups, 4 KB blocks, reduced seeks
1984	Sun NFS v2: network filesystem, stateless server model
1992	Linux ext (first Linux filesystem); ext2 in 1993
1993	Log-Structured Filesystem (Rosenblum & Ousterhout)
1994	Journaling enters mainstream (IBM JFS concept)
1996	SGI XFS open-sourced; B+ tree directory and extent allocation
1998	ReiserFS: first mainstream journaling filesystem for Linux
2001	ext3: journaling added to ext2
2001	ZFS development begins at Sun Microsystems
2003	NTFS reverse-engineered; ntfs-3g FUSE driver follows
2004	ZFS first ships in Solaris 10 (2005)
2006	Btrfs development begins at Oracle (Chris Mason)
2008	Linux FUSE merged (mainline 2.6.14 in 2005, stabilized by 2008)
2008	ext4 merged into Linux 2.6.28 (extents, delayed allocation)
2012	Btrfs declared stable for basic usage in Linux 3.x
2016	Apple announces APFS (CoW, snapshots, clones)
2017	F2FS flash-friendly filesystem production-ready on Android
2019	ZFS on Linux (OpenZFS) 0.8: native encryption, sequential resilver
2020	bcachefs development; io_uring filesystem integration begins

Modern Relevance and Production Use Cases

Container runtimes (Docker, containerd) use OverlayFS to stack image layers; understanding CoW page semantics explains why a container writing a large file in the upper layer causes copy-on-write amplification.

Kubernetes persistent volumes backed by XFS or ext4 are formatted with specific parameters (inode ratio, block size) that affect metadata scalability; a /var/lib/etcd volume on a slow filesystem causes etcd timeouts and cluster instability.

ZFS and Btrfs are used in NAS appliances (TrueNAS, Synology) and Linux production servers for their snapshot, send/receive, and scrub capabilities; understanding the CoW tree is essential for capacity planning.

Android ships F2FS as the default userdata filesystem since Android 9; its multi-head logging and segment-based layout reduce write amplification on eMMC and UFS storage.

Database installations (PostgreSQL, MySQL) require careful fsync() and O_DIRECT interaction with the filesystem; ext4 data=ordered mode is the minimum required for PostgreSQL WAL durability.

FUSE-based filesystems (s3fs, gcsfuse, sshfs) are widely deployed in cloud environments; their 2-context-switch-per-I/O cost makes them unsuitable for metadata-heavy workloads but acceptable for sequential large-object access.

File Map

File	Description
`01-vfs-layer.md`	superblock/inode/dentry/file objects, VFS operations tables
`02-inodes-and-dentries.md`	inode on-disk format, dentry cache, dcache lookup
`03-page-cache.md`	address_space, radix tree, writeback, dirty page accounting
`04-journaling.md`	journal commit, ordered vs writeback vs data journaling
`05-ext4.md`	extent trees, block groups, htree directories, delayed allocation
`06-xfs.md`	AG structure, B+ tree allocator, reflink, realtime subvolume
`07-btrfs.md`	CoW B-tree forest, subvolumes, snapshots, RAID, scrub
`08-zfs.md`	ZPL/DMU/SPA stack, ZIL, ARC, dedup, vdev types
`09-cow-filesystems.md`	CoW semantics, snapshot internals, clone efficiency
`10-log-structured-fs.md`	LFS concepts, cleaner, F2FS multi-head logging, segment types
`11-distributed-fs.md`	NFS v4 delegation, CephFS MDS, GlusterFS translators
`12-overlay-fs.md`	OverlayFS mechanics, Docker layer merging, whiteout files
`13-fuse.md`	/dev/fuse, request routing, virtiofs, performance analysis
`14-ntfs-apfs.md`	NTFS MFT, alternate data streams, APFS container/volume model
`15-f2fs.md`	Flash-aware zones, node/data segments, adaptive logging
`16-tmpfs-procfs-sysfs.md`	Memory-backed VFS filesystems, kernel data exposition
`17-flash-aware-fs.md`	JFFS2, UBIFS, MTD layer, wear leveling at filesystem level
`18-crash-consistency.md`	Ordering guarantees, example bugs, testing with dm-flakey

Cross-References

Section 11 (Memory Management): page cache is physically part of memory management; mmap of files; DAX bypass
Section 12 (Storage Systems): block layer below VFS; I/O scheduler interaction; NVMe namespaces as block devices
Section 10 (Synchronization): inode locks, page cache spinlocks, journal locks
Section 14 (Device Drivers): MTD layer for raw flash, block device driver interface
Section 15 (Networking): NFS/SMB over network stack, RDMA for distributed filesystems
Section 17 (Distributed Systems): distributed consistency models as applied to filesystem semantics
Section 19 (Virtualization): virtio-fs, virtiofs shared directory, 9P protocol