Skip to content

Section 13: Filesystems

Purpose and Scope

A filesystem is the layer that imposes structure, naming, and durability semantics onto a raw storage medium. This section examines the full spectrum: from the abstract VFS (Virtual Filesystem Switch) that gives Linux its filesystem-agnostic POSIX interface, through the internal data structures (inodes, dentries, superblocks), to the crash-consistency mechanisms (journaling, copy-on-write, log-structured layouts) that distinguish modern filesystems from their predecessors.

Concrete filesystems covered include the Linux mainstream (ext4, XFS, Btrfs), the enterprise imports (ZFS, NTFS, APFS), the specialized (F2FS for NAND flash, tmpfs for memory, procfs/sysfs for kernel exposition, FUSE for userspace), and the distributed tier (NFS, CephFS, GlusterFS). The section treats crash consistency, POSIX semantics, and performance trade-offs as first-class subjects.


Prerequisites

  • Section 11 (Memory Management): page cache, mmap, demand paging
  • Section 12 (Storage Systems): block devices, block layer, I/O schedulers
  • Section 03 (OS Fundamentals): file descriptors, syscalls, VFS overview
  • Familiarity with POSIX file API (open, read, write, fsync, rename)

Learning Objectives

Upon completing this section you will be able to:

  1. Explain VFS abstractions (superblock, inode, dentry, file) and how they enable filesystem-independent system calls.
  2. Describe ext4 journaling modes (journal, ordered, writeback) and their crash consistency guarantees.
  3. Explain how Btrfs and ZFS use copy-on-write trees to achieve atomic multi-block updates without journals.
  4. Describe the on-disk format of an ext4 inode and how extents differ from block maps.
  5. Explain why rename() is atomic on POSIX filesystems and the conditions under which fsync() is required.
  6. Compare log-structured (F2FS, LFS) vs update-in-place (ext4) approaches for NAND flash workloads.
  7. Describe how FUSE works and its performance implications (context switches per I/O operation).
  8. Explain distributed filesystem consistency models (NFS close-to-open, CephFS capabilities).

Architecture Overview

  User Space
  ┌────────────────────────────────────────────────────────────────┐
  │  open() / read() / write() / fsync() / rename()               │
  └─────────────────────────────┬──────────────────────────────────┘
                                │  syscall
  ┌─────────────────────────────▼──────────────────────────────────┐
  │                         VFS Layer                               │
  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────┐  │
  │  │ superblock│  │  inode   │  │  dentry  │  │  file (fd)   │  │
  │  │ operations│  │ operations│  │  cache   │  │  operations  │  │
  │  └──────────┘  └──────────┘  └──────────┘  └──────────────┘  │
  └──────────────────────────────────────────────────────────────┬─┘
                      ┌──────────────┬──────────┴─────────┐
              ┌───────▼────┐  ┌──────▼──────┐  ┌─────────▼─────┐
              │   ext4     │  │    XFS      │  │    Btrfs      │
              │ journaling │  │ journaling  │  │  CoW b-tree   │
              └───────┬────┘  └──────┬──────┘  └─────────┬─────┘
                      └──────────────┴──────────────────┬─┘
                                                         │
  ┌──────────────────────────────────────────────────────▼─────────┐
  │                     Page Cache                                   │
  │  address_space: radix tree of struct page (4 KB)                │
  │  writeback: pdflush → per-BDI flusher threads                   │
  └──────────────────────────────────────────────────────┬──────────┘
                                                         │
  ┌──────────────────────────────────────────────────────▼──────────┐
  │                      Block Layer                                  │
  └─────────────────────────────────────────────────────────────────┘

  Special Filesystems:
  procfs (kernel state) ─ sysfs (device model) ─ tmpfs (RAM) ─ FUSE (user)

Key Concepts

  • VFS (Virtual Filesystem Switch): An abstraction layer that provides a uniform interface for all filesystem types, enabling system calls like open() to work without knowing the underlying filesystem.
  • Inode: The fundamental metadata object: file type, permissions, timestamps, size, and block/extent pointers. Does not store the filename.
  • Dentry (Directory Entry): Maps a name to an inode; cached in the dentry cache (dcache) for fast path lookups.
  • Superblock: Per-filesystem metadata: block size, block count, free block bitmap, inode table location, journal location.
  • Extent: A contiguous run of blocks described by (start_block, length); more efficient than per-block maps for large files.
  • Journaling: Write-ahead log that records pending metadata changes; ensures that after a crash, the filesystem can replay or discard incomplete operations (never leaving it in an inconsistent state).
  • Copy-on-Write (CoW): Instead of modifying data in place, write new data to free blocks and atomically update the tree root; enables snapshots and atomic multi-block updates.
  • Log-Structured Filesystem (LFS): All writes (data and metadata) are appended sequentially to a log; old versions accumulate and require cleaning. Ideal for NAND flash write patterns.
  • fsync(): Forces all dirty pages and journal entries for a file to durable media before returning; required for crash-consistent application writes.
  • rename() atomicity: On POSIX, rename() replacing an existing file is atomic at the directory entry level; widely used for atomic file replacement patterns.
  • Page Cache: Kernel-managed cache of file data (4 KB pages in a per-inode radix tree); read()/write() operate on cache; writeback flushes dirty pages.
  • Overlay Filesystem: Stacks a read-write upper layer on a read-only lower layer; used by Docker (OverlayFS), live CDs.
  • FUSE (Filesystem in Userspace): Allows filesystem implementations in user space via /dev/fuse; each I/O incurs 2 context switches (kernel→user→kernel).
  • Distributed Filesystem: Filesystem whose data and metadata span multiple servers; consistency model defines what readers observe after a write.

Major Historical Milestones

Year Milestone
1974 Unix Version 6 filesystem: 512-byte blocks, direct/indirect block maps
1983 Berkeley Fast Filesystem (FFS): cylinder groups, 4 KB blocks, reduced seeks
1984 Sun NFS v2: network filesystem, stateless server model
1992 Linux ext (first Linux filesystem); ext2 in 1993
1993 Log-Structured Filesystem (Rosenblum & Ousterhout)
1994 Journaling enters mainstream (IBM JFS concept)
1996 SGI XFS open-sourced; B+ tree directory and extent allocation
1998 ReiserFS: first mainstream journaling filesystem for Linux
2001 ext3: journaling added to ext2
2001 ZFS development begins at Sun Microsystems
2003 NTFS reverse-engineered; ntfs-3g FUSE driver follows
2004 ZFS first ships in Solaris 10 (2005)
2006 Btrfs development begins at Oracle (Chris Mason)
2008 Linux FUSE merged (mainline 2.6.14 in 2005, stabilized by 2008)
2008 ext4 merged into Linux 2.6.28 (extents, delayed allocation)
2012 Btrfs declared stable for basic usage in Linux 3.x
2016 Apple announces APFS (CoW, snapshots, clones)
2017 F2FS flash-friendly filesystem production-ready on Android
2019 ZFS on Linux (OpenZFS) 0.8: native encryption, sequential resilver
2020 bcachefs development; io_uring filesystem integration begins

Modern Relevance and Production Use Cases

Container runtimes (Docker, containerd) use OverlayFS to stack image layers; understanding CoW page semantics explains why a container writing a large file in the upper layer causes copy-on-write amplification.

Kubernetes persistent volumes backed by XFS or ext4 are formatted with specific parameters (inode ratio, block size) that affect metadata scalability; a /var/lib/etcd volume on a slow filesystem causes etcd timeouts and cluster instability.

ZFS and Btrfs are used in NAS appliances (TrueNAS, Synology) and Linux production servers for their snapshot, send/receive, and scrub capabilities; understanding the CoW tree is essential for capacity planning.

Android ships F2FS as the default userdata filesystem since Android 9; its multi-head logging and segment-based layout reduce write amplification on eMMC and UFS storage.

Database installations (PostgreSQL, MySQL) require careful fsync() and O_DIRECT interaction with the filesystem; ext4 data=ordered mode is the minimum required for PostgreSQL WAL durability.

FUSE-based filesystems (s3fs, gcsfuse, sshfs) are widely deployed in cloud environments; their 2-context-switch-per-I/O cost makes them unsuitable for metadata-heavy workloads but acceptable for sequential large-object access.


File Map

File Description
01-vfs-layer.md superblock/inode/dentry/file objects, VFS operations tables
02-inodes-and-dentries.md inode on-disk format, dentry cache, dcache lookup
03-page-cache.md address_space, radix tree, writeback, dirty page accounting
04-journaling.md journal commit, ordered vs writeback vs data journaling
05-ext4.md extent trees, block groups, htree directories, delayed allocation
06-xfs.md AG structure, B+ tree allocator, reflink, realtime subvolume
07-btrfs.md CoW B-tree forest, subvolumes, snapshots, RAID, scrub
08-zfs.md ZPL/DMU/SPA stack, ZIL, ARC, dedup, vdev types
09-cow-filesystems.md CoW semantics, snapshot internals, clone efficiency
10-log-structured-fs.md LFS concepts, cleaner, F2FS multi-head logging, segment types
11-distributed-fs.md NFS v4 delegation, CephFS MDS, GlusterFS translators
12-overlay-fs.md OverlayFS mechanics, Docker layer merging, whiteout files
13-fuse.md /dev/fuse, request routing, virtiofs, performance analysis
14-ntfs-apfs.md NTFS MFT, alternate data streams, APFS container/volume model
15-f2fs.md Flash-aware zones, node/data segments, adaptive logging
16-tmpfs-procfs-sysfs.md Memory-backed VFS filesystems, kernel data exposition
17-flash-aware-fs.md JFFS2, UBIFS, MTD layer, wear leveling at filesystem level
18-crash-consistency.md Ordering guarantees, example bugs, testing with dm-flakey

Cross-References

  • Section 11 (Memory Management): page cache is physically part of memory management; mmap of files; DAX bypass
  • Section 12 (Storage Systems): block layer below VFS; I/O scheduler interaction; NVMe namespaces as block devices
  • Section 10 (Synchronization): inode locks, page cache spinlocks, journal locks
  • Section 14 (Device Drivers): MTD layer for raw flash, block device driver interface
  • Section 15 (Networking): NFS/SMB over network stack, RDMA for distributed filesystems
  • Section 17 (Distributed Systems): distributed consistency models as applied to filesystem semantics
  • Section 19 (Virtualization): virtio-fs, virtiofs shared directory, 9P protocol