01 - VFS Layer

Technical Overview

The Virtual Filesystem Switch (VFS) is the kernel subsystem that provides a uniform interface for all filesystem types. Without VFS, every application would need to know whether it is accessing ext4, Btrfs, NFS, tmpfs, or procfs. VFS presents a single abstract API — open(), read(), write(), stat(), mkdir() — implemented identically regardless of the underlying filesystem. Each filesystem registers its own implementations of the VFS abstract operations; the VFS dispatches to them via function pointer tables.

VFS is one of the canonical examples of the abstraction layer design pattern in OS kernels. It was introduced in SunOS 2.0 (1985) and adopted by Linux in 0.96 (1992).

Prerequisites

Linux kernel architecture (system calls, kernel/user boundary)
Memory management basics (pages, page cache)
Block layer concepts (see 12-storage-systems/05-linux-block-layer.md)
C function pointers and struct-based polymorphism

Core Content

VFS Object Model

VFS defines four primary kernel objects, each with associated operation tables:

VFS Object Hierarchy:

        superblock (mounted FS metadata)
              |
        +-----+------+
        |            |
     inode        inode      (file metadata: permissions, size, blocks)
        |            |
     dentry       dentry     (name -> inode mapping: directory tree)
        |
      file                   (open file instance: per-process fd)

struct super_block: Represents a mounted filesystem instance.

struct super_block {
    dev_t              s_dev;        // underlying block device
    unsigned long      s_blocksize;  // block size in bytes
    unsigned char      s_blocksize_bits;
    loff_t             s_maxbytes;   // max file size
    struct file_system_type *s_type; // filesystem type (ext4, btrfs, ...)
    const struct super_operations *s_op; // superblock operations
    struct dentry      *s_root;      // root dentry of this mount
    struct list_head   s_inodes;     // all inodes on this superblock
    void               *s_fs_info;   // filesystem-specific data (ext4_sb_info, etc.)
    // ... many more fields
};

struct super_operations {
    struct inode *(*alloc_inode)(struct super_block *sb);
    void (*destroy_inode)(struct inode *);
    void (*dirty_inode)(struct inode *, int flags);
    int  (*write_inode)(struct inode *, struct writeback_control *);
    void (*evict_inode)(struct inode *);
    void (*put_super)(struct super_block *);
    int  (*sync_fs)(struct super_block *, int wait);
    int  (*statfs)(struct dentry *, struct kstatfs *);
    int  (*remount_fs)(struct super_block *, int *, char *);
    // ...
};

struct inode: Represents file metadata (NOT filename — filename is in dentry).

struct inode {
    umode_t            i_mode;    // permissions + file type (S_IFREG, S_IFDIR, ...)
    unsigned short     i_opflags;
    kuid_t             i_uid;     // owner UID
    kgid_t             i_gid;     // owner GID
    unsigned int       i_flags;
    struct timespec64  i_atime;   // access time
    struct timespec64  i_mtime;   // modification time
    struct timespec64  i_ctime;   // inode change time
    loff_t             i_size;    // file size in bytes
    blkcnt_t           i_blocks;  // 512-byte block count
    unsigned long      i_ino;     // inode number (unique within filesystem)
    atomic_t           i_count;   // reference count
    const struct inode_operations *i_op;   // inode operations
    const struct file_operations  *i_fop;  // file operations
    struct address_space          *i_mapping; // page cache for this file
    void *i_private;  // filesystem-specific data (ext4_inode_info, etc.)
    // ...
};

struct dentry: Maps a filename to an inode within a directory. Represents one component of a path.

struct dentry {
    unsigned int      d_flags;        // DCACHE_VALID, DCACHE_NEGATIVE, ...
    seqcount_spinlock_t d_seq;
    struct hlist_bl_node d_hash;      // hash table entry in dcache
    struct dentry     *d_parent;      // parent dentry
    struct qstr        d_name;        // component name (string + hash)
    struct inode      *d_inode;       // NULL for negative dentry
    unsigned char      d_iname[DNAME_INLINE_LEN]; // short name inline storage
    struct dentry_operations *d_op;   // dentry operations
    struct super_block *d_sb;         // owning superblock
    void              *d_fsdata;      // filesystem-specific data
    struct list_head   d_child;       // sibling list under parent
    struct list_head   d_subdirs;     // child dentry list
    // ...
};

struct file: Per-open-file-descriptor instance. Multiple file structs can reference the same inode (e.g., two processes opening the same file).

struct file {
    struct path        f_path;        // dentry + vfsmount (identifies file + mount)
    struct inode      *f_inode;       // inode (cached from f_path)
    const struct file_operations *f_op; // file operations
    loff_t             f_pos;         // current file offset (per-file-descriptor)
    struct fown_struct f_owner;       // for async I/O notification
    unsigned int       f_flags;       // O_RDONLY, O_WRONLY, O_NONBLOCK, etc.
    // ...
};

VFS Operation Tables

Each VFS object has an associated operations struct — a table of function pointers. Filesystems fill in these tables during registration.

inode_operations — operations on inodes (metadata):

struct inode_operations {
    struct dentry *(*lookup)(struct inode *, struct dentry *, unsigned int);
    // lookup: given parent dir inode + dentry name, find child inode
    int (*create)(struct user_namespace *, struct inode *, struct dentry *,
                  umode_t, bool);   // create regular file
    int (*link)(struct dentry *, struct inode *, struct dentry *);    // hard link
    int (*unlink)(struct inode *, struct dentry *);                   // delete file
    int (*symlink)(struct user_namespace *, struct inode *, struct dentry *, const char *);
    int (*mkdir)(struct user_namespace *, struct inode *, struct dentry *, umode_t);
    int (*rmdir)(struct inode *, struct dentry *);
    int (*rename)(struct user_namespace *, struct inode *, struct dentry *,
                  struct inode *, struct dentry *, unsigned int);
    int (*getattr)(struct user_namespace *, const struct path *, struct kstat *,
                   u32, unsigned int);  // stat()
    int (*setattr)(struct user_namespace *, struct dentry *, struct iattr *);
    // ...
};

file_operations — operations on open file descriptors:

struct file_operations {
    loff_t   (*llseek)(struct file *, loff_t, int);
    ssize_t  (*read)(struct file *, char __user *, size_t, loff_t *);
    ssize_t  (*write)(struct file *, const char __user *, size_t, loff_t *);
    ssize_t  (*read_iter)(struct kiocb *, struct iov_iter *);    // async read
    ssize_t  (*write_iter)(struct kiocb *, struct iov_iter *);   // async write
    int      (*mmap)(struct file *, struct vm_area_struct *);
    int      (*open)(struct inode *, struct file *);
    int      (*flush)(struct file *, fl_owner_t id);
    int      (*release)(struct inode *, struct file *);
    int      (*fsync)(struct file *, loff_t, loff_t, int datasync);
    long     (*unlocked_ioctl)(struct file *, unsigned int, unsigned long);
    // ...
};

address_space_operations — operations bridging the page cache to the filesystem:

struct address_space_operations {
    int (*writepage)(struct page *, struct writeback_control *);
    // writepage: write a dirty page to the block device
    void (*readahead)(struct readahead_control *);
    // readahead: submit I/O for a range of pages (new interface)
    int (*write_begin)(struct file *, struct address_space *, loff_t, unsigned,
                       struct page **, void **);
    // write_begin: prepare a page for writing (allocate, lock, bring up-to-date)
    int (*write_end)(struct file *, struct address_space *, loff_t, unsigned,
                     unsigned, struct page *, void *);
    // write_end: finish write, dirty the page
    sector_t (*bmap)(struct address_space *, sector_t);
    // bmap: logical file offset -> physical block number
    bool (*dirty_folio)(struct address_space *, struct folio *);
    int (*launder_folio)(struct folio *);  // writeback dirty page
    // ...
};

Path Resolution (namei)

When you call open("/home/user/file.txt", O_RDONLY), the kernel must resolve the path to an inode. This is called namei (name resolution), implemented in fs/namei.c.

Path resolution for "/home/user/file.txt":

Start: root dentry (d_sb->s_root) or process CWD for relative paths
              |
    Lookup "home" in dcache (dentry cache)
    Found? → use cached dentry
    Not found? → call parent_inode->i_op->lookup(parent_inode, dentry, 0)
    → filesystem resolves "home" → creates dentry, sets d_inode
              |
    Lookup "user" in "home" directory...
              |
    Lookup "file.txt" in "user" directory...
              |
    Return final dentry → final inode → open(inode) → allocate struct file

Key optimizations: - dcache hash lookup: d_lookup(parent, &name) — O(1) lookup via hash table - RCU path walk: Since Linux 2.6.38, path resolution uses RCU (Read-Copy-Update) for lockless traversal of already-cached dentries. Only acquires locks when encountering an uncached component. - Symlink recursal: If a component is a symlink, the target path replaces that component in the walk (up to MAXSYMLINKS=40 depth to prevent loops).

Mount and Mount Namespaces

Mount Internals:

struct vfsmount {
    struct dentry *mnt_root;         // root dentry of this mount
    struct super_block *mnt_sb;      // superblock
    int mnt_flags;                   // MS_RDONLY, MS_NOSUID, etc.
};

struct mount {                       // kernel-internal (not exposed via vfsmount)
    struct vfsmount mnt;
    struct mount *mnt_parent;        // parent mount (mountpoint)
    struct dentry *mnt_mountpoint;   // dentry where this FS is mounted
    struct list_head mnt_mounts;     // list of child mounts
    // ...
};

When you mount /dev/sda1 /mnt/data, the kernel: 1. Calls the filesystem's mount() method to create a super_block 2. Creates a mount struct linking the new super_block to the /mnt/data dentry 3. Inserts the mount into the mount tree

Path resolution checks: if the current dentry is a mountpoint (d_flags & DCACHE_MOUNTED), follow the mount: follow_mount() replaces the current dentry/mount pair with the child mount's root.

Mount namespaces (Linux 3.8+, CLONE_NEWNS): Each process can have an independent mount namespace — its own view of the filesystem hierarchy. Docker/containers use mount namespaces to isolate container filesystems. clone(CLONE_NEWNS) creates a new mount namespace as a copy-on-write of the parent's mount tree.

# View mount namespaces
ls -la /proc/self/ns/mnt
ls /proc/*/ns/mnt | sort -u  # all unique mount namespaces

# Create a new mount namespace (unshare)
unshare --mount bash
# Now you can mount/unmount without affecting the host
mount tmpfs /tmp -t tmpfs
# Host /tmp is unaffected
exit  # namespace destroyed, mounts released

Virtual Filesystems: /proc and /sys

/proc and /sys are virtual filesystems — they have no backing block device. Their inodes and data are generated on-demand by the kernel.

/proc (procfs): Per-process information + kernel parameters. - proc_fs_type.mount → proc_mount() creates a super_block - Each PID directory is a dynamically created dentry - Reading /proc/1234/status calls the inode's read → kernel function that formats process struct data

/sys (sysfs): Kernel object model (kobjects) exported as a filesystem. - Each kobject in the kernel automatically gets a sysfs directory - Reading /sys/block/sda/size reads the block_device->bd_inode->i_size via kobject attribute

tmpfs: In-memory filesystem backed by page cache. Files exist in RAM (or swapped if memory pressure). mount -t tmpfs none /tmp.

debugfs: Filesystem for kernel debugging — mounted at /sys/kernel/debug/. Used by ftrace, blktrace, KVM debug info.

configfs: User-space-driven kernel configuration via filesystem. Used to configure NVMe-oF targets (see 04-nvme.md).

VFS Object Relationship Diagram

Process
  |
  | fd table
  |
  v
+----------+           +----------+           +----------+
| struct   |---------->| struct   |---------->| struct   |
| files_   |  f_path   | file     |  f_inode  | inode    |
| struct   |           | (per-fd) |           |          |
+----------+           +----------+           +----------+
                           |                       |
                       f_path.dentry            i_op
                           |                   i_fop
                           v                   i_mapping
                       +----------+               |
                       | struct   |               v
                       | dentry   |           +----------+
                       | (cached  |           | address_ |
                       |  name)   |           | space    |
                       +----------+           | (page    |
                           |                  |  cache)  |
                       d_parent               +----------+
                           |
                       +----------+
                       | parent   |
                       | dentry   |
                       +----------+
                           |
                       super_block
                           |
                       +----------+
                       | on-disk  |
                       | inode    |
                       | table    |
                       +----------+

Historical Context

The concept of a virtual filesystem switch was introduced by Peter Weinberger at Bell Labs in UNIX in the early 1980s. Sun Microsystems formalized and published it in their 1985 paper "A Fast File System for UNIX" (McKusick et al.) and implemented it in SunOS 2.0 to support both the local UFS filesystem and NFS simultaneously with a clean abstraction.

Linus Torvalds imported a VFS into Linux 0.96 (1992), allowing both the Minix filesystem and later ext2 to coexist. Rick Faith and others extended the VFS design through the 2.0 and 2.4 kernel eras. The current VFS design (with dcache, page cache, address_space) was substantially shaped by Alexander Viro, who has maintained fs/namei.c and the VFS layer since the late 1990s.

The dcache (dentry cache) RCU path walk optimization (Al Viro, Nick Piggin, 2011) was a significant performance improvement: pathname resolution dropped from requiring spinlocks to being fully lockless for the common case of cached dentries.

Production Examples

Docker overlay filesystem: Docker uses overlayfs (OverlayFS, merged in Linux 3.18) — a VFS-level stacking filesystem. It presents a union of an "upper" (writable container layer) and "lower" (read-only image layers) directory. All VFS operations go through the overlay shim before reaching the underlying ext4 or xfs layers. This is how Docker images achieve copy-on-write semantics without copying entire filesystems at container start.

NFS dentry caching: NFS implements its own dentry validity checking (nfs_lookup_revalidate) — whenever the dcache returns a cached dentry for an NFS-mounted path, the NFS client checks with the server whether that dentry is still valid. Stale dentries cause ESTALE errors. The actimeo mount option controls how long NFS trusts cached attribute metadata.

Debugging Notes

# inode cache statistics
cat /proc/sys/fs/inode-nr
# nr_inodes: total allocated inodes
# nr_unused: inodes not currently referenced (available for eviction)

# dentry cache statistics
cat /proc/sys/fs/dentry-state
# nr_dentry: total dcache entries
# nr_unused: entries not currently in use

# View file descriptor usage system-wide
cat /proc/sys/fs/file-nr
# (allocated, free, max) — if allocated approaches max, increase fs.file-max

# Track VFS operations with ftrace
echo 1 > /sys/kernel/debug/tracing/events/vfs/enable
cat /sys/kernel/debug/tracing/trace_pipe

# strace a process to see VFS syscalls
strace -e trace=file,desc -p $PID

# Check which filesystem a path is on
stat -f /path/to/file
df -T /path/to/file

# Verify inode operations for a filesystem
# (kernel source: look at fs/<fstype>/<fstype>.c for inode_operations initialization)
# e.g., fs/ext4/namei.c: ext4_dir_inode_operations

Security Implications

Mount namespace attacks: A process in a user namespace can create mount namespaces and bind-mount suid binaries onto locations that override system binaries. nosuid mount flag prevents suid execution from mounted filesystems. Containers should always use nosuid,nodev,noexec on non-system mounts.

Symlink attacks (TOCTOU): Between checking a path and using it, a symlink can be swapped to point to a sensitive file. The kernel introduced openat2() (Linux 5.6) with RESOLVE_NO_SYMLINKS and RESOLVE_BENEATH flags to prevent symlink attacks in security-sensitive applications.

dentry negative caching: Negative dentries (caching "this file does not exist") prevent DoS from repeated lookups for non-existent files, but can also hide recently-created files from other processes briefly. Flushing the dcache (echo 2 > /proc/sys/vm/drop_caches) clears negative dentries.

/proc filesystem exposure: /proc/<pid>/ exposes process memory maps, file descriptors, and environment variables. In containers, /proc must be mounted with hidepid=2 to prevent unprivileged users from reading other processes' /proc entries.

Performance Implications

dcache hit rate: For path-heavy workloads (many file opens), dcache hit rate determines performance. A full ls -lR traversal on a large filesystem will warm the dcache; subsequent traversals are much faster. Monitor dcache pressure with cat /proc/sys/fs/dentry-state.

inode cache pressure: When the kernel evicts inodes from the inode cache (icache), it must re-read them from disk on next access. For workloads with many small files (email servers, build systems), inode cache pressure is a major bottleneck. Check cat /proc/sys/fs/inode-nr and set vm.vfs_cache_pressure to tune aggressiveness of inode/dentry eviction.

vm.vfs_cache_pressure (default: 100): - 0: never evict inode/dentry caches (dangerous — can OOM) - 100: reclaim inode/dentry cache at same rate as page cache - 200: aggressively reclaim (for systems with very large filesystems and limited RAM)

Failure Modes and Real Incidents

dcache memory exhaustion: Systems with millions of files and long-lived processes can accumulate millions of dentries in memory. If vm.vfs_cache_pressure=0 (or very low), the dcache grows unbounded → OOM. Resolution: set vm.vfs_cache_pressure=100 (default) and monitor /proc/sys/fs/dentry-state.

NFS stale dentry storm: When an NFS server reboots, clients with cached negative dentries for recently deleted files will try to reuse those dentries and get ESTALE. In large environments (thousands of clients), the simultaneous revalidation after server reboot can overwhelm the NFS server with lookup RPCs. Solution: NFS server keepalive monitoring + client actimeo tuning.

/proc/sys/fs/file-max exhaustion: System-wide file descriptor limit. When hit, all new open() calls fail with ENFILE. Common in high-connection-count web servers. Fix: echo 1000000 > /proc/sys/fs/file-max and increase per-process limits (ulimit -n 1048576).

Modern Usage

io_uring VFS integration: io_uring (Linux 5.1+) uses VFS read_iter/write_iter operations for file I/O, bypassing the traditional read()/write() syscall path. This enables batching of many VFS operations in a single syscall.
FUSE (Filesystem in Userspace): The VFS layer routes filesystem operations to a userspace daemon via /dev/fuse. Enables filesystems like SSHFS, NTFS-3G, and Google Drive mounts without kernel modules. Latency overhead: ~2-10x vs native kernel filesystem due to context switches.
overlayfs in containers: OverlayFS is now the default Docker/containerd storage driver on most Linux distributions. VFS's abstraction makes this possible without modifying ext4 or any underlying filesystem.
erofs (Enhanced Read-Only File System): Read-only compressed filesystem used in Android system images and container images. Implements the standard VFS operations but stores data in zstd-compressed read-only format.

Future Directions

VFS scalability work: Ongoing work to reduce locking in the VFS path resolution — Alexander Viro's RCU walk improvements continue. The inode lock (i_rwsem) is being progressively fine-grained.
Rust VFS bindings: The Linux kernel's Rust integration project includes early work on safe Rust wrappers for VFS operations, enabling filesystem modules written in Rust.
Cachefiles / ondemand read: Kernel feature for demand-loading filesystem data from a cache daemon (used in composefs for container image distribution).

Exercises

Write a C program that opens a file, forks, and has both parent and child read from the same fd. Observe that the file offset is shared (it's one struct file). Then open the same file in both parent and child separately and observe independent offsets. Explain using the VFS object model.
Use strace to trace ls /etc/ and count the number of getdents64, newfstatat, and open/openat calls. Map each syscall to its corresponding VFS operation (e.g., getdents64 → file_operations.iterate_shared).
Create a 1-million-file directory on a tmpfs. Measure ls time. Then clear the dcache (echo 3 > /proc/sys/vm/drop_caches) and re-run ls. Measure the difference. This quantifies dcache impact.
Read fs/namei.c:path_openat(). Identify where the dcache lookup happens and where the filesystem's i_op->lookup() is called on a miss. Trace the code path for opening /etc/passwd.
Create a mount namespace with unshare --mount. Inside the namespace, bind-mount /tmp to /opt. Verify that /opt shows /tmp contents inside the namespace but not outside. Explain using struct mount and mnt_mountpoint.

References

Viro, A. (maintainer): fs/namei.c, fs/dcache.c, fs/inode.c — Linux kernel source
Kleiman, S.R. "Vnodes: An Architecture for Multiple File System Types in Sun UNIX." USENIX 1986. (original VFS paper)
Corbet, J. et al. Linux Device Drivers, 3rd ed., Chapter 13 (The Linux Device Model)
Love, R. Linux Kernel Development, 3rd ed., Chapter 13 (The Virtual Filesystem)
LWN.net: "Dcache locking scalability" — https://lwn.net/Articles/360814/
LWN.net: "RCU path walk" — https://lwn.net/Articles/419811/
Linux kernel documentation: Documentation/filesystems/vfs.rst
Bovet, D. and Cesati, M. Understanding the Linux Kernel, 3rd ed., Chapter 12