02 - Inodes and Dentries

Technical Overview

The inode (index node) is the on-disk and in-memory structure that stores all metadata about a file — except its name. The dissociation of name from metadata is fundamental to how UNIX filesystems work: names live in directory entries (dentries), and multiple names can point to the same inode (hard links). When the last name is removed and all open file descriptors are closed, the inode is freed and its data blocks reclaimed.

The dentry cache (dcache) is the in-memory representation of the directory tree. It maps path components to inodes and is the primary caching layer for metadata lookups. Both inodes and dentries have cache management, reference counting, and lifecycle semantics that every filesystem engineer must understand.

Prerequisites

VFS layer architecture (see 01-vfs-layer.md)
Basic filesystem structure (superblock, block groups)
Linux memory management (page cache, LRU)
stat() syscall familiarity

Core Content

The Inode: File Metadata Without a Name

An inode stores everything about a file except its name:

stat() output mapped to inode fields:

$ stat /etc/passwd
  File: /etc/passwd
  Size: 2890            Blocks: 8          IO Block: 4096   regular file
Device: 802h/2050d      Inode: 660234      Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2024-01-15 08:23:41.123456789 +0000
Modify: 2024-01-10 12:34:56.789012345 +0000
Change: 2024-01-10 12:34:56.789012345 +0000
 Birth: -

Mapped to struct inode fields:
  i_size     = 2890
  i_blocks   = 8 (8 × 512-byte units)
  i_mode     = 0100644 (S_IFREG | 0644)
  i_ino      = 660234
  i_nlink    = 1          (hard link count)
  i_uid      = 0
  i_gid      = 0
  i_atime    = last access time
  i_mtime    = last modification of file data
  i_ctime    = last change of inode (metadata or data)
  Birth      = crtime (creation time, not stored by ext4 by default)

What inode does NOT store: - Filename (stored in directory entry pointing to this inode) - Full path (derived dynamically via dcache) - Access control lists (stored in extended attributes, pointed to by inode)

Inode number: Unique within a filesystem (superblock). Two files on different filesystems can have the same inode number. This is why cp -a across filesystems cannot preserve inode numbers.

On-Disk Inode: ext4 Example

ext4 stores inodes in an inode table — a fixed-size array at the beginning of each block group:

ext4 inode table layout:

Block Group 0:
+------------------+
| Superblock       |  One copy (+ backups in groups 1, 3, 5, 7...)
+------------------+
| Group Descriptors|  One entry per block group
+------------------+
| Block Bitmap     |  1 bit per block (allocated/free)
+------------------+
| Inode Bitmap     |  1 bit per inode (allocated/free)
+------------------+
| Inode Table      |  Fixed array: inode_count × inode_size (256 bytes in ext4)
| [inode 1]        |  inode 1 = bad blocks inode
| [inode 2]        |  inode 2 = root directory "/"
| [inode 3-10]     |  reserved by ext4
| [inode 11+]      |  normal file inodes
+------------------+
| Data Blocks      |  File and directory data
+------------------+

ext4 disk inode structure (256 bytes, struct ext4_inode in fs/ext4/ext4.h):

Bytes  Field
0-1    i_mode: file type + permissions
2-3    i_uid (lower 16 bits)
4-7    i_size_lo: lower 32 bits of file size
8-11   i_atime
12-15  i_ctime
16-19  i_mtime
20-23  i_dtime: deletion time
24-25  i_gid (lower 16 bits)
26-27  i_links_count: hard link count
28-31  i_blocks_lo: lower 32 bits of block count
32-35  i_flags: EXT4_EXTENTS_FL, EXT4_INLINE_DATA_FL, etc.
36-91  i_block[15]: block pointers OR extent tree root (if EXTENTS_FL set)
92-95  i_generation: NFS generation number
96-99  i_file_acl_lo: extended attribute block
100-103 i_size_high: upper 32 bits of file size (for files > 4GB)
...
128+   Extended fields (inode_size=256): nanosecond timestamps, crtime,
       extra uid/gid bits (for uid/gid > 65535)

Block Pointers vs. Extents

Legacy block pointers (ext2/ext3): i_block[15] contains: - i_block[0..11]: direct pointers to data blocks - i_block[12]: single indirect (points to block of pointers) - i_block[13]: double indirect - i_block[14]: triple indirect - Maximum file size: limited by triple indirect depth

ext4 extents (default since ext4): EXT4_EXTENTS_FL flag in i_flags. The i_block field stores an extent tree root node. An extent maps a range of logical blocks to a contiguous range of physical blocks:

struct ext4_extent {
    __le32 ee_block;    // first logical block
    __le16 ee_len;      // number of blocks in extent (max 32768)
    __le16 ee_start_hi; // upper 16 bits of physical block
    __le32 ee_start_lo; // lower 32 bits of physical block
};

For a 100 MB sequential file: one extent covers the entire file (1 entry vs 25,600 block pointers with legacy method). Extents dramatically reduce metadata for large files and improve performance.

Hard Links

Hard links create multiple directory entries pointing to the same inode:

Hard link example:

$ ln /etc/passwd /tmp/passwd_copy

Directory "/etc":
  "passwd" ---------> inode 660234 (i_nlink=2)
                           |
Directory "/tmp":           |
  "passwd_copy" --------> (same inode 660234)

After: ln -s /etc/passwd /tmp/passwd_symlink  (symlink — different!)

Directory "/tmp":
  "passwd_symlink" --> inode 660245 (i_nlink=1, i_mode=S_IFLNK)
                           |
                       target: "/etc/passwd" (stored in inode data or in inode itself for short paths)

Hard link invariants: - i_nlink counts hard links - File data is freed only when i_nlink == 0 AND i_count == 0 (no open file descriptors) - Hard links cannot span filesystems (inode numbers are per-filesystem) - Hard links to directories are forbidden in most filesystems (would create cycles, breaking tree traversal; only . and .. are exceptions)

Symlinks (symbolic links): - Separate inode with S_IFLNK type - Contains the target path string - Short paths (<= 60 bytes in ext4) stored inline in i_block[] area ("fast symlink") - Long paths stored in a data block

Dentry: Directory Entry and Name Cache

A dentry represents one component of a filesystem path. In memory (dcache), dentries form a tree:

Dentry tree for "/home/user/file.txt":

     "/" (root dentry, d_parent = self)
      |
    "home" (dentry, d_parent = "/")
      |
    "user" (dentry, d_parent = "/home")
      |
   "file.txt" (dentry, d_parent = "/home/user")
              d_inode = inode 123456
              d_name = "file.txt"

Negative dentries: If a path component does not exist, the kernel caches a "negative dentry" — d_inode = NULL. The next lookup of the same path returns ENOENT immediately from cache without a disk access. This is critical for performance: ls on a large directory with stat() on each entry would be cripplingly slow without negative dentry caching.

# Check dentry cache size
cat /proc/sys/fs/dentry-state
# nr_dentry  nr_unused  age_limit  want_pages  nr_negative  dummy
# 123456     100000     45         0           5000         0

Negative dentries can cause confusion:

# File created on NFS server; client's negative dentry makes it appear nonexistent
stat /nfsmount/newfile  # ENOENT -- but file exists on server!
# Force dentry revalidation:
echo 3 > /proc/sys/vm/drop_caches  # drops dcache (harsh)
# Or: use NFS actimeo option to reduce cache TTL

Dentry Cache (dcache) Internals

The dcache is an in-memory LRU cache of recently used dentries:

// Lookup: hash table indexed by (parent_dentry, name_hash)
struct dentry *d_lookup(const struct dentry *parent, const struct qstr *name)
{
    // 1. Compute hash: hash(parent, name) -> bucket
    // 2. Walk bucket's hlist looking for matching dentry
    // 3. Check dentry validity (DCACHE_OP_REVALIDATE flag)
    // 4. Return found dentry or NULL
}

Dentry lifecycle: 1. Allocated: d_alloc() — either from dentry slab cache or filesystem-specific allocation 2. Positive: d_inode != NULL — valid path component with backing inode 3. Negative: d_inode == NULL — cached "not found" result 4. Unhashed: removed from hash table but still referenced (e.g., file being deleted while still open) 5. Freed: dentry_free() — returned to slab allocator after all references dropped

LRU eviction: When memory is short, the kernel evicts unused dentries (those not referenced by any process). shrink_dcache_sb() and shrink_dentry_list() implement dentry eviction. Controlled by vm.vfs_cache_pressure.

Inode Cache (icache)

The inode cache (fs/inode.c) is an in-memory LRU of recently accessed inodes. Inode cache entries are allocated from inode_cachep (a kmem_cache slab allocator).

# View inode cache stats
cat /proc/sys/fs/inode-nr
# nr_inodes (allocated) nr_unused (evictable)

# View slab usage for inodes
cat /proc/slabinfo | grep -E "ext4_inode|btrfs_inode|inode_cache"

When an inode is evicted from cache: 1. iput() decrements reference count 2. When refcount hits 0: check i_nlink 3. If i_nlink == 0 (file was deleted while open): call i_op->evict_inode() → free data blocks + on-disk inode 4. If i_nlink > 0 (file still has names): write dirty inode to disk → free from cache

Relationship to Mount Points

Dentry tree navigation must handle mount points:

Dentry tree with mounts:

"/" (root, on disk /dev/sda1)
 |
"home" (dentry on sda1)
 |
"user" (dentry on sda1)
     |
     +-- This dentry is a MOUNTPOINT (DCACHE_MOUNTED flag set)
         |
         [mount: /dev/sdb1 mounted at "/home/user"]
         |
         "/" (root of sdb1) ← path traversal crosses here
              |
           "projects"
           "documents"

During path walk: when a dentry has DCACHE_MOUNTED, follow_mount() traverses to the child mount's root dentry.

/proc/mounts or mount command shows the current mount tree. findmnt provides a tree view:

findmnt --tree
# TARGET                 SOURCE     FSTYPE
# /                      /dev/sda1  ext4
# ├─/proc                proc       proc
# ├─/sys                 sysfs      sysfs
# │ ├─/sys/fs/cgroup     cgroup2    cgroup2
# │ └─/sys/kernel/debug  debugfs    debugfs
# ├─/home                /dev/sdb1  ext4
# └─/tmp                 tmpfs      tmpfs

Inode Exhaustion

Every filesystem has a fixed number of inodes (or dynamic inode allocation for some filesystems). When inodes are exhausted, new file creation fails with ENOSPC even if there is free space.

# Check inode usage
df -i
# Filesystem       Inodes  IUsed  IFree IUse% Mounted on
# /dev/sda1       1310720 943021 367699   72% /
# /dev/sdb1       4194304 4194304      0  100% /data  <- FULL

# Count files in a directory
find /data -maxdepth 1 -type f | wc -l

# Find directories with most inodes consumed
find /data -type d -exec sh -c 'echo "$(ls -1 {} | wc -l) {}"' \; | sort -rn | head 20

Filesystems that dynamically allocate inodes (Btrfs, XFS with inode64 layout, ZFS) avoid inode exhaustion as a fixed-limit problem. ext4's inode count is set at mkfs time (-N inodes or -i bytes-per-inode).

Historical Context

The inode concept dates to the original UNIX filesystem (Ken Thompson, Dennis Ritchie, Bell Labs, circa 1971). The design was influenced by the Multics filesystem. The decision to separate name from metadata was deliberate: it allowed hard links, enabled rename to be atomic (just updating a directory entry, not copying data), and made the filesystem structure simpler.

The original UNIX inode stored block numbers directly in a flat array — the multi-level indirect block scheme came with the Fast File System (McKusick et al., 1984). The ext2/ext3 filesystem used the same scheme until ext4's introduction of extents (2008, Theodore Ts'o et al.), dramatically improving performance for large files.

The dentry cache was introduced in Linux 1.3 (1995) to eliminate redundant path lookups. Before dcache, every path component required reading directory data from disk on each lookup — catastrophically slow for deep paths in busy systems.

Production Examples

inode exhaustion in email servers: Mail servers (Postfix, Dovecot with Maildir format) store each email as a separate file. A mail server handling millions of messages can exhaust inodes on an ext4 filesystem configured with default inode density (one inode per 16 KB of space). Fix: specify -i 1024 (one inode per 1 KB) at mkfs time, or migrate to XFS/Btrfs with dynamic inode allocation.

Build systems and dcache: Large build systems (kernel compilation, LLVM) open hundreds of thousands of files. The dcache is critical — a warm dcache makes make -j$(nproc) rebuild much faster than the first build. On some systems with large source trees, keeping the dcache warm between builds (via vmtouch or equivalent) improves rebuild time by 40-60%.

NFS negative dentry race: A common NFS debugging scenario: process A creates a file /nfs/mount/newfile. Process B (on a different client) does ls /nfs/mount/ and gets ENOENT for newfile — because B has a cached negative dentry from before the file was created. NFS actimeo (attribute cache timeout) controls how long these stale entries persist. Lower actimeo reduces consistency delay at the cost of more NFS metadata RPCs.

Debugging Notes

# Find all hard links to a file (files sharing the same inode)
find /path -inum <inode_number>
stat /etc/passwd | grep Inode
find / -inum 660234 -not -path /proc -not -path /sys  2>/dev/null

# Check if inode has extended attributes
getfattr -d /path/to/file
lsattr /path/to/file   # ext4 attributes (immutable, append-only, etc.)

# View raw ext4 inode data
debugfs /dev/sda1
> stat <inode_number>
> stat /etc/passwd

# Dump directory entries
debugfs /dev/sda1
> ls -l /  # list root directory entries with inode numbers

# View dentry cache operations via ftrace
echo 'fs:dentry_create fs:dentry_delete' > /sys/kernel/debug/tracing/set_event
cat /sys/kernel/debug/tracing/trace_pipe

# Monitor inode allocation/free rates
perf stat -e 'ext4:ext4_allocate_inode,ext4:ext4_free_inode' -p $PID sleep 10

# Force inode/dentry cache flush (use carefully — drops ALL caches)
sync; echo 2 > /proc/sys/vm/drop_caches  # drop dentry+inode cache only
sync; echo 3 > /proc/sys/vm/drop_caches  # drop page cache + dentry + inode

Security Implications

Hard link attacks: In older Linux kernels, a non-privileged user could hard-link a suid-root binary to a world-writable directory. If the suid binary had a TOCTOU vulnerability, this could lead to privilege escalation. Linux 3.6+ restricts hard link creation: non-root users cannot create hard links to files they don't own in non-writable directories (fs.protected_hardlinks=1 by default).

Symlink races: /tmp is world-writable. A program that creates a file in /tmp using open("/tmp/tempfile", O_CREAT|O_WRONLY) is vulnerable if an attacker symlinks /tmp/tempfile to /etc/passwd before the open(). Mitigation: always use O_CREAT|O_EXCL (atomic create, fails if exists), use mkstemp(), or use sticky bit on /tmp (/proc/sys/fs/protected_symlinks=1 prevents following symlinks in sticky directories not owned by the follower).

Inode timing attacks: atime (access time) is updated on reads by default, leaking information about when files were last accessed. Security-sensitive systems use noatime or relatime mount options to prevent timing-based inference attacks on file access patterns.

Performance Implications

Hard link counting overhead: ls -l calls stat() on each file, reading the inode. For directories with millions of files, this is slow (no sequential disk access pattern). ls --dereference-command-line (alias ls -l) can be replaced with ls --color=auto which avoids stat() when file type coloring is from d_type in the directory entry.

Negative dentry memory usage: Systems with many transient paths (containers starting/stopping, build systems, CI/CD) can accumulate millions of negative dentries. These consume kernel memory. Monitor with nr_negative in /proc/sys/fs/dentry-state. Tune vm.vfs_cache_pressure up (e.g., 200) to encourage eviction.

Large directory lookup: A directory with 100,000 entries stored as a linear list would require O(n) scan per lookup. ext4 uses HTree (a B-tree variant, the dir_index feature) for directories with >2 entries — O(log n) lookup. Check: dumpe2fs /dev/sda1 | grep "dir_index". Enable retroactively: tune2fs -O dir_index /dev/sda1 && e2fsck -fD /dev/sda1.

Failure Modes and Real Incidents

inode table corruption: A power failure during inode write can corrupt the on-disk inode table. ext4 checksums each inode (since Linux 3.7, metadata_csum feature) — fsck.ext4 will detect and report checksum errors. Without checksums, a corrupted inode (e.g., wrong i_mode or garbage i_block pointers) can cause kernel panics when the corrupted inode is loaded.

Hard link count overflow: ext4 limits i_nlink to 65535. Creating more than 65535 hard links to a file sets i_nlink=1 (special "too many links" indicator) and ext4 enters "htree_dir" mode. In practice: if you have >65535 subdirectories in a directory, the parent's i_nlink overflows. ext4 handles this gracefully; some older tools may misreport it as having 1 link.

Dentry race condition (CVE-2014-0203 et al.): Multiple kernel CVEs have resulted from race conditions in dentry lifecycle management (use-after-free in dcache). These are typically triggered by concurrent rename/unlink operations. Modern kernels have addressed most known races with careful locking; lockdep annotations help detect new races in testing.

Maildir inode exhaustion at scale (real pattern): A hosting provider with 50,000 email accounts on a single server using Maildir format (one file per email) hit ext4 inode exhaustion at ~16 million files despite having 2 TB of free space. The server accepted no new email for 8 hours until emergency inode table expansion (requires unmount + fsck + tune2fs resize — not trivial). Lesson: use XFS or Btrfs for mail storage, or design with multiple filesystems per server.

Modern Usage

XFS dynamic inode allocation: XFS allocates inodes dynamically from free blocks, with no fixed inode count. Inodes are grouped near the files they describe (inode64 layout). Preferred for high-file-count workloads.
Btrfs dynamic inodes: Btrfs allocates inodes from its B-tree structure with no fixed table. Inode numbers in Btrfs are 64-bit and not reused (monotonically increasing), which helps NFS generation number uniqueness.
NFS inode generation numbers: NFS uses (filesystem UUID, inode number, inode generation number) as a persistent file handle. Generation numbers prevent reuse of inode numbers after file deletion from causing stale file handles. ext4 stores i_generation in the inode; Btrfs uses transaction IDs.
Inode number stability for monitoring: System monitoring tools often track files by inode number. File rotation (log rotation) creates a new file with a new inode number — inotify-based tools must handle IN_MOVE_FROM/IN_MOVE_TO events correctly.

Future Directions

Case-insensitive dentries: Ongoing work in VFS and ext4/F2FS for case-insensitive directory lookup (needed for Docker on macOS/HFS+ compatibility). Linux 5.2+ supports case-insensitive directories in ext4.
Atomic file attribute updates: setattr() modifying multiple inode fields (size + timestamps) is not atomic in most filesystems. Research on atomic compound metadata updates for consistent snapshots.
Inode versioning: Some proposals for persistent inode change counters (beyond i_ctime) to enable more efficient cache coherence in distributed systems.

Exercises

Create a file, hard-link it 3 times, and verify stat() shows i_nlink=4. Delete the links one by one and observe when the file data is actually freed (use ls -la and watch the size). At what point does the data disappear?
Write a C program using openat(AT_FDCWD, path, O_CREAT|O_EXCL|O_WRONLY, 0600) for safe temp file creation. Compare it to open(path, O_CREAT|O_WRONLY). Explain the race condition in the second form.
Exhaust inodes on a tmpfs. Create a tmpfs with -o nr_inodes=100, then create 101 files and observe the exact error. How does this differ from disk-space exhaustion?
Use debugfs /dev/sda1 to examine an ext4 inode directly. Find the extent tree entries for a large file (>128 MB). How many extents does the file have? Is it fragmented? Compare to a freshly-written file vs an old, repeatedly-modified file.
Measure the impact of negative dentry caching. Write a script that accesses 10,000 non-existent files. Time the first run vs the second run (warmed negative cache). Then flush the dcache (echo 2 > /proc/sys/vm/drop_caches) and time the third run.

References

McKusick, M.K. et al. "A Fast File System for UNIX." ACM Trans. Comput. Syst. 2(3), 1984.
Ts'o, T. and Tweedie, S. "Planned Extensions to the Linux Ext2/Ext3 Filesystem." USENIX 2002.
Linux kernel source: fs/inode.c, fs/dcache.c, fs/ext4/inode.c, fs/ext4/extents.c
ext4 disk layout: fs/ext4/ext4.h — struct ext4_inode
Love, R. Linux Kernel Development, 3rd ed., Chapter 13
Viro, A. and others. Dcache design discussions: https://lwn.net/Articles/419811/
man 7 inode — Linux inode documentation
debugfs man page: man 8 debugfs