04 - Copy-on-Write Filesystems

Technical Overview

Copy-on-write (CoW) filesystems take a fundamentally different approach to crash consistency than journaling. Instead of writing changes to a log before applying them in-place, CoW filesystems never overwrite existing data. Every modification writes to a new, unallocated location on disk, then updates a tree of pointers to reflect the new location. The root of the tree is updated last — atomically — making the entire operation appear instantaneous.

The consequences of this design are profound: snapshots become free (just preserve the old root pointer), checksums can protect every block, there is no write hole, and crash recovery requires no replay. The tradeoffs are increased fragmentation over time and higher write amplification for small random updates.

Prerequisites

Filesystem basics (inodes, block pointers, B-trees)
Journaling concepts (see 03-journaling.md)
B-tree data structures
RAID concepts for CoW RAID modes

Core Content

CoW Write Path

In-place update (journaling/ext4):
  Before: [Tree root] -> [Inode] -> [Data Block 42: "hello"]
  Write "world" to block 42:
    1. Write "world" directly to block 42 (destroying "hello")
    2. Update inode (mtime, size if changed)
    3. Journal commit
  After: [Tree root] -> [Inode] -> [Data Block 42: "world"]
  Problem: if crash between step 1 and 2, state is inconsistent

CoW update (Btrfs/ZFS):
  Before: [Tree root @generation 5] -> [Inode @gen5] -> [Data Block 42: "hello"]
  Write "world":
    1. Allocate new Data Block 100 (free space)
    2. Write "world" to block 100
    3. Allocate new inode copy (Block 200)
    4. Write new inode (mtime updated, block pointer to 100)
    5. Update B-tree node pointing to inode (allocate new B-tree node block)
    6. Update parent B-tree nodes (cascade up to root)
    7. Write new root pointer (generation 6) — ATOMIC
  After: [Tree root @generation 6] -> [Inode @gen6] -> [Data Block 100: "world"]
  Block 42 and old inode are now freed (marked as free in space allocator)

  Crash at any point before step 7:
    Root still points to generation 5 state — perfectly consistent
    Partially-written new blocks are orphaned → reclaimed on next mount

The key invariant: the root pointer write is the commit point. Everything written before the root pointer update can be abandoned without inconsistency. This is similar to a database transaction where the commit record is the "point of no return."

Advantages of CoW

No write hole: Complete stripes are written atomically in new locations. No partial stripe overwrite.
Instant snapshots: Create a snapshot by preserving a copy of the current root pointer. Zero copying required. Snapshot shares all unchanged blocks with the live filesystem via reference counting.
Per-block checksums: Since blocks are written once and never modified, checksums can be stored separately and validated on every read.
Consistent on-disk state at all times: Any live block pointer points to valid data. No "pending" states.
No fsck required: Mount just validates the tree from the root. Any unreachable blocks (orphans) are recovered by a background garbage collector.

Btrfs Architecture

Btrfs ("B-tree filesystem") is designed around a collection of B-trees stored in a single address space. The fundamental structure is the Btrfs B-tree node:

Btrfs Key Space:

Every object in Btrfs is identified by a (objectid, type, offset) triple.
Examples:
  (256, INODE_ITEM, 0)         - inode for object 256
  (256, EXTENT_DATA, 0)        - file data extent at offset 0 for object 256
  (256, DIR_ITEM, name_hash)   - directory entry in directory 256

Btrfs Tree Hierarchy:

              Root Tree (tree 1)
            /        |          \
     FS Tree      Extent Tree   Chunk Tree    Device Tree
  (per-subvolume   (free space   (logical->   (device info,
   file/dir tree)   tracking)    physical      RAID config)
                                 mapping)

FS Tree contains all inodes and file data for one subvolume.
Each subvolume (including snapshots) has its own FS Tree root.

Btrfs uses extents as the allocation unit — a contiguous range of bytes. An EXTENT_DATA item describes one extent of a file, containing a logical offset, physical offset, length, and compression type.

# Examine Btrfs tree structure
btrfs inspect-internal dump-tree /dev/sda1
btrfs inspect-internal dump-super /dev/sda1  # superblock details

# Show extent tree (fragmentation analysis)
btrfs fi defragment -r /path  # defragment file range

# Show detailed file extent layout
filefrag -v /path/to/file  # shows physical extent locations

Btrfs Subvolumes and Snapshots

A subvolume is an independent copy-on-write namespace within a Btrfs filesystem. Each subvolume has its own root inode and its own FS Tree. Subvolumes appear as independent filesystem roots, are mountable individually, and can be snapshotted independently.

Btrfs subvolume and snapshot:

Initial state:
  Subvolume 5 (default): A B C D E (files/blocks)

Create snapshot:
  btrfs subvolume snapshot /data /data/.snapshot-2024-01-01

Snapshot state:
  Subvolume 5 (default): A B C D E
  Subvolume 256 (snapshot): A B C D E  ← shares all blocks with subvol 5

After modifying file D in subvolume 5:
  CoW write: allocate new block D', update subvol 5 tree
  Subvolume 5: A B C D' E  ← D' is new, all others shared with snapshot
  Subvolume 256 (snapshot): A B C D E  ← D still points to original block

Space usage:
  sizeof(A+B+C+D+E) + sizeof(D')  ← space for snapshot is only the new data
  Not: 2 × sizeof(A+B+C+D+E)

# Create subvolume
btrfs subvolume create /data/mysubvol

# Create read-only snapshot (for backup)
btrfs subvolume snapshot -r /data/mysubvol /data/snap-$(date +%Y%m%d)

# Create writable snapshot (for testing/cloning)
btrfs subvolume snapshot /data/mysubvol /data/snap-test

# List subvolumes
btrfs subvolume list /data

# Delete snapshot (releases shared blocks not needed by others)
btrfs subvolume delete /data/snap-old

# Get subvolume usage (space consumed uniquely by this snapshot)
btrfs qgroup show /data   # requires qgroups enabled

Btrfs send/receive

btrfs send serializes a read-only snapshot to a byte stream. btrfs receive applies it on a different filesystem. Incremental sends transmit only the difference between two snapshots:

# Full send (initial backup)
btrfs send /data/snap-day1 | ssh backup-server 'btrfs receive /backup/'

# Incremental send (only delta since snap-day1)
btrfs send -p /data/snap-day1 /data/snap-day2 | ssh backup-server 'btrfs receive /backup/'
# Only transmits blocks that changed between day1 and day2

This enables efficient, network-level incremental backups — analogous to ZFS send/receive.

Btrfs Checksums

Btrfs computes checksums for every data and metadata block, stored in a dedicated checksum tree. Default checksum: CRC32c. Available algorithms: crc32c, xxhash, sha256, blake2b (selectable at mkfs.btrfs time).

# Create Btrfs with SHA256 checksums (more secure, slower)
mkfs.btrfs --checksum sha256 /dev/sdb

# Default (CRC32c)
mkfs.btrfs /dev/sdb

# Verify checksums (scrub reads all data and verifies checksums)
btrfs scrub start /data
btrfs scrub status /data

When a checksum mismatch is detected during a read: 1. If RAID1 or RAID10: Btrfs reads the mirror and validates its checksum. If mirror is good, returns mirror data and repairs the bad block transparently. 2. If RAID5/6: Attempts parity-based recovery. 3. If no redundancy: Returns EIO (I/O error) to the reader. Reports the corrupted block via kernel messages.

This detects silent data corruption (bit rot) — a common failure mode on HDDs with bad sectors that pass SMART checks but return wrong data.

Btrfs RAID Modes

Btrfs implements RAID internally (no mdadm needed):

# Create Btrfs RAID1 (mirror) across two devices
mkfs.btrfs -m raid1 -d raid1 /dev/sdb /dev/sdc
# -m: metadata RAID level, -d: data RAID level

# Create RAID10
mkfs.btrfs -m raid10 -d raid10 /dev/sdb /dev/sdc /dev/sdd /dev/sde

# Add device to existing pool
btrfs device add /dev/sdd /data

# Balance: redistribute data/metadata chunks to new device
btrfs balance start -dconvert=raid10 -mconvert=raid10 /data

# Check device usage
btrfs device usage /data
btrfs fi show /data

Btrfs RAID 5/6 warning: As of Linux 6.x (2024), Btrfs RAID5/6 is still marked as unstable. Known issues include write hole under specific conditions and incorrect parity computation under certain workloads. Do not use Btrfs RAID5/6 for production data. Use RAID1 or RAID10 instead, or use mdadm RAID5 with an ext4/XFS filesystem on top.

Btrfs Limitations and Bug History

RAID 5/6 still not production-ready (as of 2024): Multiple parity calculation bugs have been found and partially fixed. The Btrfs developers themselves advise against it.
Balance can be slow: Rebalancing large amounts of data across devices takes hours-days and is I/O intensive.
btrfs check --repair danger: The repair mode of btrfs check has historically caused more problems than it fixed. Only run with explicit guidance from btrfs developers.
Free space cache corruption: Early versions had bugs in the free space cache that could cause allocation errors. Mostly fixed in kernels 4.x+.
Performance regression on small random writes: CoW write-amplification for small random writes (e.g., database without O_DIRECT) can be severe. Disable CoW for database files: chattr +C /var/lib/mysql (sets the NOCOW flag on the inode).

# Disable CoW for specific directory (useful for VM disk images, databases)
chattr +C /var/lib/mysql
# Or at creation: btrfs property set /file nodatacow true

# Check if CoW is disabled
lsattr /var/lib/mysql/
# ---------------C-- (C = NOCOW)

ZFS Overview

ZFS was developed at Sun Microsystems (2001-2005) by Jeff Bonwick and Bill Moore. It embodies the same CoW principles as Btrfs but with a more mature implementation, end-to-end integrity verification, and a combined volume manager + filesystem (the "pooled storage" model).

Covered in depth in 07-zfs.md. Summary of CoW-relevant features:

ZFS always-consistent state: The ZFS "uberblock" is the filesystem root. Only one uberblock is written per transaction group commit. The uberblock is written to 128 locations in the pool header using a ring buffer — each write uses a slot, with the newest always determining the current state.

ZFS CoW write path:

Transaction Group (TxG):
  - All writes in one TxG are batched (default 5-second commit interval)
  - New blocks are allocated from the free space map
  - Modified tree nodes are copied to new locations
  - At TxG commit: write all new data, flush, then write new uberblock
  - If crash during: old uberblock still valid, partially-written new blocks orphaned
  - Orphans freed during next mount by examining TxG sequence numbers

CoW and Fragmentation

The primary downside of CoW is fragmentation. Consider writing a large file sequentially on an initially empty filesystem: - Allocation: contiguous blocks 0, 1, 2, 3... → excellent sequential I/O - After modifying scattered 4K regions over time: - Each modification allocates new blocks from wherever free space exists - Over time, the file is scattered across the disk - Sequential read becomes scattered → performance degrades

CoW fragmentation example (Btrfs):

Initial write: [Block 0][Block 1][Block 2][Block 3] (contiguous)

After 100 random 4K updates:
[new-block-5234][Block 1][new-block-8901][Block 3] (scattered)
  ^ Block 0       ^still  ^ Block 2         ^still
    replaced       orig     replaced          orig

Fragmentation visible via:
filefrag -v /path/to/file | tail
# ext:     logical_offset:        physical_offset: length:
#    0:        0..       0:      23452..     23452:      1
#    1:        1..       1:        100..       100:      1
#    2:        2..       2:      89012..     89012:      1
# (many extents = fragmented)

CoW filesystems need periodic defragmentation or rely on workload patterns that minimize random updates (e.g., append-only logs, object storage).

# Btrfs defragmentation
btrfs fi defragment -r /data  # recursive defrag of directory
btrfs fi defragment -r -clzo /data  # defrag + compress with LZO

# ZFS has no explicit defrag tool; relies on workload patterns
# ZFS sequential write patterns naturally produce contiguous extents

Historical Context

The copy-on-write principle in storage systems was described in the Plan 9 operating system research (Bell Labs, 1990s). ZFS was publicly presented at the USENIX Annual Technical Conference in 2004 (Jeff Bonwick's "ARC: A Self-Tuning, Low Overhead Replacement Cache" paper) and first shipped in Solaris 10 in 2005.

The decision to use CoW as the foundation of both ZFS and Btrfs was driven by the growing problem of silent data corruption ("bit rot") on commodity storage. Studies at CERN (2007) and Google (Pinheiro 2007) showed that NAND and HDD errors were far more common than industry acknowledged, and that RAID alone did not protect against silent corruption (parity could be silently corrupted too). CoW + checksums was the solution.

Btrfs development was started by Chris Mason (Oracle) in 2007 and merged into Linux 2.6.29 (2009). It was intended as the eventual successor to ext4 and is the default filesystem on SUSE Linux Enterprise Desktop/Server and openSUSE since 2014. Red Hat Enterprise Linux 8 dropped Btrfs (citing instability); RHEL 9 does not include it.

Meta (Facebook) adopted Btrfs for their fleet of Linux servers around 2014-2015, contributing significantly to its stability improvements. Meta specifically uses Btrfs on high-density storage servers for cold data, benefiting from Btrfs's built-in compression and efficient snapshot semantics.

Production Examples

macOS APFS: Apple's APFS (Apple File System, 2017) is a CoW filesystem designed for SSDs, closely modeled after ZFS. It features instant snapshots (used by Time Machine), clones (zero-copy file copies), atomic safe-save, and per-file encryption. APFS demonstrates CoW as the modern standard for new filesystems.

Docker volume storage with Btrfs: Docker's btrfs storage driver uses Btrfs subvolumes for container layers and snapshots for instantiating containers. Each docker run creates a snapshot of the image subvolume — O(1) regardless of image size. Container writes only allocate space for changed blocks.

ZFS at Oracle, Joyent (SmartOS), TrueNAS: ZFS is production standard for enterprise NAS appliances (iXsystems TrueNAS CORE), cloud infrastructure (Joyent's SmartOS), and Oracle's Solaris/SPARC infrastructure. iXsystems has deployed ZFS at enterprise scale since 2011 with high reliability record.

Debugging Notes

# Btrfs health check
btrfs fi show /data          # device and usage summary
btrfs fi df /data            # data/metadata space usage by RAID type
btrfs device stats /data     # per-device error counters (read/write errors)

# Scrub (verify all checksums)
btrfs scrub start /data
btrfs scrub status /data
btrfs scrub cancel /data     # stop if causing I/O impact

# Check for errors
btrfs check /dev/sda1        # READ-ONLY check (safe)
# WARNING: btrfs check --repair is DANGEROUS

# Balance status and progress
btrfs balance status /data

# Show extent tree fragmentation
btrfs inspect-internal inode-resolve <inode-number> /data

# ZFS health
zpool status                  # pool health, device failures
zpool status -v               # verbose with per-device errors
zfs list                      # list datasets and space usage
zfs get all pool/dataset      # all properties
zpool scrub pool              # verify all checksums

# ZFS I/O statistics
zpool iostat 1               # per-pool I/O stats (like iostat for ZFS)
zpool iostat -v 1            # per-vdev stats

# Check ARC (ZFS cache) hit rate
arc_summary                  # Linux: from zfs-linux-utils
# or via /proc/spl/kstat/zfs/arcstats

Security Implications

Snapshot data retention: Snapshots preserve data that was "deleted" from the live filesystem. If sensitive data is deleted from a Btrfs or ZFS dataset, it remains accessible from snapshots until those snapshots are explicitly deleted. In regulated environments, snapshot retention policies must account for data residency requirements.

CoW and encrypted filesystems: CoW filesystems on top of a block-level encrypted device (LUKS/dm-crypt) provide full-disk encryption. CoW's block allocation pattern can leak metadata about when and where writes occurred (via timing correlation of block allocation on the encrypted device). For adversarial confidentiality requirements, use filesystems with native encryption (ZFS native encryption, APFS encryption).

Btrfs NOCOW and data integrity: Setting chattr +C (NOCOW) on a file disables checksums for that file. No checksum protection means silent corruption on that file. Use with awareness — only appropriate when the application provides its own data integrity (like databases with internal checksums).

Performance Implications

CoW small random write amplification: Each 4K random write in Btrfs requires allocating a new data block (4K) + updating metadata (which may cascade CoW through multiple B-tree levels: leaf node, internal node, possibly root). On a cold, fragmented filesystem, one logical 4K write can cause 5-10 physical 4K writes. This is why Btrfs is not recommended for database storage without NOCOW.

Snapshot accumulation and GC: Every outstanding snapshot holds references to old blocks. As files change, old versions accumulate. When a snapshot is deleted, the reference counts for all its blocks are decremented, potentially triggering cascading frees. Deleting a large, old snapshot can cause a significant I/O burst (the free space accounting work). Plan snapshot deletion carefully.

ZFS transaction group coalescing: ZFS batches all writes within a 5-second transaction group, amortizing the CoW tree update overhead. During a bulk write workload, many file modifications within one TxG share tree updates — write amplification is lower than for random single-block updates.

Failure Modes and Real Incidents

Btrfs RAID5 data loss (reported 2015-2022): Multiple users reported losing data when using Btrfs RAID5 with unclean shutdowns. The root cause was incomplete stripe writes that parity could not recover. As of 2024, Btrfs RAID5/6 has a "write hole" fix in progress but is still not considered production-ready. Users who deployed it and lost data had to recover from backups.

ZFS silent corruption detection success: ZFS's end-to-end checksums have been responsible for detecting numerous silent data corruption events that would have been invisible to other filesystems. A well-documented case at CERN (2007): ZFS detected that a RAID controller was silently flipping bits on reads — bit errors that matched neither the failing drive count nor the RAID level's designed tolerance. RAID-6 was not enough without checksums to detect the controller bug.

Btrfs free space exhaustion (CoW overhead): A common user complaint: disk shows 200 GB free, but all writes fail with ENOSPC. This happens because Btrfs separates data and metadata allocation. Metadata chunks can be full even if data chunks have space. Fix: btrfs balance start -m /data to rebalance metadata, or add another device. This was significantly improved in Linux 5.14+ with more aggressive metadata chunk reclaim.

Modern Usage

APFS (Apple, 2017): CoW filesystem as the universal standard on all Apple devices (iPhone, Mac). Demonstrates CoW + inline checksums + native encryption at consumer scale.
STRATIS (Red Hat's Fedora/RHEL volume manager): Uses XFS on top of thin-provisioned LVM with CoW pool semantics. Not a native CoW filesystem but provides snapshot and thin-provisioning via LVM's CoW.
Btrfs in container environments: Btrfs is supported as a container storage driver in Docker and Podman, enabling efficient image layering and instant container creation.

Future Directions

Btrfs RAID5/6 repair: Active development to fix the write hole and make RAID5/6 production-ready. Expected to land in 2024-2025 kernel releases.
Btrfs extent tree v2: A new extent tree format to reduce fragmentation in the extent tree itself (the tree grows and fragments over time with many small extents).
ZFS dRAID production hardening: dRAID (distributed spare RAID in ZFS 2.1) enables faster resilver by distributing spare capacity across all vdev disks rather than one dedicated spare. Production use increasing.
BCACHEFS: A new CoW filesystem with journaling hybrid (for fast small writes) + CoW (for large writes and snapshots), developed by Kent Overstreet. Merged into Linux 6.7 (2024). Designed to combine the best of ext4 (journaling speed) and Btrfs/ZFS (CoW integrity, snapshots).

Exercises

Create a Btrfs filesystem on a loop device. Write a 100 MB file. Create a snapshot. Modify 10 MB of the file (random 4K writes). Use btrfs fi df to measure space usage before and after modification. How much space does the snapshot consume after the modification?
Demonstrate silent data corruption detection. Write a file on Btrfs. Unmount. Use dd to corrupt a random block (knowing the physical offset from filefrag). Remount and read the file. Observe the checksum error in dmesg.
Compare Btrfs CoW write performance vs in-place write performance. Create two files: one on Btrfs with NOCOW (chattr +C), one with CoW. Run fio --rw=randwrite --bs=4k on each. Measure IOPS difference. What does this reveal about CoW write amplification?
Use btrfs send/receive to perform an incremental backup. Create a snapshot, modify some files, create another snapshot. Send incrementally. Measure the data transferred vs the total file data. What is the efficiency ratio?
Research ZFS's "always-consistent on-disk state" guarantee in detail. Read the ZFS source (module/zfs/dsl_pool.c, module/zfs/uberblock.c). How does ZFS ensure the uberblock write is atomic? What prevents a partially-written uberblock from being mistaken for a committed transaction?

References

Bonwick, J. and Moore, B. "ZFS: The Last Word in File Systems." OpenSolaris, 2004. (presentation)
Bonwick, J. et al. "The Zettabyte File System." FAST 2003 (early ZFS paper).
Mason, C. et al. "Btrfs: The Linux B-Tree Filesystem." USENIX 2013.
Rodeh, O. "B-Trees, Shadowing, and Clones." ACM Trans. Storage, 2008.
CERN silent corruption study: Panzer-Steindel, B. "Data Integrity." CHEP 2007.
Linux kernel source: fs/btrfs/, fs/btrfs/ctree.c, fs/btrfs/extent_io.c
OpenZFS source: https://github.com/openzfs/zfs
Btrfs documentation: https://btrfs.readthedocs.io/
APFS Reference: https://developer.apple.com/support/downloads/Apple-File-System-Reference.pdf
bcachefs: https://bcachefs.org/