07 - ZFS
Technical Overview
ZFS (Zettabyte File System) is a combined filesystem and volume manager originally developed at Sun Microsystems by Jeff Bonwick, Bill Moore, and Mark Shellenbaum (2001-2005). It represents the most complete implementation of the "end-to-end data integrity" philosophy: every block of data and metadata has a checksum, every write is copy-on-write, every read verifies its checksum, and the pool of storage is always in a consistent on-disk state.
ZFS's design principles — pooled storage, always-consistent state, and end-to-end checksums — influenced every serious filesystem designed after it. OpenZFS, the open-source continuation after Oracle's acquisition of Sun, is the production standard for NAS systems (TrueNAS), Solaris successors (OmniOS, illumos), FreeBSD, and Linux (as a kernel module due to CDDL license incompatibility with GPL).
Prerequisites
- Copy-on-write filesystem concepts (see 04-copy-on-write-filesystems.md)
- RAID levels and erasure coding (see 12-storage-systems/06-raid.md)
- Storage hierarchy and caching (see 12-storage-systems/01-storage-hierarchy.md)
- Btrfs background useful for comparison
Core Content
ZFS Design Principles
Pooled storage: ZFS eliminates the distinction between volume management and filesystem. A zpool aggregates physical disks (or files, or partitions) into a storage pool. Datasets (filesystems, zvols, snapshots) are carved from the pool's free space. This replaces the traditional: disk → partition → LVM volume → filesystem stack with disk → zpool → dataset.
Always-consistent on-disk state: Every transaction group (TxG) commit writes all new blocks before atomically updating the "uberblock" (root pointer). If power fails, the last committed uberblock is used on next boot — the filesystem is always in the state it was in after the last successful TxG commit. No replay needed.
End-to-end checksums: Every block (data and metadata) has a checksum stored in its parent pointer (NOT in the block itself — this allows detecting corruption of the block from a reliable parent). Checksum algorithms: fletcher4 (default, fast), sha256, sha512, skein, edonr.
Pooled redundancy (RAIDZ): RAIDZ stripes eliminate the write hole (no partial stripe writes due to CoW). The stripe width varies per I/O, so each write is atomic.
ZFS Pool and Vdev Model
ZFS Pool Architecture:
zpool "tank"
|
+-- vdev 0: mirror
| +-- /dev/sda (disk 0)
| +-- /dev/sdb (mirror of disk 0)
|
+-- vdev 1: raidz2
| +-- /dev/sdc
| +-- /dev/sdd
| +-- /dev/sde
| +-- /dev/sdf (2 parity disks)
| +-- /dev/sdg
|
+-- vdev 2: log (ZIL — SLOG device)
| +-- /dev/nvme0n1 (fast NVMe for sync writes)
|
+-- vdev 3: cache (L2ARC)
+-- /dev/nvme1n1 (SSD cache for read amplification)
Vdev types:
- disk (or file): single disk, no redundancy
- mirror: RAID1 between N disks
- raidz1/2/3: RAID5/6/7 equivalent with CoW — no write hole
- log (SLOG): ZIL device for synchronous write acceleration
- cache (L2ARC): L2 read cache
- spare: hot spare, automatically used if a disk fails
- draid: distributed RAID with distributed spare (ZFS 2.1+)
# Create a pool with RAIDZ2 (2 parity) across 5 drives
zpool create tank raidz2 /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde
# Create a mirrored pool
zpool create tank mirror /dev/sda /dev/sdb mirror /dev/sdc /dev/sdd
# Add log device (SLOG) for synchronous write acceleration
zpool add tank log /dev/nvme0n1
# Add L2ARC cache
zpool add tank cache /dev/nvme1n1
# View pool status
zpool status tank
# pool: tank
# state: ONLINE
# config:
# NAME STATE READ WRITE CKSUM
# tank ONLINE 0 0 0
# raidz2-0 ONLINE 0 0 0
# sda ONLINE 0 0 0
# ...
# Pool I/O statistics
zpool iostat -v tank 1
ZFS Dataset Types
Filesystem: A mountable POSIX filesystem within the pool.
zfs create tank/data
zfs set mountpoint=/data tank/data
zfs mount tank/data
zvol (ZFS Volume): A block device exposed from the pool. Used for iSCSI targets, VM disk images, swap.
zfs create -V 100G tank/vm-disk # creates /dev/zvol/tank/vm-disk
# Useful for: QEMU/KVM VM disks, iSCSI LUNs
Snapshot: Read-only point-in-time copy of a dataset.
zfs snapshot tank/data@2024-01-01
zfs list -t snapshot
zfs rollback tank/data@2024-01-01 # revert dataset to snapshot state
Clone: Writable copy of a snapshot (starts as exact copy, diverges via CoW).
zfs clone tank/data@2024-01-01 tank/data-clone
Bookmark: A lightweight reference to a snapshot's transaction group number (does not preserve data, only enables incremental send).
zfs bookmark tank/data@2024-01-01 tank/data#bookmark-jan01
# After deleting the snapshot, the bookmark can still be used as parent for incremental send
ARC: Adaptive Replacement Cache
ZFS has its own caching layer — the ARC (Adaptive Replacement Cache) — that lives in kernel memory. It does NOT use the Linux page cache. This is why ZFS on Linux requires careful memory tuning: ZFS ARC and the Linux page cache compete for the same DRAM.
ARC Architecture:
L1: ARC (in-process DRAM, managed by ZFS)
|
+-- MRU (Most Recently Used) list
| Recently accessed blocks; evicted first if needed
|
+-- MFU (Most Frequently Used) list
| Blocks accessed multiple times; protected from single-scan eviction
| (this is the key improvement over simple LRU)
|
Adaptive: ARC dynamically adjusts the MRU/MFU boundary based on hit patterns
"scan resistance" — prevents a full table scan from evicting hot cache entries
ARC is smarter than OS page cache LRU:
- Ghost lists (recently evicted items) track whether eviction was correct
- Adjusts MRU/MFU split based on ghost list hit rates
# Monitor ARC stats (Linux)
cat /proc/spl/kstat/zfs/arcstats | grep -E 'size|hits|misses|c_max'
# arc_size: current ARC size in bytes
# arc_hits: total cache hits
# arc_misses: total cache misses
# c_max: maximum ARC size (default: ~50% of RAM or 1/2 RAM)
# Tune ARC size (persistent via /etc/modprobe.d/zfs.conf)
echo "options zfs zfs_arc_max=17179869184" > /etc/modprobe.d/zfs.conf
# 17179869184 = 16 GB ARC max
# arc_summary script from zfs-linux-utils
arc_summary
ARC sizing guidelines: - ZFS ARC default max: min(50% of RAM, total minus 1 GB) - For database workloads that use O_DIRECT: ARC is unused for data, set smaller - For NAS/file server: ARC should be as large as possible (70-80% of RAM) - Shared with Linux page cache: balance based on workload
L2ARC (Level 2 ARC)
L2ARC is an optional SSD cache tier that sits between ARC (DRAM) and the vdev storage:
Read path:
ARC (DRAM hit): fastest
↓ miss
L2ARC (NVMe SSD hit): ~100 µs
↓ miss
vdev (HDD): 5-10 ms
Write path:
ALL writes go directly to vdev pool (L2ARC is READ cache only)
L2ARC is populated asynchronously from blocks evicted from ARC
# Add L2ARC device
zpool add tank cache /dev/nvme1n1
# Monitor L2ARC efficiency
arc_summary | grep -A 10 L2
# L2ARC stats in arcstats
grep l2 /proc/spl/kstat/zfs/arcstats
# l2_hits: L2ARC hit count
# l2_misses: L2ARC miss count
# l2_size: current L2ARC usage
L2ARC is most effective for: workloads that access a large "warm" dataset that doesn't fit in ARC but is accessed repeatedly; HDD-backed pools where the working set exceeds ARC.
L2ARC is NOT effective for: write-heavy workloads (L2ARC only helps reads); already SSD-backed pools (vdev latency already low); uniformly random access over huge dataset (no hot set to cache).
ZIL (ZFS Intent Log) and SLOG
Applications that call fsync() or use O_SYNC/O_DSYNC require synchronous writes — the write must be durable before the call returns. Without a ZIL device, ZFS must wait for the next TxG commit (up to 5 seconds) to confirm durability, then return.
The ZIL provides synchronous write semantics with lower latency:
Synchronous write path:
Application write() + fsync()
|
v
1. Write data to ZIL (sequential log, fast commit)
(if SLOG: write to dedicated NVMe SLOG device)
(if no SLOG: write to pool devices' ZIL area)
|
Return to application: write is durable
|
2. (background) TxG commit: copy from ZIL to final pool location
3. Mark ZIL entries as committed: free ZIL space
Without SLOG: ZIL writes go to the main pool vdev. On HDDs, this means the synchronous write latency is ~5-10 ms (HDD seek + rotational latency). On NVMe vdevs, ZIL writes are already fast.
With SLOG (Separate ZIL LOG device): ZIL writes go to a dedicated NVMe device. Synchronous write latency drops to ~100 µs. The SLOG must be mirrored (use two NVMe drives) — if the SLOG fails and has uncommitted data, that data is lost (the ZIL is the sole copy of recent writes).
# Add mirrored SLOG
zpool add tank log mirror /dev/nvme0n1 /dev/nvme1n1
# Monitor ZIL write rate
zpool iostat -v tank 1 # check log device I/O
# ZIL stats
grep zil /proc/spl/kstat/zfs/zil
RAIDZ vs MD RAID Comparison
| Feature | ZFS RAIDZ | Linux MD RAID5/6 |
|---|---|---|
| Write hole | None (CoW) | Exists (requires BBU/journal) |
| Checksums | Per-block (detect bit rot) | None (RAID can't detect corruption) |
| Stripe width | Variable per I/O | Fixed (all disks) |
| Snapshot support | Native, instant | Not in RAID (filesystem level) |
| Adding disks | RAIDZ expansion (ZFS 2.2+) | Cannot add to existing set |
| Read performance | (N-1) × disk_throughput | Same |
| Recovery speed | dRAID: faster; regular RAIDZ: similar | Similar |
| Software quality | Mature (20+ years) | Mature |
| WAL/journal needed | No | Recommended for RAID5 |
dRAID (Distributed RAID)
Traditional RAIDZ has one spare disk that is idle until a failure. dRAID (ZFS 2.1+) distributes the spare capacity across all disks:
Traditional RAIDZ2 with hot spare (7 drives: 5 data + 2 parity + 1 spare):
[D0][D1][D2][D3][D4][P1][P2][SPARE]
active idle
dRAID2 (7 drives):
Each drive has a mix of data, parity, and "distributed spare" slices:
[D0+DS0][D1+DS1][D2+DS2][D3+DS3][D4+DS4][P1+DS5][P2+DS6]
Advantage: rebuild after failure reads from/writes to ALL drives simultaneously
→ faster resilver than traditional RAIDZ
# Create dRAID2 with 7 drives, 5 data drives per group
zpool create tank draid2:5d:7c:1s /dev/sd{a..g}
# 5d = 5 data per group, 7c = 7 children, 1s = 1 distributed spare
ZFS Snapshots and Rollback
# Snapshot workflow
zfs snapshot tank/data@before-update
# ... do work ...
# If something goes wrong:
zfs rollback tank/data@before-update # immediately reverts all changes
# Recursive snapshot (filesystem + all children)
zfs snapshot -r tank@nightly-2024-01-01
# Clone a snapshot (writable copy for testing)
zfs clone tank/data@before-update tank/data-test
# Automated snapshot with zfs-auto-snapshot or sanoid
# sanoid: a ZFS snapshot management tool
ZFS send/receive
# Full send (initial replication)
zfs snapshot tank/data@base
zfs send tank/data@base | zfs receive backup/data
# Incremental send
zfs snapshot tank/data@current
zfs send -i tank/data@base tank/data@current | zfs receive backup/data
# Compressed send (compress the stream itself)
zfs send -c tank/data@current | ssh remote 'zfs receive backup/data'
# -c: send compressed as stored (avoids decompress+recompress)
# Replicate entire pool (with -R flag)
zfs send -R tank@snap | zfs receive backup_pool
OpenZFS zrepl is a popular automated ZFS replication daemon that manages snapshot lifecycle and incremental replication with configurable policies.
ZFS vs Btrfs Comparison
| Feature | ZFS (OpenZFS) | Btrfs |
|---|---|---|
| Maturity | ~20 years, very stable | ~15 years, mostly stable |
| RAID5/6 | RAIDZ (no write hole, stable) | RAID5/6 (write hole, unstable) |
| RAID1/10 | Mirror/mirror+stripe (stable) | RAID1/10 (stable) |
| Native encryption | Yes (dataset-level, keystore) | No (use dm-crypt) |
| Deduplication | Yes (block-level, RAM-heavy) | Partial (file-level clones) |
| Linux integration | Out-of-tree module (CDDL) | In-tree (GPL) |
| Compression | lz4, gzip, zstd, lzjb | zlib, lzo, zstd |
| Send/receive | Stable, widely used | Stable |
| Memory management | Own ARC (no page cache) | Uses Linux page cache |
| Nested datasets | Yes (datasets/zvols) | Yes (subvolumes) |
| Online expansion | Add vdevs (not to RAIDZ) | Add devices + balance |
| RAIDZ expansion | ZFS 2.2+ (in-place restripe) | N/A |
| Enterprise NAS use | TrueNAS, Nexenta, Oracle | SUSE, Fedora, not NAS |
ZFS Licensing: CDDL vs GPL Controversy
Sun open-sourced ZFS under the CDDL (Common Development and Distribution License) in 2005. The CDDL is an OSI-approved open-source license but is incompatible with the GPL v2. This means ZFS cannot be legally included in the Linux kernel mainline (which is GPL v2).
The Linux kernel's position (stated by Linus Torvalds): distributing ZFS as a kernel module is a legal gray area. Some Linux distributions (Ubuntu, Canonical) ship ZFS as a DKMS (Dynamic Kernel Module Support) module that is compiled on the user's machine — technically the user is "combining" the modules, not the distributor. Other distributions (Debian, Red Hat) refuse to ship ZFS due to legal concerns.
Oracle has not relicensed ZFS to GPL. The OpenZFS project (illumos, FreeBSD, macOS ZFS port, Linux ZFS) operates under the same CDDL license.
# On Ubuntu: ZFS is available out-of-box
apt install zfsutils-linux # installs ZFS kernel module via DKMS
# Check ZFS version
zfs --version
zpool --version
Historical Context
Jeff Bonwick began the ZFS design in 2001 at Sun, with the stated goal of "the last word in filesystems" — a design so scalable (zettabyte capacity) and correct (end-to-end integrity) that it would not need replacement for decades. Matt Ahrens was the primary developer alongside Bonwick.
ZFS shipped in Solaris 10 (2005) and OpenSolaris (2008). After Oracle acquired Sun in 2010, OpenSolaris was closed. The OpenZFS project was formed (2013) to continue development, led by Matt Ahrens and maintained by the illumos, FreeBSD, and Linux communities.
Apple licensed ZFS for Mac OS X briefly (2007-2009) but abandoned it for APFS, which shares many ZFS concepts but with different implementation.
The "ZFS on Linux" (ZoL) project, now "OpenZFS," provides a Linux kernel module. Ubuntu became the first major Linux distribution to officially support ZFS on Linux (16.04, 2016).
Production Examples
TrueNAS CORE/SCALE: iXsystems TrueNAS is the industry-standard NAS operating system, built on ZFS. Used in enterprise NAS deployments from 4-drive desktop units to petabyte-scale clustered storage. TrueNAS CORE (FreeBSD + ZFS) and TrueNAS SCALE (Linux + OpenZFS + k8s).
OmniOS / illumos: UNIX-derivative server OS with ZFS as the root filesystem and primary storage. Used in specialized high-reliability environments where ZFS's 20-year production track record is valued.
PostgreSQL on ZFS at Etsy: Etsy's infrastructure team wrote extensively about using ZFS for PostgreSQL storage. ZFS's synchronous write with SLOG (NVMe) provided sub-millisecond commit latency. ARC sizing required careful tuning to not compete with PostgreSQL's shared_buffers.
Debugging Notes
# Pool health and events
zpool status -v # verbose: shows recent errors per device
zpool history # audit log of all ZFS commands executed
zpool events # recent pool events (scrub, errors, resilver)
# Dataset statistics
zfs list -t all # all datasets, snapshots, volumes
zfs get all tank/data # all properties
# ARC performance
arc_summary # human-readable ARC stats
cat /proc/spl/kstat/zfs/arcstats # raw counters
# ZIL stats (important for sync write latency diagnosis)
grep zil /proc/spl/kstat/zfs/zil
# RAIDZ checksum error analysis
zpool status -v # shows per-device read/write/checksum errors
# Diagnose I/O latency per vdev
zpool iostat -vq tank 1
# Identify files on a degraded device
# (after a drive fails in raidz, find which files are at risk)
find /tank -exec stat --file-system --printf "%n: %T\n" {} \; 2>/dev/null | head
# Memory tuning
echo "options zfs zfs_arc_max=17179869184" > /etc/modprobe.d/zfs.conf
echo "options zfs zfs_arc_min=2147483648" >> /etc/modprobe.d/zfs.conf
update-initramfs -u
Security Implications
ZFS native encryption (OpenZFS 0.8+, FreeBSD 12+):
# Create encrypted dataset
zfs create -o encryption=aes-256-gcm -o keylocation=prompt -o keyformat=passphrase \
tank/encrypted
# Load key (on mount)
zfs load-key tank/encrypted
zfs mount tank/encrypted
# Per-dataset encryption: different keys for different datasets
# A compromised host can access mounted datasets (keys loaded) but not unmounted ones
Keystore: ZFS stores encryption keys in the kernel keyring when loaded. Keys can be in files (for automated unlock) or hardware tokens (for physical security). Dataset-level encryption allows different tenants to have separate, independent keys on a shared pool.
Pool export on disconnect: zpool export tank removes all keys from memory, making the pool inaccessible even to root until imported again. Useful for laptops (export before sleep) and secure server shutdown.
Performance Implications
TxG commit interval: Default 5 seconds. All asynchronous writes in one TxG are committed together. Increasing (e.g., 10-30 seconds) improves throughput (larger TxG = better compression ratio, better extent allocation) at the cost of more data loss on power failure. Decreasing (1-2 seconds) improves durability.
zfs set sync=standard tank/data # uses TxG commit for async, ZIL for sync (default)
zfs set sync=disabled tank/data # all writes async (fastest, dangerous)
zfs set sync=always tank/data # all writes synchronous (safest, slowest)
Deduplication warning: ZFS block-level deduplication requires a dedup table (DDT) in RAM: ~320 bytes per unique block. A pool with 10 TB of unique data and 4 KB blocks needs 800 million entries × 320 bytes = 256 GB of DDT memory. Dedup with insufficient RAM causes severe performance degradation as the DDT is paged to disk. Only enable dedup if DDT fits in ARC.
Failure Modes and Real Incidents
Silent data corruption detected by ZFS (CERN 2007): Brian Behlendorf (CERN IT) reported that ZFS checksum verification on their high-energy physics data detected bit flip errors that were invisible to the RAID controller and Linux software RAID. The errors were caused by a faulty memory controller on the storage server — ECC RAM would have caught the error in-flight but the error was occurring on the PCIe DMA path between the RAID controller and DRAM. ZFS detected it at checksum verification, preventing silent corruption of physics data.
ZFS ARC eating all RAM (common misconfiguration): On Linux, ZFS's ARC competes with the OS page cache and application memory. If zfs_arc_max is not set, the ARC can grow to fill most available RAM, causing OOM kills of applications. On a 256 GB RAM server: set zfs_arc_max to leave at least 32 GB for OS and application needs.
SLOG device failure causing data loss: If the SLOG NVMe device fails while holding uncommitted ZIL data (between application fsync() return and TxG commit), that data is permanently lost. The application received "success" from fsync() but the data never made it to the pool. Mitigation: always mirror the SLOG (zpool add tank log mirror nvme0 nvme1).
RAIDZ expansion taking weeks (ZFS 2.2 restripe): RAIDZ expansion (adding one disk to an existing RAIDZ vdev) triggers a full online restripe — every block is relocated. On a 100 TB RAIDZ2 pool: the restripe reads and writes every block at ~2 GB/s → 50,000 seconds ≈ 14 hours. During this time, the pool has reduced redundancy. For very large pools, planned during maintenance windows.
Modern Usage
- TrueNAS SCALE: OpenZFS on Linux with Kubernetes + containerized apps. Enterprise NAS from home labs to large deployments.
- Ubuntu ZFS default: Ubuntu installer (20.04+) offers ZFS on root option. Canonical maintains the OpenZFS DKMS package.
- PostgreSQL + ZFS: PostgreSQL recommends ZFS for storage on FreeBSD and Linux. ZFS's
recordsizeproperty (set to8kor16kto match PostgreSQL's 8 KB page) and SLOG provide optimal PostgreSQL write latency. - ZFS on FreeBSD: ZFS is in FreeBSD's kernel directly (no licensing issue). Used as the native boot filesystem on FreeBSD servers.
Future Directions
- dRAID production hardening: dRAID continues to mature in OpenZFS. Expected to become the recommended RAIDZ replacement for large pools.
- RAIDZ expansion stabilization: ZFS 2.2 RAIDZ expansion (online restripe) reached stable status. Enables growing RAIDZ vdevs without recreating the pool — a long-requested feature.
- Block cloning: OpenZFS 2.2+ supports block cloning at the ZFS level (zero-copy
cp --reflink). Used for efficient VM disk cloning, deduplication-without-DDT. - ZFS + CXL: Research on CXL-attached memory pools as ZFS L2ARC or DRAM-class ARC expansion.
Exercises
-
Create a ZFS pool with RAIDZ2 across 5 loop device files. Write 10 GB of data. Simulate a drive failure by destroying one loop device file. Verify the pool remains online (degraded). Measure the scrub repair rate as it reconstructs the missing data.
-
Benchmark synchronous write latency with and without a SLOG. Create a ZFS pool on HDDs. Benchmark
fio --rw=write --sync=1 --bs=4k(simulating database commits). Add an NVMe SLOG. Re-benchmark. What is the latency improvement? -
Measure ARC efficiency. Create a dataset larger than ARC. Read the dataset twice. Use
arc_summaryor/proc/spl/kstat/zfs/arcstatsto observe hit rate improvement on the second read. How does ARC compare to the page cache for sequential vs random access? -
Implement a ZFS replication setup using
zfs send/receive. Create a source pool, take hourly snapshots for a simulated week (manually), and replicate to a destination pool incrementally. Measure total data transferred vs total data changed. -
Investigate ZFS's CDDL vs GPL licensing issue. Read the OpenZFS project's legal FAQ and Linus Torvalds's statements on the matter. Draft a 1-page policy decision: should a new Linux-based product use ZFS or Btrfs, and why? Consider: maturity, features, legal risk, support ecosystem.
References
- Bonwick, J. "ARC: A Self-Tuning, Low Overhead Replacement Cache." USENIX FAST 2003.
- Bonwick, J. and Moore, B. "ZFS: The Last Word in File Systems." OpenSolaris presentation, 2004.
- Ahrens, M. et al. "OpenZFS." USENIX ATC 2013.
- OpenZFS documentation: https://openzfs.github.io/openzfs-docs/
- OpenZFS GitHub: https://github.com/openzfs/zfs
- TrueNAS ZFS documentation: https://www.truenas.com/docs/references/zfsprimer/
- CERN data integrity study: https://indico.cern.ch/event/13893/
- ZFS on Linux wiki: https://wiki.debian.org/ZFS
- Gregg, B. Systems Performance, 2nd ed., Chapter 8 (File Systems)
man 8 zpool,man 8 zfs— primary reference documentation