06 - Btrfs

Technical Overview

Btrfs (B-tree filesystem, pronounced "butter FS" or "better FS" — the developers are deliberately casual about pronunciation) is a copy-on-write filesystem for Linux that was designed to replace ext4 with a fundamentally different architecture: built around B-trees, with native support for snapshots, RAID, checksums, and online operations. Unlike ext4, which evolved incrementally from ext, Btrfs was designed from scratch to address modern storage challenges: large capacities, multiple devices, silent data corruption, and the need for space-efficient backups.

Btrfs is the default filesystem on openSUSE, SUSE Linux Enterprise, and Fedora Workstation (since Fedora 33). Meta uses it at massive scale for their cold storage infrastructure.

Prerequisites

Copy-on-write filesystem concepts (see 04-copy-on-write-filesystems.md)
B-tree data structures
RAID concepts (see 12-storage-systems/06-raid.md)
Btrfs subvolume and snapshot model (introduced in 04-copy-on-write-filesystems.md)

Core Content

Btrfs Design Goals

Chris Mason (Oracle, later Facebook/Meta) started Btrfs development in 2007 with explicit goals: - Writable snapshots with minimal overhead - Data and metadata checksums (silent corruption detection) - Built-in RAID (no mdadm needed) - Online filesystem extension and balancing - Efficient incremental backup (send/receive) - Self-healing with RAID (checksum mismatch → read from mirror) - Subvolumes (independent CoW namespaces, mountable separately)

B-Tree Structure Overview

Every structure in Btrfs is a B-tree node. All trees share the same node format:

Btrfs B-tree Node (typical 16KB node size):

+------------------+
| Header (101 bytes)|
|  - fsid          |  filesystem UUID
|  - bytenr        |  physical address of this node
|  - flags         |  (written, reloc, ...)
|  - chunk_tree_uuid
|  - generation    |  transaction generation when written
|  - owner         |  tree identifier (FS_TREE, EXTENT_TREE, etc.)
|  - nritems       |  number of key-pointer or key-item pairs
|  - level         |  0 = leaf, >0 = internal node
+------------------+
| Items (if leaf):  |
|  [key0 | data0]  |  key=(objectid, type, offset), data=actual content
|  [key1 | data1]  |
|  [key2 | data2]  |
|  ...             |
+------------------+
| Pointers (if internal):
|  [key0 | child_block_ptr | generation]
|  [key1 | child_block_ptr | generation]
+------------------+
| Checksum (32B)   |  CRC32c or SHA256 at end
+------------------+

Btrfs Tree Hierarchy

Btrfs Tree Forest:

     Root Tree (tree id = 1)
       Maps tree id -> root block pointer for all other trees
       |
       +--- FS Tree (id=5, default subvolume)
       |     Contains: INODE_ITEMs, DIR_ITEMs, DIR_INDEXes, EXTENT_DATAs
       |     One per subvolume + one per snapshot
       |
       +--- Extent Tree (id=2)
       |     EXTENT_ITEMs: block address -> reference count (who owns this block)
       |     BLOCK_GROUP_ITEMs: allocation groups with free space
       |
       +--- Chunk Tree (id=3)
       |     CHUNK_ITEMs: logical address -> physical address + RAID stripe info
       |     DEV_ITEMs: list of devices in the filesystem
       |
       +--- Device Tree (id=4)
       |     DEV_EXTENTs: tracks which physical ranges of devices are allocated
       |
       +--- Checksum Tree (id=7)
       |     EXTENT_CSUM items: checksum for every data block
       |
       +--- Free Space Tree (id=10, replaces free space cache)
             FREE_SPACE_EXTENT and FREE_SPACE_BITMAP items

Log Tree (id=255):
       Temporary CoW-bypassing journal for fsync() — see Btrfs ordered writes

The Chunk Tree is the indirection layer between logical addresses (what the filesystem uses) and physical addresses (actual disk offsets). RAID stripping/mirroring is implemented here — one CHUNK_ITEM maps a range of logical addresses to a RAID stripe across multiple devices.

Subvolumes

A subvolume is an independent filesystem tree within a Btrfs filesystem:

# Create subvolumes (recommended layout for system)
btrfs subvolume create /btrfs_pool/@           # root subvolume
btrfs subvolume create /btrfs_pool/@home       # home subvolume
btrfs subvolume create /btrfs_pool/@var_log    # logs subvolume
btrfs subvolume create /btrfs_pool/@snapshots  # snapshot container

# Mount specific subvolumes
mount -o subvol=@ /dev/sda1 /
mount -o subvol=@home /dev/sda1 /home
mount -o subvol=@var_log /dev/sda1 /var/log

# List all subvolumes
btrfs subvolume list /

# Get subvolume ID for a path
btrfs subvolume show /home

# Delete a subvolume (recursive, like rm -rf but for CoW namespace)
btrfs subvolume delete /old_subvolume

Subvolume ID 5 is the top-level subvolume (the filesystem root). The subvolid=5 mount option mounts the entire tree including all subvolumes; subvol=@ mounts only the @ subvolume. SUSE and Fedora use the @-subvolume scheme for their default Btrfs layouts.

Snapshots

Snapshots in Btrfs are instant and space-efficient:

# Read-only snapshot (for backups — cannot be modified)
btrfs subvolume snapshot -r /home /snapshots/home-2024-01-01

# Writable snapshot (for testing or cloning)
btrfs subvolume snapshot /home /home_test

# Automated snapshot with snapper (OpenSUSE/Fedora)
snapper -c home create --description "before update"
snapper -c home list

# Manual snapshot rotation (keep 7 daily)
SNAPDIR=/snapshots
DATE=$(date +%Y%m%d)
btrfs subvolume snapshot -r / ${SNAPDIR}/root-${DATE}
# Delete snapshots older than 7 days:
find ${SNAPDIR} -maxdepth 1 -name 'root-*' -mtime +7 -exec \
  btrfs subvolume delete {} \;

Snapshot space accounting: btrfs qgroup show / shows per-subvolume space usage. Without qgroups, du on a snapshot reports the full size (counts shared blocks). With qgroups, "exclusive" size shows only space unique to that snapshot:

# Enable qgroups (required for accurate snapshot accounting)
btrfs quota enable /
btrfs qgroup show --mbytes /
# qgroupid         rfer      excl    max_rfer    max_excl
# 0/5             10.5GiB   8.2GiB     none        none
# 0/256 (snap1)   10.5GiB   0.0GiB     none        none  ← no exclusive data
# 0/257 (snap2)   10.5GiB   1.2GiB     none        none  ← 1.2 GiB unique after changes

Btrfs send/receive for Incremental Backups

# Initial full backup
btrfs subvolume snapshot -r /home /home/.snap-base
btrfs send /home/.snap-base | ssh backup "btrfs receive /backup/"

# Incremental backup (only changes since snap-base)
btrfs subvolume snapshot -r /home /home/.snap-$(date +%Y%m%d)
btrfs send -p /home/.snap-base /home/.snap-$(date +%Y%m%d) \
  | ssh backup "btrfs receive /backup/"

# The send stream encodes:
# - New or modified extents (cloned from parent where possible)
# - Deleted files/directories
# - Metadata changes (permissions, timestamps)
# Wire format: binary stream with opcode+length+data records

btrfs send is used by enterprise backup systems (Veeam Backup for Linux, Bacula) to efficiently backup Btrfs volumes. The incremental send stream can be much smaller than the changed data if most changes are small (metadata updates, small writes).

Btrfs Scrub

btrfs scrub reads every data and metadata block on all devices, verifies checksums, and repairs errors where possible (using RAID redundancy):

# Start scrub (runs in background)
btrfs scrub start /

# Monitor progress
btrfs scrub status /

# Cancel if causing too much I/O impact
btrfs scrub cancel /

# Example output after completion:
btrfs scrub status /
# scrub status for ebc9a3bd-...
# scrub started at Tue Jan  1 00:00:00 2024 and finished after 02:34:15
# total bytes scrubbed: 1.23TiB with 0 errors
# (2 read errors corrected from mirror)

Schedule weekly scrub for data integrity (cron or systemd timer):

# systemd: /etc/systemd/system/btrfs-scrub.service
[Unit]
Description=Btrfs scrub

[Service]
Type=oneshot
ExecStart=btrfs scrub start -B /

Btrfs Balance

Balance redistributes data and metadata chunks across all devices in the pool. Used to: - Redistribute data after adding a new device - Convert RAID levels (e.g., change RAID1 to RAID10 after adding drives) - Reclaim space from over-allocated metadata chunks

# Check balance need
btrfs device usage /
btrfs fi df /

# Start balance (very I/O intensive — do during off-peak)
btrfs balance start /

# Filter: only rebalance chunks that are >80% full
btrfs balance start -dusage=80 -musage=80 /

# Convert metadata from single to RAID1 (after adding second device)
btrfs balance start -mconvert=raid1 /

# Convert data to RAID1
btrfs balance start -dconvert=raid1 /

# Background balance with rate limiting
btrfs balance start --bg /  # run in background

# Check balance progress
btrfs balance status /

Btrfs Compression

Btrfs supports transparent compression: zlib, LZO, ZSTD (default recommendation):

# Mount with compression
mount -o compress=zstd /dev/sda1 /data

# Compress specific directory
btrfs property set /data/logs compression zstd

# Check compression ratio on a file
compsize /data/logs  # requires compsize tool from package

# Example output:
# Type       Perc     Disk Usage   Uncompressed Referenced
# TOTAL       27%       3.1GiB       11.4GiB      11.4GiB
# none        0%       2.0GiB        2.0GiB       2.0GiB
# zstd       19%       1.1GiB        5.8GiB       5.8GiB

ZSTD compression is recommended for most workloads: faster than zlib at similar ratios, better than LZO for compressible data.

Btrfs Limitations and Known Issues (2024)

Feature	Status	Notes
RAID 1, 10	Stable	Production ready
RAID 5/6	UNSAFE	Write hole, parity bugs. Do NOT use in production
Scrub	Stable	Essential for data integrity
send/receive	Stable	Minor edge cases with certain ioctls
Compression	Stable	ZSTD preferred
Encryption	No	No native encryption; use dm-crypt beneath
Online balance	Stable	Can be slow; I/O intensive
Subvolumes	Stable	Core feature, widely used
qgroups	Buggy	Performance issues with large numbers of snapshots
NOCOW	Stable	Required for databases, VM images
Large directories	OK	Slower than ext4 for very large dirs (>1M entries)

# Comprehensive Btrfs health commands
btrfs check /dev/sda1         # read-only filesystem check
btrfs check --readonly /dev/sda1  # explicit read-only

# WARNING: never run btrfs check --repair without developer guidance
# It has caused more data loss than it has prevented

# Get filesystem statistics and error counts
btrfs device stats /
# [/dev/sda].write_io_errs    0
# [/dev/sda].read_io_errs     0
# [/dev/sda].flush_io_errs    0
# [/dev/sda].corruption_errs  0  ← checksum failures
# [/dev/sda].generation_errs  0  ← tree generation mismatch errors

Production: Meta's Btrfs Usage

Meta has used Btrfs on their cold storage servers since approximately 2014. Their primary benefits: - Compression: zstd compression on cold data blocks reduces storage costs by 20-40% - Snapshots: efficient rolling snapshots of large cold datasets - NOCOW for hot data: metadata and frequently-updated index files use chattr +C - Custom kernel patches: Meta maintains private Btrfs patches for their specific workloads (object storage access patterns)

Meta has contributed significantly to Btrfs upstream development, including many of the stability fixes in the 4.x-6.x kernel era.

Historical Context

Chris Mason began Btrfs development at Oracle in 2007, inspired by the theoretical work of Ohad Rodeh ("B-Trees, Shadowing, and Clones," ACM ToS 2008) which proved that CoW B-trees could be made space-efficient. The first in-kernel version appeared in Linux 2.6.29 (2009) as a beta quality implementation.

The development history is marked by long periods of "not quite production-ready" — RAID5/6 was marked experimental for years, various fsck and corruption bugs appeared in the 3.x kernel era. The breakthrough in stability came with Linux 4.12-4.14 (2017-2018) when the free space tree, metadata csums, and better error handling landed.

Red Hat's decision in 2017 to drop Btrfs from RHEL 7.5 (after initially planning to make it default) was a significant setback, citing "ongoing technical concerns" and prioritization of XFS. This drove enterprise users toward XFS for new deployments.

However, SUSE, openSUSE, and Fedora maintained commitment to Btrfs. Fedora's adoption as default in Fedora 33 (2020) with a well-designed subvolume layout (enabling Fedora's automatic snapshots before system upgrades) was a major legitimizing step.

Production Examples

Fedora system layout: Fedora's default Btrfs layout creates @ (root) and @home subvolumes. snapper creates automatic snapshots before dnf package updates. If an update breaks the system, btrfs send + boot to snapshot + rollback restores the previous state in seconds.

SUSE Linux Enterprise: SUSE uses Btrfs with snapper for system snapshots, providing automated pre/post rollback during system updates. This is standard in SUSE Enterprise deployments and is a key differentiating feature.

Nextcloud on Btrfs: Nextcloud (self-hosted Google Drive alternative) benefits from Btrfs compression and snapshots for storage efficiency. A 10 TB Btrfs pool with ZSTD compression effectively stores 15-18 TB of typical document/photo data.

Debugging Notes

# Full diagnostic dump
btrfs check --readonly /dev/sda1 2>&1 | head -50

# Show tree keys (extremely verbose, useful for low-level debugging)
btrfs inspect-internal dump-tree /dev/sda1 > /tmp/btrfs_dump.txt

# Find physical block owner (which file owns a physical block)
btrfs inspect-internal logical-resolve -P $LOGICAL_ADDR /

# Inode to path resolution
btrfs inspect-internal inode-resolve $INODE_NUM /

# Show chunk tree (RAID stripe layout)
btrfs inspect-internal dump-tree -t chunk /dev/sda1

# Monitor I/O stats
btrfs device stats /
btrfs fi show

# Find space usage anomalies
btrfs fi du -s /          # recursive disk usage (CoW-aware)
# Reports: "total   exclusive  set shared  filename"
# total: all blocks referenced by this path (may include shared snapshot blocks)
# exclusive: blocks ONLY referenced by this path

# Check for orphan items (usually cleaned up on mount)
btrfs check --readonly --check-data-csum /dev/sda1 2>&1 | grep orphan

Security Implications

No native encryption: Btrfs has no built-in encryption. Use dm-crypt (LUKS) beneath Btrfs for encrypted storage. The combination LUKS → Btrfs provides both encryption and checksums/snapshots with a slight performance cost.

# Encrypted Btrfs
cryptsetup luksFormat /dev/sda1
cryptsetup open /dev/sda1 encrypted_btrfs
mkfs.btrfs /dev/mapper/encrypted_btrfs
mount /dev/mapper/encrypted_btrfs /data

Snapshot data retention: See 04-copy-on-write-filesystems.md. Deleted files remain in snapshots. Audit all snapshots before decommissioning storage containing sensitive data.

Subvolume send stream trust: btrfs receive can apply arbitrary operations to the target filesystem (create files, set permissions, etc.). Never receive from an untrusted source without sandboxing.

Performance Implications

NOCOW for databases: As emphasized in 04-copy-on-write-filesystems.md, databases (MySQL, PostgreSQL, SQLite) on Btrfs without NOCOW experience severe write amplification. Always set chattr +C on database directories:

# Before creating the database directory:
mkdir /var/lib/mysql
chattr +C /var/lib/mysql
# Then initialize the database normally

Compression and CPU: ZSTD compression adds CPU overhead. On modern CPUs (Skylake+), ZSTD level 1 (default) decompresses at 1-3 GB/s and compresses at 300-500 MB/s — fast enough to not bottleneck sequential I/O on NVMe. For CPU-constrained systems, use compress=lzo (faster) or no compression.

Large snapshot trees and fragmentation: After months of daily snapshots with file modifications, the extent tree can become fragmented (many small extents from CoW updates). Performance degrades gradually. Balance (btrfs balance start -dusage=80) and defragmentation (btrfs fi defragment -r) can restore performance.

Failure Modes and Real Incidents

Btrfs RAID5 data loss (multiple user reports 2016-2020): The Btrfs RAID5 implementation has a known write hole: if power fails during a partial stripe write, the parity is inconsistent and recovery can produce incorrect data. Multiple users reported confirmed data loss. The Btrfs developers have marked RAID5/6 as "known-broken" since 2015. As of Linux 6.7, there are in-progress patches but the feature is still marked experimental and discouraged.

Btrfs corruptions on kernel upgrades: Several kernel versions (notably 5.15.x during PREEMPT_RT merge, and various 5.x stable releases) introduced regressions in Btrfs CoW metadata write ordering that caused tree corruption. These were typically subtle races that only manifested under specific workload patterns + power loss. Upstream backported fixes within weeks, but enterprise users on older kernels were affected for months.

btrfs check --repair caused data loss (multiple reports): The -repair option in btrfs check has a history of incorrectly "fixing" things that weren't actually broken, causing permanent data loss. This is actively documented by the Btrfs team. The tool was essentially untrusted for repair purposes for several years. Kernel 5.19+ improvements to the btrfs rescue subcommand provide safer alternatives.

ENOSPC on unbalanced metadata: A filesystem at 90% of listed capacity can show "no space left on device" if metadata chunks are full while data chunks still have room. The Btrfs online free space tree (kernel 4.9+) reduced this problem, but it still occurs on heavily-fragmented systems. Monitor with btrfs fi df / and proactively balance when metadata used approaches total.

Modern Usage

Fedora default: Fedora Workstation uses Btrfs with ZSTD compression and snapper for automated system rollback
SUSE Linux Enterprise and openSUSE: Btrfs is the default for system (not data) volumes, with snapper integration
Docker btrfs driver: Less common than overlayfs but used in some enterprise configurations for efficient container layer management
NAS (TrueNAS SCALE): TrueNAS SCALE uses OpenZFS (not Btrfs) but the Btrfs vs ZFS comparison is relevant for NAS evaluation

Future Directions

RAID5/6 fix: The Btrfs developers are working on a complete RAID5/6 rewrite using a per-stripe journal to eliminate the write hole. Expected in 2024-2025 upstream.
Extent tree v2: A new format for the extent tree to reduce its size and improve performance for filesystems with many millions of extents.
Kernel I/O passthrough for NVMe ZNS: Research integration of Btrfs with NVMe ZNS to map CoW extents to zones, reducing write amplification by aligning CoW allocation with zone boundaries.
subvolume encryption (long-term): Native per-subvolume encryption is on the roadmap but has no confirmed timeline.

Exercises

Create a Btrfs filesystem on two loop devices with RAID1. Write 100 MB of data. Simulate a drive failure by dd-corrupting one device. Verify Btrfs detects the error during scrub and serves data from the mirror.
Implement a daily snapshot rotation script: create a timestamped read-only snapshot, keep 7 daily snapshots, and delete the oldest. Verify with btrfs subvolume list that exactly 7 snapshots exist after 10 days.
Measure ZSTD compression ratio on real-world data. Enable compress=zstd on a Btrfs filesystem and copy your home directory to it. Use compsize /path to measure compression ratio. What types of files compress best?
Set up an incremental backup workflow using btrfs send/receive. Create a source Btrfs pool with daily snapshots. Perform a full send to a backup destination, then 5 incremental sends. Verify the backup destination can mount the latest snapshot and reads correctly.
Investigate the RAID5 write hole. Read the current status in the Btrfs wiki and fs/btrfs/raid56.c. Identify where the parity update and data update operations can be non-atomic. What would a journal for this look like? Compare to ZFS RAIDZ's approach (variable stripe width CoW).

References

Mason, C. et al. "Btrfs: The Linux B-Tree Filesystem." USENIX ATC 2013.
Rodeh, O. "B-Trees, Shadowing, and Clones." ACM Transactions on Storage 3(4), 2008.
Btrfs wiki: https://btrfs.readthedocs.io/ and https://btrfs.wiki.kernel.org/
Linux kernel source: fs/btrfs/
Btrfs status page: https://btrfs.readthedocs.io/en/latest/Status.html
Meta Btrfs usage: Watanabe, N. et al. "Usage of Btrfs at Facebook scale." (talk at Linux Storage and Filesystems Conference 2018)
Snapper documentation: https://en.opensuse.org/openSUSE:Snapper_Tutorial
compsize tool: https://github.com/kilobyte/compsize