06 - RAID

Technical Overview

RAID (Redundant Array of Inexpensive/Independent Disks) is a data storage virtualization technology that combines multiple physical disks into one logical unit for redundancy, performance, or both. The core RAID levels were first described by Patterson, Gibson, and Katz at UC Berkeley in 1988. Today, RAID concepts underpin everything from desktop NAS devices to the erasure coding systems in hyperscale data centers.

RAID has two fundamental implementations: hardware RAID (dedicated controller with its own processor, battery-backed write cache, and opaque to the OS) and software RAID (managed by the OS kernel, using generic CPUs — modern CPUs make software RAID competitive or superior for most workloads).

Prerequisites

Block device concepts (sector, LBA)
XOR bit arithmetic (critical for parity calculations)
Linux block layer basics (see 05-linux-block-layer.md)
Basic probability (for rebuild risk calculations)

Core Content

RAID 0: Striping

RAID 0 (Striping, No Redundancy)

Disk 0      Disk 1      Disk 2      Disk 3
+------+    +------+    +------+    +------+
|  A0  |    |  A1  |    |  A2  |    |  A3  |   <- Stripe 0 (chunk A)
+------+    +------+    +------+    +------+
|  B0  |    |  B1  |    |  B2  |    |  B3  |   <- Stripe 1 (chunk B)
+------+    +------+    +------+    +------+

Capacity: N × disk_size
Throughput: N × disk_throughput (sequential), N × IOPS (random, uncorrelated)
Redundancy: None — any single disk failure loses all data
Write penalty: None
Use case: Scratch space, temporary data, applications where performance matters and data is expendable

A 4-disk RAID 0 with NVMe SSDs can saturate PCIe bandwidth: 4 × 7 GB/s = 28 GB/s sequential.

RAID 1: Mirroring

RAID 1 (Mirroring)

Disk 0 (primary)    Disk 1 (mirror)
+----------------+  +----------------+
|      A         |  |      A         |
+----------------+  +----------------+
|      B         |  |      B         |
+----------------+  +----------------+

Capacity: 1 × disk_size (regardless of number of mirrors)
Read throughput: Can read from either disk — 2× IOPS for random reads if load-balanced across both
Write throughput: Must write to all mirrors — limited to slowest disk. Writes happen in parallel so latency is max(d0_write_latency, d1_write_latency)
Redundancy: Survives any (N-1) disk failures (all mirrors except one)
Write penalty: No parity calculation, but bandwidth consumed on all mirrors

RAID 1 with 3 mirrors (RAID 1E or triple mirror) is used by Google for Colossus chunk storage with specific durability targets.

RAID 5: Distributed Parity

RAID 5 (4 disks shown, distributed parity)

Disk 0   Disk 1   Disk 2   Disk 3
+------+  +------+  +------+  +------+
|  A0  |  |  A1  |  |  A2  |  | AP   |   <- AP = A0 XOR A1 XOR A2
+------+  +------+  +------+  +------+
|  B0  |  |  B1  |  | BP   |  |  B3  |   <- BP = B0 XOR B1 XOR B3
+------+  +------+  +------+  +------+
|  C0  |  | CP   |  |  C2  |  |  C3  |   <- CP = C0 XOR C2 XOR C3
+------+  +------+  +------+  +------+
| DP   |  |  D1  |  |  D2  |  |  D3  |   <- DP = D1 XOR D2 XOR D3
+------+  +------+  +------+  +------+

Parity rotates across disks each stripe (distributed)

Capacity: (N-1) × disk_size
Read throughput: (N-1) × disk_throughput (parity stripe not useful for reads)
Write throughput: See write penalty below
Redundancy: Survives exactly 1 disk failure
Minimum disks: 3

RAID 5 Write Penalty (Read-Modify-Write):

To update a single stripe chunk (e.g., update A0): 1. Read old A0 (to compute delta) 2. Read old parity AP (to update parity) 3. Compute new_AP = AP XOR old_A0 XOR new_A0 4. Write new A0 5. Write new AP

4 I/Os for every logical write = write penalty of 4×. For small random writes on a RAID 5 with HDDs, this is catastrophic — 4 seeks per write.

Mitigation: RAID 5 with write-back cache (BBU — Battery Backed Unit) absorbs random writes in DRAM and flushes sequentially. Without BBU, RAID 5 is unsuitable for write-intensive workloads on HDDs.

RAID 6: Double Parity

RAID 6 (two independent parity calculations per stripe)

Disk 0   Disk 1   Disk 2   Disk 3   Disk 4
+------+  +------+  +------+  +------+  +------+
|  A0  |  |  A1  |  |  A2  |  | AP   |  | AQ   |
+------+  +------+  +------+  +------+  +------+

AP = XOR parity (simple)
AQ = Reed-Solomon / Galois Field parity (detects which of two disks failed)

Capacity: (N-2) × disk_size
Redundancy: Survives exactly 2 disk failures
Write penalty: 6× (6 I/Os for small random write: read 2 data + 2 parity, write 2)
Minimum disks: 4
Use case: Large HDD arrays where double failure during rebuild is a real risk

AQ uses GF(2^8) (Galois Field arithmetic) — P (simple XOR) can detect one failure but cannot determine which of two failed disks is which. Q uses polynomial arithmetic over GF(2^8) to solve a system of equations identifying both failure positions and values.

The Linux MD (md/raid6.c) implements raid6_datap_recov() and uses SIMD-accelerated GF multiply (arch/x86/include/asm/raid6.h).

RAID 10: Stripe of Mirrors

RAID 10 (1+0: mirror first, then stripe)

Mirror 0              Mirror 1
+------+ +------+    +------+ +------+
| Disk0| |Disk1 |    | Disk2| |Disk3 |
|  A0  | |  A0  |    |  A1  | |  A1  |  <- A striped across mirrors
|  B0  | |  B0  |    |  B1  | |  B1  |
+------+ +------+    +------+ +------+

Capacity: N/2 × disk_size
Read throughput: N × IOPS (reads from any mirror)
Write throughput: N/2 × IOPS (writes to both disks in each mirror)
Redundancy: Can survive losing one disk per mirror group (minimum 1 failure guaranteed, up to N/2 failures if distributed one per mirror)
Write penalty: 2× (write to both mirrors, no parity recalculation)
Minimum disks: 4
Use case: High I/O databases (MySQL, PostgreSQL production), latency-sensitive workloads

RAID 10 combines good write performance (no parity overhead), good read performance, and reasonable redundancy. It is the preferred RAID level for database storage.

RAID 5 Write Hole Problem

The RAID 5 write hole is a data corruption risk during power failure. A stripe write is non-atomic:

Stripe write in progress:
Step 1: Write new_A0 to disk 0  [DONE]
Step 2: Write new_AP to disk 3  [NOT DONE — power failure here!]

State after power-on:
Disk 0 has: new_A0  (new data)
Disk 3 has: old_AP  (old parity, computed from old_A0)

Inconsistency: new_A0 XOR A1 XOR A2 ≠ old_AP
               (parity is "wrong" for the new data)

If disk 1 fails now and we attempt to recover A1:
A1_recovered = new_A0 XOR A2 XOR old_AP = GARBAGE

The write hole means: after a power failure during a RAID 5 write, the array is in an inconsistent state. If another disk then fails during rebuild, the wrong parity is used to reconstruct data → silent data corruption.

Mitigations: - Journal (intent log): Linux MD uses a dedicated journal device to log stripe changes. Before writing to data/parity, log to journal. If power fails, replay journal. - BBU (Battery-Backed Unit): Hardware RAID controllers with battery can preserve write-back cache contents, completing interrupted stripe writes on next power-on. - ZFS RAIDZ: Uses variable-stripe-width CoW — there is no write hole because the stripe is written atomically in a new location (CoW semantics). - Avoid RAID 5/6 for critical data: Many storage engineers simply use RAID 10 or ZFS RAIDZ to avoid the write hole entirely.

RAID Rebuild Risk

During rebuild after a disk failure, the array is degraded. A second failure during rebuild causes data loss. For large modern HDDs, rebuild time is:

Rebuild time = Disk capacity / Sequential read speed
             = 14 TB / 200 MB/s
             = 70,000 seconds
             ≈ 19.4 hours

Annual Failure Rate (AFR) for common HDDs: 1-3%
Hourly failure rate: AFR / 8760

P(second failure during rebuild) =
  1 - e^(-λ × rebuild_hours)
  where λ = (N-1 remaining disks) × hourly_failure_rate

For RAID 5 with 5 drives after first failure:
  λ = 4 × (0.02 / 8760) = 9.1 × 10^-6 /hour
  P = 1 - e^(-9.1e-6 × 19.4) ≈ 0.018%

Not negligible for a large fleet. For 1000 such RAID groups:
  Expected dual failures during rebuild: 0.18 per rebuild event

For QLC NVMe at 1% AFR but 6 GB/s rebuild speed: rebuild_time = 4 TB / 6 GB/s = 667 seconds ≈ 11 minutes → negligible risk.

This demonstrates why SSD-based RAID is safer during rebuild than HDD-based RAID.

Linux MD (Software RAID)

mdadm is the user-space tool for Linux MD (Multiple Devices) driver:

# Create RAID 5 with 4 drives
mdadm --create /dev/md0 --level=5 --raid-devices=4 \
    /dev/sdb /dev/sdc /dev/sdd /dev/sde

# Create RAID 10 with 4 drives
mdadm --create /dev/md1 --level=10 --raid-devices=4 \
    /dev/sdf /dev/sdg /dev/sdh /dev/sdi

# Monitor array status
cat /proc/mdstat
mdadm --detail /dev/md0

# Replace a failed drive
mdadm /dev/md0 --remove /dev/sdb    # mark as failed and remove
mdadm /dev/md0 --add /dev/sdj       # hot add replacement -> rebuild begins

# Set chunk size (default 512KB, tune for workload)
mdadm --create /dev/md0 --chunk=64 ...  # 64KB chunks for small random I/O
mdadm --create /dev/md0 --chunk=1024 ... # 1MB chunks for sequential

# Tune rebuild speed (balance rebuild vs production I/O)
echo 50000 > /proc/sys/dev/raid/speed_limit_min  # min rebuild speed (KB/s)
echo 200000 > /proc/sys/dev/raid/speed_limit_max # max rebuild speed (KB/s)

/proc/mdstat output during rebuild:

md0 : active raid5 sde[3] sdd[2] sdc[1] sdj[4]
      29297024 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] [UUU_]
      [============>........]  recovery = 62.1% (6080064/9765008)
                               finish=14.5min speed=67248K/sec

ZFS RAIDZ

ZFS RAIDZ avoids the write hole through copy-on-write semantics:

RAIDZ write:
1. Compute full stripe (all data + parity) in memory
2. Write complete stripe to new location atomically (CoW)
3. Update superblock/tree pointer to new stripe location

If power fails at step 2:
- New stripe is partially written but tree still points to OLD stripe
- Old stripe is intact: no data corruption, no inconsistency
- Partially-written new stripe is orphaned (reclaimed by scrub)

RAIDZ variants: - RAIDZ1: 1 parity disk equivalent (like RAID 5), variable stripe width - RAIDZ2: 2 parity disks (like RAID 6) - RAIDZ3: 3 parity disks - DRAID (Distributed RAID): ZFS 2.1+ — distributed spare capacity across vdev, enabling faster rebuild (partial rebuild vs full disk rebuild)

RAIDZ vs RAID 5 key difference: variable stripe width. In RAIDZ, a single write to a 4K file uses a 2-disk stripe (1 data + 1 parity). A 1 MB write uses all N disks. This eliminates partial stripe writes → no write hole.

RAIDZ limitation: No RAID chunk remapping — cannot add a disk to expand RAIDZ vdev without rebuilding (traditional ZFS). ZFS 2.2+ (2023) adds RAIDZ expansion: can add one disk to existing RAIDZ, which triggers online restripe (slow, but enables capacity growth).

Erasure Coding at Scale

For large-scale distributed storage (petabytes+), traditional RAID is replaced by erasure coding:

Erasure code (e.g., 10+4 Reed-Solomon):

Original data: split into 10 data shards (D0-D9) + 4 parity shards (P0-P3)
Each shard stored on different node/disk

         D0  D1  D2  D3  D4  D5  D6  D7  D8  D9  P0  P1  P2  P3
Node 0:  [D0]
Node 1:      [D1]
...
Node 9:                                            [D9]
Node10:                                                [P0]
Node11:                                                    [P1]
Node12:                                                        [P2]
Node13:                                                            [P3]

Survives any 4 node/disk failures (lose any 4 of 14 shards and still recover all 10 data shards)
Storage overhead: 14/10 = 1.4× (vs RAID 6 overhead of N/(N-2), e.g., 5/3 = 1.67× for a 5-disk array)
Used by: HDFS, Ceph, Google Colossus, Azure Blob Storage, Facebook f4
Compute cost: encoding/decoding requires matrix multiply over GF(2^8) — handled by specialized chips or SIMD ISA extensions (Intel VPCLMULQDQ, AVX512)

Facebook's f4 (HDFS warm blob storage): 14+10 erasure code = 1.4× overhead, down from HDFS 3× replication. Saves hundreds of PB of storage at Facebook's scale.

Historical Context

RAID was formalized in the 1988 UC Berkeley paper "A Case for Redundant Arrays of Inexpensive Disks (RAID)" by David Patterson, Garth Gibson, and Randy Katz. The paper proposed 5 RAID levels (RAID 1-5). RAID 6 and RAID 10 were later standardized by the industry.

The original "I" in RAID stood for "Inexpensive" — the thesis was that many cheap disks could outperform one expensive mainframe disk while providing redundancy. The industry later softened it to "Independent" when RAID became used with expensive enterprise drives.

Hardware RAID controllers (Adaptec, LSI/Broadcom, Areca) dominated through the 2000s. Software RAID gained preference in the 2010s with multi-core CPUs making XOR computation cheap, and with the understanding that hardware RAID's opaque nature could lead to data loss when batteries died silently.

Production Examples

Backblaze RAID-less storage pods: Backblaze uses RAID 6 software RAID (md) within each storage pod for data protection, combined with erasure coding across pods for additional redundancy. Their Storage Pod 7.0 holds 480 TB in 2U with 60 drives at ~$0.03/GB/year.

Percona MySQL RAID 10 recommendation: Percona's MySQL best practices explicitly recommend RAID 10 for database storage over RAID 5/6 due to write penalty impact on MySQL binary log and InnoDB writes.

Google Colossus: Uses custom erasure codes (Reed-Solomon variants) across disks and racks. Unlike traditional RAID, the erasure group spans physical racks — tolerating rack-level failures, not just disk failures.

Debugging Notes

# Check array status and rebuild progress
cat /proc/mdstat
mdadm --detail /dev/md0

# Check for bitmap (write intent log — accelerates rebuild after unclean shutdown)
mdadm --detail /dev/md0 | grep Bitmap

# Add bitmap to existing array
mdadm --grow /dev/md0 --bitmap=internal

# Check consistency after scrub
mdadm --action=check /dev/md0
cat /sys/block/md0/md/mismatch_cnt  # should be 0

# Fix mismatches (dangerous — only if you know which copy is correct)
mdadm --action=repair /dev/md0

# Examine superblock
mdadm --examine /dev/sdb

# Monitor events
mdadm --monitor --daemonize --mail=admin@example.com --delay=60 /dev/md0

Security Implications

RAID is not a backup: RAID protects against disk failure but not against accidental deletion, ransomware, filesystem corruption, or operator error. All of these affect all mirrors/parity simultaneously.

Data exposure during failed drive disposal: A failed RAID member drive contains real data (not encrypted unless dm-crypt is used under MD). Drives retired from RAID arrays must be securely erased before disposal. RAID 5 drives contain the parity XOR of other drives — partial data recovery is possible from a single RAID 5 member.

BBU battery failure: A dead BBU on a hardware RAID controller with write-back cache silently downgrades to write-through mode (or the controller refuses to start with write-back cache). Unmonitored dead BBUs are a common source of unexpected I/O performance degradation discovered only during incidents.

Performance Implications

RAID 5 vs RAID 10 for databases: Assume 100K 4K random write IOPS from database: - RAID 5 (4 disk): write penalty 4× → needs 400K physical IOPS → impossible on 4 HDDs (~600 total IOPS) - RAID 10 (4 disk): write penalty 2× → needs 200K physical IOPS → feasible on 4 NVMe SSDs

Chunk size tuning: MD RAID chunk size affects performance significantly: - Small chunks (16-64 KB): Better parallelism for small random I/Os (each read spans more disks) - Large chunks (512 KB - 1 MB): Better for sequential I/Os (fewer disks per I/O, less coordination overhead) - For databases with 8 KB page size, a 64 KB chunk with RAID 10 stripes reads across 2 disks per 8 KB page (with 4-disk RAID 10) — good parallelism.

Rebuild impact on production: During rebuild, RAID controller/MD uses rebuild bandwidth. On a busy system, this can saturate disk throughput. Use speed_limit_min/max in /proc/sys/dev/raid/ to rate-limit rebuild.

Failure Modes and Real Incidents

The RAID 5 + DM-SMR disaster (2020): Multiple users in NAS forums and Ars Technica reported ZFS RAIDZ and MD RAID 5 arrays becoming completely unresponsive during resilver/rebuild when using WD Red (non-Plus) drives, which turned out to be DM-SMR internally. The RAID rebuild sequential write pattern triggered DM-SMR's band management, reducing writes to 1-2 MB/s and causing rebuild to take weeks instead of hours. ZFS reported checksum errors and suspended the pool.

Areca RAID card BBU silent failure (common): Multiple reports of production MySQL servers experiencing sudden I/O performance degradation traced to silent BBU death on Areca hardware RAID controllers. The controller switched from write-back to write-through cache, turning what were 200 µs writes (hitting cache) into 5 ms writes (hitting HDD). Unmonitored BBU status meant this went undiscovered until an incident.

CERN CASTOR RAID 6 rebuild during disk shortage (2019): CERN's tape-disk hybrid archive system (CASTOR) ran degraded RAID 6 groups for months due to disk procurement delays. Multiple arrays entered double-degraded state (one disk away from data loss). Resolved by emergency procurement and 24/7 rebuild operations.

Modern Usage

Software RAID via mdadm and ZFS is preferred over hardware RAID in most modern deployments: - Modern CPUs make XOR computation free (Intel AVX512 VPCLMULQDQ) - Hardware RAID's BBU introduces single points of failure - ZFS RAIDZ provides better integrity guarantees (end-to-end checksums, no write hole) - Cloud environments use erasure coding at the distributed storage layer, eliminating local RAID

For all-NVMe systems, the I/O bottleneck is no longer disk seek time — RAID 10 for databases, or ZFS RAIDZ2 for general-purpose storage with integrity requirements.

Future Directions

Erasure coding in filesystems: Btrfs RAID 5/6 parity has long-standing bugs (not production-ready as of 2024). ZFS RAIDZ is mature but lacks distributed erasure across hosts. The gap between local RAID and distributed erasure coding is narrowing via CephFS.
ZFS RAIDZ expansion (2.2+): Adding disks to existing RAIDZ vdevs online (in-place restripe). Eliminates the need to recreate pools for capacity expansion.
NVMe ZNS RAID: Exposing zone-sequential NAND blocks directly to RAIDZ-like software — eliminating FTL write amplification in the RAID layer. Ongoing research.

Exercises

Create a RAID 5 array with mdadm using loop devices (no real disks needed). Simulate a disk failure (mdadm --fail), add a replacement, and observe the rebuild process in /proc/mdstat. Time the rebuild.
Measure the RAID 5 write penalty empirically. Create fio test: --rw=randwrite --bs=4k on a 3-disk RAID 5 vs 3-disk RAID 0 (same hardware). Compute the write IOPS ratio. Is it close to the theoretical 4× penalty?
Calculate the RAID 5 vs RAID 6 rebuild risk for a 12-drive array using 16 TB HDDs at 2% AFR and 200 MB/s rebuild speed. How does the risk change with 6 TB HDDs at 1% AFR?
Read the ZFS RAIDZ source code in OpenZFS (module/zfs/vdev_raidz.c). Find where variable stripe width is computed. What determines the number of data disks in a given stripe?
Research Facebook's f4 erasure coding paper. What erasure code parameters does f4 use, and what is the storage overhead compared to f4's predecessor (Haystack with 3× replication)? What are the read/write performance tradeoffs?

References

Patterson, D., Gibson, G., Katz, R. "A Case for Redundant Arrays of Inexpensive Disks (RAID)." ACM SIGMOD 1988.
Plank, J. "A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like Systems." Software: Practice and Experience, 1997.
Seltzer, M. et al. "The Case for Application-Specific Logging." USENIX ATC 1992.
Linux MD documentation: Documentation/admin-guide/md.rst
Linux MD source: drivers/md/
Huang, C. et al. "Erasure Coding in Windows Azure Storage." USENIX ATC 2012.
Muralidhar, S. et al. "f4: Facebook's Warm BLOB Storage System." OSDI 2014.
ZFS RAIDZ: https://openzfs.github.io/openzfs-docs/Basic%20Concepts/RAIDZ.html
Backblaze Pod: https://www.backblaze.com/cloud-storage/resources/storage-pod