Section 12: Storage Systems
Purpose and Scope
Storage systems sit at the boundary between volatile computation and durable state. This section traces the complete path from application write() call to bits committed on persistent media, covering the storage hierarchy, physical device characteristics (HDD, SSD, NVMe), protocols (SATA, SAS, NVMe/PCIe), the Linux block layer and I/O scheduler stack, RAID architectures, DMA for storage, storage caching strategies, and storage networking (iSCSI, Fibre Channel, NFS). It extends to modern cloud object storage and its design implications.
Understanding storage requires reasoning simultaneously about latency (nanoseconds to milliseconds spanning six orders of magnitude), bandwidth, IOPS, durability semantics, and the economic pressures that shaped each layer of the hierarchy.
Prerequisites
- Section 02 (CPU Architecture): PCI Express bus, DMA, cache hierarchy
- Section 03 (OS Fundamentals): block devices, file descriptors, syscall path
- Section 11 (Memory Management): DMA API, page cache, mmap
- Basic familiarity with Linux device model (/dev/sd, /dev/nvme)
Learning Objectives
Upon completing this section you will be able to:
- Explain the mechanical and electronic physics that determine HDD latency (seek + rotational latency + transfer).
- Describe the internal architecture of a NAND SSD: flash translation layer, wear leveling, garbage collection, write amplification.
- Compare NVMe, SATA, and SAS protocols across latency, queue depth, and CPU overhead dimensions.
- Trace a write request through the Linux block layer from
submit_bio()to DMA completion interrupt. - Choose an appropriate I/O scheduler (none/mq-deadline/BFQ/Kyber) for a given workload.
- Explain RAID 0/1/5/6/10 in terms of rebuild time, write penalty, and fault tolerance.
- Describe iSCSI and Fibre Channel architectures and their failure domains.
- Explain how object storage (S3-compatible) differs architecturally from block and file storage.
Architecture Overview
Application
│ read() / write() / io_uring
▼
VFS Layer
│ page cache lookup / writeback
▼
Filesystem (ext4, XFS, Btrfs …)
│ bio construction
▼
┌────────────────────────────────────────────────────────┐
│ Linux Block Layer │
│ ┌─────────────────────────────────────────────────┐ │
│ │ I/O Scheduler │ │
│ │ mq-deadline │ BFQ │ Kyber │ none (NVMe) │ │
│ └───────────────────────┬─────────────────────────┘ │
│ Multi-Queue (blk-mq): per-CPU hardware queues │
└───────────────────────────┬────────────────────────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌─────▼──────┐ ┌───────▼──────┐ ┌──────▼──────┐
│ NVMe │ │ SATA/AHCI │ │ SAS HBA │
│ (PCIe 4/5)│ │ (6 Gbps) │ │ (12 Gbps) │
└─────┬──────┘ └───────┬──────┘ └──────┬──────┘
│ │ │
┌─────▼──────┐ ┌───────▼──────┐ ┌──────▼──────┐
│ NVMe SSD │ │ SATA SSD │ │ SAS HDD │
│ ~100 µs │ │ ~100 µs │ │ ~5–10 ms │
└────────────┘ └──────────────┘ └─────────────┘
Storage Networking:
Host ──(iSCSI/TCP)──► iSCSI Target ──► Block Device
Host ──(FC 16/32G)──► FC Switch ──► Storage Array
Host ──(NFS/TCP)────► NFS Server ──► Filesystem
Key Concepts
- Storage Hierarchy: Registers → L1/L2/L3 cache → DRAM → NVMe SSD → SATA SSD → HDD → Tape; each level trades capacity for latency.
- HDD Internals: Seek time (actuator arm movement), rotational latency (platter spin to sector), transfer rate; random I/O limited to ~100–200 IOPS at 7200 RPM.
- NAND Flash: MLC/TLC/QLC cells store 2/3/4 bits; program-erase cycles limited (100K for SLC, ~1K for QLC); erase at block granularity (256 KB–1 MB) creates write amplification.
- Flash Translation Layer (FTL): Maps logical block addresses to physical flash pages; implements wear leveling, garbage collection, over-provisioning.
- Write Amplification Factor (WAF): Ratio of bytes written to flash vs bytes written by host; high WAF reduces SSD endurance.
- NVMe Protocol: Low-latency PCIe-attached interface; supports up to 65,535 queues with 65,535 entries each; eliminates SATA/AHCI serialization bottleneck.
- Block Layer (blk-mq): Linux multi-queue block layer; maps software queues to hardware queues for SMP scalability.
- I/O Scheduler: Algorithms that reorder, merge, and throttle I/O requests; BFQ provides per-process fairness; Kyber targets latency; none is optimal for NVMe.
- RAID: Redundant Array of Independent Disks; RAID 0 (striping), RAID 1 (mirroring), RAID 5/6 (parity), RAID 10 (mirror+stripe).
- Write Penalty: For RAID 5, each write requires 4 I/Os (read old data, read old parity, write new data, write new parity).
- DMA (Direct Memory Access): Storage controllers transfer data directly to/from host DRAM without CPU involvement; uses scatter-gather lists.
- Storage Caching: Write-back (data in cache, acknowledged before media write), write-through (acknowledged after media write), read-ahead.
- iSCSI: SCSI commands encapsulated in TCP/IP; software initiator on host, hardware offload via iSCSI HBA (TOE).
- Fibre Channel: Purpose-built lossless SAN protocol; FC-over-Ethernet (FCoE) converges on Ethernet infrastructure.
- Object Storage: Flat namespace of immutable objects identified by key; S3-compatible API; no filesystem semantics; scales horizontally.
- Persistent Memory (PMEM): NVDIMMs or Optane DIMMs; byte-addressable like DRAM, persistent like SSD; accessed via DAX, bypassing page cache.
Major Historical Milestones
| Year | Milestone |
|---|---|
| 1956 | IBM RAMAC: first hard disk drive (3.75 MB, 50 platters) |
| 1973 | IBM Winchester disk: sealed head/disk assembly, modern HDD architecture |
| 1988 | Patterson, Gibson, Katz: "A Case for Redundant Arrays" (RAID coined) |
| 1994 | ATA/IDE becomes dominant PC interface |
| 1996 | Fibre Channel ratified as ANSI standard |
| 2000 | iSCSI specification development begins (RFC 3270 in 2004) |
| 2003 | SATA 1.0 ratified (1.5 Gbps); replaces parallel ATA |
| 2004 | Linux CFQ (Complete Fair Queuing) I/O scheduler |
| 2007 | SanDisk ships first consumer NAND SSD |
| 2008 | Intel X25-M SSD; SSD enters enterprise mainstream |
| 2011 | NVMe 1.0 specification published |
| 2012 | Samsung 840 Pro: TLC NAND in enterprise workloads |
| 2013 | Linux blk-mq (multi-queue block layer) merged |
| 2015 | NVMe over Fabrics (NVMe-oF) specification published |
| 2017 | Linux BFQ I/O scheduler merged (mainline 4.12) |
| 2019 | PCIe 4.0 enables 7 GB/s sequential NVMe read (Samsung 980 Pro) |
| 2022 | PCIe 5.0 NVMe: 14 GB/s sequential; Micron 9400 Pro |
| 2023 | io_uring matures as zero-syscall-overhead async I/O interface |
Modern Relevance and Production Use Cases
Cloud storage services (AWS EBS, GCP Persistent Disk) abstract block storage over a network; understanding IOPS provisioning, burst credits, and multi-attach semantics requires solid block layer knowledge.
Database storage engines (InnoDB, RocksDB, PostgreSQL) make explicit assumptions about fsync durability, O_DIRECT vs buffered I/O, and write ordering; a misconfigured storage stack silently violates these assumptions.
Video streaming and CDN workloads saturate sequential read bandwidth; understanding read-ahead, page cache, and sendfile() is essential for maximizing cache hit throughput.
NVMe-oF enables disaggregated storage with sub-100 µs remote latency over RoCE or TCP; hyperscalers (AWS Nitro, Azure Stack) are built on this model.
All-flash arrays (NetApp AFF, Pure Storage) use ZFS or proprietary FTLs with aggressive deduplication and compression; WAF and garbage collection are primary design axes.
File Map
| File | Description |
|---|---|
01-storage-hierarchy.md |
Latency/bandwidth/capacity ladder, economic drivers |
02-hdd-internals.md |
Platters, heads, seek/rotational latency, ZBR, SMR |
03-ssd-nand-flash.md |
NAND cell types, page/block structure, P/E cycles, retention |
04-flash-translation-layer.md |
FTL design, wear leveling, garbage collection, WAF |
05-nvme-protocol.md |
NVMe queues, submission/completion rings, namespace model |
06-sata-sas-protocols.md |
AHCI, SAS expanders, protocol overhead comparison |
07-block-layer.md |
bio structure, blk-mq, request merging, plugging |
08-io-schedulers.md |
CFQ/BFQ/mq-deadline/Kyber/none, latency vs throughput |
09-raid-levels.md |
RAID 0/1/5/6/10, write penalty, rebuild time, mdadm |
10-storage-controllers-dma.md |
HBA architecture, scatter-gather DMA, interrupt coalescing |
11-storage-caching.md |
Page cache, write-back/through, bcache, dm-cache, ZFS ARC |
12-persistent-memory.md |
Optane DCPMM, DAX mode, fsdax, devdax, PMDK |
13-iscsi-fc.md |
iSCSI initiator/target stack, FC zoning, FCoE |
14-nfs-cifs.md |
NFS v3/v4/v4.1 (pNFS), CIFS/SMB3, locking semantics |
15-nvme-of.md |
NVMe-oF over RoCE/TCP, host/target driver architecture |
16-object-storage.md |
S3 API, erasure coding, consistency model, Ceph RADOS |
17-cloud-storage.md |
EBS, GCP PD, Azure Disk, IOPS/throughput provisioning |
18-io-uring.md |
io_uring design, SQE/CQE rings, fixed buffers, registered files |
Cross-References
- Section 02 (CPU Architecture): PCIe topology, DMA, IOMMU
- Section 11 (Memory Management): page cache, DMA API, huge pages for I/O buffers
- Section 13 (Filesystems): VFS, page cache writeback, journaling on top of block layer
- Section 14 (Device Drivers): storage driver stack, SCSI mid-layer, libata
- Section 17 (Distributed Systems): distributed storage consistency, replication
- Section 19 (Virtualization): virtio-blk, VirtIO-SCSI, NVMe passthrough