05 - Linux Block Layer
Technical Overview
The Linux block layer is the kernel subsystem that provides a uniform interface between filesystems (and other I/O consumers) and block storage devices. It abstracts the differences between HDDs, SSDs, RAID arrays, loop devices, and network storage behind a common API. The block layer handles request merging, I/O scheduling, request queue management, and the handoff to device drivers.
The block layer underwent a fundamental redesign with the introduction of blk-mq (multi-queue block layer) in Linux 3.13 (2013). The old single-queue block layer, designed for HDD latencies and single-core CPUs, became a bottleneck when NVMe SSDs exposed per-million IOPS capability. blk-mq replaced it with a two-tier queue model optimized for NUMA-aware, multi-core submission.
Prerequisites
- Linux kernel architecture basics
- VFS and filesystem concepts
- PCIe and NVMe basics (see 04-nvme.md)
- Understanding of NUMA topology
Core Content
The Full I/O Path
User Space
|
| read()/write()/pread()/pwrite() syscalls
v
+---------------------------+
| VFS (Virtual Filesystem) | path resolution, file descriptor lookup
| vfs_read() / vfs_write() |
+---------------------------+
|
| file->f_op->read_iter()
v
+---------------------------+
| Page Cache | address_space, struct page
| (check/update cache) | Generic_file_read_iter()
+---------------------------+
| cache miss: submit_bio()
v
+---------------------------+
| Filesystem | ext4_readpages(), btrfs_submit_bio()
| (block mapping) | Maps logical file offset -> LBA
+---------------------------+
| submit_bio()
v
+---------------------------+
| Block Layer |
| +-----------------------+|
| | bio submission || struct bio (the I/O descriptor)
| | blk_mq_submit_bio() ||
| +-----------------------+|
| | Software queues || per-CPU staging queues (blk-mq)
| | (blk_mq_ctx) ||
| +-----------------------+|
| | I/O Scheduler || mq-deadline, kyber, none
| | (reorder/prioritize) ||
| +-----------------------+|
| | Hardware dispatch || per-hardware-queue (blk_mq_hw_ctx)
| | queues || dispatched to driver
| +-----------------------+|
+---------------------------+
| blk_mq_ops->queue_rq()
v
+---------------------------+
| Device Driver | nvme_queue_rq(), sd_init_command()
| (translate to hw cmds) |
+---------------------------+
| PCIe / SATA / FC / ...
v
+---------------------------+
| Storage Device | NVMe SSD, HDD, RAID controller
+---------------------------+
Key Data Structures
struct bio — the fundamental I/O unit in the block layer:
struct bio {
struct bio *bi_next; // linked list for bio chaining
struct block_device *bi_bdev; // target block device
blk_opf_t bi_opf; // operation flags (READ/WRITE/FLUSH/FUA)
unsigned short bi_vcnt; // number of bvecs (scatter-gather entries)
struct bvec_iter bi_iter; // current position within bio
bio_end_io_t *bi_end_io; // completion callback
struct bio_vec bi_inline_vecs[]; // embedded bvecs for small I/Os
};
struct bio_vec {
struct page *bv_page; // physical page containing data
unsigned int bv_len; // byte count in this segment
unsigned int bv_offset; // byte offset within page
};
A bio represents a single contiguous-on-disk I/O request, potentially from non-contiguous physical memory pages (scatter-gather). A read of a 1 MB file might produce one bio with 256 bvecs (256 × 4KB pages).
struct request — wraps one or more bios after merging:
// In blk-mq, a request is allocated from a per-hardware-queue pool
struct request {
struct request_queue *q;
struct blk_mq_ctx *mq_ctx; // software queue context
struct blk_mq_hw_ctx *mq_hctx; // hardware queue context
unsigned int cmd_flags;
req_op_t cmd_type;
blk_status_t error;
sector_t __sector; // start sector (LBA)
unsigned int __data_len; // total byte count
struct bio *bio; // first bio in chain
struct bio *biotail; // last bio
// ... driver-private area follows
};
struct request_queue — the queue associated with a block device:
struct request_queue {
struct blk_mq_tag_set *tag_set; // hardware queue configuration
struct elevator_queue *elevator; // I/O scheduler
struct queue_limits limits; // device capabilities (max_sectors, etc.)
// ... many tuning parameters
};
blk_mq_hw_ctx — hardware dispatch queue:
- Corresponds to one NVMe submission queue / SCSI tagged-command queue
- Has a tag set (pre-allocated struct request pool)
- Managed by the driver (nvme_queue maps 1:1 to blk_mq_hw_ctx)
blk-mq Two-Level Queue Architecture
blk-mq Queue Architecture:
CPU 0 CPU 1 CPU 2 CPU 3
| | | |
v v v v
[SWQ 0] [SWQ 1] [SWQ 2] [SWQ 3]
(Software (Software (Software (Software
Queue 0) Queue 1) Queue 2) Queue 3)
| | | |
+------+--------+------+------+
| |
v v
[HWQ 0] [HWQ 1]
(Hardware Queue 0) (Hardware Queue 1)
(maps to NVMe (maps to NVMe
SQ #1) SQ #2)
| |
+-------+--------+
|
v
NVMe Controller
- Software queues (blk_mq_ctx): one per CPU, per hardware queue. Staging area before I/O scheduler. Lock-free (each CPU owns its SWQ). Batches I/Os for efficient dispatch.
- Hardware queues (blk_mq_hw_ctx): one per NVMe submission queue (or SCSI host bus adapter queue). May map to multiple SWQs. I/O scheduler operates at this level.
For NVMe with 32 hardware queues on a 32-core system: 1:1 mapping — each CPU has its own dedicated NVMe submission queue with zero contention.
Request Merging
Before dispatch, the block layer attempts to merge adjacent I/O requests:
Front merge: New bio covers sectors just before an existing request → prepend to request
Back merge: New bio covers sectors just after an existing request → append to request
Plug/unplug: blk_start_plug() holds I/Os in a per-task list to allow batching before submitting to the scheduler
// Kernel: request merging in blk-mq
// drivers/../blk-merge.c: blk_attempt_plug_merge()
// If current task has an active plug, try to merge with pending requests
// before submitting — reduces queue contention and improves merging
Merging is critical for HDD performance (turns 100 random 4K writes into one 4MB sequential write if addresses are adjacent) and less important for NVMe (parallelism matters more than merge).
I/O Schedulers
The I/O scheduler sits between the software queues and hardware queues:
# View/set scheduler for a device
cat /sys/block/nvme0n1/queue/scheduler
# Output: [none] mq-deadline kyber
cat /sys/block/sda/queue/scheduler
# Output: [mq-deadline] none kyber
echo mq-deadline > /sys/block/sda/queue/scheduler
none (passthrough):
- No reordering; I/Os dispatched in submission order
- Zero overhead
- Best choice for NVMe SSDs (device-side parallelism makes software ordering irrelevant)
- Best choice for RAM-backed devices (RAM disks, loop with SSD backing)
mq-deadline:
- Maintains sorted red-black tree of pending requests (sorted by LBA)
- Two queues: read and write, each with a FIFO expiry queue
- Deadline parameters: read_expire (500ms default), write_expire (5s default)
- Front dispatch: service requests in LBA order (like elevator) until a request deadline expires
- Best choice for rotating media (HDDs) — minimizes seek distance while preventing starvation
# mq-deadline tuning
cat /sys/block/sda/queue/iosched/read_expire # ms before read is "urgent"
cat /sys/block/sda/queue/iosched/write_expire # ms before write is "urgent"
cat /sys/block/sda/queue/iosched/fifo_batch # requests to service from sorted queue
cat /sys/block/sda/queue/iosched/front_merges # enable/disable front merges
kyber:
- Latency-based scheduler for fast block devices
- Maintains separate queues for reads and synchronous writes
- Throttles queue depth to hit latency targets (configurable)
- Less useful than none for most NVMe workloads; useful when multiple concurrent workloads compete for a fast SSD
# kyber latency targets (nanoseconds)
cat /sys/block/nvme0n1/queue/iosched/read_lat_nsec # default: 2ms
cat /sys/block/nvme0n1/queue/iosched/write_lat_nsec # default: 10ms
Legacy schedulers (pre-blk-mq, removed in modern kernels):
- CFQ (Completely Fair Queuing): per-process I/O fairness with bandwidth slices. Removed in 5.0.
- Deadline: Single-queue version of mq-deadline. Removed in 5.0.
- NOOP: Single-queue passthrough. Replaced by none.
Key /sys/block/ Tuning Parameters
# For /dev/sda (HDD):
/sys/block/sda/queue/scheduler # I/O scheduler
/sys/block/sda/queue/nr_requests # queue depth (default 128)
/sys/block/sda/queue/read_ahead_kb # readahead size (default 128)
/sys/block/sda/queue/rotational # 1=HDD, 0=SSD
/sys/block/sda/queue/hw_sector_size # physical sector size (bytes)
/sys/block/sda/queue/logical_block_size # logical block size
/sys/block/sda/queue/max_sectors_kb # max I/O size
/sys/block/sda/queue/discard_granularity # TRIM/DISCARD granularity
/sys/block/sda/queue/write_cache # write-back or write-through
# For /dev/nvme0n1 (NVMe SSD):
/sys/block/nvme0n1/queue/scheduler # typically "none"
/sys/block/nvme0n1/queue/nr_requests # per-HWQ depth
/sys/block/nvme0n1/queue/numa_node # NUMA node of NVMe controller
/sys/block/nvme0n1/queue/nomerges # 0=merging on, 2=off
iostat Output Interpretation
iostat -xz 1
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm
r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sda 0.00 500.00 0.00 2000.00 0.00 100.00 0.00 16.67
0.00 12.00 6.00 0.00 4.00 1.80 90.00
Key fields: - r/s, w/s: read/write operations per second - rkB/s, wkB/s: throughput in KB/s - rrqm/s, wrqm/s: read/write requests merged per second (higher = better sequential pattern) - r_await, w_await: average read/write request latency including queue wait (ms) — most useful for HDD - aqu-sz: average queue length (was avgqu-sz). >1 means device is saturated for HDD; expected to be large for NVMe - %util: percentage of time the device had at least one request. For HDDs, >80% = saturated. For NVMe, this metric is misleading — NVMe can handle many parallel requests, so %util saturates at 100% even if device has headroom
blktrace and blkparse
blktrace captures detailed per-I/O events from the kernel trace framework:
# Capture trace on /dev/sda for 10 seconds
blktrace -d /dev/sda -o /tmp/sda_trace -w 10
# Parse and display events
blkparse -i /tmp/sda_trace.blktrace.0
# Single-command trace and parse
blktrace -d /dev/sda -o - | blkparse -i -
# Output columns:
# CPU Seq# Time(s) PID Action RW Sector+Size Process
# 0 1 0.000001 1234 Q R 1234+8 [fio] <- Queued
# 0 1 0.000002 1234 G R 1234+8 [fio] <- Get (alloc request)
# 0 1 0.000003 1234 I R 1234+8 [fio] <- Inserted to scheduler
# 0 1 0.000005 1234 D R 1234+8 [fio] <- Dispatched to driver
# 0 1 0.002000 0 C R 1234+8 0 <- Completed
# btt: block trace timeline analysis
btt -i /tmp/sda_trace.blktrace.0
# Shows: D2C (dispatch to completion = actual device service time)
# Q2C (queue to completion = total latency including scheduler delay)
# Q2D (queue to dispatch = scheduler delay)
bio/request Stack Diagram
Filesystem Block Layer Driver
| | |
| submit_bio(bio) | |
+-------------------------->| |
| blk_mq_submit_bio() |
| [try plug merge] |
| [try elv merge] |
| alloc request |
| copy bio to request |
| elevator_add() |
| |
[I/O Scheduler] |
(mq-deadline: sort by LBA) |
| |
| blk_mq_dispatch_rq() |
+---------------------->|
| nvme_queue_rq()
| build NVMe cmd
| write doorbell
|
[NVMe SSD]
|
| MSI-X interrupt
|
| blk_mq_complete_rq() |
|<----------------------+
| call bio->bi_end_io() |
+-------------------------->
Filesystem
page cache update
Historical Context
The Linux block layer was originally written by Linus Torvalds in the early 1990s as a simple queue with elevator scheduling for HDDs. The I/O scheduler framework was significantly redesigned in Linux 2.5 by Jens Axboe (the primary block layer maintainer), introducing the pluggable elevator abstraction.
CFQ was the default scheduler for over a decade due to its fairness properties on multi-user systems. The transition away from CFQ in 2019 (Linux 5.0) was a recognition that modern NVMe SSDs made fairness-at-the-scheduler-level unnecessary (hardware queues provide sufficient parallelism).
blk-mq was developed by Jens Axboe and Christoph Hellwig at Fusion-io (later acquired by SanDisk/Western Digital), motivated by NVMe SSDs that could do more IOPS than the single-queue block layer could submit. First appeared in Linux 3.13 (2014). Full transition from single-queue to multi-queue was completed by Linux 5.0.
Production Examples
Database O_DIRECT pattern: MySQL InnoDB, PostgreSQL with O_DIRECT, and Oracle bypass the page cache with O_DIRECT. This means I/Os go: filesystem → block layer → device driver, skipping the page cache. The block layer scheduler still applies. For databases on NVMe, set scheduler to none and tune nr_requests high (e.g., 1024-2048).
Netflix Hollow compression: Netflix's in-memory dataset serving uses large sequential reads from NVMe. They tune read_ahead_kb to 4096 KB for their workload (large sequential dataset reads), maximizing per-I/O efficiency.
Debugging Notes
# Check if a device is using blk-mq
cat /sys/block/sda/mq/ # directory exists if using blk-mq
ls /sys/block/sda/mq/ # lists hardware queue directories (0/, 1/, ...)
# See hardware queue depth
cat /sys/block/nvme0n1/mq/0/nr_reserved_tags
cat /sys/block/nvme0n1/mq/0/nr_tags
# Enable I/O error injection (testing)
echo 1 > /sys/kernel/debug/block/sda/fail_io # requires CONFIG_FAULT_INJECTION
# Trace bio events with ftrace
echo 1 > /sys/kernel/debug/tracing/events/block/block_bio_queue/enable
echo 1 > /sys/kernel/debug/tracing/events/block/block_rq_complete/enable
cat /sys/kernel/debug/tracing/trace_pipe | grep sda
# Check block device stats
cat /sys/block/sda/stat
# Fields: reads completed, reads merged, sectors read, time reading,
# writes completed, writes merged, sectors written, time writing,
# I/Os in progress, time doing I/Os, weighted time doing I/Os
# Long latency I/O detection
echo 1 > /sys/block/sda/queue/latency_hist # if supported
Security Implications
I/O Isolation: Without cgroup blkio (block I/O controller), one process can monopolize a block device. blkio cgroup allows limiting I/O bandwidth and IOPS per cgroup. In Docker/Kubernetes, use --blkio-weight or resource limits.
# Set blkio weight for a cgroup
echo 500 > /sys/fs/cgroup/blkio/mygroup/blkio.weight
# Throttle write IOPS: major:minor iops_value
echo "8:0 1000" > /sys/fs/cgroup/blkio/mygroup/blkio.throttle.write_iops_device
DMA Security: Block I/O uses DMA — the device writes directly to host DRAM via PCIe. An IOMMU (Intel VT-d, AMD-Vi) is essential to prevent a compromised device driver or DMA attack from writing to arbitrary physical memory. Ensure intel_iommu=on in kernel parameters for production systems.
Performance Implications
Plug batching: The kernel plug mechanism (blk_start_plug() / blk_finish_plug()) holds I/Os in a per-task list for batching. Filesystems and DM layers use plugging. Userspace I/O via io_uring also benefits from batching (submit ring) before flushing to the block layer.
NUMA-aware queuing: blk-mq assigns hardware queues to NUMA nodes. I/Os submitted from CPU on node 0 should go to a hardware queue also on node 0 to avoid remote DRAM access for the request structure. Verify with /sys/block/nvme0n1/mq/0/cpu_list.
Write barriers and FUA: REQ_PREFLUSH flushes the device's volatile write cache. REQ_FUA (Force Unit Access) writes directly to persistent media, bypassing write cache. These ensure durability but add latency. Databases use FUA for WAL writes (O_DSYNC or fsync()).
Failure Modes and Real Incidents
SCSI error handling and EH threads: When a SCSI/SATA command times out (default 30s for SCSI), the error handler (EH) thread attempts recovery (abort → device reset → bus reset → host reset). During EH, the block layer pauses the request queue. If EH takes too long, the system appears hung. Tune with scsi_mod.timeout or udev rules.
blk-mq tag exhaustion: With many concurrent I/Os, the hardware queue's pre-allocated tag pool can exhaust. New I/Os are blocked until tags free up. Monitor with: cat /sys/block/nvme0n1/mq/0/nr_tags and ensure nr_requests is tuned appropriately.
Linux 4.x deadline scheduler regression: In Linux 4.14-4.17, the transition from legacy deadline to mq-deadline exposed a bug where high queue depths on NVMe + mq-deadline caused unnecessary I/O serialization, reducing throughput by 30-40%. Fixed in 4.18. Affected early Kubernetes deployments using NVMe persistent volumes.
Modern Usage
- io_uring (Linux 5.1+): Event-driven I/O framework that submits I/Os to the block layer with zero system call overhead in polling mode. io_uring with
IORING_SETUP_SQPOLLruns a kernel thread that polls the NVMe submission queue doorbell, achieving ~10M IOPS from a single CPU core. - BPF-based I/O schedulers: Experimental
sched_extequivalent for block layer — custom BPF programs implementing I/O scheduling policy. - dm-multipath: Device mapper target that load-balances across multiple block device paths to the same storage. Relevant for NVMe-oF with multiple fabric paths for redundancy.
Future Directions
- io_uring and SPDK convergence: Intel SPDK (Storage Performance Development Kit) runs entirely in userspace, bypassing the kernel block layer entirely. io_uring approaches SPDK performance while remaining kernel-managed.
- BPF block layer observability:
bpftraceprograms can hookblock:block_rq_insert,block:block_rq_completetracepoints for production-safe I/O analysis withoutblktraceoverhead. - Per-namespace I/O accounting: NVMe namespace-level I/O statistics in
/sys/block/nvme0n*/statenable fine-grained cgroup-like accounting without cgroup overhead.
Exercises
-
Use
blktrace+blkparse+bttto analyze the I/O pattern fromfiorunning 4K random reads on an HDD. Measure the D2C (device service time) and Q2D (scheduler delay) separately. What fraction of total latency is scheduler overhead? -
Write a
bpftraceone-liner that prints the I/O latency histogram for all block completions on/dev/sdafor 10 seconds. (Hint:block:block_rq_insertandblock:block_rq_completetracepoints, usehistaggregation.) -
Change the I/O scheduler on an NVMe device from
nonetomq-deadline. Run a mixed read/write workload withfio. Doesmq-deadlinehurt or help throughput and latency? Why? -
Examine
struct bioininclude/linux/blk_types.h. Count how many flags exist forbi_opf. What doREQ_PREFLUSHandREQ_FUAdo, and when would a filesystem use them together? -
Run
iostat -xz 1during a database checkpoint (e.g., PostgreSQLCHECKPOINT). Observe howaqu-szandw_awaitchange. What does this reveal about checkpoint I/O behavior?
References
- Axboe, J. "Linux Block I/O: Present and Future." Ottawa Linux Symposium 2004.
- Linux kernel block layer documentation:
Documentation/block/ - Linux kernel source:
block/blk-mq.c,block/elevator.c,block/mq-deadline.c - blktrace man pages and source: https://git.kernel.dk/blktrace
- io_uring documentation:
Documentation/block/io-uring.rst+ https://kernel.dk/io_uring.pdf - Gregg, B. Systems Performance, 2nd ed., Chapter 9 (Disks)
- Bjørling, M. et al. "Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems." SYSTOR 2013.