05 - Linux Block Layer

Technical Overview

The Linux block layer is the kernel subsystem that provides a uniform interface between filesystems (and other I/O consumers) and block storage devices. It abstracts the differences between HDDs, SSDs, RAID arrays, loop devices, and network storage behind a common API. The block layer handles request merging, I/O scheduling, request queue management, and the handoff to device drivers.

The block layer underwent a fundamental redesign with the introduction of blk-mq (multi-queue block layer) in Linux 3.13 (2013). The old single-queue block layer, designed for HDD latencies and single-core CPUs, became a bottleneck when NVMe SSDs exposed per-million IOPS capability. blk-mq replaced it with a two-tier queue model optimized for NUMA-aware, multi-core submission.

Prerequisites

Linux kernel architecture basics
VFS and filesystem concepts
PCIe and NVMe basics (see 04-nvme.md)
Understanding of NUMA topology

Core Content

The Full I/O Path

User Space
  |
  | read()/write()/pread()/pwrite() syscalls
  v
+---------------------------+
|  VFS (Virtual Filesystem) |  path resolution, file descriptor lookup
|  vfs_read() / vfs_write() |
+---------------------------+
  |
  | file->f_op->read_iter()
  v
+---------------------------+
|  Page Cache               |  address_space, struct page
|  (check/update cache)     |  Generic_file_read_iter()
+---------------------------+
  |  cache miss: submit_bio()
  v
+---------------------------+
|  Filesystem               |  ext4_readpages(), btrfs_submit_bio()
|  (block mapping)          |  Maps logical file offset -> LBA
+---------------------------+
  |  submit_bio()
  v
+---------------------------+
|  Block Layer              |
|  +-----------------------+|
|  | bio submission        ||  struct bio (the I/O descriptor)
|  | blk_mq_submit_bio()   ||
|  +-----------------------+|
|  | Software queues       ||  per-CPU staging queues (blk-mq)
|  | (blk_mq_ctx)          ||
|  +-----------------------+|
|  | I/O Scheduler         ||  mq-deadline, kyber, none
|  | (reorder/prioritize)  ||
|  +-----------------------+|
|  | Hardware dispatch     ||  per-hardware-queue (blk_mq_hw_ctx)
|  | queues                ||  dispatched to driver
|  +-----------------------+|
+---------------------------+
  |  blk_mq_ops->queue_rq()
  v
+---------------------------+
|  Device Driver            |  nvme_queue_rq(), sd_init_command()
|  (translate to hw cmds)   |
+---------------------------+
  |  PCIe / SATA / FC / ...
  v
+---------------------------+
|  Storage Device           |  NVMe SSD, HDD, RAID controller
+---------------------------+

Key Data Structures

struct bio — the fundamental I/O unit in the block layer:

struct bio {
    struct bio          *bi_next;     // linked list for bio chaining
    struct block_device *bi_bdev;     // target block device
    blk_opf_t           bi_opf;      // operation flags (READ/WRITE/FLUSH/FUA)
    unsigned short      bi_vcnt;     // number of bvecs (scatter-gather entries)
    struct bvec_iter    bi_iter;     // current position within bio
    bio_end_io_t        *bi_end_io;  // completion callback
    struct bio_vec      bi_inline_vecs[]; // embedded bvecs for small I/Os
};

struct bio_vec {
    struct page   *bv_page;    // physical page containing data
    unsigned int   bv_len;    // byte count in this segment
    unsigned int   bv_offset; // byte offset within page
};

A bio represents a single contiguous-on-disk I/O request, potentially from non-contiguous physical memory pages (scatter-gather). A read of a 1 MB file might produce one bio with 256 bvecs (256 × 4KB pages).

struct request — wraps one or more bios after merging:

// In blk-mq, a request is allocated from a per-hardware-queue pool
struct request {
    struct request_queue *q;
    struct blk_mq_ctx   *mq_ctx;    // software queue context
    struct blk_mq_hw_ctx *mq_hctx;  // hardware queue context
    unsigned int         cmd_flags;
    req_op_t             cmd_type;
    blk_status_t         error;
    sector_t             __sector;  // start sector (LBA)
    unsigned int         __data_len; // total byte count
    struct bio           *bio;       // first bio in chain
    struct bio           *biotail;   // last bio
    // ... driver-private area follows
};

struct request_queue — the queue associated with a block device:

struct request_queue {
    struct blk_mq_tag_set  *tag_set;    // hardware queue configuration
    struct elevator_queue  *elevator;   // I/O scheduler
    struct queue_limits     limits;     // device capabilities (max_sectors, etc.)
    // ... many tuning parameters
};

blk_mq_hw_ctx — hardware dispatch queue: - Corresponds to one NVMe submission queue / SCSI tagged-command queue - Has a tag set (pre-allocated struct request pool) - Managed by the driver (nvme_queue maps 1:1 to blk_mq_hw_ctx)

blk-mq Two-Level Queue Architecture

blk-mq Queue Architecture:

CPU 0          CPU 1          CPU 2          CPU 3
  |              |              |              |
  v              v              v              v
[SWQ 0]       [SWQ 1]        [SWQ 2]        [SWQ 3]
(Software      (Software     (Software      (Software
 Queue 0)       Queue 1)      Queue 2)       Queue 3)
  |              |              |              |
  +------+--------+------+------+
         |                |
         v                v
    [HWQ 0]           [HWQ 1]
(Hardware Queue 0) (Hardware Queue 1)
  (maps to NVMe     (maps to NVMe
   SQ #1)            SQ #2)
         |                |
         +-------+--------+
                 |
                 v
           NVMe Controller

Software queues (blk_mq_ctx): one per CPU, per hardware queue. Staging area before I/O scheduler. Lock-free (each CPU owns its SWQ). Batches I/Os for efficient dispatch.
Hardware queues (blk_mq_hw_ctx): one per NVMe submission queue (or SCSI host bus adapter queue). May map to multiple SWQs. I/O scheduler operates at this level.

For NVMe with 32 hardware queues on a 32-core system: 1:1 mapping — each CPU has its own dedicated NVMe submission queue with zero contention.

Request Merging

Before dispatch, the block layer attempts to merge adjacent I/O requests:

Front merge: New bio covers sectors just before an existing request → prepend to request Back merge: New bio covers sectors just after an existing request → append to request Plug/unplug: blk_start_plug() holds I/Os in a per-task list to allow batching before submitting to the scheduler

// Kernel: request merging in blk-mq
// drivers/../blk-merge.c: blk_attempt_plug_merge()
// If current task has an active plug, try to merge with pending requests
// before submitting — reduces queue contention and improves merging

Merging is critical for HDD performance (turns 100 random 4K writes into one 4MB sequential write if addresses are adjacent) and less important for NVMe (parallelism matters more than merge).

I/O Schedulers

The I/O scheduler sits between the software queues and hardware queues:

# View/set scheduler for a device
cat /sys/block/nvme0n1/queue/scheduler
# Output: [none] mq-deadline kyber

cat /sys/block/sda/queue/scheduler
# Output: [mq-deadline] none kyber

echo mq-deadline > /sys/block/sda/queue/scheduler

none (passthrough): - No reordering; I/Os dispatched in submission order - Zero overhead - Best choice for NVMe SSDs (device-side parallelism makes software ordering irrelevant) - Best choice for RAM-backed devices (RAM disks, loop with SSD backing)

mq-deadline: - Maintains sorted red-black tree of pending requests (sorted by LBA) - Two queues: read and write, each with a FIFO expiry queue - Deadline parameters: read_expire (500ms default), write_expire (5s default) - Front dispatch: service requests in LBA order (like elevator) until a request deadline expires - Best choice for rotating media (HDDs) — minimizes seek distance while preventing starvation

# mq-deadline tuning
cat /sys/block/sda/queue/iosched/read_expire    # ms before read is "urgent"
cat /sys/block/sda/queue/iosched/write_expire   # ms before write is "urgent"
cat /sys/block/sda/queue/iosched/fifo_batch     # requests to service from sorted queue
cat /sys/block/sda/queue/iosched/front_merges   # enable/disable front merges

kyber: - Latency-based scheduler for fast block devices - Maintains separate queues for reads and synchronous writes - Throttles queue depth to hit latency targets (configurable) - Less useful than none for most NVMe workloads; useful when multiple concurrent workloads compete for a fast SSD

# kyber latency targets (nanoseconds)
cat /sys/block/nvme0n1/queue/iosched/read_lat_nsec    # default: 2ms
cat /sys/block/nvme0n1/queue/iosched/write_lat_nsec   # default: 10ms

Legacy schedulers (pre-blk-mq, removed in modern kernels): - CFQ (Completely Fair Queuing): per-process I/O fairness with bandwidth slices. Removed in 5.0. - Deadline: Single-queue version of mq-deadline. Removed in 5.0. - NOOP: Single-queue passthrough. Replaced by none.

Key /sys/block/ Tuning Parameters

# For /dev/sda (HDD):
/sys/block/sda/queue/scheduler          # I/O scheduler
/sys/block/sda/queue/nr_requests        # queue depth (default 128)
/sys/block/sda/queue/read_ahead_kb      # readahead size (default 128)
/sys/block/sda/queue/rotational         # 1=HDD, 0=SSD
/sys/block/sda/queue/hw_sector_size     # physical sector size (bytes)
/sys/block/sda/queue/logical_block_size # logical block size
/sys/block/sda/queue/max_sectors_kb     # max I/O size
/sys/block/sda/queue/discard_granularity # TRIM/DISCARD granularity
/sys/block/sda/queue/write_cache        # write-back or write-through

# For /dev/nvme0n1 (NVMe SSD):
/sys/block/nvme0n1/queue/scheduler      # typically "none"
/sys/block/nvme0n1/queue/nr_requests    # per-HWQ depth
/sys/block/nvme0n1/queue/numa_node      # NUMA node of NVMe controller
/sys/block/nvme0n1/queue/nomerges       # 0=merging on, 2=off

iostat Output Interpretation

iostat -xz 1

Device    r/s    w/s  rkB/s   wkB/s  rrqm/s  wrqm/s  %rrqm  %wrqm
         r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util

sda       0.00  500.00   0.00 2000.00    0.00  100.00  0.00  16.67
          0.00   12.00  6.00     0.00     4.00  1.80  90.00

Key fields: - r/s, w/s: read/write operations per second - rkB/s, wkB/s: throughput in KB/s - rrqm/s, wrqm/s: read/write requests merged per second (higher = better sequential pattern) - r_await, w_await: average read/write request latency including queue wait (ms) — most useful for HDD - aqu-sz: average queue length (was avgqu-sz). >1 means device is saturated for HDD; expected to be large for NVMe - %util: percentage of time the device had at least one request. For HDDs, >80% = saturated. For NVMe, this metric is misleading — NVMe can handle many parallel requests, so %util saturates at 100% even if device has headroom

blktrace and blkparse

blktrace captures detailed per-I/O events from the kernel trace framework:

# Capture trace on /dev/sda for 10 seconds
blktrace -d /dev/sda -o /tmp/sda_trace -w 10

# Parse and display events
blkparse -i /tmp/sda_trace.blktrace.0

# Single-command trace and parse
blktrace -d /dev/sda -o - | blkparse -i -

# Output columns:
# CPU  Seq#  Time(s)  PID  Action  RW  Sector+Size  Process
#  0   1   0.000001  1234  Q  R  1234+8   [fio]    <- Queued
#  0   1   0.000002  1234  G  R  1234+8   [fio]    <- Get (alloc request)
#  0   1   0.000003  1234  I  R  1234+8   [fio]    <- Inserted to scheduler
#  0   1   0.000005  1234  D  R  1234+8   [fio]    <- Dispatched to driver
#  0   1   0.002000     0  C  R  1234+8      0     <- Completed

# btt: block trace timeline analysis
btt -i /tmp/sda_trace.blktrace.0
# Shows: D2C (dispatch to completion = actual device service time)
#        Q2C (queue to completion = total latency including scheduler delay)
#        Q2D (queue to dispatch = scheduler delay)

bio/request Stack Diagram

Filesystem                  Block Layer                Driver
     |                           |                       |
     | submit_bio(bio)            |                       |
     +-------------------------->|                       |
                                 | blk_mq_submit_bio()   |
                                 | [try plug merge]      |
                                 | [try elv merge]       |
                                 | alloc request         |
                                 | copy bio to request   |
                                 | elevator_add()        |
                                 |                       |
                           [I/O Scheduler]               |
                           (mq-deadline: sort by LBA)    |
                                 |                       |
                                 | blk_mq_dispatch_rq()  |
                                 +---------------------->|
                                                         | nvme_queue_rq()
                                                         | build NVMe cmd
                                                         | write doorbell
                                                         |
                                                    [NVMe SSD]
                                                         |
                                                         | MSI-X interrupt
                                                         |
                                 | blk_mq_complete_rq()  |
                                 |<----------------------+
                                 | call bio->bi_end_io() |
                                 +-------------------------->
                                                    Filesystem
                                                    page cache update

Historical Context

The Linux block layer was originally written by Linus Torvalds in the early 1990s as a simple queue with elevator scheduling for HDDs. The I/O scheduler framework was significantly redesigned in Linux 2.5 by Jens Axboe (the primary block layer maintainer), introducing the pluggable elevator abstraction.

CFQ was the default scheduler for over a decade due to its fairness properties on multi-user systems. The transition away from CFQ in 2019 (Linux 5.0) was a recognition that modern NVMe SSDs made fairness-at-the-scheduler-level unnecessary (hardware queues provide sufficient parallelism).

blk-mq was developed by Jens Axboe and Christoph Hellwig at Fusion-io (later acquired by SanDisk/Western Digital), motivated by NVMe SSDs that could do more IOPS than the single-queue block layer could submit. First appeared in Linux 3.13 (2014). Full transition from single-queue to multi-queue was completed by Linux 5.0.

Production Examples

Database O_DIRECT pattern: MySQL InnoDB, PostgreSQL with O_DIRECT, and Oracle bypass the page cache with O_DIRECT. This means I/Os go: filesystem → block layer → device driver, skipping the page cache. The block layer scheduler still applies. For databases on NVMe, set scheduler to none and tune nr_requests high (e.g., 1024-2048).

Netflix Hollow compression: Netflix's in-memory dataset serving uses large sequential reads from NVMe. They tune read_ahead_kb to 4096 KB for their workload (large sequential dataset reads), maximizing per-I/O efficiency.

Debugging Notes

# Check if a device is using blk-mq
cat /sys/block/sda/mq/  # directory exists if using blk-mq
ls /sys/block/sda/mq/   # lists hardware queue directories (0/, 1/, ...)

# See hardware queue depth
cat /sys/block/nvme0n1/mq/0/nr_reserved_tags
cat /sys/block/nvme0n1/mq/0/nr_tags

# Enable I/O error injection (testing)
echo 1 > /sys/kernel/debug/block/sda/fail_io  # requires CONFIG_FAULT_INJECTION

# Trace bio events with ftrace
echo 1 > /sys/kernel/debug/tracing/events/block/block_bio_queue/enable
echo 1 > /sys/kernel/debug/tracing/events/block/block_rq_complete/enable
cat /sys/kernel/debug/tracing/trace_pipe | grep sda

# Check block device stats
cat /sys/block/sda/stat
# Fields: reads completed, reads merged, sectors read, time reading,
#         writes completed, writes merged, sectors written, time writing,
#         I/Os in progress, time doing I/Os, weighted time doing I/Os

# Long latency I/O detection
echo 1 > /sys/block/sda/queue/latency_hist  # if supported

Security Implications

I/O Isolation: Without cgroup blkio (block I/O controller), one process can monopolize a block device. blkio cgroup allows limiting I/O bandwidth and IOPS per cgroup. In Docker/Kubernetes, use --blkio-weight or resource limits.

# Set blkio weight for a cgroup
echo 500 > /sys/fs/cgroup/blkio/mygroup/blkio.weight
# Throttle write IOPS: major:minor iops_value
echo "8:0 1000" > /sys/fs/cgroup/blkio/mygroup/blkio.throttle.write_iops_device

DMA Security: Block I/O uses DMA — the device writes directly to host DRAM via PCIe. An IOMMU (Intel VT-d, AMD-Vi) is essential to prevent a compromised device driver or DMA attack from writing to arbitrary physical memory. Ensure intel_iommu=on in kernel parameters for production systems.

Performance Implications

Plug batching: The kernel plug mechanism (blk_start_plug() / blk_finish_plug()) holds I/Os in a per-task list for batching. Filesystems and DM layers use plugging. Userspace I/O via io_uring also benefits from batching (submit ring) before flushing to the block layer.

NUMA-aware queuing: blk-mq assigns hardware queues to NUMA nodes. I/Os submitted from CPU on node 0 should go to a hardware queue also on node 0 to avoid remote DRAM access for the request structure. Verify with /sys/block/nvme0n1/mq/0/cpu_list.

Write barriers and FUA: REQ_PREFLUSH flushes the device's volatile write cache. REQ_FUA (Force Unit Access) writes directly to persistent media, bypassing write cache. These ensure durability but add latency. Databases use FUA for WAL writes (O_DSYNC or fsync()).

Failure Modes and Real Incidents

SCSI error handling and EH threads: When a SCSI/SATA command times out (default 30s for SCSI), the error handler (EH) thread attempts recovery (abort → device reset → bus reset → host reset). During EH, the block layer pauses the request queue. If EH takes too long, the system appears hung. Tune with scsi_mod.timeout or udev rules.

blk-mq tag exhaustion: With many concurrent I/Os, the hardware queue's pre-allocated tag pool can exhaust. New I/Os are blocked until tags free up. Monitor with: cat /sys/block/nvme0n1/mq/0/nr_tags and ensure nr_requests is tuned appropriately.

Linux 4.x deadline scheduler regression: In Linux 4.14-4.17, the transition from legacy deadline to mq-deadline exposed a bug where high queue depths on NVMe + mq-deadline caused unnecessary I/O serialization, reducing throughput by 30-40%. Fixed in 4.18. Affected early Kubernetes deployments using NVMe persistent volumes.

Modern Usage

io_uring (Linux 5.1+): Event-driven I/O framework that submits I/Os to the block layer with zero system call overhead in polling mode. io_uring with IORING_SETUP_SQPOLL runs a kernel thread that polls the NVMe submission queue doorbell, achieving ~10M IOPS from a single CPU core.
BPF-based I/O schedulers: Experimental sched_ext equivalent for block layer — custom BPF programs implementing I/O scheduling policy.
dm-multipath: Device mapper target that load-balances across multiple block device paths to the same storage. Relevant for NVMe-oF with multiple fabric paths for redundancy.

Future Directions

io_uring and SPDK convergence: Intel SPDK (Storage Performance Development Kit) runs entirely in userspace, bypassing the kernel block layer entirely. io_uring approaches SPDK performance while remaining kernel-managed.
BPF block layer observability: bpftrace programs can hook block:block_rq_insert, block:block_rq_complete tracepoints for production-safe I/O analysis without blktrace overhead.
Per-namespace I/O accounting: NVMe namespace-level I/O statistics in /sys/block/nvme0n*/stat enable fine-grained cgroup-like accounting without cgroup overhead.

Exercises

Use blktrace + blkparse + btt to analyze the I/O pattern from fio running 4K random reads on an HDD. Measure the D2C (device service time) and Q2D (scheduler delay) separately. What fraction of total latency is scheduler overhead?
Write a bpftrace one-liner that prints the I/O latency histogram for all block completions on /dev/sda for 10 seconds. (Hint: block:block_rq_insert and block:block_rq_complete tracepoints, use hist aggregation.)
Change the I/O scheduler on an NVMe device from none to mq-deadline. Run a mixed read/write workload with fio. Does mq-deadline hurt or help throughput and latency? Why?
Examine struct bio in include/linux/blk_types.h. Count how many flags exist for bi_opf. What do REQ_PREFLUSH and REQ_FUA do, and when would a filesystem use them together?
Run iostat -xz 1 during a database checkpoint (e.g., PostgreSQL CHECKPOINT). Observe how aqu-sz and w_await change. What does this reveal about checkpoint I/O behavior?

References

Axboe, J. "Linux Block I/O: Present and Future." Ottawa Linux Symposium 2004.
Linux kernel block layer documentation: Documentation/block/
Linux kernel source: block/blk-mq.c, block/elevator.c, block/mq-deadline.c
blktrace man pages and source: https://git.kernel.dk/blktrace
io_uring documentation: Documentation/block/io-uring.rst + https://kernel.dk/io_uring.pdf
Gregg, B. Systems Performance, 2nd ed., Chapter 9 (Disks)
Bjørling, M. et al. "Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems." SYSTOR 2013.