08 - I/O Schedulers and Tuning

Technical Overview

I/O scheduling is the kernel policy for determining the order and priority in which I/O requests are dispatched to block devices. For HDDs, scheduling is critical — reordering random I/Os to minimize seek distance can improve throughput 10x. For NVMe SSDs with hardware-level parallelism and ~100 µs latency, software scheduling overhead itself becomes a bottleneck, and passthrough (none) is often optimal.

Beyond scheduler selection, I/O performance depends on a range of kernel and hardware tuning parameters: writeback thresholds (controlling dirty page flush behavior), read-ahead (prefetching sequential data), queue depth, and I/O priority (allowing latency-sensitive processes to preempt background I/O).

Prerequisites

Linux block layer architecture (see 05-linux-block-layer.md)
blk-mq queue model
HDD mechanical operation and seek cost (see 02-hdd-internals.md)
NVMe queue model (see 04-nvme.md)

Core Content

I/O Scheduling Goals

The two competing goals:

Minimize latency: Dispatch I/Os as quickly as possible in arrival order. Critical for interactive workloads (databases, web servers).
Maximize throughput: Reorder and batch I/Os to maximize device utilization. Critical for bulk workloads (backups, analytics, HDFS).

These goals conflict. Reordering to improve throughput introduces latency for individual requests. The scheduler must balance throughput vs fairness vs deadline guarantees.

For HDDs: seek minimization (elevator) dominates — reordering reduces mean seek distance. For SSDs: the device handles parallelism internally; software reordering adds overhead without benefit.

Legacy Single-Queue Schedulers (Pre-blk-mq, Removed in Linux 5.0)

CFQ (Completely Fair Queuing): - Per-process I/O queues; each process gets a "time slice" of I/O bandwidth - Synchronous reads prioritized over asynchronous writes - Idle detection: waits briefly after a synchronous request for more requests from the same process (exploiting temporal locality) - Default scheduler for desktop Linux for ~10 years (2.6.18 to 5.0) - Poor fit for SSDs (idle waiting introduces unnecessary latency) and for databases using O_DIRECT (no benefit from per-process queues)

Deadline: - Maintains two sorted queues (read, write) ordered by LBA (for elevator ordering) - Additionally maintains two FIFO queues ordered by deadline (arrival time + timeout) - Dispatches from sorted queue unless a deadline is expiring - Read timeout default: 500 ms; Write timeout default: 5 seconds - Prevents starvation while still reducing seek distance - Preferred for database storage on HDDs

NOOP: - Simple FIFO with basic request merging - No reordering - Used for SSD and RAM disk (where seek order is irrelevant)

blk-mq Era Schedulers (Linux 3.13+)

none (passthrough):

Submission: CPU X → SWQ X → direct dispatch → HWQ
No sorting, no reordering, minimal overhead

Best for NVMe SSDs: device has 65K queues, own internal command reordering
Best for RAM-backed devices
Context: echo none > /sys/block/nvme0n1/queue/scheduler

mq-deadline:

Submission: CPU X → SWQ X → red-black tree (sorted by LBA)
                            + FIFO expiry list (sorted by deadline)
Dispatch: elevator order unless deadline expiring

Prevents starvation: requests with expired deadlines jump to front
Good for HDD RAID arrays and SATA SSDs with mixed workloads
Read/write deadlines independently tunable
Default for HDD block devices in modern kernels

kyber: - Token bucket: limits queue depth for reads/sync writes to maintain latency targets - Separate latency targets for reads and synchronous writes - Throttles high-latency ops to protect latency-sensitive ops - Useful for fast SSDs with mixed interactive + background I/O load - Context: adjust targets via /sys/block/nvme0n1/queue/iosched/

# Kyber targets (nanoseconds, default: read=2ms, write=10ms)
cat /sys/block/nvme0n1/queue/iosched/read_lat_nsec
cat /sys/block/nvme0n1/queue/iosched/write_lat_nsec
echo 500000 > /sys/block/nvme0n1/queue/iosched/read_lat_nsec   # 500µs target

Scheduler Selection Heuristics

Is the device a rotating disk (HDD)?
  YES:
    Is the workload latency-sensitive (database, transactional)?
      YES: mq-deadline (with appropriate read/write expiry settings)
      NO: mq-deadline or none (if large sequential throughput is only goal)
  NO (SSD):
    Is it an NVMe?
      YES: none (passthrough — device handles reordering)
      NO (SATA SSD):
        Latency-sensitive or mixed: mq-deadline or kyber
        Sequential bulk only: none

Special cases:
    Virtual disk (virtio, VMware): none (hypervisor handles scheduling)
    Network block device (iSCSI, NBD): mq-deadline (adds ordering on top of variable network latency)
    RAM disk: none

Check and set scheduler:

# Check current scheduler (brackets indicate active)
cat /sys/block/sda/queue/scheduler
# [mq-deadline] none kyber

# Permanently set scheduler (udev rule)
# /etc/udev/rules.d/60-scheduler.rules
ACTION=="add|change", KERNEL=="sd[a-z]*", ATTR{queue/rotational}=="1", \
  ATTR{queue/scheduler}="mq-deadline"
ACTION=="add|change", KERNEL=="nvme*", \
  ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="sd[a-z]*", ATTR{queue/rotational}=="0", \
  ATTR{queue/scheduler}="none"

I/O Priority: ionice and ioprio_set

Linux supports per-process I/O scheduling classes and priorities, distinct from CPU priority:

I/O scheduling classes: - RT (Real-Time, class 1): preempts all other I/O. Reserved for time-critical applications. 8 priority levels (0 highest). - BE (Best-Effort, class 2): default class. 8 priority levels. Priority 4 is default. - IDLE (class 3): only receives I/O when no other process needs the disk. Good for background jobs. - None (class 0): inherit from CPU scheduler (best-effort, priority based on CPU nice value)

# Check a process's I/O class/priority
ionice -p <PID>

# Run a command with idle I/O priority (backup, low-priority jobs)
ionice -c 3 tar czf /backup/home.tar.gz /home

# Run database process with real-time I/O priority (dangerous — can starve other I/O)
ionice -c 1 -n 0 -p $(pgrep postgres)

# Set via syscall
# ioprio_set(IOPRIO_WHO_PROCESS, pid,
#            IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 7));  # low best-effort

I/O priority is only meaningful when using mq-deadline or bfq schedulers. With none, the device receives all I/Os at hardware level — I/O class is irrelevant.

bfq (Budget Fair Queueing): Not in mainline blk-mq era but available as module in some distros. Provides per-process bandwidth slices + latency guarantees. Recommended for desktop environments with mixed interactive/background I/O.

Writeback Tuning

The kernel keeps "dirty" (modified) pages in memory before writing them to disk. This improves throughput (coalesces small writes) but creates risk (power loss = data loss for dirty pages). Key parameters in /proc/sys/vm/:

# View current writeback settings
sysctl vm.dirty_ratio
sysctl vm.dirty_background_ratio
sysctl vm.dirty_expire_centisecs
sysctl vm.dirty_writeback_centisecs

dirty_ratio (default: 20%): - When dirty pages exceed this % of total memory, processes that write are throttled (synchronous writeback forced on the writing process). Hard limit. - Raise for write-heavy streaming applications; lower for latency-sensitive systems.

dirty_background_ratio (default: 10%): - When dirty pages exceed this %, background pdflush/kworker threads begin writing to disk. Soft limit. - Lower this to start writeback sooner (keeps dirty pages small, reduces flush latency spikes).

dirty_expire_centisecs (default: 3000 = 30 seconds): - Pages older than this value must be written. Prevents indefinitely-delayed writes. - Lower for journaling filesystems or databases that need bounded staleness.

dirty_writeback_centisecs (default: 500 = 5 seconds): - Interval at which kworker/flush wakes to check for expired dirty pages.

Tuning examples:

# For a write-heavy batch server (wants maximum throughput):
sysctl -w vm.dirty_ratio=40
sysctl -w vm.dirty_background_ratio=20
sysctl -w vm.dirty_expire_centisecs=3000

# For a latency-sensitive database server (minimize write spikes):
sysctl -w vm.dirty_ratio=5
sysctl -w vm.dirty_background_ratio=2
sysctl -w vm.dirty_expire_centisecs=500

# For a filesystem that uses O_DIRECT (dirty pages irrelevant for db data):
# Keep defaults — O_DIRECT bypasses page cache anyway
# BUT: journal/WAL writes are often still buffered

Filesystem-level dirty control: Most journaling filesystems commit their journal every 5 seconds (ext4: commit=5 mount option). For applications requiring durability, use fsync() or O_SYNC rather than relying on system-wide dirty ratios.

Read-Ahead

The kernel's read-ahead mechanism pre-fetches pages sequentially from block devices, anticipating sequential access patterns:

# Check current read-ahead for a device
blockdev --getra /dev/sda   # in 512-byte sectors
cat /sys/block/sda/queue/read_ahead_kb  # in KB

# Set read-ahead
blockdev --setra 2048 /dev/sda    # 1 MB (2048 × 512 bytes)
echo 4096 > /sys/block/sda/queue/read_ahead_kb  # 4 MB

Read-ahead recommendations: - HDD sequential streaming (backup, analytics): 4096-8192 KB — keeps drive's buffer full, eliminates rotational latency for sequential reads - NVMe random I/O database: 0-128 KB — read-ahead wastes bandwidth on non-sequential patterns; some read-ahead helps for index scans - NFS mounted filesystems: read-ahead critical for performance; default often too small (128 KB); raise to 1024-4096 KB - VM disk images: 512-1024 KB — mix of random and sequential patterns

NCQ (Native Command Queuing)

NCQ is a SATA/SAS feature that allows the drive's firmware to reorder up to 32 queued commands for optimal seek performance. Analogous to the elevator algorithm but implemented in the drive firmware with knowledge of actual head position.

# Verify NCQ is enabled and queue depth
cat /sys/block/sda/device/queue_depth  # should be 31 or 32 for NCQ

# Disable NCQ (for debugging or specific HDDs with NCQ bugs)
hdparm -Q 1 /dev/sda   # set queue depth to 1 (disables NCQ effectively)

NCQ advantages: drive-firmware elevator algorithm has better seek information than host OS; reduces seek time 10-30% for random I/O workloads on HDDs.

NCQ is irrelevant for NVMe (NVMe's multi-queue model supersedes it).

Production: Database I/O Tuning

For MySQL InnoDB or PostgreSQL on NVMe SSDs:

# 1. Set I/O scheduler to none (NVMe passthrough)
echo none > /sys/block/nvme0n1/queue/scheduler

# 2. Set nr_requests high for NVMe
echo 1024 > /sys/block/nvme0n1/queue/nr_requests

# 3. Disable APST (power state transitions add latency)
echo 0 > /sys/module/nvme_core/parameters/default_ps_max_latency_us

# 4. Database uses O_DIRECT — disable read-ahead (saves memory bandwidth)
blockdev --setra 0 /dev/nvme0n1

# 5. Disable transparent huge pages (database manages its own pages)
echo never > /sys/kernel/mm/transparent_hugepage/enabled

# 6. For journaled filesystem with database: tune dirty ratios
# (DB uses O_DIRECT, but system writes like journal still matter)
sysctl -w vm.dirty_ratio=5
sysctl -w vm.dirty_background_ratio=2

# 7. PostgreSQL-specific: filesystem mount options
# For ext4 with PostgreSQL (disable write barrier for non-enterprise NVMe)
# Only if drive has power-loss protection (PLP capacitors)
# mount -o nobarrier /dev/nvme0n1p1 /var/lib/postgresql

For MySQL InnoDB on HDD RAID 10:

# 1. Use mq-deadline scheduler
echo mq-deadline > /sys/block/md0/queue/scheduler

# 2. Tune deadline to prioritize reads (database often latency-sensitive on reads)
echo 100 > /sys/block/md0/queue/iosched/read_expire    # 100ms
echo 1000 > /sys/block/md0/queue/iosched/write_expire  # 1s

# 3. RAID read-ahead (covers stride pattern)
# chunk_size * (N_data_disks): for RAID10 with 4 disks, 2 data disks, 512KB chunk
# optimal read-ahead = 512KB * 2 = 1MB
blockdev --setra 2048 /dev/md0

# 4. InnoDB: ensure O_DIRECT is set
# my.cnf: innodb_flush_method = O_DIRECT

Cgroup-Based I/O Throttling

Linux cgroups v2 provide the io controller for per-cgroup I/O throttling:

# Enable io controller (if not already in cgroup hierarchy)
echo "+io" >> /sys/fs/cgroup/cgroup.subtree_control

# Create a cgroup for a workload
mkdir /sys/fs/cgroup/myapp
echo $PID > /sys/fs/cgroup/myapp/cgroup.procs

# Set I/O weight (proportional scheduling)
echo "default 100" > /sys/fs/cgroup/myapp/io.weight

# Hard throttle: max 1000 read IOPS on /dev/nvme0n1 (major:minor = 259:0)
echo "259:0 riops=1000" > /sys/fs/cgroup/myapp/io.max

# Hard throttle: max 100 MB/s write bandwidth
echo "259:0 wbps=104857600" > /sys/fs/cgroup/myapp/io.max

# Monitor cgroup I/O stats
cat /sys/fs/cgroup/myapp/io.stat
# 259:0 rbytes=1234567 wbytes=89012345 rios=123 wios=456 dbytes=0 dios=0

In Kubernetes, the BlkioDeviceReadBps and BlkioDeviceWriteBps limits in container spec translate to these cgroup settings.

Historical Context

The I/O scheduling history in Linux mirrors the evolution of storage hardware:

2.4 kernel (2001): Simple single elevator algorithm. One queue per device, no preemption.
2.5/2.6 (2002-2003): Jens Axboe redesigned the block layer with pluggable schedulers: AS (Anticipatory Scheduling), Deadline, CFQ, NOOP.
CFQ as default (2.6.18, 2006-2019): CFQ became default for its balance of fairness and throughput on the HDD-dominated storage landscape.
blk-mq (3.13, 2014): Multi-queue block layer for SSD-class devices. Initially optional, became mandatory by 5.0.
CFQ removal (5.0, 2019): CFQ, AS, and NOOP removed. Only none, mq-deadline, kyber remain.

The BFQ (Budget Fair Queuing) scheduler was developed by Paolo Valente (University of Modena) and merged in 4.12. It provides per-process latency guarantees and is default in some desktop distributions (Fedora Workstation).

Production Examples

Percona MySQL I/O tuning guide: Recommends mq-deadline for HDD-backed databases with read_expire=500ms, write_expire=5000ms, and innodb_flush_method=O_DIRECT_NO_FSYNC for NVMe. Also recommends disabling NUMA interleaving (numactl) and Transparent Hugepages.

MongoDB on NVMe: MongoDB documentation recommends deadline (now mq-deadline) even for SSDs, citing performance improvements in their benchmarks due to request batching. This contradicts the theoretical case for none — real workloads with mixed read/write patterns sometimes benefit from scheduler batching even on NVMe.

Elasticsearch disk I/O: Elasticsearch recommends noop (now none) for SSDs and disabling swappiness (vm.swappiness=1). Also recommends vm.max_map_count=262144 for mmap indexing.

Debugging Notes

# Monitor I/O scheduler behavior with iostat
iostat -xdz 1
# Look for: r_await, w_await (latency), aqu-sz (queue depth), %util

# Trace I/O request lifecycle with blktrace
blktrace -d /dev/sda -o - | blkparse -i -
# Key events:
# Q: queued to OS, G: get request, I: insert into scheduler,
# D: dispatched to driver, C: completed

# Check queue stats
cat /sys/block/sda/stat    # cumulative counters
# Reset stats
echo 1 > /sys/block/sda/reset_now  # not always available

# Check writeback progress
cat /proc/meminfo | grep -E 'Dirty|Writeback'
# Dirty: amount of dirty cache (waiting to be written)
# Writeback: currently being written

# Monitor kworker flush threads
pidstat -d 1   # per-process disk I/O stats (includes kworker)

# Check if a process is doing O_DIRECT (bypasses page cache)
strace -e trace=open -p $PID 2>&1 | grep O_DIRECT

# Identify processes doing the most I/O
iotop -o -a 1  # -o: only show processes with I/O, -a: accumulated

# Check per-device NCQ depth utilization
cat /sys/block/sda/device/iodone_cnt   # I/Os completed

Security Implications

I/O covert channels: By observing disk I/O timing (via /proc/diskstats or block layer timing), a process can infer when other processes are doing I/O, potentially leaking information about workload patterns. In multi-tenant environments, this is a low-bandwidth side channel. Mitigation: use separate physical disks per tenant, or ensure kernel I/O isolation via cgroups.

ionice real-time abuse: ionice -c 1 -n 0 (real-time I/O priority) allows a process to starve all other disk I/O. Without proper cgroup I/O limits, a rogue or misbehaving process can cause all other I/O to time out. Production systems should use cgroup io.max limits on untrusted workloads.

Performance Implications

dirty_ratio too high → write stalls: When dirty pages hit dirty_ratio, writing processes are forced to synchronously flush dirty pages before continuing. This causes "write stalls" — application threads appear to sleep with no CPU usage but at ~HDD latency. Visible as latency spikes correlated with dirty page high-water marks. Solution: lower dirty_background_ratio to start flushing earlier.

Read-ahead thrashing: Aggressive read-ahead on random-access workloads (like database random reads) can pollute the page cache with prefetched pages that are never read. This evicts actually-needed pages and reduces effective cache size. For random I/O, reduce read_ahead_kb to 0-16 KB.

mq-deadline write starvation prevention: Under heavy random write load, mq-deadline may delay reads past their deadline. Set read_expire to a conservative value (200-500 ms) to prevent read starvation. Monitor r_await in iostat — if it exceeds read_expire regularly, something is wrong.

Failure Modes and Real Incidents

Dirty page flush storm: A common incident pattern: system runs normally for hours, then experiences a 10-60 second I/O storm where all write throughput goes to flushing dirty pages. Root cause: dirty_background_ratio and dirty_ratio too high → dirty pages accumulate → background flush falls behind → ratio hits hard limit → all processes stall for synchronous flush. Fix: lower dirty_background_ratio to 5-10% and monitor /proc/meminfo Dirty field.

Linux 3.x CFQ idle wait bug on SSDs: CFQ's "anticipatory idle" — waiting 10 ms after a synchronous read for more reads from the same process — was designed for HDD seek optimization. On SSDs, this introduced 10 ms artificial latency for single-threaded random reads. This caused widespread complaints about slow SSD performance on Linux 3.x. Fix: use noop (now none) scheduler on SSDs. This was a primary motivation for making none the default for rotational=0 devices in modern kernels.

Writeback stall during ext4 journal commit: On busy systems, ext4's 5-second journal commit interval can cause a writeback stall if the journal is large and on a slow device. The fix is to mount with journal_async_commit (ext4) or use a dedicated fast journal device (SSD for journal, HDD for data).

Modern Usage

io_uring with sqpoll: For NVMe, io_uring's submission-side polling (IORING_SETUP_SQPOLL) bypasses the block layer scheduler entirely — kernel polls the submission ring directly. This achieves ~10M IOPS from userspace with zero system call overhead, at the cost of a dedicated CPU core for polling.
BPF-based I/O tracing: bpftrace provides production-safe I/O tracing at scheduler level: bash bpftrace -e 'tracepoint:block:block_rq_insert { @[args->rwbs] = hist(args->bytes); }'
Kubernetes I/O limits: Kubernetes uses cgroup v2 io.max for container I/O throttling. The kubelet configures these limits based on container resource specs.

Future Directions

BFQ as default: BFQ (Budget Fair Queuing) provides per-process latency guarantees while maximizing throughput. Expected to become the universal default, replacing mq-deadline for HDDs.
Hardware-accelerated I/O scheduling: Emerging SmartNICs and DPUs (Data Processing Units — Nvidia BlueField, Intel Mount Evans) can offload I/O scheduling to ARM cores on the DPU, freeing host CPUs.
ML-based adaptive scheduling: Research proposals for ML models that predict I/O patterns and dynamically adjust scheduler parameters. Proof-of-concept systems have shown 15-30% throughput improvements on mixed workloads.

Exercises

On a system with an HDD, set the scheduler to mq-deadline. Run fio --rw=randread --iodepth=1 and capture blktrace output. Use btt to compute D2C (device service time) and Q2D (scheduler wait). How much latency does the scheduler add/save vs FIFO order?
Write a script that monitors /proc/meminfo Dirty and alerts when it exceeds 5% of total RAM. Trigger a large write burst (dd if=/dev/zero of=/tmp/bigfile bs=1M count=2048) and observe how long before background writeback kicks in and how long to clear the dirty pages.
Use ionice -c 3 to run a backup job (tar czf). Simultaneously run a latency-sensitive workload (sysbench io). Measure the latency impact of the backup with and without ionice. How effective is idle I/O priority at protecting interactive workloads?
Configure cgroup v2 I/O throttling to limit a process to 10 MB/s write bandwidth. Run dd if=/dev/zero of=/test/file bs=1M count=1024 inside the cgroup. Verify the limit via cat /sys/fs/cgroup/mygroup/io.stat.
Research the vm.dirty_bytes and vm.dirty_background_bytes sysctl parameters (absolute byte values instead of percentages). Why would you prefer absolute values over percentages on a very large RAM server (e.g., 512 GB)? Calculate the appropriate bytes values for a 512 GB server where you want writeback to start at 4 GB dirty.

References

Axboe, J. "Linux Block I/O: Present and Future." Ottawa Linux Symposium 2004.
Valente, P. and Checconi, F. "High Throughput Disk Scheduling with Deterministic Guarantees on Bandwidth Distribution." IEEE TPDS 2010 (BFQ paper).
Linux kernel documentation: Documentation/block/, Documentation/admin-guide/sysctl/vm.rst
Linux kernel source: block/mq-deadline.c, block/kyber-iosched.c, block/bfq-iosched.c
Percona MySQL I/O tuning: https://www.percona.com/blog/linux-io-scheduler-tuning/
MongoDB tuning: https://www.mongodb.com/docs/manual/administration/production-notes/
Gregg, B. Systems Performance, 2nd ed., Chapter 9 (Disks)
io_uring documentation: https://kernel.dk/io_uring.pdf