04 — I/O Performance

Technical Overview

I/O performance sits at the intersection of hardware capabilities, kernel I/O subsystems, and application I/O patterns. A modern NVMe SSD delivers 1 million IOPS and 7 GB/s sequential throughput—but an application issuing synchronous single-threaded reads will extract only a fraction of that because it serializes requests that the device could service in parallel.

The Linux I/O stack has evolved from blocking syscalls through POSIX AIO to io_uring, which fundamentally restructures the kernel-user boundary. Understanding that evolution—and the performance characteristics at each layer—is essential for building storage-intensive services.

Prerequisites

File descriptor model: open/read/write/close.
Linux VFS layer basics.
Understanding of kernel vs. user space boundary.
Block layer and NVMe protocol basics.

Core Content

I/O Performance Metrics

Metric	Definition	Tool
IOPS	I/O operations per second	`iostat`, `fio`
Throughput	MB/s or GB/s transferred	`iostat`, `fio`, `dd`
Latency	Time per individual I/O operation	`fio --lat-percentiles`, `biolatency`
Queue depth	Outstanding I/Os at any instant	`iostat avgqu-sz`
Utilization	% of time device is busy	`iostat %util`

IOPS and throughput are related by I/O size: Throughput = IOPS × Block_Size. A device doing 100K IOPS at 4 KB has 400 MB/s throughput—the same as doing 50K IOPS at 8 KB.

Disk I/O Patterns

Sequential vs. random: the most important I/O characteristic for spinning disks; less so for SSDs, but still relevant for NVMe due to internal parallelism and cache behavior.

Sequential read (1 MB block, single queue depth):
  HDD:   ~150 MB/s    (head doesn't seek)
  SATA SSD: ~550 MB/s
  NVMe: ~7,000 MB/s

Random read (4 KB, queue depth 1):
  HDD:   ~0.5 MB/s    (seek time + rotational latency)
  SATA SSD: ~40 MB/s
  NVMe: ~60 MB/s (queue depth 1 limited)

Random read (4 KB, queue depth 32):
  HDD:   ~0.7 MB/s    (mechanical can't parallelize seeks)
  SATA SSD: ~400 MB/s
  NVMe: ~3,000 MB/s   (parallelism fully exploited)

Queue depth impact on NVMe: NVMe protocol supports up to 65,535 queues with 65,535 entries each. The device's internal NAND channels and die interleaving can service many commands in parallel. Queue depth 1 serializes commands; high queue depth exposes device parallelism:

NVMe IOPS vs Queue Depth (Samsung 990 Pro):
  QD1:    15,000 IOPS
  QD4:    200,000 IOPS
  QD32:   700,000 IOPS
  QD128:  1,000,000 IOPS (peak)

For HDD, queue depth above 2–4 is rarely beneficial—the mechanical arm can only physically execute one seek at a time (though NCQ elevator scheduling helps slightly).

iostat Output Interpretation

iostat -x 1

Device   r/s    w/s    rkB/s   wkB/s  rrqm/s  wrqm/s  %rrqm  %wrqm  r_await  w_await  aqu-sz  rareq-sz  wareq-sz  svctm  %util
nvme0n1  125.0  80.0   8000.0  5120.0   0.0     0.0     0.0    0.0    0.8      1.2      0.18    64.0      64.0      0.9    18.5

Key columns: - r_await / w_await: average latency (ms) from request submission to completion. Includes queue time. - svctm: service time per I/O (ms). Note: this column is deprecated in modern iostat and unreliable—use r_await/w_await. - aqu-sz (avgqu-sz in older versions): average queue depth. If > 1, device is saturated. - %util: percent of time the device had at least one I/O pending. For SSDs, 100% util doesn't mean saturated—it can still handle more I/Os. For HDDs, 100% util is typically saturation.

A well-tuned NVMe under moderate load:

r_await: 0.1–0.5 ms
w_await: 0.05–0.3 ms
aqu-sz:  1–16
%util:   20–80%

O_DIRECT vs. Buffered I/O

Buffered I/O (default): reads/writes go through the page cache. The OS caches blocks in DRAM; repeated reads are served from cache. Write-back means writes complete quickly in DRAM and are flushed to disk asynchronously.

O_DIRECT: bypasses page cache. Every read/write goes directly to the block device. Required for: - Databases managing their own buffer pool (PostgreSQL, MySQL InnoDB use O_DIRECT to avoid double-caching). - Avoiding cache pollution from bulk data ingestion (log archiving, backup).

Constraints: I/O buffers must be aligned to block size (typically 512 or 4096 bytes):

// O_DIRECT requires aligned buffer
void *buf;
posix_memalign(&buf, 4096, 4096);
int fd = open("/data/file", O_RDONLY | O_DIRECT);
ssize_t n = read(fd, buf, 4096);

Performance comparison for a workload fitting in RAM:

Buffered I/O (cache warm):  ~50 GB/s (reads from DRAM page cache)
O_DIRECT:                   ~7 GB/s  (reads from NVMe each time)

For a workload exceeding RAM:

Buffered I/O (cache cold): ~7 GB/s  (no cache benefit)
O_DIRECT:                  ~7 GB/s  (equivalent, no double-copy overhead)

io_uring: Architecture and Performance

io_uring (Linux 5.1, 2019, authored by Jens Axboe) is the most significant I/O subsystem addition to Linux in decades. It solves fundamental scalability problems with POSIX AIO and epoll.

Problems with pre-io_uring I/O: - read/write: blocking syscalls, one syscall per I/O. - POSIX AIO (aio_read): designed poorly, many operations still block, kernel thread-per-operation implementation in glibc. - epoll + non-blocking I/O: good for network, but file I/O is always "ready" (lies about non-blocking status—actual reads to cold pages still block).

io_uring architecture: two lock-free ring buffers shared between kernel and user space.

User Space                          Kernel Space
┌──────────────────────────────────────────────────────────┐
│                                                          │
│  ┌──────────────────┐         ┌──────────────────────┐  │
│  │ Submission Queue │ ──mmap──► SQ Ring Buffer         │  │
│  │ (SQ Ring)        │         │                      │  │
│  │ [sqe][sqe][sqe]  │         │  io_uring kernel     │  │
│  └──────────────────┘         │  worker              │  │
│                               │                      │  │
│  ┌──────────────────┐         │  ┌────────────────┐  │  │
│  │ Completion Queue │ ◄──mmap─│  │ CQ Ring Buffer │  │  │
│  │ (CQ Ring)        │         │  │[cqe][cqe][cqe] │  │  │
│  │                  │         │  └────────────────┘  │  │
│  └──────────────────┘         └──────────────────────┘  │
│                                                          │
│  io_uring_enter(1 syscall) to submit batch               │
│  or SQPOLL mode: 0 syscalls                              │
└──────────────────────────────────────────────────────────┘

SQE (Submission Queue Entry): describes an I/O operation (opcode, file descriptor, buffer, offset, flags).

CQE (Completion Queue Entry): result of a completed operation (res = return value, user_data for correlation).

Key features:

Fixed buffers (IORING_OP_REGISTER_BUFFERS): register buffers once; avoid per-I/O buffer validation in kernel.
Fixed files (IORING_OP_REGISTER_FILES): register fd table once; avoid per-I/O fd table lookup.
SQPOLL (IORING_SETUP_SQPOLL): kernel polls SQ ring from a dedicated kernel thread. User submits by writing to SQ ring with no syscall—zero-syscall I/O.
Linked operations: SQE with IOSQE_IO_LINK flag creates a dependency chain (read, then process, then write—all queued as one unit).
Timeout operations: IORING_OP_TIMEOUT enables I/O timeouts without extra threads.
Vectored I/O: IORING_OP_READV/IORING_OP_WRITEV support scatter-gather.

Performance numbers (Axboe's 2019 benchmarks):

Baseline: 4 KB random read, 1 CPU core

libaio (POSIX AIO):      640,000 IOPS
io_uring (basic):        1,700,000 IOPS
io_uring + fixed bufs:   1,900,000 IOPS
io_uring + SQPOLL:       2,000,000 IOPS

That is a 3x improvement over the previous best Linux async I/O interface, and approaches DPDK-style userspace I/O performance.

Basic io_uring usage:

#include <liburing.h>

struct io_uring ring;
io_uring_queue_init(32, &ring, 0);  // depth=32, no special flags

// Submit a read
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, len, offset);
sqe->user_data = (uint64_t)my_context;
io_uring_submit(&ring);

// Reap completions
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
int res = cqe->res;  // bytes read or -errno
io_uring_cqe_seen(&ring, cqe);

io_uring_queue_exit(&ring);

Applications using io_uring: RocksDB (Facebook), QEMU, Nginx (experimental), the Rust Tokio async runtime, and numerous database engines.

SPDK: Storage Performance Development Kit

SPDK (Intel, 2016) takes the DPDK approach to storage: a userspace NVMe driver that bypasses the kernel entirely.

Architecture:

Application
    │
    ▼
SPDK NVMe Driver (userspace)
    │  DMA directly to NVMe BAR registers
    ▼
NVMe PCIe Device
    (no kernel interrupts — poll-mode driver)

No kernel block layer, no VFS, no page cache.
Application busy-polls a completion queue on a dedicated CPU core.
Achieves near-theoretical NVMe IOPS: ~4 million IOPS per core on high-end NVMe.
Used by: Ceph SPDK backend, Samsung's storage appliances, distributed storage systems.
Trade-off: dedicated CPU core (spinning on poll loop), requires root/hugepages, complex integration.

Historical Context

Linux block I/O has a long history. The original ll_rw_blk (pre-2.6) was simple but didn't scale. The block layer rewrite by Jens Axboe (2.6.0, 2003) introduced the I/O elevator (CFQ, deadline, noop schedulers). The multi-queue block layer (blk-mq, 3.13, 2014) enabled per-CPU submission queues matching NVMe's hardware parallelism. io_uring (5.1, 2019) addressed the user-kernel interface bottleneck.

The POSIX AIO standard was written in the 1990s for magnetic disks; its thread-per-operation glibc implementation was always a kludge. Linux kernel AIO (io_submit) was faster but had severe limitations (couldn't do file I/O without O_DIRECT on most kernel versions, no network support). io_uring superseded all prior approaches.

Production Examples

Case: PostgreSQL IOPS saturation. A PostgreSQL database on an AWS instance using gp2 EBS volumes (3,000 IOPS baseline) saturated at peak load. iostat -x 1 showed aqu-sz=28 and r_await=45 ms (normal: 2 ms). The fix was three-pronged: upgrade to gp3 (16,000 IOPS), tune effective_io_concurrency=64 (PostgreSQL's queue depth for bitmap heap scans), and enable shared_preload_libraries = pg_prewarm to populate buffer pool on startup, reducing cold I/O.

Case: io_uring in production at Cloudflare. Cloudflare's quiche (QUIC/HTTP3 library) adopted io_uring for sending UDP packets. Eliminating per-packet sendmsg syscalls via io_uring batching reduced kernel time by 40% for their highest-throughput servers (blog.cloudflare.com, 2022).

Debugging Notes

# Block I/O latency histogram (BCC tool)
/usr/share/bcc/tools/biolatency -D 10

# Block I/O per-process breakdown
/usr/share/bcc/tools/biotop 1

# Block I/O trace (verbose)
blktrace -d /dev/nvme0n1 -o - | blkparse -i -

# Check I/O scheduler
cat /sys/block/nvme0n1/queue/scheduler
# For NVMe: use 'none' (no scheduler — device handles queuing)
echo none > /sys/block/nvme0n1/queue/scheduler

# Check queue depth setting
cat /sys/block/nvme0n1/queue/nr_requests
# For NVMe: set to 1024+
echo 1024 > /sys/block/nvme0n1/queue/nr_requests

fio Configuration Examples

# Random read, 4K, queue depth 32 — measures IOPS
[global]
bs=4k
iodepth=32
direct=1
ioengine=io_uring
group_reporting=1

[randread]
rw=randread
size=10G
filename=/dev/nvme0n1

# Sequential write, 1M blocks — measures throughput
[seqwrite]
rw=write
bs=1M
iodepth=8
direct=1
ioengine=io_uring
size=50G
filename=/data/testfile

# Latency-focused: queue depth 1, small blocks
[latency]
rw=randread
bs=4k
iodepth=1
direct=1
ioengine=io_uring
latency_percentiles=1
lat_percentiles=95:99:99.9
numjobs=1

Run and parse:

fio --output-format=json randread.fio | jq '.jobs[0].read | {iops, bw, lat_ns}'

Security Implications

io_uring has had significant CVEs. The syscall's flexibility (it can represent many operations including IORING_OP_OPENAT, IORING_OP_SOCKET, IORING_OP_CONNECT) creates a large attack surface reachable from within seccomp sandboxes that only filter traditional syscall numbers. Notable issues:

CVE-2022-29582: Use-after-free in io_uring timeout handling. Exploited for local privilege escalation.
CVE-2023-2598: Heap out-of-bounds write via io_uring buffer ring. LPE.
Android and Chrome OS disable io_uring in their kernels (Google, 2023) due to the volume of exploitable bugs.
RHEL 9 and some Ubuntu LTS versions restrict io_uring via sysctl kernel.io_uring_disabled=1 for untrusted workloads.

O_DIRECT file I/O can be used to bypass mandatory locking on some filesystems—always audit O_DIRECT use in security-sensitive contexts.

Performance Implications

The kernel page cache provides enormous read amplification for hot data (DRAM speeds for disk data). Bypassing it with O_DIRECT eliminates this benefit. Databases (PostgreSQL, MySQL) manage their own buffer pools to recoup this with application-specific cache replacement policies (e.g., using MRU for sequential scans to avoid thrashing the buffer pool).

Write coalescing: the page cache merges adjacent writes before flushing to disk. O_DIRECT writes each go to disk individually (unless the application implements its own write coalescing). This can reduce write efficiency for small, scattered writes.

Failure Modes and Real Incidents

io_uring LPE exploits (2022–2023): Several local privilege escalation exploits targeting io_uring vulnerabilities were published. CISA and Google recommended disabling io_uring in environments running untrusted code. The kernel community has invested heavily in io_uring hardening (codebase audits, fuzzing with syzkaller), but the complexity of the io_uring codebase remains a concern.

EBS volume stall during snapshot (AWS, 2019): AWS EBS gp2 volumes occasionally had their IOPS burst credits exhausted during automated snapshot operations, causing production I/O latency to spike from 2 ms to 500 ms. Applications that relied on burst IOPS for normal operation failed. Lesson: provision baseline IOPS (gp3 with explicit IOPS, or io1/io2) for latency-sensitive production databases.

Modern Usage

io_uring has become the preferred I/O engine in Rust async runtimes (tokio-uring, glommio), and is the default in some versions of RocksDB. SPDK continues to be used in high-performance storage appliances. The combination of io_uring + O_DIRECT + fixed buffers can approach raw NVMe throughput from userspace without the complexity and privilege requirements of SPDK.

Future Directions

io_uring with IOMMU passthrough: enabling io_uring operations to DMA directly from NVMe to user memory without kernel copies.
Persistent memory (CXL) integration with io_uring: treating CXL-attached memory as a storage tier with byte-addressable semantics.
NVMe-oF (NVMe over Fabrics): extending NVMe queue model over RDMA or TCP, bringing low-latency NVMe semantics to disaggregated storage.

Exercises

Run fio with ioengine=io_uring and ioengine=libaio on the same NVMe device for random 4 KB reads at QD=32. Compare IOPS and latency distributions. Explain the difference.
Use iostat -x 1 during a fio run. Identify the point at which increasing queue depth no longer improves IOPS (the device's saturation point). Plot IOPS vs. queue depth from QD1 to QD128.
Write a simple C program that opens a file with O_DIRECT and O_RDONLY, reads 4 KB aligned blocks, and measures latency per read using clock_gettime(CLOCK_MONOTONIC). Compare average latency with and without O_DIRECT.
Compare iostat output for a sequential read workload vs. a random read workload (both using fio). Explain the difference in aqu-sz, await, and %util.
Investigate io_uring security: look up CVE-2022-29582. Describe the vulnerability class, how it was exploited, and what kernel version introduced the fix.

References

Axboe, J. "Efficient IO with io_uring." https://kernel.dk/io_uring.pdf (2019).
Axboe, J. "io_uring and the new frontier of I/O performance." KernelTLK 2019.
Intel SPDK documentation: https://spdk.io/doc/
fio documentation: https://fio.readthedocs.io/
Gregg, B. Systems Performance (2nd ed., 2020). Chapter 9: Disks.
CVE-2022-29582: https://nvd.nist.gov/vuln/detail/CVE-2022-29582
Google Security Blog: "The Security Case Against io_uring in the Linux Kernel." https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-regarding-io.html
Cloudflare: "Using io_uring to make network I/O go brrrr." https://blog.cloudflare.com/missing-manuals-io_uring-worker-pool/