04 — I/O Performance
Technical Overview
I/O performance sits at the intersection of hardware capabilities, kernel I/O subsystems, and application I/O patterns. A modern NVMe SSD delivers 1 million IOPS and 7 GB/s sequential throughput—but an application issuing synchronous single-threaded reads will extract only a fraction of that because it serializes requests that the device could service in parallel.
The Linux I/O stack has evolved from blocking syscalls through POSIX AIO to io_uring, which fundamentally restructures the kernel-user boundary. Understanding that evolution—and the performance characteristics at each layer—is essential for building storage-intensive services.
Prerequisites
- File descriptor model: open/read/write/close.
- Linux VFS layer basics.
- Understanding of kernel vs. user space boundary.
- Block layer and NVMe protocol basics.
Core Content
I/O Performance Metrics
| Metric | Definition | Tool |
|---|---|---|
| IOPS | I/O operations per second | iostat, fio |
| Throughput | MB/s or GB/s transferred | iostat, fio, dd |
| Latency | Time per individual I/O operation | fio --lat-percentiles, biolatency |
| Queue depth | Outstanding I/Os at any instant | iostat avgqu-sz |
| Utilization | % of time device is busy | iostat %util |
IOPS and throughput are related by I/O size: Throughput = IOPS × Block_Size. A device doing 100K IOPS at 4 KB has 400 MB/s throughput—the same as doing 50K IOPS at 8 KB.
Disk I/O Patterns
Sequential vs. random: the most important I/O characteristic for spinning disks; less so for SSDs, but still relevant for NVMe due to internal parallelism and cache behavior.
Sequential read (1 MB block, single queue depth):
HDD: ~150 MB/s (head doesn't seek)
SATA SSD: ~550 MB/s
NVMe: ~7,000 MB/s
Random read (4 KB, queue depth 1):
HDD: ~0.5 MB/s (seek time + rotational latency)
SATA SSD: ~40 MB/s
NVMe: ~60 MB/s (queue depth 1 limited)
Random read (4 KB, queue depth 32):
HDD: ~0.7 MB/s (mechanical can't parallelize seeks)
SATA SSD: ~400 MB/s
NVMe: ~3,000 MB/s (parallelism fully exploited)
Queue depth impact on NVMe: NVMe protocol supports up to 65,535 queues with 65,535 entries each. The device's internal NAND channels and die interleaving can service many commands in parallel. Queue depth 1 serializes commands; high queue depth exposes device parallelism:
NVMe IOPS vs Queue Depth (Samsung 990 Pro):
QD1: 15,000 IOPS
QD4: 200,000 IOPS
QD32: 700,000 IOPS
QD128: 1,000,000 IOPS (peak)
For HDD, queue depth above 2–4 is rarely beneficial—the mechanical arm can only physically execute one seek at a time (though NCQ elevator scheduling helps slightly).
iostat Output Interpretation
iostat -x 1
Device r/s w/s rkB/s wkB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
nvme0n1 125.0 80.0 8000.0 5120.0 0.0 0.0 0.0 0.0 0.8 1.2 0.18 64.0 64.0 0.9 18.5
Key columns: - r_await / w_await: average latency (ms) from request submission to completion. Includes queue time. - svctm: service time per I/O (ms). Note: this column is deprecated in modern iostat and unreliable—use r_await/w_await. - aqu-sz (avgqu-sz in older versions): average queue depth. If > 1, device is saturated. - %util: percent of time the device had at least one I/O pending. For SSDs, 100% util doesn't mean saturated—it can still handle more I/Os. For HDDs, 100% util is typically saturation.
A well-tuned NVMe under moderate load:
r_await: 0.1–0.5 ms
w_await: 0.05–0.3 ms
aqu-sz: 1–16
%util: 20–80%
O_DIRECT vs. Buffered I/O
Buffered I/O (default): reads/writes go through the page cache. The OS caches blocks in DRAM; repeated reads are served from cache. Write-back means writes complete quickly in DRAM and are flushed to disk asynchronously.
O_DIRECT: bypasses page cache. Every read/write goes directly to the block device. Required for: - Databases managing their own buffer pool (PostgreSQL, MySQL InnoDB use O_DIRECT to avoid double-caching). - Avoiding cache pollution from bulk data ingestion (log archiving, backup).
Constraints: I/O buffers must be aligned to block size (typically 512 or 4096 bytes):
// O_DIRECT requires aligned buffer
void *buf;
posix_memalign(&buf, 4096, 4096);
int fd = open("/data/file", O_RDONLY | O_DIRECT);
ssize_t n = read(fd, buf, 4096);
Performance comparison for a workload fitting in RAM:
Buffered I/O (cache warm): ~50 GB/s (reads from DRAM page cache)
O_DIRECT: ~7 GB/s (reads from NVMe each time)
For a workload exceeding RAM:
Buffered I/O (cache cold): ~7 GB/s (no cache benefit)
O_DIRECT: ~7 GB/s (equivalent, no double-copy overhead)
io_uring: Architecture and Performance
io_uring (Linux 5.1, 2019, authored by Jens Axboe) is the most significant I/O subsystem addition to Linux in decades. It solves fundamental scalability problems with POSIX AIO and epoll.
Problems with pre-io_uring I/O:
- read/write: blocking syscalls, one syscall per I/O.
- POSIX AIO (aio_read): designed poorly, many operations still block, kernel thread-per-operation implementation in glibc.
- epoll + non-blocking I/O: good for network, but file I/O is always "ready" (lies about non-blocking status—actual reads to cold pages still block).
io_uring architecture: two lock-free ring buffers shared between kernel and user space.
User Space Kernel Space
┌──────────────────────────────────────────────────────────┐
│ │
│ ┌──────────────────┐ ┌──────────────────────┐ │
│ │ Submission Queue │ ──mmap──► SQ Ring Buffer │ │
│ │ (SQ Ring) │ │ │ │
│ │ [sqe][sqe][sqe] │ │ io_uring kernel │ │
│ └──────────────────┘ │ worker │ │
│ │ │ │
│ ┌──────────────────┐ │ ┌────────────────┐ │ │
│ │ Completion Queue │ ◄──mmap─│ │ CQ Ring Buffer │ │ │
│ │ (CQ Ring) │ │ │[cqe][cqe][cqe] │ │ │
│ │ │ │ └────────────────┘ │ │
│ └──────────────────┘ └──────────────────────┘ │
│ │
│ io_uring_enter(1 syscall) to submit batch │
│ or SQPOLL mode: 0 syscalls │
└──────────────────────────────────────────────────────────┘
SQE (Submission Queue Entry): describes an I/O operation (opcode, file descriptor, buffer, offset, flags).
CQE (Completion Queue Entry): result of a completed operation (res = return value, user_data for correlation).
Key features:
- Fixed buffers (
IORING_OP_REGISTER_BUFFERS): register buffers once; avoid per-I/O buffer validation in kernel. - Fixed files (
IORING_OP_REGISTER_FILES): register fd table once; avoid per-I/O fd table lookup. - SQPOLL (
IORING_SETUP_SQPOLL): kernel polls SQ ring from a dedicated kernel thread. User submits by writing to SQ ring with no syscall—zero-syscall I/O. - Linked operations: SQE with
IOSQE_IO_LINKflag creates a dependency chain (read, then process, then write—all queued as one unit). - Timeout operations:
IORING_OP_TIMEOUTenables I/O timeouts without extra threads. - Vectored I/O:
IORING_OP_READV/IORING_OP_WRITEVsupport scatter-gather.
Performance numbers (Axboe's 2019 benchmarks):
Baseline: 4 KB random read, 1 CPU core
libaio (POSIX AIO): 640,000 IOPS
io_uring (basic): 1,700,000 IOPS
io_uring + fixed bufs: 1,900,000 IOPS
io_uring + SQPOLL: 2,000,000 IOPS
That is a 3x improvement over the previous best Linux async I/O interface, and approaches DPDK-style userspace I/O performance.
Basic io_uring usage:
#include <liburing.h>
struct io_uring ring;
io_uring_queue_init(32, &ring, 0); // depth=32, no special flags
// Submit a read
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, buf, len, offset);
sqe->user_data = (uint64_t)my_context;
io_uring_submit(&ring);
// Reap completions
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
int res = cqe->res; // bytes read or -errno
io_uring_cqe_seen(&ring, cqe);
io_uring_queue_exit(&ring);
Applications using io_uring: RocksDB (Facebook), QEMU, Nginx (experimental), the Rust Tokio async runtime, and numerous database engines.
SPDK: Storage Performance Development Kit
SPDK (Intel, 2016) takes the DPDK approach to storage: a userspace NVMe driver that bypasses the kernel entirely.
Architecture:
Application
│
▼
SPDK NVMe Driver (userspace)
│ DMA directly to NVMe BAR registers
▼
NVMe PCIe Device
(no kernel interrupts — poll-mode driver)
- No kernel block layer, no VFS, no page cache.
- Application busy-polls a completion queue on a dedicated CPU core.
- Achieves near-theoretical NVMe IOPS: ~4 million IOPS per core on high-end NVMe.
- Used by: Ceph SPDK backend, Samsung's storage appliances, distributed storage systems.
- Trade-off: dedicated CPU core (spinning on poll loop), requires root/hugepages, complex integration.
Historical Context
Linux block I/O has a long history. The original ll_rw_blk (pre-2.6) was simple but didn't scale. The block layer rewrite by Jens Axboe (2.6.0, 2003) introduced the I/O elevator (CFQ, deadline, noop schedulers). The multi-queue block layer (blk-mq, 3.13, 2014) enabled per-CPU submission queues matching NVMe's hardware parallelism. io_uring (5.1, 2019) addressed the user-kernel interface bottleneck.
The POSIX AIO standard was written in the 1990s for magnetic disks; its thread-per-operation glibc implementation was always a kludge. Linux kernel AIO (io_submit) was faster but had severe limitations (couldn't do file I/O without O_DIRECT on most kernel versions, no network support). io_uring superseded all prior approaches.
Production Examples
Case: PostgreSQL IOPS saturation. A PostgreSQL database on an AWS instance using gp2 EBS volumes (3,000 IOPS baseline) saturated at peak load. iostat -x 1 showed aqu-sz=28 and r_await=45 ms (normal: 2 ms). The fix was three-pronged: upgrade to gp3 (16,000 IOPS), tune effective_io_concurrency=64 (PostgreSQL's queue depth for bitmap heap scans), and enable shared_preload_libraries = pg_prewarm to populate buffer pool on startup, reducing cold I/O.
Case: io_uring in production at Cloudflare. Cloudflare's quiche (QUIC/HTTP3 library) adopted io_uring for sending UDP packets. Eliminating per-packet sendmsg syscalls via io_uring batching reduced kernel time by 40% for their highest-throughput servers (blog.cloudflare.com, 2022).
Debugging Notes
# Block I/O latency histogram (BCC tool)
/usr/share/bcc/tools/biolatency -D 10
# Block I/O per-process breakdown
/usr/share/bcc/tools/biotop 1
# Block I/O trace (verbose)
blktrace -d /dev/nvme0n1 -o - | blkparse -i -
# Check I/O scheduler
cat /sys/block/nvme0n1/queue/scheduler
# For NVMe: use 'none' (no scheduler — device handles queuing)
echo none > /sys/block/nvme0n1/queue/scheduler
# Check queue depth setting
cat /sys/block/nvme0n1/queue/nr_requests
# For NVMe: set to 1024+
echo 1024 > /sys/block/nvme0n1/queue/nr_requests
fio Configuration Examples
# Random read, 4K, queue depth 32 — measures IOPS
[global]
bs=4k
iodepth=32
direct=1
ioengine=io_uring
group_reporting=1
[randread]
rw=randread
size=10G
filename=/dev/nvme0n1
# Sequential write, 1M blocks — measures throughput
[seqwrite]
rw=write
bs=1M
iodepth=8
direct=1
ioengine=io_uring
size=50G
filename=/data/testfile
# Latency-focused: queue depth 1, small blocks
[latency]
rw=randread
bs=4k
iodepth=1
direct=1
ioengine=io_uring
latency_percentiles=1
lat_percentiles=95:99:99.9
numjobs=1
Run and parse:
fio --output-format=json randread.fio | jq '.jobs[0].read | {iops, bw, lat_ns}'
Security Implications
io_uring has had significant CVEs. The syscall's flexibility (it can represent many operations including IORING_OP_OPENAT, IORING_OP_SOCKET, IORING_OP_CONNECT) creates a large attack surface reachable from within seccomp sandboxes that only filter traditional syscall numbers. Notable issues:
- CVE-2022-29582: Use-after-free in io_uring timeout handling. Exploited for local privilege escalation.
- CVE-2023-2598: Heap out-of-bounds write via io_uring buffer ring. LPE.
- Android and Chrome OS disable io_uring in their kernels (Google, 2023) due to the volume of exploitable bugs.
- RHEL 9 and some Ubuntu LTS versions restrict io_uring via
sysctl kernel.io_uring_disabled=1for untrusted workloads.
O_DIRECT file I/O can be used to bypass mandatory locking on some filesystems—always audit O_DIRECT use in security-sensitive contexts.
Performance Implications
The kernel page cache provides enormous read amplification for hot data (DRAM speeds for disk data). Bypassing it with O_DIRECT eliminates this benefit. Databases (PostgreSQL, MySQL) manage their own buffer pools to recoup this with application-specific cache replacement policies (e.g., using MRU for sequential scans to avoid thrashing the buffer pool).
Write coalescing: the page cache merges adjacent writes before flushing to disk. O_DIRECT writes each go to disk individually (unless the application implements its own write coalescing). This can reduce write efficiency for small, scattered writes.
Failure Modes and Real Incidents
io_uring LPE exploits (2022–2023): Several local privilege escalation exploits targeting io_uring vulnerabilities were published. CISA and Google recommended disabling io_uring in environments running untrusted code. The kernel community has invested heavily in io_uring hardening (codebase audits, fuzzing with syzkaller), but the complexity of the io_uring codebase remains a concern.
EBS volume stall during snapshot (AWS, 2019): AWS EBS gp2 volumes occasionally had their IOPS burst credits exhausted during automated snapshot operations, causing production I/O latency to spike from 2 ms to 500 ms. Applications that relied on burst IOPS for normal operation failed. Lesson: provision baseline IOPS (gp3 with explicit IOPS, or io1/io2) for latency-sensitive production databases.
Modern Usage
io_uring has become the preferred I/O engine in Rust async runtimes (tokio-uring, glommio), and is the default in some versions of RocksDB. SPDK continues to be used in high-performance storage appliances. The combination of io_uring + O_DIRECT + fixed buffers can approach raw NVMe throughput from userspace without the complexity and privilege requirements of SPDK.
Future Directions
- io_uring with IOMMU passthrough: enabling io_uring operations to DMA directly from NVMe to user memory without kernel copies.
- Persistent memory (CXL) integration with io_uring: treating CXL-attached memory as a storage tier with byte-addressable semantics.
- NVMe-oF (NVMe over Fabrics): extending NVMe queue model over RDMA or TCP, bringing low-latency NVMe semantics to disaggregated storage.
Exercises
-
Run
fiowithioengine=io_uringandioengine=libaioon the same NVMe device for random 4 KB reads at QD=32. Compare IOPS and latency distributions. Explain the difference. -
Use
iostat -x 1during a fio run. Identify the point at which increasing queue depth no longer improves IOPS (the device's saturation point). Plot IOPS vs. queue depth from QD1 to QD128. -
Write a simple C program that opens a file with
O_DIRECTandO_RDONLY, reads 4 KB aligned blocks, and measures latency per read usingclock_gettime(CLOCK_MONOTONIC). Compare average latency with and withoutO_DIRECT. -
Compare
iostatoutput for a sequential read workload vs. a random read workload (both using fio). Explain the difference inaqu-sz,await, and%util. -
Investigate io_uring security: look up CVE-2022-29582. Describe the vulnerability class, how it was exploited, and what kernel version introduced the fix.
References
- Axboe, J. "Efficient IO with io_uring." https://kernel.dk/io_uring.pdf (2019).
- Axboe, J. "io_uring and the new frontier of I/O performance." KernelTLK 2019.
- Intel SPDK documentation: https://spdk.io/doc/
- fio documentation: https://fio.readthedocs.io/
- Gregg, B. Systems Performance (2nd ed., 2020). Chapter 9: Disks.
- CVE-2022-29582: https://nvd.nist.gov/vuln/detail/CVE-2022-29582
- Google Security Blog: "The Security Case Against io_uring in the Linux Kernel." https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-regarding-io.html
- Cloudflare: "Using io_uring to make network I/O go brrrr." https://blog.cloudflare.com/missing-manuals-io_uring-worker-pool/