io_uring: High-Performance Async I/O for Linux

Overview

io_uring is a Linux kernel interface for asynchronous I/O, introduced by Jens Axboe and merged in Linux 5.1 (May 2019). It represents the most significant redesign of Linux I/O primitives in decades, solving fundamental problems that plagued the older POSIX AIO interface, and enabling I/O throughput that approaches the theoretical hardware limits of modern NVMe storage and high-speed networking.

At its core, io_uring uses two memory-mapped ring buffers shared between kernel and userspace to eliminate system call overhead on the common I/O path. In its most aggressive configuration, a process can submit and reap I/O completions with zero system calls.

Prerequisites

Understanding of POSIX I/O (read, write, open, close)
Familiarity with Linux system calls and file descriptors
Basic understanding of ring buffers and memory-mapped I/O
Awareness of the difference between blocking and non-blocking I/O
Knowledge of epoll/select/poll for context on prior async approaches

The Problem io_uring Solves

The History of Async I/O Attempts

POSIX AIO (aio_read, aio_write) was the first standardized async I/O interface. It was plagued by fundamental limitations:

Only worked asynchronously for files opened with O_DIRECT (bypassing page cache). Regular buffered file I/O fell back to synchronous execution in a thread pool.
Each I/O operation required at least one system call to submit and another to reap the completion.
No support for non-file operations: you could not async-await a network accept(), connect(), or send().
The Linux implementation used a userspace thread pool internally for most operations, wasting context switch overhead.

epoll handles asynchronous notification for network sockets and pipes but does not perform the I/O itself—it only notifies when a file descriptor is ready, requiring the application to then make the blocking call.

libaio (Linux's native io_submit/io_getevents) solved the O_DIRECT limitation for storage but still required per-operation system calls and had no support for networking.

What Applications Actually Need

A high-performance server needs to: 1. Issue many I/O operations (reads, writes, sends, receives, accepts) concurrently 2. Be notified when any of them complete 3. Chain operations: receive data, parse it, write it to disk, then acknowledge the network write 4. Do all this with minimal CPU overhead

Before io_uring, achieving this required either a thread-per-connection model (which collapses under high concurrency due to context switch overhead) or a complex epoll-based event loop that still required a system call per ready event.

io_uring Design

Two Ring Buffers

io_uring exposes two ring buffers, both memory-mapped into the userspace process:

Submission Queue (SQ): The application writes I/O operation descriptors here. The kernel reads from it.
Completion Queue (CQ): The kernel writes completion events here. The application reads from it.

The buffers are lock-free ring buffers with producer/consumer head and tail pointers. The application only calls into the kernel (io_uring_enter()) to notify it that new submissions are available—or, with SQPOLL mode, not at all.

io_uring RING DIAGRAM

USERSPACE                         KERNEL
+---------------------------------+----------------------------------+
|                                 |                                  |
|   Application                   |   io_uring subsystem             |
|                                 |                                  |
|  +--SQ Ring Buffer-----------+  |  +--SQ Ring Buffer-----------+  |
|  | [SQE][SQE][SQE][ ][ ][ ] |  |  | [SQE][SQE][SQE][ ][ ][ ] |  |
|  |  ^tail         ^head      |  |  |  ^tail         ^head      |  |
|  +--------mmap shared--------+  |  +--------mmap shared--------+  |
|       App writes SQEs here      |       Kernel reads SQEs here     |
|                                 |                                  |
|  +--CQ Ring Buffer-----------+  |  +--CQ Ring Buffer-----------+  |
|  | [CQE][CQE][ ][ ][ ][ ]   |  |  | [CQE][CQE][ ][ ][ ][ ]   |  |
|  |  ^head         ^tail      |  |  |  ^head         ^tail      |  |
|  +--------mmap shared--------+  |  +--------mmap shared--------+  |
|       App reads CQEs here       |       Kernel writes CQEs here    |
|                                 |                                  |
|   io_uring_enter() syscall  --> |   Processes queued submissions   |
|   (only needed if not SQPOLL)   |   Issues I/O asynchronously      |
+---------------------------------+----------------------------------+

SQE = Submission Queue Entry (describes the operation)
CQE = Completion Queue Entry (describes the result)

The Submission Queue Entry (SQE)

Each SQE describes one I/O operation:

struct io_uring_sqe {
    __u8  opcode;        // IORING_OP_READ, WRITE, ACCEPT, etc.
    __u8  flags;         // IOSQE_FIXED_FILE, IOSQE_IO_LINK, etc.
    __u16 ioprio;        // I/O priority
    __s32 fd;            // file descriptor (or fixed file index)
    union {
        __u64 off;       // offset in file
        __u64 addr2;     // for some ops
    };
    __u64 addr;          // pointer to buffer
    __u32 len;           // buffer length
    union {
        __kernel_rwf_t rw_flags;
        __u32 fsync_flags;
        __u16 poll_events;
        __u32 sync_range_flags;
        __u32 msg_flags;
        __u32 timeout_flags;
        __u32 accept_flags;
        __u32 cancel_flags;
        __u32 open_flags;
        __u32 statx_flags;
        __u32 fadvise_advice;
        __u32 splice_flags;
    };
    __u64 user_data;     // returned unchanged in CQE — for correlation
    union {
        __u16 buf_index; // for registered buffer pools
        __u64 __pad2[3];
    };
};

The user_data field is critical: it is an opaque 64-bit value the kernel returns unchanged in the completion event, allowing the application to correlate completions with the original requests (typically a pointer to a request context struct).

The Completion Queue Entry (CQE)

struct io_uring_cqe {
    __u64 user_data; // same value from SQE
    __s32 res;       // result (errno negated on error, or bytes transferred)
    __u32 flags;     // IORING_CQE_F_MORE for multishot, etc.
};

Operation Types

io_uring has grown to cover nearly the complete POSIX syscall surface:

Operation	Description
`IORING_OP_NOP`	No-op, useful for testing and wakeups
`IORING_OP_READV` / `IORING_OP_WRITEV`	Vectored read/write (scatter-gather)
`IORING_OP_READ` / `IORING_OP_WRITE`	Simple read/write
`IORING_OP_FSYNC`	Flush file data to storage
`IORING_OP_POLL_ADD`	Wait for fd readiness (like epoll)
`IORING_OP_ACCEPT`	Accept incoming connection
`IORING_OP_CONNECT`	Connect to remote host
`IORING_OP_SEND` / `IORING_OP_RECV`	Network send/receive
`IORING_OP_SENDMSG` / `IORING_OP_RECVMSG`	Full sendmsg/recvmsg
`IORING_OP_OPENAT`	Open file
`IORING_OP_CLOSE`	Close file descriptor
`IORING_OP_STATX`	Extended file stat
`IORING_OP_SPLICE`	Zero-copy data transfer between fds
`IORING_OP_TEE`	Duplicate data between pipe fds
`IORING_OP_TIMEOUT`	Timer with completion event
`IORING_OP_CANCEL`	Cancel a pending operation
`IORING_OP_SOCKET`	Create a socket (5.19+)
`IORING_OP_URING_CMD`	Passthrough to device-specific handler (NVMe, etc.)

Submission Modes

Standard Mode: io_uring_enter()

In the default mode, the application fills SQEs into the ring, then calls io_uring_enter(ring_fd, to_submit, min_complete, flags):

to_submit: how many SQEs to hand to the kernel
min_complete: if non-zero, block until at least this many completions are available

This batches multiple operations into a single system call, amortizing syscall overhead across all submitted operations.

SQPOLL Mode: Zero-Syscall I/O

With IORING_SETUP_SQPOLL, the kernel creates a dedicated kernel thread that continuously polls the SQ ring for new entries. The application simply writes SQEs to the shared memory and the kernel thread picks them up—no system call required.

SQPOLL MODE

Application                     Kernel SQPOLL thread
    |                                    |
    | write SQE to ring (no syscall)     | while(true) {
    |                                    |   if (sq_ring has entries) {
    |                                    |     process_sqe();
    |                                    |   }
    | write SQE to ring                  |   if (idle too long) {
    |                                    |     sleep(); // app must
    |                                    |   }         // io_uring_enter to wake
    | read CQE from ring (no syscall)    |   write_cqe(result);
    |                                    | }

SQPOLL is a busy-poll mode: the kernel thread spins on the SQ. It is appropriate for extremely latency-sensitive storage or network applications where the CPU cost of polling is acceptable. The kernel thread idles (and must be explicitly woken via io_uring_enter) if no new SQEs arrive within a configurable time window.

Linked Requests

A fundamental limitation of epoll-based designs is the need to return to the event loop between I/O stages: receive data, process, send response. Each hop through the event loop adds latency and scheduling overhead.

io_uring's linked requests allow chaining operations so that the completion of one automatically triggers submission of the next, without returning to userspace:

LINKED REQUEST CHAIN

SQE[0]: RECV from fd 5                (flags: IOSQE_IO_LINK)
   |
   +-> on completion, auto-submit:
SQE[1]: WRITE to fd 7 (same buffer)   (flags: IOSQE_IO_LINK)
   |
   +-> on completion, auto-submit:
SQE[2]: FSYNC fd 7                    (no link flag — end of chain)

All three operations complete without a single userspace round-trip.
If any operation in the chain fails, subsequent linked operations
are cancelled with -ECANCELED.

Hard links (IOSQE_IO_HARDLINK) continue the chain even if a linked operation fails, useful for cleanup chains.

Registered Buffers and Files

Every time the kernel processes an IORING_OP_READ with a user pointer, it must: 1. Validate the pointer 2. Pin the memory pages (prevent them from being swapped) 3. Build a scatter-gather list

For hot I/O paths, this per-operation overhead is significant.

Registered buffers (io_uring_register(IORING_REGISTER_BUFFERS)) allow pre-registering a set of buffers with the kernel ring. The pages are pinned once at registration time and mapped into kernel address space. Subsequent I/O operations reference buffers by index rather than pointer, eliminating the per-operation pinning cost.

Registered files similarly allow pre-registering a set of file descriptors. Operations reference files by their index in the registered table rather than the actual fd, eliminating the per-operation fdget() reference counting overhead.

io_uring Performance

NVMe Storage

With registered buffers and SQPOLL, io_uring can drive modern NVMe SSDs to their hardware limits:

IOPS COMPARISON (Samsung 980 Pro NVMe, 4KB random read, single thread)

Interface                  IOPS         Latency (p99)
---------------------------------------------------------
synchronous read()         ~100K        ~10 µs
libaio (O_DIRECT)          ~700K        ~8 µs
io_uring (standard mode)   ~800K        ~6 µs
io_uring (registered buf)  ~900K        ~5 µs
io_uring (SQPOLL)          ~1.2M        ~3 µs

NVMe hardware limit:       ~1.3M IOPS   ~2-3 µs

At queue depth 1 (the hardest case for async I/O), io_uring with SQPOLL approaches the hardware limit—something previously achievable only with DPDK-style kernel bypass storage drivers.

Network I/O

For network workloads, io_uring eliminates the need to call epoll_wait() and then a separate recv(). With IORING_OP_RECV and multishot mode (IORING_RECV_MULTISHOT, 5.19+), a single submission generates multiple completion events as data arrives, effectively implementing edge-triggered I/O without additional system calls.

Security Issues

io_uring's rapid development and its deep integration with the kernel made it a fertile ground for security bugs:

CVE-2022-29582 (Linux 5.15–5.17): A use-after-free in the io_uring timeout handling code. A race condition between a timeout expiry and a cancellation could cause the kernel to access a freed io_kiocb structure. Exploitable for local privilege escalation.

CVE-2023-2598 (Linux 6.3): A buffer overflow in io_uring's fixed buffer registration. Passing a crafted iovec array could cause out-of-bounds writes in kernel memory. Privilege escalation to root.

CVE-2022-1786 (Linux 5.10): Type confusion in io_uring's linked timeout handling.

Responses to the volume of io_uring CVEs:

Android: Disabled io_uring entirely (2023) for Android apps and GKI kernels; only re-enabling for specific privileged components
Ubuntu: Restricted unprivileged io_uring via seccomp in default profiles
Google gVisor: io_uring not supported in the gVisor sandbox kernel
Cloudflare: Disabled io_uring in their serverless sandbox environment

This represents a recurring tension in Linux development: groundbreaking interfaces that are complex enough to harbor subtle memory safety bugs, deployed to production before the security surface is fully understood.

Production Examples

Databases

io_uring in io_uring: PostgreSQL 16 added io_uring support for WAL writes and table scans on Linux 5.1+
RocksDB: Facebook's key-value store has experimental io_uring support for background compaction I/O
TiKV: Rust-based distributed KV store uses Tokio's io-uring backend for storage

Web Servers

NGINX: Experimental io_uring support for file reads and network operations
Hyper (Rust HTTP library): io-uring backend via Tokio
liburing: Jens Axboe's reference library; used by numerous server applications

Runtimes

Tokio (Rust async runtime): tokio-uring crate provides io_uring backend
Seastar (C++ high-performance framework): io_uring backend for storage I/O
Node.js: Experimental libuv backend using io_uring for file I/O

Failure Modes

CQ ring overflow: If the application does not drain the CQ ring fast enough and it fills, further completions are dropped or cause submission backpressure. Monitor io_uring_cq_has_overflow().
SQPOLL thread starvation: The SQPOLL kernel thread competes for CPU like any other thread. On a heavily loaded system, it may not get scheduled promptly, negating the latency benefit.
Fixed buffer exhaustion: Submitting more concurrent I/Os than registered buffers causes fallback to the slower unregistered path, silently degrading performance.
Linked request cancellation cascade: A failure in the first operation of a long chain cancels all subsequent operations. Applications must handle -ECANCELED in CQEs gracefully.
Memory pressure and page pinning: SQPOLL and registered buffers pin memory, preventing it from being reclaimed. On memory-constrained systems, this can trigger OOM.

Debugging Notes

# Check io_uring support
uname -r  # need 5.1+; 5.10+ for most production features

# Use liburing's io_uring_queue_init_params with IORING_SETUP_SQPOLL
# Set sq_thread_idle (ms) to control SQPOLL idle sleep time

# Monitor io_uring operations via tracepoints
perf trace -e io_uring:* -p <PID>

# BPF-based io_uring tracing
bpftrace -e 'tracepoint:io_uring:io_uring_submit_sqe {
    printf("op=%d fd=%d\n", args->opcode, args->fd);
}'

# Check for CQ overflow
# Application should call io_uring_cq_has_overflow() and
# io_uring_cq_eventfd_enabled() periodically

# Inspect registered buffers/files
# (no direct kernel interface; use application-level logging)

# strace io_uring (will show io_uring_enter calls)
strace -e io_uring_setup,io_uring_register,io_uring_enter ./app

Debugging CVE-2022-29582 Pattern

The use-after-free class of bugs in io_uring typically involves:

An operation in flight with a req (io_kiocb) pointer
A concurrent cancellation or timeout that frees req
The original operation's completion handler accessing the now-freed req

Timeline of UAF:

t1: app submits TIMEOUT SQE -> kernel allocates req_A
t2: app cancels TIMEOUT     -> kernel queues req_A for freeing
t3: timeout fires           -> timer callback accesses req_A  <- UAF
t4: req_A is freed

Modern kernels address this with reference counting (io_req_ref_*) and careful ordering of cancellation and completion paths.

Performance Implications

For storage workloads with many concurrent I/Os, io_uring provides 2–4x throughput improvement over pread()/pwrite() by eliminating per-operation system call overhead
For latency-sensitive single-operation workloads, io_uring with SQPOLL reduces storage latency from ~10 µs (synchronous) to ~3 µs
For network workloads, the benefit is smaller because epoll + recv() is already reasonably efficient; io_uring primarily helps when combining storage and network I/O in the same event loop
Registered files alone reduce CPU overhead per I/O by ~8–12% according to Axboe's benchmarks

Modern Usage

Linux 6.x: IORING_OP_URING_CMD for NVMe passthrough (zero-copy from NVMe to userspace), multishot ACCEPT/RECV for high-connection-rate servers, io_uring_buf_ring for buffer selection by the kernel
io_uring + BPF: Proposals for BPF programs attached to io_uring completions for custom dispatch logic
Userspace storage drivers via io_uring: IORING_OP_URING_CMD allows NVMe driver to deliver completions directly to io_uring, enabling SPDK-like performance without kernel bypass

Future Directions

Zero-copy networking: IORING_OP_SEND_ZC (zero-copy send, 5.20) and IORING_OP_RECV_ZC eliminate data copying for large network sends
io_uring for everything: Proposals to route epoll, futex, and even signals through io_uring to create a unified async event interface
Security hardening: Ongoing effort to formally audit io_uring's locking and reference counting; potential use of Rust for new io_uring subsystem code
io_uring in containers: Seccomp profiles and container runtimes adapting to allow safe io_uring use without full CAP_SYS_ADMIN

Exercises

Write a simple file copy program using liburing that reads from source and writes to destination using linked SQEs (IOSQE_IO_LINK). Compare performance with a cp-like synchronous implementation using pread()/pwrite().
Implement a minimal HTTP/1.1 echo server using io_uring that handles ACCEPT, RECV, and SEND all via io_uring. Measure requests per second vs an epoll-based implementation.
Enable SQPOLL mode and measure latency for 4KB random reads on an NVMe device. At what concurrency level does SQPOLL outperform standard mode?
Study the io_uring source code in io_uring/ (Linux kernel tree). Trace the lifecycle of a single IORING_OP_READ: from SQE intake to CQE delivery. Identify where reference counting happens.
Reproduce the mitigation for io_uring in Android: write a seccomp-bpf policy that blocks io_uring_setup and io_uring_enter syscalls. Verify it works with a test program.

References

Axboe, J. "Efficient IO with io_uring" (kernel.dk, 2019) — the primary design document
Axboe, J. "Lord of the io_uring" (conference talk, 2020) — performance deep dive
liburing GitHub: https://github.com/axboe/liburing — reference library with examples
Linux kernel documentation: Documentation/block/uring.rst
LWN.net: "The rapid growth of io_uring" (Corbet, 2020)
scylladb blog: "How io_uring and eBPF Will Revolutionize Programming in Linux" (2020)
Google Project Zero: Analysis of io_uring security vulnerabilities (2022)
Linux kernel io_uring/ directory — the implementation (Axboe, et al.)