Skip to content

07 — Zero-Copy Networking

Technical Overview

Conventional data transfer between a file or application buffer and the network involves multiple memory copies that consume CPU cycles and memory bandwidth without performing any computation. Zero-copy networking techniques eliminate or minimize these redundant copies by allowing data to flow from its source directly to the NIC's DMA engine without transiting through user-space buffers. At 25 Gbps and above, the memory subsystem — not computation — becomes the bottleneck, making zero-copy techniques critical infrastructure for high-performance storage, CDN, and distributed systems.


Prerequisites

  • Virtual memory: page tables, physical vs virtual addresses, page pinning
  • DMA (Direct Memory Access): how NICs read/write host memory independently
  • Linux system calls: read(), write(), sendfile(), splice(), mmap()
  • Socket buffer internals (see 02-sockets.md)
  • Linux network stack data path (see 01-linux-network-stack.md)

Core Content

The Copy Problem

A naive file-to-network transfer with standard system calls (read() + write()):

Disk                    Kernel                    User              NIC
 |                        |                        |                 |
 |--- DMA read ---------> |                        |                 |
 |                    page cache                   |                 |
 |                        |--- copy to user buf -->|                 |
 |                        |                    user buffer           |
 |                        |<-- copy to sock buf ---|                 |
 |                        |                        |                 |
 |                        |--- DMA write ----------------------------->|

4 operations, 2 memory copies (kernel→user, user→kernel), 2 context switches, and the data passes through L1/L2 cache twice, evicting hot application data.

At 10 Gbps, memory bandwidth consumed by copies: 10 Gbps / 8 = 1.25 GB/s. A CPU core has ~50 GB/s of memory bandwidth — copies alone could consume 2.5% of memory bandwidth, but the cache pollution effect multiplies this 5–10x in practice.


sendfile(): Kernel-to-Kernel Zero Copy

sendfile(out_fd, in_fd, offset, count) was introduced in Linux 2.2. It transfers data from an open file descriptor directly to a socket descriptor entirely within the kernel, without copying to user space:

Disk                    Kernel                              NIC
 |                        |                                  |
 |--- DMA read ---------> |                                  |
 |                    page cache                             |
 |                        |--- DMA gather (scatter/gather) ->|
 |                        |    (file pages referenced,       |
 |                        |     no memory copy)              |

2 DMA operations, 0 memory copies (with sendfile + DMA scatter-gather NIC support). The kernel passes page references to the NIC DMA engine rather than copying data.

#include <sys/sendfile.h>

int serve_file(int sockfd, const char *path) {
    int filefd = open(path, O_RDONLY);
    struct stat st;
    fstat(filefd, &st);

    off_t offset = 0;
    ssize_t sent = sendfile(sockfd, filefd, &offset, st.st_size);
    close(filefd);
    return sent;
}

Real performance impact: nginx uses sendfile for static file serving. Benchmark comparison (serving 1MB files at 10 Gbps):

Method CPU utilization Throughput
read() + write() 85% 4.2 Gbps
sendfile() 45% 9.8 Gbps

The ~40% CPU reduction and near line-rate throughput are typical. The gain is larger for large files and smaller for small files (syscall overhead dominates).

Enable in nginx:

sendfile on;
sendfile_max_chunk 512k;  # limit to prevent worker starvation
tcp_nopush on;             # TCP_CORK — batch TCP segments

Limitation: sendfile cannot apply SSL/TLS encryption to the data stream — the file data must be in plaintext. HTTPS serving requires a copy for encryption. This is what kTLS (Kernel TLS) solves.


splice(): Pipe-Based Zero Copy

splice(fd_in, off_in, fd_out, off_out, len, flags) moves data between file descriptors via a kernel pipe buffer, without a user-space copy:

File fd  -->  pipe buffer  -->  socket fd
              (kernel pages,
               no copy —
               page references
               transferred)
int pipe_fds[2];
pipe(pipe_fds);

/* Move file data to pipe (no copy) */
splice(filefd, NULL, pipe_fds[1], NULL, file_size, SPLICE_F_MOVE);

/* Move from pipe to socket (no copy) */
splice(pipe_fds[0], NULL, sockfd, NULL, file_size, SPLICE_F_MOVE | SPLICE_F_MORE);

splice is more flexible than sendfile because it works with any two file descriptors (not just file→socket), enabling zero-copy for pipe transformations. Kafka uses splice extensively for its log replication path: reading from a topic partition file and writing to a network socket without touching the data in user space.


MSG_ZEROCOPY: Socket Send Without Copying

MSG_ZEROCOPY (Linux 4.14) pins user pages and passes them directly to the NIC's scatter-gather DMA engine, bypassing the kernel socket buffer copy:

User buffer (pinned pages)
     |
     | (DMA reference, not copy)
     v
NIC DMA engine
     |
     v
Wire

The catch: the kernel cannot free the user pages until the NIC has finished transmitting. The application receives a completion notification via recvmsg() with MSG_ERRQUEUE when the pages are safe to reuse:

#include <linux/errqueue.h>

/* Send with zero-copy */
ssize_t sent = send(sockfd, buf, len, MSG_ZEROCOPY);

/* Wait for completion (pages can be reused after this) */
struct msghdr msg = {};
char cbuf[100];
msg.msg_control = cbuf;
msg.msg_controllen = sizeof(cbuf);

recvmsg(sockfd, &msg, MSG_ERRQUEUE);
struct cmsghdr *cm = CMSG_FIRSTHDR(&msg);
struct sock_extended_err *serr = (struct sock_extended_err *)CMSG_DATA(cm);
/* serr->ee_code == SO_EE_CODE_ZEROCOPY_COPIED → fallback copy occurred */
/* serr->ee_data → last completed send number */

Important nuance: the kernel may fall back to copying if the buffer is in low memory, the socket is congested, or the NIC doesn't support scatter-gather. The SO_EE_CODE_ZEROCOPY_COPIED error code signals this fallback. Applications must handle both paths.

Enable and check:

setsockopt(sockfd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one));

Overhead vs sendfile: MSG_ZEROCOPY has higher per-message overhead due to page pinning and completion notification tracking. It is most effective for large buffers (>32KB) where copy cost exceeds notification overhead, and for buffers that must pass through user-space processing (encryption, compression) before sending.


io_uring Zero-Copy Send

io_uring (Linux 5.1+) introduced fixed buffers — user memory registered once, DMA-mapped persistently, available to all subsequent I/O operations without per-operation page pinning:

/* Register fixed buffers once */
struct iovec iov = { .iov_base = buf, .iov_len = BUF_SIZE };
io_uring_register_buffers(&ring, &iov, 1);

/* Use fixed buffer for send (no page pin per operation) */
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_send_zc_fixed(sqe, sockfd, buf, len, 0, 0, buf_idx);
io_uring_submit(&ring);

io_uring_prep_send_zc (Linux 6.0) is the zero-copy send variant, combining io_uring's async model with MSG_ZEROCOPY semantics. Completions are delivered via the CQE ring without a separate recvmsg() call.


NIC DMA from User Buffer (Registered Buffers)

When a user buffer is registered with the NIC via ibv_reg_mr() (RDMA verbs) or AF_XDP UMEM registration, the NIC can DMA directly from/to user memory:

AF_XDP with ZEROCOPY flag:

/* Register user memory region */
struct xdp_umem_reg umem = {
    .addr = (uint64_t)umem_buf,
    .len  = UMEM_SIZE,
    .chunk_size = FRAME_SIZE,
    .headroom   = HEADROOM,
    .flags = XDP_UMEM_UNALIGNED_CHUNK_FLAG,
};
setsockopt(xdp_fd, SOL_XDP, XDP_UMEM_REG, &umem, sizeof(umem));

/* Bind with ZEROCOPY flag */
struct sockaddr_xdp sxdp = { .sxdp_flags = XDP_ZEROCOPY, ... };
bind(xdp_fd, (struct sockaddr *)&sxdp, sizeof(sxdp));

The NIC DMA's received packets directly into the UMEM ring — no kernel buffers involved.


RDMA: Zero Copy to Remote Memory

RDMA (Remote Direct Memory Access) extends zero-copy across the network. An RDMA write:

  1. Local NIC DMAs from application memory → wire
  2. Remote NIC DMAs from wire → remote application memory
  3. No CPU involvement on either host for the data path
  4. One-sided operation: remote CPU is not interrupted
Host A                               Host B
App memory                           App memory
     |                                    ^
     | DMA                            DMA |
     v                                    |
   RNIC  ====== InfiniBand/RoCE =====>  RNIC
     (no kernel, no CPU, no copy)

This enables: - Throughput: >100 Gbps per link (InfiniBand HDR) - Latency: <2 µs end-to-end (vs 20–50 µs for TCP) - CPU offload: ~0% CPU for data transfer (vs 100% CPU at line rate for TCP)

RDMA is used in: - HPC MPI (Message Passing Interface): collective operations across thousands of nodes - Distributed storage: Ceph RADOS with RDMA, DAOS, PMDK - ML training: NCCL (NVIDIA Collective Communications Library) uses RoCE v2 for all-reduce operations across GPU nodes

# Check RDMA devices
ibv_devices
ibstat mlx5_0

# Bandwidth test
ib_send_bw -d mlx5_0        # server
ib_send_bw -d mlx5_0 <host> # client

# Latency test
ib_send_lat -d mlx5_0

kTLS: Kernel TLS for sendfile() over HTTPS

kTLS (Linux 4.13+) moves TLS record layer processing into the kernel, enabling sendfile() to work for HTTPS serving:

# nginx with kTLS (Linux kernel 5.2+, OpenSSL 3.0+)
# nginx.conf:
ssl_conf_command Options KTLS;
sendfile on;

With kTLS: 1. OpenSSL handshake completes; session keys installed in kernel via setsockopt(TLS_TX/TLS_RX) 2. sendfile() passes file pages to kTLS module 3. kTLS encrypts in-place using AES-NI and passes to NIC DMA 4. No user-space copy, no additional buffer allocation

Performance: kTLS + sendfile achieves ~8–9 Gbps on a 10G link with AES-128-GCM, using 15% CPU (vs 60%+ for userspace TLS + read/write).


Historical Context

sendfile() was introduced in Linux 2.2 (1999) and Solaris independently developed a similar call at the same time. Linus Torvalds was skeptical about the design, but Andi Kleen convinced the community of its importance for web server performance.

The "zero-copy" framing became popular with a 2003 IBM developerWorks article by Shailabh Nagar that quantified the CPU savings — it became the canonical reference for explaining why Kafka, nginx, and similar systems are fast.

MSG_ZEROCOPY was contributed by Willem de Bruijn (Google) in 2017 after profiling showed that large-payload services like YouTube streaming were spending significant CPU on socket buffer copies at 10Gbps+.


Production Examples

Kafka zero-copy for log replication:

// Kafka uses FileChannel.transferTo() which calls sendfile() under the hood
fileChannel.transferTo(position, count, socketChannel);
// No JVM heap allocation, no byte[] copy — page cache direct to socket

nginx static file serving tuning:

# Verify sendfile in use
strace -e sendfile64 nginx -g 'daemon off;'

# Monitor sendfile efficiency
perf stat -e cache-references,cache-misses nginx &
# cache-misses should be low relative to cache-references

Debugging Notes

# Verify sendfile support on filesystem (some FSes don't support it)
strace -e sendfile64,sendfile ./your_app 2>&1 | head

# Check for MSG_ZEROCOPY fallback copies (costly — means no benefit)
bpftrace -e 'tracepoint:sock:sock_zerocopy_enqueue { @[args->len] = count(); }'
# If ee_code == SO_EE_CODE_ZEROCOPY_COPIED, zero-copy fell back to copy

# Monitor kernel TLS
cat /proc/net/tls_stat
# TlsCurrTxSw = sockets using SW kTLS TX
# TlsCurrTxDevice = sockets using NIC-offloaded kTLS TX

# Check sendfile call count and bytes
bpftrace -e 'tracepoint:syscalls:sys_exit_sendfile { @bytes = hist(args->ret); }'

Security Implications

  • Page pinning and page reclaim: MSG_ZEROCOPY pins user pages for the duration of NIC DMA. If the application does not drain completions promptly, pages remain pinned, potentially exhausting pinnable memory. An attacker can use this to trigger OOM in the kernel.
  • RDMA and privilege: RDMA gives a process the ability to write directly to remote memory — if the verbs layer is misconfigured, it can write to arbitrary physical addresses. Hardware IOMMU integration (iWARP, RoCE with IOMMU) is required for multi-tenant RDMA.
  • sendfile and O_RDONLY bypass: sendfile bypasses normal read permissions in some kernel versions — always ensure the source fd is opened with appropriate permissions.
  • kTLS master secret in kernel: session keys are transferred to the kernel via setsockopt. A kernel exploit that reads kTLS socket state gains TLS session keys. This is the accepted trade-off for kTLS performance.

Performance Implications

Technique Copies CPU for 10Gbps Latency impact
read() + write() 2 ~80% Baseline
sendfile() 0 ~45% -20–30%
splice() 0 ~45% -20–30%
MSG_ZEROCOPY (>32KB) 0 ~30% +µs (completion)
io_uring fixed buffer 0 ~20% Lowest
RDMA write 0 ~0% <2 µs

The break-even for MSG_ZEROCOPY vs regular send is approximately 10KB payload — below that, page pinning overhead exceeds copy cost.


Failure Modes and Real Incidents

Incident: sendfile with encrypted filesystem (2018, object storage) sendfile() does not work with dm-crypt or eCryptFS (the kernel cannot pass encrypted page cache pages to the NIC without decrypting). The application silently fell back to read()+write(), causing 3x CPU increase at scale. Fix: use TLS at the socket layer (kTLS) rather than at-rest filesystem encryption for hot data paths.

Failure Mode: MSG_ZEROCOPY completion queue flooding An application sending at 40Gbps with MSG_ZEROCOPY generates completions faster than the MSG_ERRQUEUE drain loop consumes them. The error queue fills, send() blocks. Fix: drain completion queue in a separate thread, or use io_uring zero-copy which integrates completions into the normal CQE ring.

Incident: splice() and NFS regression A kernel update changed NFS page cache behavior, causing splice() to fall back to a copy for NFS-backed files. Throughput dropped 40% for an NFS-backed video streaming server. Fix: file system type check before choosing splice vs read+write.


Modern Usage

  • io_uring zero-copy send (IORING_OP_SEND_ZC, Linux 6.0) is now used by production applications including DPDK-AF_XDP hybrids and database WAL writers
  • kTLS hardware offload: Mellanox/Nvidia ConnectX-6+ can offload kTLS crypto to NIC, achieving TLS encryption at full 100G line rate with near-zero CPU
  • RDMA in ML training: Pytorch Distributed / NCCL uses RoCEv2 for all-reduce collectives — essential for scaling beyond ~32 GPUs where PCIe bandwidth becomes the bottleneck

Future Directions

  • Unified zero-copy API: there is ongoing work to unify MSG_ZEROCOPY, io_uring, and AF_XDP into a consistent zero-copy network I/O model with completion semantics
  • CXL (Compute Express Link): extends RDMA-like semantics to memory-attached accelerators over PCIe, blurring the line between RDMA and local zero-copy
  • NIC-based TLS with zero-copy read: current kTLS zerocopy works well for send; receive zero-copy (kernel TLS receive bypass) is under active development

Exercises

  1. Implement a file server in C using sendfile(). Measure throughput for 1KB, 100KB, and 10MB files. Compare to a read()+write() baseline. At what file size does sendfile() break even?

  2. Implement a MSG_ZEROCOPY sender that correctly drains the completion queue via MSG_ERRQUEUE. Deliberately omit the drain loop in a second version. Measure the stall behavior under sustained load and explain the mechanism.

  3. Write a pipeline using splice() that reads from a file, passes through a pipe (for potential transformation), and writes to a socket — all without a user-space data copy. Verify zero-copy by adding a bpftrace probe on __copy_to_user that should not fire.

  4. Set up kTLS in a test environment (Linux 5.2+, OpenSSL 3.0+ with ktls engine, curl with ktls support). Compare CPU utilization for HTTPS file transfer with and without kTLS. Verify the kTLS TX socket count via /proc/net/tls_stat.

  5. Using ib_send_bw and ib_send_lat on an RDMA-capable testbed (or RXE software RDMA over Ethernet: modprobe rdma_rxe), measure RDMA throughput and latency. Compare to TCP iperf3 on the same link. Explain the performance difference in terms of CPU path length.


References

  • net/core/sock_zerocopy.c — MSG_ZEROCOPY implementation
  • fs/splice.c — splice() / sendfile() implementation
  • net/tls/ — kernel TLS implementation (tls_device.c for NIC offload)
  • include/linux/skbuff.h — skb_zerocopy helpers
  • de Bruijn, W. Zero copy networking. Netdev 2.1 conference, 2017.
  • Nagar, S. Efficient data transfer through zero copy. IBM developerWorks, 2003.
  • RDMA Consortium. InfiniBand Architecture Specification. infinibandta.org
  • man 2 sendfile, man 2 splice, man 7 socket (MSG_ZEROCOPY section)
  • Documentation/networking/msg_zerocopy.rst
  • Axboe, J. Efficient IO with io_uring. kernel.dk, 2019.