05 — Network Performance

Technical Overview

Network performance is multi-dimensional: raw throughput (bytes/second), latency (round-trip time), packets per second (PPS), and CPU cost per packet. Modern NICs deliver 100 Gbps and 150 million PPS; modern kernels process roughly 14 million PPS per CPU core at best. For high-throughput services, the network hardware is rarely the bottleneck—the kernel network stack is.

Optimizing network performance requires understanding which layer of the stack is the bottleneck and applying the appropriate technique: interrupt coalescing (reduce hardware IRQ rate), NAPI polling (switch from interrupt to polling at high load), offloading (GSO/GRO push work to NIC), multiqueue (distribute work across CPUs), and kernel bypass (DPDK, XDP—skip the kernel entirely).

Prerequisites

Linux network stack basics: socket layer, protocol processing, device driver model.
TCP/UDP mechanics.
IRQ and interrupt handling.
Basic understanding of DMA and NIC operation.

Core Content

Network Performance Metrics

Metric	Definition	Tool
Throughput	Bytes/second transferred	`iperf3`, `netperf`, `ethtool`
PPS	Packets per second	`nstat`, `ip -s link`, `/proc/net/dev`
Latency	RTT or one-way delay	`ping`, `hping3`, `sockperf`
CPU/packet	CPU cycles consumed per packet	`perf stat` + packet counter
TCP retransmits	Retransmissions / total sends	`ss -s`, `netstat -s`
Drop rate	Packets dropped at receive	`ethtool -S` (NIC drops)

CPU per packet is the least-discussed but most important metric for high-PPS workloads. At 10 million PPS, a 1,000-cycle-per-packet cost consumes an entire 3 GHz core.

Kernel Network Stack Bottlenecks

The Linux network stack for a received packet:

NIC Hardware
    │  DMA packet to ring buffer
    │  Raise hardware interrupt (IRQ)
    ▼
Interrupt Handler (driver, hard IRQ context)
    │  Acknowledge IRQ
    │  Schedule NAPI poll
    ▼
Softirq (NET_RX_SOFTIRQ)
    │  NAPI poll: drain RX ring
    │  Allocate sk_buff per packet
    │  GRO: merge TCP segments
    ▼
Network Protocol Processing
    │  IP layer: routing, validation
    │  TCP/UDP layer: sequence numbers, checksums
    │  Socket lookup (connected hash table)
    ▼
Socket Receive Buffer
    │  Copy data to socket buffer
    ▼
Application (read/recv syscall)
    │  Copy from socket buffer to user space
    ▼
User Application

Bottlenecks at each stage: - Interrupt rate: too many IRQs per second starve the CPU of time to process packets. - sk_buff allocation: each packet requires kernel heap allocation. At 10 Mpps, this is 10 million kmalloc/kfree calls/second. - Protocol processing: checksum computation, sequence number tracking, ACK generation. - Socket layer: finding the right socket, copying to socket buffer. - Memory copy: kernel→user copy on each read() syscall.

Interrupt Coalescing

Interrupt coalescing (adaptive interrupt moderation) batches NIC completions into a single interrupt. Instead of one IRQ per packet (at 1 Mpps = 1 million IRQs/second), the NIC fires one IRQ per N packets or per T microseconds.

# View current coalescing settings
ethtool -c eth0

# Set coalescing: interrupt every 50 µs or 64 packets
ethtool -C eth0 rx-usecs 50 rx-frames 64

Trade-off: higher coalescing reduces IRQ rate (lower CPU overhead) but increases latency (packets sit in the NIC longer before IRQ fires). Latency-sensitive workloads (HFT, gaming) prefer low coalescing or disable it entirely (rx-usecs 0).

NAPI Polling

NAPI (New API) is the Linux mechanism for switching between interrupt and polling mode dynamically.

NIC raises interrupt → driver disables the IRQ, schedules NAPI poll.
Softirq drains the RX ring (polls) until the ring is empty or the budget is exhausted.
If ring empty → re-enable IRQ (return to interrupt mode).
If budget exhausted → yield and reschedule.

This prevents interrupt storms at high load while reverting to interrupt-driven mode at low load (saving CPU when idle).

Tune NAPI budget:

# Budget: max packets processed per NAPI poll cycle
sysctl -w net.core.netdev_budget=600
sysctl -w net.core.netdev_budget_usecs=8000

GRO / GSO

GRO (Generic Receive Offload): in the receive path, the kernel merges multiple TCP segments with consecutive sequence numbers into one large segment before passing to the protocol stack. Reduces the number of protocol-processing operations.

Without GRO: 100 × 1,500-byte packets → 100 sk_buff operations
With GRO:    100 × 1,500-byte packets → 1 × 150,000-byte super-segment

GSO (Generic Segmentation Offload): in the transmit path, the application writes a large buffer (e.g., 64 KB). GSO defers segmentation into MTU-sized packets to the latest possible stage (NIC hardware if it supports TSO, or the kernel). Reduces per-packet overhead.

ethtool -k eth0 | grep -E "generic|tcp|udp|scatter"
ethtool -K eth0 gro on gso on tso on

RSS: Multi-Queue NIC

RSS (Receive Side Scaling) uses the NIC's hardware to distribute incoming packets across multiple receive queues based on a hash of (src IP, dst IP, src port, dst port). Each queue has its own interrupt, which is handled by a different CPU.

NIC Hardware
  RX Queue 0 → CPU 0 (IRQ 80)
  RX Queue 1 → CPU 1 (IRQ 81)
  RX Queue 2 → CPU 2 (IRQ 82)
  RX Queue 3 → CPU 3 (IRQ 83)

Set queue count:

ethtool -L eth0 combined 8  # 8 combined TX+RX queues

Set IRQ affinity (match queues to CPUs):

# Pin NIC IRQ 80 to CPU 0
echo 1 > /proc/irq/80/smp_affinity  # CPU 0 bitmask
# Or use set_irq_affinity.sh from kernel tools

RSS requires the NIC to support hardware hashing. For UDP-only or custom protocols that don't match the hash tuple, use RPS.

RPS / RFS

RPS (Receive Packet Steering): software equivalent of RSS when the NIC has only one queue or the hardware hash is insufficient. The softirq on CPU 0 hashes the packet and enqueues it to another CPU's backlog.

# Enable RPS: spread processing across all 8 CPUs
echo ff > /sys/class/net/eth0/queues/rx-0/rps_cpus

RFS (Receive Flow Steering): extends RPS to steer packets to the CPU where the application receiving that flow is running. Reduces cross-CPU cache migrations for received data.

sysctl -w net.core.rps_sock_flow_entries=32768
echo 2048 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt

SO_REUSEPORT

Traditional: one listening socket, one accept queue. Multiple threads calling accept() contend on the kernel's accept queue lock.

SO_REUSEPORT (Linux 3.9): multiple sockets bound to the same port. The kernel hash-assigns incoming connections to one of the listening sockets, with no lock contention between accept threads.

int opt = 1;
setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &opt, sizeof(opt));
bind(fd, ...);
listen(fd, 1024);
// Repeat in each thread — each gets its own queue

Used by: nginx (default since 1.9.1), HAProxy, envoy, and most high-performance servers.

Busy Polling (SO_BUSY_POLL)

When a socket is waiting for data, the application normally goes to sleep (via poll/epoll), and the kernel wakes it when data arrives. This context-switch introduces scheduling latency.

Busy polling (SO_BUSY_POLL): the kernel's poll() implementation spins on the NIC RX queue for N microseconds before sleeping. Eliminates context-switch latency at the cost of burning a CPU.

# System-wide default busy poll time (microseconds)
sysctl -w net.core.busy_poll=50
sysctl -w net.core.busy_read=50

# Per-socket
setsockopt(fd, SOL_SOCKET, SO_BUSY_POLL, &usecs, sizeof(usecs));

Kernel Bypass: DPDK and XDP

DPDK (Data Plane Development Kit): moves the NIC driver entirely to user space. A dedicated CPU core busy-polls the NIC ring buffer. The kernel sees no packets—they go directly from NIC to the DPDK application via IOMMU-mapped DMA.

Traditional:  NIC → kernel → socket → app
DPDK:         NIC → DPDK userspace driver → app
              (no kernel involvement, no syscall)

Achieved performance: 80+ Gbps wire rate, 60+ million PPS from a single core with modern NICs.

DPDK applications: Open vSwitch (OVS-DPDK), VPP (Vector Packet Processing), Snort IDS, NFV functions.

XDP (eXpress Data Path): an eBPF program attached to the NIC driver's RX path, before sk_buff allocation. Can drop, forward, or redirect packets in ~50 ns with zero copy.

NIC DMA → XDP program (eBPF, runs before sk_buff) → action:
  XDP_DROP: discard (fastest, ~20 ns)
  XDP_PASS: continue to kernel stack
  XDP_TX: retransmit on same interface
  XDP_REDIRECT: forward to another interface or CPU

XDP is used in: Cloudflare's DDoS mitigation (drops millions of PPS from a single core), Facebook's load balancer (Katran), Linux firewall acceleration.

Zero-Copy Networking

sendfile(2): transfers file data from a file descriptor to a socket without copying to user space.

// Traditional: read → user buffer → send
read(file_fd, buf, len);
send(sock_fd, buf, len, 0);

// sendfile: no user-space copy
sendfile(sock_fd, file_fd, &offset, len);

Used by: nginx for static file serving (reduces CPU by 30–50% for file serving workloads).

MSG_ZEROCOPY: for user-space data, the kernel pins the user pages and sends directly. Completion is notified via error queue (MSG_ERRQUEUE).

setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one));
send(fd, buf, len, MSG_ZEROCOPY);
// Wait for completion notification before reusing buf

Overhead warning: MSG_ZEROCOPY has per-message overhead (~1 µs) that makes it slower than memcpy for small messages (< ~10 KB). Only beneficial for large transfers.

RDMA

RDMA (Remote Direct Memory Access): one machine writes directly to another machine's memory without involving the target's CPU. Used in HPC and high-performance storage.

Machine A                          Machine B
┌───────────────┐                  ┌───────────────┐
│ App writes to │                  │ RDMA adapter   │
│ RDMA buf      │ ─── InfiniBand ──► writes to app  │
│               │   (no CPU on B)  │ memory         │
└───────────────┘                  └───────────────┘

Technologies: InfiniBand (IB), RoCE (RDMA over Converged Ethernet), iWARP (RDMA over TCP).

Latency: ~1–2 µs for RDMA vs. ~50–100 µs for TCP loopback. Used in: Lustre/GPFS parallel filesystems, MPI (Message Passing Interface) in HPC, distributed databases (RAMCloud, Pilaf, DrTM).

Sysctl Tuning Guide

Parameter	Default	Recommended	Effect
`net.core.rmem_max`	212992	134217728	Max socket receive buffer
`net.core.wmem_max`	212992	134217728	Max socket send buffer
`net.core.netdev_max_backlog`	1000	5000	Per-CPU softirq queue depth
`net.core.somaxconn`	128	65535	Max listen() backlog
`net.ipv4.tcp_rmem`	4096 87380 6291456	4096 87380 134217728	TCP recv buffer range
`net.ipv4.tcp_wmem`	4096 16384 4194304	4096 65536 134217728	TCP send buffer range
`net.ipv4.tcp_tw_reuse`	0	1	Reuse TIME_WAIT sockets for new connections
`net.ipv4.ip_local_port_range`	32768 60999	1024 65535	Ephemeral port range
`net.ipv4.tcp_max_syn_backlog`	128	8192	SYN queue depth (DDoS resistance)
`net.core.busy_poll`	0	50 (if latency-critical)	Busy poll time (µs)
`net.ipv4.tcp_fastopen`	0	3	TCP Fast Open (save one RTT)
`net.ipv4.tcp_congestion_control`	cubic	bbr	BBR improves throughput in lossy/high-BDP

Apply permanently in /etc/sysctl.d/99-network-perf.conf.

Historical Context

The NAPI framework was merged in Linux 2.5.44 (2002) to address the interrupt storm problem on fast NICs (100 Mbps was fast then). 10 Gbps NICs (2005) revealed that the kernel protocol stack couldn't keep up; GRO/GSO and RSS were added in Linux 2.6.29–2.6.30 (2009).

The DPDK project started at Intel in 2010 to enable telco NFV workloads. It was open-sourced in 2013 under the Linux Foundation.

XDP was merged in Linux 4.8 (2016) based on work from Netronome, Suricata, and Facebook. It enabled in-kernel wire-speed packet processing with programmable eBPF logic.

Production Examples

Case: Cloudflare DDoS mitigation with XDP. During a DDoS attack generating 300 million PPS, Cloudflare used XDP programs to drop attack traffic before sk_buff allocation. A single server dropped 10+ million PPS with negligible CPU load. The same traffic would have overwhelmed the kernel stack entirely. Reference: Cloudflare blog, "L4Drop: XDP DDoS Mitigations."

Case: Facebook Katran load balancer. Facebook's L4 load balancer (Katran) uses XDP+eBPF for consistent hashing and health checking. Replacing the previous IPVS-based solution reduced CPU overhead from ~10% to ~3% per server. Reference: Facebook Engineering blog, 2018.

Debugging Notes

# Socket statistics (TCP state, buffer usage)
ss -s            # summary
ss -mti dst :443 # detailed per-socket info

# Network interface errors and drops
ip -s link show eth0
ethtool -S eth0 | grep -i drop

# Softirq backlog drops
cat /proc/net/softnet_stat
# Column 2: drops (backlog queue full)

# TCP retransmit rate
nstat -az | grep Retrans

# BCC tools for network profiling
/usr/share/bcc/tools/tcplife    # TCP connection durations
/usr/share/bcc/tools/tcpretrans # retransmit events with stack
/usr/share/bcc/tools/nettop     # network top-talkers

Security Implications

SO_REUSEPORT without group restriction allows any process on the system to steal connections from a listening server on the same port. Mitigated with SO_REUSEPORT_EBPF (eBPF program to control which sockets receive connections) or Linux 4.5+ SO_INCOMING_CPU flag.

DPDK applications run as root with direct NIC access. A vulnerability in the DPDK application has no kernel sandbox—the entire NIC is compromised. Use SR-IOV to limit DPDK to a Virtual Function (VF) rather than the Physical Function (PF).

XDP programs are eBPF-verified before loading—they cannot loop infinitely or access arbitrary memory. However, bugs in XDP programs can drop legitimate traffic. Production XDP programs should be tested with xdp-tools and monitored via perf counters.

Performance Implications

TCP congestion control algorithm choice matters significantly over long RTT links (> 10 ms) or lossy links. CUBIC (default) reduces rate on any loss; BBR models bandwidth and RTT independently, sustaining higher throughput on the modern Internet. Enabling BBR:

sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq  # BBR requires fq qdisc

Socket buffer sizes must be large enough to fill the bandwidth-delay product (BDP): BDP = bandwidth × RTT. For 10 Gbps with 50 ms RTT: BDP = 10 Gbps × 0.05 s = 500 Mb = 62.5 MB. Default rmem_max of 212 KB cannot fill this pipe.

Failure Modes and Real Incidents

Softnet drop storm. A cloud provider experienced a 30% packet drop rate on their hypervisor hosts. cat /proc/net/softnet_stat showed column 2 consistently non-zero. Root cause: netdev_max_backlog was left at the kernel default (1000), and bursts from multiple VMs saturated the per-CPU backlog queue. Fix: sysctl -w net.core.netdev_max_backlog=5000.

TIME_WAIT exhaustion. A high-connection-rate API server exhausted ephemeral ports (60,000 ports available by default). New connections failed with EADDRINUSE. Fix: net.ipv4.tcp_tw_reuse=1 and expand ip_local_port_range to 1024 65535. Lesson: 60,000 ports are exhausted easily at 1,000 short-lived connections/second with a 60-second TIME_WAIT.

Modern Usage

XDP is now standard in Linux-based network appliances. Cilium (Kubernetes CNI) uses XDP for load balancing and network policy enforcement. The Linux kernel's sch_cake and fq_codel qdiscs bring active queue management to reduce bufferbloat—a long-standing problem in home and enterprise routers.

AF_XDP (Linux 4.18) allows XDP programs to forward packets to user space via a shared memory ring, combining XDP's kernel-bypass speed with the flexibility of user-space processing (without DPDK's complexity).

Future Directions

QUIC and HTTP/3: moves transport-layer reliability to user space (UDP + QUIC library). Reduced kernel involvement, better multiplexing. Requires efficient UDP send/recv (GSO for UDP: UDP_SEGMENT).
SmartNICs (DPU — Data Processing Units): programmable NICs that run Linux and execute network functions (encryption, load balancing) on-NIC, offloading host CPUs. NVIDIA BlueField, Intel IPU.
eBPF TC (Traffic Control): eBPF programs in the TC layer for per-packet manipulation, gradually replacing iptables for complex policies.

Exercises

Use iperf3 -c <server> -P 4 -t 30 to measure TCP throughput with 4 parallel streams. Then apply BBR congestion control on both client and server and re-measure. Explain any difference.
Configure SO_REUSEPORT in a Python or Go server and run multiple worker processes. Use ss -tlp to verify multiple sockets are bound to the same port. Load-test with wrk and compare throughput vs. a single-process server.
Write an XDP program (using libbpf or bpf-helpers) that counts packets by source IP and drops packets from a specific IP. Load it with ip link set dev eth0 xdp obj prog.bpf.o. Verify drops with ip -s link.
Tune socket buffer sizes for a high-throughput, high-RTT link (simulate with tc qdisc add dev lo root netem delay 100ms). Measure throughput with and without large rmem/wmem settings.
Run cat /proc/net/softnet_stat before and after a high-PPS load test. Identify and explain each column. What tuning would reduce the drop count in column 2?

References

Gregg, B. Systems Performance (2nd ed., 2020). Chapter 10: Network.
Hemminger, S. "Network Tuning for 10GbE." Netconf 2005.
DPDK documentation: https://doc.dpdk.org/
XDP tutorial: https://github.com/xdp-project/xdp-tutorial
Cloudflare: "L4Drop: XDP DDoS Mitigations." https://blog.cloudflare.com/l4drop-xdp-ebpf-based-ddos-mitigations/
Facebook Katran: https://engineering.fb.com/2018/05/22/open-source/open-sourcing-katran-a-scalable-network-load-balancer/
Linux kernel network sysctl: https://www.kernel.org/doc/html/latest/networking/ip-sysctl.html
Cardwell, N. et al. "BBR: Congestion-Based Congestion Control." ACM Queue, 2016.