08 — Network Performance Tuning

Technical Overview

Network performance tuning is systematic identification and elimination of bottlenecks in the packet processing path — from NIC hardware through kernel subsystems to application code. A production system leaving performance on the table at 10 Gbps typically has multiple interacting bottlenecks: high interrupt rates, unnecessary memory copies, CPU context switches, lock contention, and bufferbloat in the transmit path. Effective tuning requires measurement first, then targeted intervention. This document provides a systematic framework with specific kernel knobs, tool commands, and the reasoning behind each setting.

Prerequisites

Linux network stack internals: NAPI, softirqs, qdisc (see 01-linux-network-stack.md)
TCP socket buffers and flow control (see 02-sockets.md)
Zero-copy techniques (see 07-zero-copy-networking.md)
Familiarity with ethtool, ss, tc, perf, netperf, iperf3

Core Content

Bottleneck Taxonomy

Network performance bottlenecks organize into four categories:

Category	Root Cause	Symptoms
Interrupt rate	NIC fires too many IRQs	High `si` CPU%, `NET_RX_SOFTIRQ` time
Memory copies	Data copied between kernel/user buffers	High `sys` CPU%, cache miss rate
Lock contention	Shared socket lock, qdisc lock	CPU cycles wasted in `spin_lock`
Context switches	Blocking I/O, thread wake/sleep	High voluntary context switch rate

Additionally, bufferbloat — excessive queuing in the transmit path — inflates latency without improving throughput, and is addressed separately.

Interrupt Coalescing

Problem: A NIC generating one interrupt per packet at 1 Gbps with 64-byte frames fires 1.5 million interrupts/second. Each interrupt costs ~2–5 µs of CPU time → 3–7.5 seconds of CPU time per second of network time. Impossible.

Solution: Interrupt coalescing — the NIC fires an interrupt after receiving N packets OR waiting X microseconds, whichever comes first.

# View current coalescing settings
ethtool -c eth0

# Set coalescing parameters
ethtool -C eth0 rx-usecs 50 rx-frames 0
# rx-usecs: fire interrupt after 50µs idle
# rx-frames: 0 = disable frame count trigger (usecs only)

# Adaptive coalescing (NIC adjusts based on load)
ethtool -C eth0 adaptive-rx on adaptive-tx on

Trade-off: coalescing reduces interrupt rate (better throughput) at the cost of increased latency (packets wait up to rx-usecs before processing). For latency-sensitive workloads (HFT, real-time), reduce or disable coalescing. For throughput workloads (bulk transfer, CDN), increase coalescing.

Typical production values: - Throughput-optimized: rx-usecs 200 rx-frames 256 - Latency-optimized: rx-usecs 10 rx-frames 0 - Default adaptive usually works well for mixed traffic

RX Ring Buffer Size

The NIC has a hardware ring buffer of descriptors pointing to pre-allocated kernel memory. If the ring fills before the driver (NAPI) drains it, the NIC hardware drops packets — these drops are invisible to the kernel stack and hard to diagnose:

# Check current and maximum ring buffer sizes
ethtool -g eth0
# Ring parameters for eth0:
# Pre-set maximums:
# RX:    4096
# Current hardware settings:
# RX:    256    <-- often the default, often too small

# Increase RX ring buffer
ethtool -G eth0 rx 4096

# Verify NIC hardware drops
ethtool -S eth0 | grep -i 'drop\|miss'
# e.g., rx_missed_errors, rx_no_buffer_count

Hardware drops show up in ethtool -S but NOT in ip -s link or netstat -s. This is the most common missed drop category.

RPS and RFS: Software RSS

RPS (Receive Packet Steering): for NICs with only one RX queue, RPS distributes incoming packets across multiple CPUs by computing a flow hash in software and enqueuing packets to the target CPU's backlog:

# Enable RPS on all CPUs (hex bitmask = 0xff for 8 CPUs)
echo ff > /sys/class/net/eth0/queues/rx-0/rps_cpus

# Set RPS flow table size (larger = better hash distribution)
echo 32768 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
echo 32768 > /proc/sys/net/core/rps_sock_flow_entries

RFS (Receive Flow Steering): extends RPS by steering packets to the CPU where the application thread that owns the socket is running, improving cache locality. Without RFS, a packet arrives on CPU 0 (RPS), but the application wakes up on CPU 7 and must fetch socket data from CPU 0's cache:

# RFS is enabled automatically when rps_sock_flow_entries > 0
sysctl -w net.core.rps_sock_flow_entries=32768

# Per-queue flow table (should sum to rps_sock_flow_entries)
echo 32768 > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt

XPS (Transmit Packet Steering): steers outgoing packets from a CPU to the TX queue associated with that CPU, reducing cross-CPU TX locking:

# Pin TX queue 0 to CPU 0, TX queue 1 to CPU 1, etc.
echo 01 > /sys/class/net/eth0/queues/tx-0/xps_cpus
echo 02 > /sys/class/net/eth0/queues/tx-1/xps_cpus

GRO: Generic Receive Offload

GRO coalesces multiple incoming TCP/UDP segments into a single larger sk_buff before passing to the network stack. It simulates hardware Large Receive Offload (LRO) in software:

# Check and enable GRO
ethtool -k eth0 | grep generic-receive
ethtool -K eth0 generic-receive-offload on

# GRO coalesced packets visible in /proc/net/dev
cat /proc/net/dev  # compare RX packets to actual byte rate

GRO reduces the number of sk_buffs processed (each coalesced super-segment replaces 10–44 individual segments), cutting protocol processing overhead proportionally. For 1 Gbps TCP, GRO typically reduces CPU usage 30–50%.

GRO is enabled by default on most distributions. Disable it only when diagnosing timing issues (GRO adds ~100–500 µs coalescing window) or when running XDP (XDP runs before GRO, so GRO is irrelevant for XDP-dropped packets).

GSO and TSO: Segmentation Offload

TSO (TCP Segmentation Offload): the NIC splits a large TCP buffer into MSS-sized segments, adding TCP/IP headers to each. The kernel sends one large virtual segment; the NIC does the work:

ethtool -k eth0 | grep segmentation
ethtool -K eth0 tcp-segmentation-offload on
ethtool -K eth0 generic-segmentation-offload on  # GSO: software fallback

GSO (Generic Segmentation Offload): software TSO. When the NIC doesn't support TSO (or for tunneled traffic the NIC doesn't understand), GSO performs segmentation in the kernel's dev layer, just before the driver. Still avoids segmentation in TCP's protocol layer.

Impact: without TSO/GSO, TCP sends 1500-byte segments (1 MTU). With TSO/GSO, TCP hands 64KB (or larger) buffers to the NIC layer — 40x fewer protocol processing iterations.

Socket Buffer Tuning

Default TCP socket buffer limits are often too small for high-bandwidth, high-latency links:

# Current defaults
sysctl net.core.rmem_default    # 212992 (208KB)
sysctl net.core.rmem_max        # 212992 (208KB — far too small)
sysctl net.ipv4.tcp_rmem        # 4096 131072 6291456 (min/default/max)
sysctl net.ipv4.tcp_wmem        # 4096 16384 4194304

# For a 10 Gbps, 100ms RTT link, bandwidth-delay product:
# BDP = 10e9 bits/s * 0.1 s / 8 bits/byte = 125 MB
# Socket buffer must be >= BDP to fill the pipe

# Tuning for high-BDP paths
sysctl -w net.core.rmem_max=134217728      # 128MB max
sysctl -w net.core.wmem_max=134217728      # 128MB max
sysctl -w net.ipv4.tcp_rmem='4096 87380 134217728'
sysctl -w net.ipv4.tcp_wmem='4096 65536 134217728'

# Enable TCP buffer auto-tuning (default on, verify)
sysctl net.ipv4.tcp_moderate_rcvbuf   # should be 1

For UDP, auto-tuning doesn't exist — set explicitly:

sysctl -w net.core.rmem_max=26214400   # 25MB for UDP multicast
setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &buf_size, sizeof(buf_size));

Bufferbloat and BQL

Bufferbloat: excessive buffering in the transmit path masks congestion signals, causing queue lengths to grow to hundreds of milliseconds of delay before packets are dropped. This makes TCP's congestion control blind to congestion.

BQL (Byte Queue Limits): introduced in Linux 3.3, limits the number of bytes queued in the NIC's hardware TX ring to the minimum needed to keep the NIC busy:

# View current BQL limits
cat /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit
cat /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max
cat /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min

# BQL auto-adjusts; limit_max can be set to constrain maximum queue
echo 10000 > /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max

BQL works with fair queueing qdiscs (fq, fq_codel) to limit per-flow queue depth. Without BQL, a single large TCP transfer can fill the hardware ring, introducing 10–100ms of queuing delay for interactive traffic.

TCP small queues (TSQ): another anti-bufferbloat mechanism. Limits outstanding TSO packets per TCP flow to 2 segments by default, preventing a single flow from monopolizing the qdisc:

sysctl net.ipv4.tcp_limit_output_bytes  # default 1048576 (1MB)

Qdisc: Queue Discipline Selection

The qdisc is the Linux traffic scheduler — it determines the order packets leave the device.

Qdisc	Algorithm	Best for
`pfifo_fast`	3 priority bands, FIFO	Simple, low overhead
`fq_codel`	Fair queue + CoDel AQM	Default (Debian/Ubuntu); reduces bufferbloat
`fq`	Fair queue, per-flow pacing	High-throughput servers (nginx, databases)
`mq`	Multi-queue wrapper	NICs with multiple TX queues
`tbf`	Token bucket	Rate limiting
`htb`	Hierarchical token bucket	Traffic shaping with classes

# Check current qdisc
tc qdisc show dev eth0

# Set fq for a high-throughput server
tc qdisc replace dev eth0 root fq

# Set fq_codel for a mixed workload server
tc qdisc replace dev eth0 root fq_codel

# Inspect qdisc statistics
tc -s qdisc show dev eth0

fq (Fair Queue, Linux 3.11, Eric Dumazet) implements per-flow pacing at the TCP level — each TCP connection gets its own small queue, preventing one large flow from starving small flows (HTTP requests, DNS). It also implements TCP pacing, spreading out packet transmissions to match the congestion window rate, reducing bursts.

Complete Tuning Workflow

#!/bin/bash
# Network performance tuning script for a 10G server

IFACE=eth0

# 1. Interrupt coalescing (adaptive for mixed workload)
ethtool -C $IFACE adaptive-rx on adaptive-tx on

# 2. Maximize NIC ring buffers
ethtool -G $IFACE rx 4096 tx 4096

# 3. Enable all offloads
ethtool -K $IFACE gro on tso on gso on rx on tx on

# 4. RSS: distribute queues across all cores
ethtool -L $IFACE combined $(nproc)

# 5. IRQ affinity (one queue per core)
for i in $(seq 0 $(($(nproc) - 1))); do
    IRQ=$(grep "${IFACE}-TxRx-${i}" /proc/interrupts | cut -d: -f1 | tr -d ' ')
    [ -n "$IRQ" ] && echo $i > /proc/irq/$IRQ/smp_affinity_list
done

# 6. Socket buffers
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728
sysctl -w net.ipv4.tcp_rmem='4096 87380 134217728'
sysctl -w net.ipv4.tcp_wmem='4096 65536 134217728'

# 7. NAPI budget
sysctl -w net.core.netdev_budget=600
sysctl -w net.core.netdev_budget_usecs=3000

# 8. qdisc: fair queue for server workload
tc qdisc replace dev $IFACE root fq

# 9. TCP congestion control
sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq  # required for BBR pacing

echo "Done. Verify with: ethtool -S $IFACE && ss -tm"

Benchmarking with iperf3 and netperf

# iperf3: TCP throughput (single stream)
iperf3 -s                    # server
iperf3 -c <server> -t 30     # client: 30-second test

# iperf3: parallel streams (fill the pipe)
iperf3 -c <server> -P 8 -t 30

# iperf3: UDP packet loss test
iperf3 -c <server> -u -b 1G  # UDP at 1Gbps target

# iperf3: retransmit and RTT reporting
iperf3 -c <server> --get-server-output

# netperf: more precise, measures CPU overhead too
netserver &                    # server
netperf -H <server> -t TCP_STREAM -l 30  # throughput
netperf -H <server> -t TCP_RR -l 30      # request/response latency
netperf -H <server> -t UDP_RR -l 30      # UDP round-trip

# qperf: InfiniBand / RDMA benchmarking
qperf <server> tcp_bw tcp_lat

Historical Context

TCP performance tuning was largely a research area until the late 1990s when web servers began saturating 100 Mbps links. The seminal work by W. Richard Stevens (TCP/IP Illustrated) and by Sally Floyd on TCP improvements provided the theoretical basis.

The fq_codel algorithm (Dave Täht, Jim Gettys, Eric Dumazet, 2012) emerged from the CoDel (Controlled Delay) AQM research, directly addressing bufferbloat that had been accumulating in home routers and enterprise switches since the 1990s. The Bufferbloat Project (bufferbloat.net) documented widespread RTT inflation caused by oversized hardware buffers.

BQL was developed by Tom Herbert (Google) as a practical kernel-level fix to hardware queue depth bloat, landing in Linux 3.3 (2012) in conjunction with the fq qdisc.

Production Examples

CDN edge server (10G, high QPS):

# Many short connections → minimize TIME_WAIT buildup
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_fin_timeout=15

# Enable ECMP for multiple uplinks
sysctl -w net.ipv4.fib_multipath_hash_policy=1

# Verify throughput with iperf3 (should see >9.5 Gbps)
iperf3 -c edge-test.example.com -P 4

Database server (latency-sensitive, 25G):

# Minimal coalescing for low latency
ethtool -C eth0 rx-usecs 10 tx-usecs 10 adaptive-rx off

# TCP_NODELAY set per-connection by application (postgresql.conf: tcp_keepalives_interval=10)

# Verify latency target (<1ms round trip for same-rack)
netperf -H db-peer -t TCP_RR -- -r 4096,4096

Debugging Notes

# Full bottleneck diagnosis workflow

# 1. Check hardware drops (NIC ring overflow)
ethtool -S eth0 | grep -i 'miss\|drop\|error'

# 2. Check softnet drops (NAPI budget)
cat /proc/net/softnet_stat | awk 'BEGIN{cpu=0} {print "CPU"cpu++": pkts="$1" drops="$2" squeezed="$11}'

# 3. Check socket drops (recv buffer full)
ss -uam  # UDP: check Recv-Q; any non-zero = drops likely
netstat -su | grep 'errors\|overflow'

# 4. Check TCP retransmits (congestion or loss)
ss -ti dst :443 | grep retrans

# 5. Check qdisc drops
tc -s qdisc show dev eth0 | grep dropped

# 6. CPU bottleneck identification
perf top -e cpu-cycles --sort cpu,comm,symbol
# Look for: __napi_poll, net_rx_action, tcp_rcv_established

# 7. Interrupt distribution
cat /proc/interrupts | grep eth0
# Should be spread across CPUs; if all on CPU0, RSS misconfigured

# One-shot bottleneck snapshot
ss -s && cat /proc/net/softnet_stat && ethtool -S eth0 | grep drop

Security Implications

Large socket buffers + per-socket memory: increasing rmem_max to 128MB means a single socket can consume 128MB of kernel memory. With many connections, this enables memory exhaustion. Mitigate with per-process socket limits and connection rate limiting.
Disabling offloads for IDS/IPS: TSO and GRO reconstruct large virtual segments — an IDS/IPS appliance may need to see individual TCP segments to detect fragmentation attacks. Disable offloads on monitoring interfaces.
tc qdisc and DoS: an attacker sending traffic that matches a high-priority qdisc class can starve lower-priority classes. Implement ingress rate limiting (tc police) before qdisc prioritization.

Performance Implications

Key principle: measure before tuning. A common mistake is applying a complete tuning script without profiling, masking the actual bottleneck. The right order:

Establish baseline with iperf3 + netperf
Identify bottleneck category (interrupt, copy, lock, context switch)
Apply one change at a time
Measure impact
Document final configuration in /etc/sysctl.d/99-network.conf

Failure Modes and Real Incidents

Incident: GRO interaction with XDP (2020, financial firm) Enabling XDP on an interface that still had GRO enabled caused GRO to reassemble packets before XDP could drop them — XDP saw coalesced segments with incorrect lengths. Fix: ethtool -K eth0 gro off when using XDP in native mode.

Incident: Ring buffer increase causing OOM (e-commerce, 2021) Increasing RX ring buffer from 256 to 4096 on a 40G NIC across 100 network interfaces consumed 40G × 4096 × 2KB per buffer = 320GB of kernel memory. Host OOMed. Fix: calculate total memory impact before scaling: ring_size × frame_size × num_queues × num_interfaces.

Failure Mode: BBR without fq qdisc BBR requires the fq qdisc for its pacing mechanism. Running BBR with pfifo_fast causes burst-then-idle behavior that BBR's model interprets as a bottleneck — congestion window shrinks. Always pair: net.ipv4.tcp_congestion_control=bbr with net.core.default_qdisc=fq.

Modern Usage

Linux 6.x: SO_TXREHASH socket option forces flow rehashing when RSS queue count changes (e.g., NIC hotplug), ensuring even distribution without restart
ethtool netlink API: replaces ioctl-based ethtool with a kernel netlink interface supporting atomic multi-parameter changes
NIC hardware timestamping: ethtool -T eth0 — hardware RX/TX timestamps enable sub-microsecond latency measurement without software instrumentation overhead
BPF qdisc offload: tc BPF programs can implement custom queueing disciplines that run in kernel space without the overhead of user-space qdisc daemons

Future Directions

eBPF-based congestion control and pacing at the qdisc level, enabling per-flow policies without kernel patches
Hardware pacing in NICs (Mellanox ConnectX-7+): the NIC enforces pacing directly in hardware, removing the kernel fq qdisc overhead for BBR
Kernel-bypass qdisc via AF_XDP: the Linux community is exploring allowing AF_XDP to participate in the kernel's qdisc pipeline, enabling userspace qdiscs with kernel-managed rate enforcement

Exercises

Benchmark interrupt coalescing: set rx-usecs to 0 (minimum latency), 50 (balanced), and 500 (maximum throughput). For each, measure throughput with iperf3 and P99 latency with netperf TCP_RR. Plot the latency-throughput curve.
Demonstrate bufferbloat: set tc qdisc replace dev eth0 root pfifo limit 10000. Run iperf3 for 10 seconds while simultaneously running ping. Record the ping RTT inflation. Then switch to fq_codel and repeat.
Tune socket buffers for a 1 Gbps, 100ms RTT WAN link. Calculate the required buffer size from BDP. Apply the sysctl settings. Verify with iperf3 that throughput improves, and use ss -tm to observe buffer auto-tuning in action.
Write a performance regression test that captures 15 key network metrics (softnet_stat, ethtool -S, ss -s, tc -s qdisc) before and after a configuration change. Output a diff and flag any metric that changed by more than 5%.
Profile a high-QPS HTTP server under load using perf record -g. Identify the top 3 kernel functions consuming CPU in the network receive path. For each, explain what optimization (GRO, interrupt coalescing, RSS) reduces its overhead.

References

net/core/dev.c — net_rx_action, netdev_budget, NAPI budget logic
net/sched/sch_fq.c — Fair Queue qdisc implementation
net/sched/sch_fq_codel.c — FQ-CoDel (CoDel AQM)
net/core/skbuff.c — BQL (netdev_tx_completed_queue)
Herbert, T. & de Bruijn, W. Scaling in the Linux Networking Stack. kernel.org docs, 2014.
Gettys, J. & Nichols, K. Bufferbloat: Dark Buffers in the Internet. ACM Queue, 2011.
Dumazet, E. TCP Small Queues. LWN.net, 2012.
Documentation/networking/scaling.rst — RSS, RPS, RFS, XPS
Documentation/networking/ip-sysctl.rst — all tcp/ip sysctls
man 8 ethtool, man 8 tc, man 1 iperf3, man 1 netperf