01 — Linux Network Stack
Technical Overview
The Linux network stack is a layered software system that moves data between user-space applications and physical hardware. Unlike a textbook OSI model, the Linux implementation collapses layers pragmatically: the socket layer serves both session and transport concerns, the IP layer handles network-layer routing, and the device driver layer encompasses both data-link and physical concerns. Understanding the stack in depth is prerequisite to diagnosing latency spikes, packet drops, and throughput ceilings in production systems.
The stack is invoked on two paths — receive (Rx) and transmit (Tx) — and each path crosses the kernel/user boundary, traverses several subsystems, and interacts with hardware DMA engines.
Prerequisites
- C language familiarity (struct layouts, pointer arithmetic)
- Basic OS concepts: interrupts, softirqs, kernel threads
- Network fundamentals: Ethernet frames, IP packets, TCP/UDP segments
- Familiarity with
ss,ip,ethtool,tccommand-line tools
Core Content
The sk_buff Structure
Every packet in the Linux kernel is represented by a struct sk_buff (socket buffer), defined in include/linux/skbuff.h. It is the single most important data structure in the network stack — created on receive, consumed on transmit, cloned for multicast, and annotated by every layer it traverses.
Key fields:
struct sk_buff {
/* Transport layer pointers */
union {
struct tcphdr *th;
struct udphdr *uh;
struct icmphdr *icmph;
} h;
/* Network layer pointer */
union {
struct iphdr *iph;
struct ipv6hdr *ipv6h;
} nh;
/* Link layer pointer */
union {
struct ethhdr *ethernet;
unsigned char *raw;
} mac;
struct net_device *dev; /* ingress/egress device */
unsigned int len; /* total packet length */
unsigned int data_len; /* length of paged data */
__u16 protocol; /* L3 protocol (ETH_P_IP, etc.) */
__u32 priority;
sk_buff_data_t tail; /* end of linear data */
sk_buff_data_t end; /* end of buffer */
unsigned char *head; /* start of buffer */
unsigned char *data; /* start of packet data */
unsigned char cb[48]; /* control block, per-layer */
struct sock *sk; /* owning socket (or NULL) */
ktime_t tstamp; /* receive timestamp */
/* ... ~200 more fields ... */
};
The cb[] array (control block) is a 48-byte scratch space that each layer uses for its own metadata without polluting the main structure. TCP uses it to store sequence numbers during segmentation; Netfilter stores connection tracking state in it.
Memory layout around packet data:
head tail end
| | |
v v v
+---------+-------------------+-----+
| headroom| packet data |tail |
| (pushed | (L2 hdr)(L3 hdr) |room |
| hdrs) | (L4 hdr)(payload)| |
+---------+-------------------+-----+
^ ^
| |
head data
skb_push() decrements data to prepend a header (used going down the stack). skb_put() advances tail to append data. skb_pull() increments data to consume a header (used going up the stack on receive).
Packet Receive Path
NIC hardware
|
| DMA ring descriptor filled, interrupt fired
v
Hardware ISR (hard IRQ — minimal work)
|
| raise NET_RX_SOFTIRQ
v
NAPI poll() [net_rx_action in softirq context]
|
| skb allocated, headers checked
v
netif_receive_skb()
|
+---> ptype_all handlers (AF_PACKET sniffers, tcpdump)
|
+---> ptype_base[protocol] (ETH_P_IP → ip_rcv)
|
v
ip_rcv() [net/ipv4/ip_input.c]
|
+---> Netfilter: NF_INET_PRE_ROUTING
|
v
ip_rcv_finish()
|
v
ip_route_input() — routing decision
|
+--- local delivery ---> ip_local_deliver()
| |
| NF_INET_LOCAL_IN
| |
| tcp_v4_rcv() / udp_rcv()
| |
| sock receive queue
| |
| application recv()
|
+--- forward ----------> ip_forward()
|
NF_INET_FORWARD
|
ip_output()
NAPI (New API) was introduced to solve the interrupt storm problem. Under high packet rates, a NIC firing an interrupt per packet consumes all CPU time in interrupt context. NAPI switches the NIC to polling mode after the first interrupt: the driver disables interrupts for that queue and schedules a poll() function to run in softirq context, draining up to budget packets (default 64) before re-enabling interrupts.
/* Driver NAPI poll callback example */
static int mydriver_poll(struct napi_struct *napi, int budget)
{
int work_done = 0;
while (work_done < budget && ring_has_packets(rx_ring)) {
skb = build_skb_from_ring(rx_ring);
napi_gro_receive(napi, skb); /* GRO coalescing */
work_done++;
}
if (work_done < budget) {
napi_complete_done(napi, work_done);
enable_irq(rx_irq);
}
return work_done;
}
RSS (Receive Side Scaling) distributes packets across multiple CPU cores using a hash of the 4-tuple (src IP, src port, dst IP, dst port) computed in NIC hardware. Each hash bucket maps to an RX queue, each queue has an IRQ affined to a specific CPU. This eliminates the single-CPU bottleneck for high-throughput servers.
Check RSS queue count and indirection table:
ethtool -l eth0 # show current/maximum channels
ethtool -x eth0 # show RSS hash indirection table
ethtool -X eth0 equal 8 # distribute evenly across 8 queues
Packet Transmit Path
Application write() / sendmsg()
|
v
Socket layer (sock_sendmsg)
|
v
TCP / UDP layer
| TCP: segment, add headers, manage cwnd
v
ip_queue_xmit() / ip_output()
|
+---> NF_INET_LOCAL_OUT (Netfilter OUTPUT chain)
|
v
ip_finish_output()
|
+---> Fragmentation if needed
|
v
dev_queue_xmit()
|
v
Traffic Control (qdisc) — pfifo_fast, fq, fq_codel, etc.
|
| raise NET_TX_SOFTIRQ if queue was empty
v
dev_hard_start_xmit()
|
v
Driver ndo_start_xmit()
|
v
NIC DMA — packet copied to NIC TX ring buffer
|
v
Wire
The qdisc layer is where QoS, traffic shaping, and fair queuing operate. Even the "default" pfifo_fast qdisc implements three priority bands. Modern kernels default to fq_codel or fq for egress to combat bufferbloat.
Softirq Processing
Softirqs are a deferred interrupt mechanism. Two softirq vectors handle networking:
| Vector | Purpose |
|---|---|
NET_RX_SOFTIRQ (3) |
Process received packets from NAPI poll |
NET_TX_SOFTIRQ (2) |
Transmit packets queued by upper layers |
Softirqs run in the context of ksoftirqd/<cpu> threads or inline after hard IRQ return. They are bounded by time (netdev_budget_usecs, default 2ms) and packet count (netdev_budget, default 300) per invocation.
Monitor softirq load:
watch -n1 'cat /proc/net/softnet_stat'
# Columns: total received, dropped, time_squeeze (budget exhausted)
# Per-CPU softirq counts
cat /proc/softirqs | grep -E 'NET_RX|NET_TX'
If time_squeeze is increasing rapidly, the NAPI budget is too small or CPU is overloaded:
sysctl -w net.core.netdev_budget=600
sysctl -w net.core.netdev_budget_usecs=4000
Historical Context
The original Linux network stack (pre-2.2) used a simple interrupt-per-packet model that worked for 10 Mbps Ethernet. As 100 Mbps and then Gigabit NICs arrived, interrupt rates hit hundreds of thousands per second, causing receive livelock where the CPU spent all time handling interrupts with no time left for application processing.
Alexey Kuznetsov and Jamal Hadi Salim designed NAPI (landed in 2.5.x, ~2002) to solve this. The key insight: under high load, polling is more efficient than interrupting.
The sk_buff structure has grown from ~100 bytes in early kernels to over 200 bytes today as more features (GSO, GRO, XDP, timestamps, security labels) were added. This growth has memory implications: millions of concurrent connections means millions of sk_buffs in flight.
Production Examples
nginx high-throughput tuning — A 10G nginx server receiving HTTP traffic needs RSS queues matched to CPU count, NAPI budget tuned, and interrupt affinity pinned:
# Pin NIC IRQs to CPUs 0-7 (assuming 8-queue NIC)
for i in $(seq 0 7); do
echo $i > /proc/irq/$((IRQ_BASE + i))/smp_affinity_list
done
# Set RPS for single-queue NICs (software RSS)
echo "ff" > /sys/class/net/eth0/queues/rx-0/rps_cpus
Packet capture overhead — tcpdump attaches a ptype_all handler via AF_PACKET. Every received packet hits this handler regardless of filter, before the BPF filter runs in-kernel. On a 10Gbps link, this adds measurable latency to the main data path. Use -B 65536 (larger buffer) and BPF filters to minimize impact.
Debugging Notes
# Global packet drop counters
ip -s link show eth0
netstat -s | grep -i drop
# Detailed NIC statistics (hardware-level drops)
ethtool -S eth0 | grep -i drop
# Socket buffer overruns (UDP drops due to full recv buffer)
ss -u -a -p | grep UNCONN
# RX ring buffer size (increase if drops at high pps)
ethtool -g eth0
ethtool -G eth0 rx 4096
# Kernel Rx drop tracking (per-CPU)
cat /proc/net/softnet_stat | awk '{print "cpu"NR-1": total="$1" dropped="$2" throttled="$3}'
bpftrace one-liners:
# Trace sk_buff allocation failures
bpftrace -e 'kprobe:__alloc_skb { @[comm] = count(); }'
# Track packet receive latency from NIC IRQ to socket
bpftrace -e 'kprobe:netif_receive_skb { @start[arg0] = nsecs; }
kprobe:__tcp_data_queue { @lat = hist(nsecs - @start[arg0]); delete(@start[arg0]); }'
Security Implications
- IP spoofing: Without
rp_filter(reverse path filtering), a host will accept packets with source IPs that are not reachable via the interface they arrived on. Enable:sysctl -w net.ipv4.conf.all.rp_filter=1 - Packet sniffing: Any process with
CAP_NET_RAW(or root) can openAF_PACKETand receive all packets on an interface. Network namespaces provide isolation. - sk_buff information leak: Historically, uninitialized
cb[]bytes have leaked kernel stack data to userspace viaAF_PACKET. CVE-2016-8655 exploited a race inAF_PACKET. - NAPI budget starvation: A hostile VM or container generating floods can exhaust NAPI budget on shared host NICs — use hardware rate limiting or SR-IOV VF rate caps.
Performance Implications
| Optimization | Mechanism | Typical Gain |
|---|---|---|
| RSS | Hardware hash → multiple queues | Linear with core count |
| NAPI | Poll instead of interrupt | Eliminates interrupt livelock |
| GRO | Coalesce segments before stack | 30–50% CPU reduction |
| RPS/RFS | Software RSS + flow steering | For single-queue NICs |
| Interrupt coalescing | ethtool -C rx-usecs 50 |
Reduces IRQ rate 10x |
The fastest path through the Linux network stack (loopback, AF_UNIX) incurs ~1–3 µs. A packet crossing a 10G NIC typically sees 10–30 µs end-to-end kernel overhead under light load, dominated by memory allocation and cache misses.
Failure Modes and Real Incidents
Incident: Receive livelock under UDP flood (2014, major CDN)
A UDP amplification DDoS saturated all CPU time in NET_RX_SOFTIRQ — no application packets processed. Fix: NAPI budget reduction + early BPF drop at driver level (predates XDP).
Incident: sk_buff slab exhaustion
A misconfigured application opened millions of sockets without closing them. Each socket holds at least one sk_buff in its receive queue. Kernel OOM killed unrelated processes. Fix: net.core.rmem_default reduction + per-process socket limit via ulimit -n.
Failure Mode: CPU affinity mismatch
RSS queue IRQs on CPU 0, application on CPU 7. Every received packet crosses NUMA nodes. Symptom: high sys CPU on CPU 0, high latency. Diagnose with perf top -C 0 showing __copy_skb_header.
Modern Usage
- XDP (eXpress Data Path) hooks before
skballocation for wire-speed processing — see04-ebpf-networking.md - io_uring provides an asynchronous interface to the socket layer, reducing syscall overhead for high-rate applications
- AF_XDP with
ZEROCOPYflag allows userspace to DMA directly from NIC rings, bypassing the kernel stack entirely at the cost of application complexity - Multi-queue NICs with 32–128 queues are now common; tools like
irqbalanceauto-manage affinity but manual tuning outperforms it for latency-sensitive workloads
Future Directions
- Kernel TLS (kTLS) pushes encryption into the network stack so
sendfile()can work with TLS streams, eliminating an extra copy - BPF TCP header option parser (Linux 5.10) allows eBPF programs to parse and inject custom TCP options — enabling application-aware congestion control from userspace
- Hardware-accelerated GRO (Intel E800 series): the NIC itself coalesces segments before DMA, reducing software GRO overhead
- Memory-mapped packet sockets (
PACKET_MMAP,AF_XDP) are converging toward a unified zero-copy API
Exercises
-
Use
bpftraceto instrumentnetif_receive_skband measure the distribution of packet sizes received on your primary NIC over 10 seconds. Explain what the bimodal distribution (if present) means. -
Enable a second RSS queue on a test NIC (
ethtool -L eth0 combined 2), then generate TCP traffic withiperf3 -P 4. Usecat /proc/net/softnet_statbefore and after to observe CPU load distribution changes. -
Write a minimal character device driver stub that allocates an
sk_buff, populates a UDP header usingskb_push(), and callsdev_queue_xmit(). Explain each step. -
Reproduce
time_squeezeby settingnet.core.netdev_budget=1and runningiperf3. Observe the packet drop rate inip -s link. Then restore the default and explain the trade-off between latency and throughput in budget selection. -
Using
perf record -e net:netif_receive_skb -- iperf3 -s, generate a flame graph and identify the top three callees consuming CPU during packet receive. Explain what each does.
References
include/linux/skbuff.h— sk_buff definitionnet/core/dev.c—netif_receive_skb,dev_queue_xmit, NAPI corenet/ipv4/ip_input.c—ip_rcv,ip_local_delivernet/ipv4/tcp_input.c—tcp_v4_rcv,tcp_rcv_established- Corbet, J. et al. Linux Device Drivers, 3rd Ed. — Chapter 17: Network Drivers
- Benvenuti, C. Understanding Linux Network Internals. O'Reilly, 2006.
- Salim, J.H. Beyond Softnet. Ottawa Linux Symposium 2001. (NAPI design rationale)
Documentation/networking/scaling.rst— RSS, RPS, RFS, XPS kernel docsman 7 packet— AF_PACKET socket interface