Section 15: Networking
Purpose and Scope
The Linux network stack is one of the most performance-critical and architecturally complex subsystems in any operating system. This section covers the entire journey of a packet: from the NIC ring buffer through the kernel's receive path to a userspace socket, and symmetrically the transmit path. It examines the socket layer, protocol family abstractions, network namespaces, virtual networking (veth pairs, bridges, tun/tap), packet filtering and manipulation (iptables, nftables, tc/qdisc, netfilter), and the modern kernel-bypass and acceleration techniques (XDP, eBPF networking, DPDK, SR-IOV, RDMA, zero-copy).
The scope bridges kernel internals and production systems engineering: understanding why a container network exhibits 50 µs baseline latency, or why a misconfigured qdisc causes packet loss, requires literacy at every layer.
Prerequisites
- Section 02 (CPU Architecture): DMA, cache, NUMA, PCIe for NICs
- Section 03 (OS Fundamentals): sockets, file descriptors, process model
- Section 14 (Device Drivers): NIC driver (NAPI), interrupt handling, DMA
- Section 11 (Memory Management): sk_buff page pinning, zero-copy mechanics
- Familiarity with IPv4/IPv6 addressing and basic TCP/UDP
Learning Objectives
Upon completing this section you will be able to:
- Trace a received packet from NIC DMA into sk_buff through the netif_receive_skb() path to a socket receive queue.
- Explain why NAPI exists and how it balances interrupt rate against latency.
- Describe the netfilter hook points and how iptables/nftables hook into them.
- Explain how network namespaces isolate networking state and how veth pairs interconnect them.
- Describe XDP: where it runs, what it can do (drop, pass, redirect, TX), and its performance envelope.
- Explain DPDK's kernel-bypass model, PMD polling, and huge page requirements.
- Describe SR-IOV: how a physical function creates virtual functions and how they map to guest VMs or containers.
- Explain RDMA semantics (send/recv, read, write, atomic) and the verbs API.
- Tune a Linux TCP socket: send/receive buffer sizes, TCP_NODELAY, SO_REUSEPORT, GRO/GSO/TSO.
Architecture Overview
Userspace
┌──────────────────────────────────────────────────────────────────┐
│ Application: send() / recv() / sendmsg() / recvmsg() │
└──────────────────────────────┬───────────────────────────────────┘
│ syscall
┌──────────────────────────────▼───────────────────────────────────┐
│ Socket Layer (SOCK_STREAM / SOCK_DGRAM) │
│ sock → sk_buff receive queue → socket buffer (rmem) │
└──────────────────────────────┬───────────────────────────────────┘
│
┌──────────────────────────────▼───────────────────────────────────┐
│ Protocol Layer (TCP / UDP / ICMP / raw) │
│ tcp_rcv_established() → skb_queue → socket │
└──────────────────────────────┬───────────────────────────────────┘
│
┌──────────────────────────────▼───────────────────────────────────┐
│ Network Layer (IPv4 / IPv6) │
│ ip_rcv() → routing → netfilter FORWARD/INPUT hook │
└──────────────────────────────┬───────────────────────────────────┘
│
┌──────────────────────────────▼───────────────────────────────────┐
│ Netfilter / iptables / nftables / tc │
│ PRE_ROUTING → FORWARD → POST_ROUTING → INPUT → OUTPUT │
└──────────────────────────────┬───────────────────────────────────┘
│
┌──────────────────────────────▼───────────────────────────────────┐
│ Link Layer / Traffic Control (qdisc) │
│ HTB / fq_codel / CAKE / pfifo_fast queuing disciplines │
└──────────────────────────────┬───────────────────────────────────┘
│
┌──────────────────────────────▼───────────────────────────────────┐
│ NIC Driver (NAPI) / XDP │
│ XDP_DROP │ XDP_PASS │ XDP_TX │ XDP_REDIRECT (before skb alloc) │
└──────────────────────────────┬───────────────────────────────────┘
│ DMA ring buffer
┌──────────────────────────────▼───────────────────────────────────┐
│ NIC Hardware │
│ RSS (multi-queue) ─ TX/RX descriptor rings ─ PCIe DMA │
└──────────────────────────────────────────────────────────────────┘
Virtual Networking:
Container NS ──veth──► Linux Bridge ──► Physical NIC (or VXLAN/OVS)
VM NIC ──────tap──────► Bridge / OVS ──► SR-IOV VF ──► PCIe switch
Key Concepts
- sk_buff (socket buffer): The kernel's central packet representation; a
struct sk_buffwraps packet data with metadata (headers, timestamps, marks, conntrack state). - NAPI (New API): A receive abstraction that switches from interrupt-driven to poll-driven receive when traffic exceeds a budget, capping interrupt rate and coalescing work.
- netfilter: In-kernel packet filtering framework; hooks at 5 points in the packet path; nf_tables/iptables/conntrack build on it.
- Network Namespace: Complete isolation of network stack state (interfaces, routes, iptables, sockets); the foundation of container networking.
- veth pair: Virtual Ethernet pair; packet written to one end appears on the other; connects namespaces or VM to bridge.
- Linux Bridge: Layer-2 software switch; learns MAC addresses, forwards frames; used heavily by Docker and KVM.
- tun/tap: Virtual network devices; tun operates at L3 (IP packets), tap at L2 (Ethernet frames); used by VPNs and hypervisors.
- iptables/nftables: Packet filter/NAT/mangling tools; iptables uses linked lists of rules per table/chain; nftables uses a more efficient set-based lookup.
- tc/qdisc: Traffic control and queueing discipline; shapes, polices, and classifies outbound traffic; HTB for hierarchical bandwidth allocation.
- XDP (eXpress Data Path): eBPF program attached at the earliest RX point (in driver, before skb allocation); can drop/pass/redirect/transmit with sub-100 ns latency.
- eBPF networking: BPF programs attached to tc ingress/egress, socket, cgroup, and XDP hooks; enables programmable packet processing without kernel modification.
- DPDK (Data Plane Development Kit): User-space poll-mode driver framework; bypasses kernel entirely; requires dedicated cores and huge pages; achieves sub-1 µs latency.
- SR-IOV (Single Root I/O Virtualization): PCIe capability to present one physical NIC as multiple virtual functions (VFs); each VF is a lightweight PCIe device with its own DMA and interrupt resources.
- RDMA (Remote Direct Memory Access): Zero-copy, kernel-bypass network I/O; CPU not involved in data transfer; 1–2 µs latency over InfiniBand or RoCE.
- GRO/GSO/TSO: Generic Receive Offload (coalesce received segments), Generic Segmentation Offload (defer segmentation), TCP Segmentation Offload (NIC does segmentation); reduce CPU overhead per byte.
- SO_REUSEPORT: Allows multiple sockets to bind the same port; kernel load-balances incoming connections; used by multi-process servers (NGINX, HAProxy).
Major Historical Milestones
| Year | Milestone |
|---|---|
| 1983 | BSD socket API introduced (4.2BSD) |
| 1991 | Linux 0.01 networking: minimal TCP/IP |
| 1994 | Linux 1.2: first serious networking stack; ip_tables precursor |
| 1998 | Netfilter/iptables designed (Harald Welte, Rusty Russell) |
| 1999 | Linux 2.4: netfilter merged, iptables replaces ipchains |
| 2001 | TCP segmentation offload (TSO) in NIC hardware |
| 2002 | NAPI merged into Linux 2.4 |
| 2004 | iproute2 and tc mature as Linux traffic control tools |
| 2005 | Linux network namespaces (initial support in 2.6.x) |
| 2008 | GRO (Generic Receive Offload) merged |
| 2011 | Open vSwitch (OVS) reaches production quality |
| 2014 | nftables merged into Linux 3.13 |
| 2016 | XDP merged into Linux 4.8 |
| 2017 | AF_XDP (zero-copy userspace socket) development begins |
| 2018 | AF_XDP merged into Linux 4.18 |
| 2019 | eBPF tc programs for programmable packet processing mature |
| 2020 | Cilium (eBPF-based Kubernetes CNI) reaches production at scale |
| 2021 | MPTCP mainlined in Linux 5.6 (2020); multipath TCP in production |
| 2023 | io_uring network (IORING_OP_RECV/SEND) reaches production quality |
Modern Relevance and Production Use Cases
Kubernetes CNI plugins (Calico, Cilium, Flannel) implement pod networking by combining veth pairs, Linux bridge or direct routing, iptables/eBPF, and optional VXLAN overlay; a packet from pod A to pod B on a remote node traverses 6–10 netfilter hooks with iptables, or 1–2 eBPF tail calls with Cilium.
Cloud load balancers (AWS NLB, GCP Cloud Load Balancing) are built on XDP or DPDK; Google Maglev processes millions of flows per second per core using consistent hashing implemented in XDP-equivalent programs.
High-frequency trading infrastructure uses DPDK-based NICs with poll-mode drivers and kernel bypass; median RTT on co-located machines is achieved in 600 ns end-to-end.
Service mesh (Envoy, Istio) injects sidecar proxies that intercept traffic via iptables REDIRECT or eBPF sockops; understanding the netfilter hook costs explains the ~100 µs overhead of iptables-based interception.
5G UPF (User Plane Function) implementations use DPDK or AF_XDP to process GTP-U tunneled traffic at 100 Gbps line rate; Linux kernel path cannot achieve the required PPS at these speeds.
File Map
| File | Description |
|---|---|
01-network-stack-overview.md |
Packet lifecycle: RX from wire to socket, TX from socket to wire |
02-socket-layer.md |
sock/socket structures, protocol ops, socket buffer management |
03-sk-buff.md |
sk_buff layout, headroom/tailroom, clone vs copy, page frags |
04-protocol-families.md |
AF_INET, AF_INET6, AF_UNIX, AF_PACKET, AF_XDP |
05-napi-and-rx-path.md |
NAPI polling, GRO, RPS/RFS, RSS multi-queue |
06-tx-path-and-gso.md |
GSO, TSO, tx queues, qdisc integration |
07-network-namespaces.md |
Namespace isolation, /proc/net, setns, ip-netns |
08-veth-bridge-tuntap.md |
veth pair, Linux bridge, tun/tap, macvlan, ipvlan |
09-iptables-nftables.md |
Netfilter hooks, table/chain model, conntrack, NAT |
10-tc-qdisc.md |
Traffic shaping, HTB, fq_codel, CAKE, tc mirred |
11-netfilter-internals.md |
nf_hook_ops, hook registration, conntrack state machine |
12-xdp.md |
XDP modes (native/generic/offload), BPF maps, AF_XDP |
13-ebpf-networking.md |
BPF prog types: XDP, tc, sockops, sk_lookup, cgroup sock |
14-dpdk.md |
PMD poll mode, hugepages, memory pools, multi-core pipelines |
15-sr-iov.md |
PF/VF model, VF MAC/VLAN assignment, guest passthrough |
16-rdma.md |
InfiniBand verbs, QP/CQ/MR, RoCE v2, RDMA_CM |
17-zero-copy-networking.md |
sendfile, splice, MSG_ZEROCOPY, io_uring zero-copy |
18-socket-tuning.md |
tcp_rmem/wmem, SO_SNDBUF, TCP_NODELAY, SO_REUSEPORT |
19-network-performance.md |
PPS analysis, CPU affinity, IRQ balancing, perf profiling |
Cross-References
- Section 02 (CPU Architecture): NUMA effects on NIC interrupt affinity, PCIe bandwidth
- Section 11 (Memory Management): sk_buff page management, zero-copy page pinning
- Section 14 (Device Drivers): NIC driver (NAPI), MSI-X interrupt vectors, DMA ring
- Section 16 (TCP/IP Internals): TCP/UDP protocol implementation built on this stack
- Section 17 (Distributed Systems): network partition detection, overlay networks
- Section 19 (Virtualization): virtio-net, SR-IOV passthrough, OVS-DPDK