Skip to content

Section 15: Networking

Purpose and Scope

The Linux network stack is one of the most performance-critical and architecturally complex subsystems in any operating system. This section covers the entire journey of a packet: from the NIC ring buffer through the kernel's receive path to a userspace socket, and symmetrically the transmit path. It examines the socket layer, protocol family abstractions, network namespaces, virtual networking (veth pairs, bridges, tun/tap), packet filtering and manipulation (iptables, nftables, tc/qdisc, netfilter), and the modern kernel-bypass and acceleration techniques (XDP, eBPF networking, DPDK, SR-IOV, RDMA, zero-copy).

The scope bridges kernel internals and production systems engineering: understanding why a container network exhibits 50 µs baseline latency, or why a misconfigured qdisc causes packet loss, requires literacy at every layer.


Prerequisites

  • Section 02 (CPU Architecture): DMA, cache, NUMA, PCIe for NICs
  • Section 03 (OS Fundamentals): sockets, file descriptors, process model
  • Section 14 (Device Drivers): NIC driver (NAPI), interrupt handling, DMA
  • Section 11 (Memory Management): sk_buff page pinning, zero-copy mechanics
  • Familiarity with IPv4/IPv6 addressing and basic TCP/UDP

Learning Objectives

Upon completing this section you will be able to:

  1. Trace a received packet from NIC DMA into sk_buff through the netif_receive_skb() path to a socket receive queue.
  2. Explain why NAPI exists and how it balances interrupt rate against latency.
  3. Describe the netfilter hook points and how iptables/nftables hook into them.
  4. Explain how network namespaces isolate networking state and how veth pairs interconnect them.
  5. Describe XDP: where it runs, what it can do (drop, pass, redirect, TX), and its performance envelope.
  6. Explain DPDK's kernel-bypass model, PMD polling, and huge page requirements.
  7. Describe SR-IOV: how a physical function creates virtual functions and how they map to guest VMs or containers.
  8. Explain RDMA semantics (send/recv, read, write, atomic) and the verbs API.
  9. Tune a Linux TCP socket: send/receive buffer sizes, TCP_NODELAY, SO_REUSEPORT, GRO/GSO/TSO.

Architecture Overview

  Userspace
  ┌──────────────────────────────────────────────────────────────────┐
  │  Application: send() / recv() / sendmsg() / recvmsg()           │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │  syscall
  ┌──────────────────────────────▼───────────────────────────────────┐
  │                       Socket Layer (SOCK_STREAM / SOCK_DGRAM)    │
  │  sock → sk_buff receive queue → socket buffer (rmem)             │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │
  ┌──────────────────────────────▼───────────────────────────────────┐
  │               Protocol Layer  (TCP / UDP / ICMP / raw)           │
  │  tcp_rcv_established() → skb_queue → socket                      │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │
  ┌──────────────────────────────▼───────────────────────────────────┐
  │               Network Layer  (IPv4 / IPv6)                       │
  │  ip_rcv() → routing → netfilter FORWARD/INPUT hook               │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │
  ┌──────────────────────────────▼───────────────────────────────────┐
  │               Netfilter / iptables / nftables / tc               │
  │  PRE_ROUTING → FORWARD → POST_ROUTING → INPUT → OUTPUT          │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │
  ┌──────────────────────────────▼───────────────────────────────────┐
  │               Link Layer / Traffic Control (qdisc)               │
  │  HTB / fq_codel / CAKE / pfifo_fast queuing disciplines          │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │
  ┌──────────────────────────────▼───────────────────────────────────┐
  │               NIC Driver (NAPI) / XDP                            │
  │  XDP_DROP │ XDP_PASS │ XDP_TX │ XDP_REDIRECT (before skb alloc) │
  └──────────────────────────────┬───────────────────────────────────┘
                                 │  DMA ring buffer
  ┌──────────────────────────────▼───────────────────────────────────┐
  │                         NIC Hardware                              │
  │  RSS (multi-queue) ─ TX/RX descriptor rings ─ PCIe DMA           │
  └──────────────────────────────────────────────────────────────────┘

  Virtual Networking:
  Container NS ──veth──► Linux Bridge ──► Physical NIC (or VXLAN/OVS)
  VM NIC ──────tap──────► Bridge / OVS ──► SR-IOV VF ──► PCIe switch

Key Concepts

  • sk_buff (socket buffer): The kernel's central packet representation; a struct sk_buff wraps packet data with metadata (headers, timestamps, marks, conntrack state).
  • NAPI (New API): A receive abstraction that switches from interrupt-driven to poll-driven receive when traffic exceeds a budget, capping interrupt rate and coalescing work.
  • netfilter: In-kernel packet filtering framework; hooks at 5 points in the packet path; nf_tables/iptables/conntrack build on it.
  • Network Namespace: Complete isolation of network stack state (interfaces, routes, iptables, sockets); the foundation of container networking.
  • veth pair: Virtual Ethernet pair; packet written to one end appears on the other; connects namespaces or VM to bridge.
  • Linux Bridge: Layer-2 software switch; learns MAC addresses, forwards frames; used heavily by Docker and KVM.
  • tun/tap: Virtual network devices; tun operates at L3 (IP packets), tap at L2 (Ethernet frames); used by VPNs and hypervisors.
  • iptables/nftables: Packet filter/NAT/mangling tools; iptables uses linked lists of rules per table/chain; nftables uses a more efficient set-based lookup.
  • tc/qdisc: Traffic control and queueing discipline; shapes, polices, and classifies outbound traffic; HTB for hierarchical bandwidth allocation.
  • XDP (eXpress Data Path): eBPF program attached at the earliest RX point (in driver, before skb allocation); can drop/pass/redirect/transmit with sub-100 ns latency.
  • eBPF networking: BPF programs attached to tc ingress/egress, socket, cgroup, and XDP hooks; enables programmable packet processing without kernel modification.
  • DPDK (Data Plane Development Kit): User-space poll-mode driver framework; bypasses kernel entirely; requires dedicated cores and huge pages; achieves sub-1 µs latency.
  • SR-IOV (Single Root I/O Virtualization): PCIe capability to present one physical NIC as multiple virtual functions (VFs); each VF is a lightweight PCIe device with its own DMA and interrupt resources.
  • RDMA (Remote Direct Memory Access): Zero-copy, kernel-bypass network I/O; CPU not involved in data transfer; 1–2 µs latency over InfiniBand or RoCE.
  • GRO/GSO/TSO: Generic Receive Offload (coalesce received segments), Generic Segmentation Offload (defer segmentation), TCP Segmentation Offload (NIC does segmentation); reduce CPU overhead per byte.
  • SO_REUSEPORT: Allows multiple sockets to bind the same port; kernel load-balances incoming connections; used by multi-process servers (NGINX, HAProxy).

Major Historical Milestones

Year Milestone
1983 BSD socket API introduced (4.2BSD)
1991 Linux 0.01 networking: minimal TCP/IP
1994 Linux 1.2: first serious networking stack; ip_tables precursor
1998 Netfilter/iptables designed (Harald Welte, Rusty Russell)
1999 Linux 2.4: netfilter merged, iptables replaces ipchains
2001 TCP segmentation offload (TSO) in NIC hardware
2002 NAPI merged into Linux 2.4
2004 iproute2 and tc mature as Linux traffic control tools
2005 Linux network namespaces (initial support in 2.6.x)
2008 GRO (Generic Receive Offload) merged
2011 Open vSwitch (OVS) reaches production quality
2014 nftables merged into Linux 3.13
2016 XDP merged into Linux 4.8
2017 AF_XDP (zero-copy userspace socket) development begins
2018 AF_XDP merged into Linux 4.18
2019 eBPF tc programs for programmable packet processing mature
2020 Cilium (eBPF-based Kubernetes CNI) reaches production at scale
2021 MPTCP mainlined in Linux 5.6 (2020); multipath TCP in production
2023 io_uring network (IORING_OP_RECV/SEND) reaches production quality

Modern Relevance and Production Use Cases

Kubernetes CNI plugins (Calico, Cilium, Flannel) implement pod networking by combining veth pairs, Linux bridge or direct routing, iptables/eBPF, and optional VXLAN overlay; a packet from pod A to pod B on a remote node traverses 6–10 netfilter hooks with iptables, or 1–2 eBPF tail calls with Cilium.

Cloud load balancers (AWS NLB, GCP Cloud Load Balancing) are built on XDP or DPDK; Google Maglev processes millions of flows per second per core using consistent hashing implemented in XDP-equivalent programs.

High-frequency trading infrastructure uses DPDK-based NICs with poll-mode drivers and kernel bypass; median RTT on co-located machines is achieved in 600 ns end-to-end.

Service mesh (Envoy, Istio) injects sidecar proxies that intercept traffic via iptables REDIRECT or eBPF sockops; understanding the netfilter hook costs explains the ~100 µs overhead of iptables-based interception.

5G UPF (User Plane Function) implementations use DPDK or AF_XDP to process GTP-U tunneled traffic at 100 Gbps line rate; Linux kernel path cannot achieve the required PPS at these speeds.


File Map

File Description
01-network-stack-overview.md Packet lifecycle: RX from wire to socket, TX from socket to wire
02-socket-layer.md sock/socket structures, protocol ops, socket buffer management
03-sk-buff.md sk_buff layout, headroom/tailroom, clone vs copy, page frags
04-protocol-families.md AF_INET, AF_INET6, AF_UNIX, AF_PACKET, AF_XDP
05-napi-and-rx-path.md NAPI polling, GRO, RPS/RFS, RSS multi-queue
06-tx-path-and-gso.md GSO, TSO, tx queues, qdisc integration
07-network-namespaces.md Namespace isolation, /proc/net, setns, ip-netns
08-veth-bridge-tuntap.md veth pair, Linux bridge, tun/tap, macvlan, ipvlan
09-iptables-nftables.md Netfilter hooks, table/chain model, conntrack, NAT
10-tc-qdisc.md Traffic shaping, HTB, fq_codel, CAKE, tc mirred
11-netfilter-internals.md nf_hook_ops, hook registration, conntrack state machine
12-xdp.md XDP modes (native/generic/offload), BPF maps, AF_XDP
13-ebpf-networking.md BPF prog types: XDP, tc, sockops, sk_lookup, cgroup sock
14-dpdk.md PMD poll mode, hugepages, memory pools, multi-core pipelines
15-sr-iov.md PF/VF model, VF MAC/VLAN assignment, guest passthrough
16-rdma.md InfiniBand verbs, QP/CQ/MR, RoCE v2, RDMA_CM
17-zero-copy-networking.md sendfile, splice, MSG_ZEROCOPY, io_uring zero-copy
18-socket-tuning.md tcp_rmem/wmem, SO_SNDBUF, TCP_NODELAY, SO_REUSEPORT
19-network-performance.md PPS analysis, CPU affinity, IRQ balancing, perf profiling

Cross-References

  • Section 02 (CPU Architecture): NUMA effects on NIC interrupt affinity, PCIe bandwidth
  • Section 11 (Memory Management): sk_buff page management, zero-copy page pinning
  • Section 14 (Device Drivers): NIC driver (NAPI), MSI-X interrupt vectors, DMA ring
  • Section 16 (TCP/IP Internals): TCP/UDP protocol implementation built on this stack
  • Section 17 (Distributed Systems): network partition detection, overlay networks
  • Section 19 (Virtualization): virtio-net, SR-IOV passthrough, OVS-DPDK