Skip to content

01 — Linux Network Stack

Technical Overview

The Linux network stack is a layered software system that moves data between user-space applications and physical hardware. Unlike a textbook OSI model, the Linux implementation collapses layers pragmatically: the socket layer serves both session and transport concerns, the IP layer handles network-layer routing, and the device driver layer encompasses both data-link and physical concerns. Understanding the stack in depth is prerequisite to diagnosing latency spikes, packet drops, and throughput ceilings in production systems.

The stack is invoked on two paths — receive (Rx) and transmit (Tx) — and each path crosses the kernel/user boundary, traverses several subsystems, and interacts with hardware DMA engines.


Prerequisites

  • C language familiarity (struct layouts, pointer arithmetic)
  • Basic OS concepts: interrupts, softirqs, kernel threads
  • Network fundamentals: Ethernet frames, IP packets, TCP/UDP segments
  • Familiarity with ss, ip, ethtool, tc command-line tools

Core Content

The sk_buff Structure

Every packet in the Linux kernel is represented by a struct sk_buff (socket buffer), defined in include/linux/skbuff.h. It is the single most important data structure in the network stack — created on receive, consumed on transmit, cloned for multicast, and annotated by every layer it traverses.

Key fields:

struct sk_buff {
    /* Transport layer pointers */
    union {
        struct tcphdr   *th;
        struct udphdr   *uh;
        struct icmphdr  *icmph;
    } h;

    /* Network layer pointer */
    union {
        struct iphdr    *iph;
        struct ipv6hdr  *ipv6h;
    } nh;

    /* Link layer pointer */
    union {
        struct ethhdr   *ethernet;
        unsigned char   *raw;
    } mac;

    struct net_device  *dev;       /* ingress/egress device       */
    unsigned int        len;       /* total packet length          */
    unsigned int        data_len;  /* length of paged data         */
    __u16               protocol;  /* L3 protocol (ETH_P_IP, etc.) */
    __u32               priority;
    sk_buff_data_t      tail;      /* end of linear data           */
    sk_buff_data_t      end;       /* end of buffer                */
    unsigned char      *head;      /* start of buffer              */
    unsigned char      *data;      /* start of packet data         */
    unsigned char       cb[48];    /* control block, per-layer     */
    struct sock        *sk;        /* owning socket (or NULL)      */
    ktime_t             tstamp;    /* receive timestamp            */
    /* ... ~200 more fields ... */
};

The cb[] array (control block) is a 48-byte scratch space that each layer uses for its own metadata without polluting the main structure. TCP uses it to store sequence numbers during segmentation; Netfilter stores connection tracking state in it.

Memory layout around packet data:

  head                         tail   end
   |                             |     |
   v                             v     v
   +---------+-------------------+-----+
   | headroom|  packet data      |tail |
   | (pushed |  (L2 hdr)(L3 hdr) |room |
   |  hdrs)  |  (L4 hdr)(payload)|     |
   +---------+-------------------+-----+
   ^         ^
   |         |
  head      data

skb_push() decrements data to prepend a header (used going down the stack). skb_put() advances tail to append data. skb_pull() increments data to consume a header (used going up the stack on receive).


Packet Receive Path

NIC hardware
     |
     | DMA ring descriptor filled, interrupt fired
     v
Hardware ISR (hard IRQ — minimal work)
     |
     | raise NET_RX_SOFTIRQ
     v
NAPI poll() [net_rx_action in softirq context]
     |
     | skb allocated, headers checked
     v
netif_receive_skb()
     |
     +---> ptype_all handlers (AF_PACKET sniffers, tcpdump)
     |
     +---> ptype_base[protocol] (ETH_P_IP → ip_rcv)
               |
               v
          ip_rcv()  [net/ipv4/ip_input.c]
               |
               +---> Netfilter: NF_INET_PRE_ROUTING
               |
               v
          ip_rcv_finish()
               |
               v
          ip_route_input() — routing decision
               |
               +--- local delivery ---> ip_local_deliver()
               |                              |
               |                        NF_INET_LOCAL_IN
               |                              |
               |                        tcp_v4_rcv() / udp_rcv()
               |                              |
               |                        sock receive queue
               |                              |
               |                        application recv()
               |
               +--- forward ----------> ip_forward()
                                             |
                                        NF_INET_FORWARD
                                             |
                                        ip_output()

NAPI (New API) was introduced to solve the interrupt storm problem. Under high packet rates, a NIC firing an interrupt per packet consumes all CPU time in interrupt context. NAPI switches the NIC to polling mode after the first interrupt: the driver disables interrupts for that queue and schedules a poll() function to run in softirq context, draining up to budget packets (default 64) before re-enabling interrupts.

/* Driver NAPI poll callback example */
static int mydriver_poll(struct napi_struct *napi, int budget)
{
    int work_done = 0;
    while (work_done < budget && ring_has_packets(rx_ring)) {
        skb = build_skb_from_ring(rx_ring);
        napi_gro_receive(napi, skb);   /* GRO coalescing */
        work_done++;
    }
    if (work_done < budget) {
        napi_complete_done(napi, work_done);
        enable_irq(rx_irq);
    }
    return work_done;
}

RSS (Receive Side Scaling) distributes packets across multiple CPU cores using a hash of the 4-tuple (src IP, src port, dst IP, dst port) computed in NIC hardware. Each hash bucket maps to an RX queue, each queue has an IRQ affined to a specific CPU. This eliminates the single-CPU bottleneck for high-throughput servers.

Check RSS queue count and indirection table:

ethtool -l eth0          # show current/maximum channels
ethtool -x eth0          # show RSS hash indirection table
ethtool -X eth0 equal 8  # distribute evenly across 8 queues

Packet Transmit Path

Application write() / sendmsg()
     |
     v
Socket layer (sock_sendmsg)
     |
     v
TCP / UDP layer
     |  TCP: segment, add headers, manage cwnd
     v
ip_queue_xmit() / ip_output()
     |
     +---> NF_INET_LOCAL_OUT (Netfilter OUTPUT chain)
     |
     v
ip_finish_output()
     |
     +---> Fragmentation if needed
     |
     v
dev_queue_xmit()
     |
     v
Traffic Control (qdisc) — pfifo_fast, fq, fq_codel, etc.
     |
     | raise NET_TX_SOFTIRQ if queue was empty
     v
dev_hard_start_xmit()
     |
     v
Driver ndo_start_xmit()
     |
     v
NIC DMA — packet copied to NIC TX ring buffer
     |
     v
Wire

The qdisc layer is where QoS, traffic shaping, and fair queuing operate. Even the "default" pfifo_fast qdisc implements three priority bands. Modern kernels default to fq_codel or fq for egress to combat bufferbloat.


Softirq Processing

Softirqs are a deferred interrupt mechanism. Two softirq vectors handle networking:

Vector Purpose
NET_RX_SOFTIRQ (3) Process received packets from NAPI poll
NET_TX_SOFTIRQ (2) Transmit packets queued by upper layers

Softirqs run in the context of ksoftirqd/<cpu> threads or inline after hard IRQ return. They are bounded by time (netdev_budget_usecs, default 2ms) and packet count (netdev_budget, default 300) per invocation.

Monitor softirq load:

watch -n1 'cat /proc/net/softnet_stat'
# Columns: total received, dropped, time_squeeze (budget exhausted)

# Per-CPU softirq counts
cat /proc/softirqs | grep -E 'NET_RX|NET_TX'

If time_squeeze is increasing rapidly, the NAPI budget is too small or CPU is overloaded:

sysctl -w net.core.netdev_budget=600
sysctl -w net.core.netdev_budget_usecs=4000

Historical Context

The original Linux network stack (pre-2.2) used a simple interrupt-per-packet model that worked for 10 Mbps Ethernet. As 100 Mbps and then Gigabit NICs arrived, interrupt rates hit hundreds of thousands per second, causing receive livelock where the CPU spent all time handling interrupts with no time left for application processing.

Alexey Kuznetsov and Jamal Hadi Salim designed NAPI (landed in 2.5.x, ~2002) to solve this. The key insight: under high load, polling is more efficient than interrupting.

The sk_buff structure has grown from ~100 bytes in early kernels to over 200 bytes today as more features (GSO, GRO, XDP, timestamps, security labels) were added. This growth has memory implications: millions of concurrent connections means millions of sk_buffs in flight.


Production Examples

nginx high-throughput tuning — A 10G nginx server receiving HTTP traffic needs RSS queues matched to CPU count, NAPI budget tuned, and interrupt affinity pinned:

# Pin NIC IRQs to CPUs 0-7 (assuming 8-queue NIC)
for i in $(seq 0 7); do
    echo $i > /proc/irq/$((IRQ_BASE + i))/smp_affinity_list
done

# Set RPS for single-queue NICs (software RSS)
echo "ff" > /sys/class/net/eth0/queues/rx-0/rps_cpus

Packet capture overheadtcpdump attaches a ptype_all handler via AF_PACKET. Every received packet hits this handler regardless of filter, before the BPF filter runs in-kernel. On a 10Gbps link, this adds measurable latency to the main data path. Use -B 65536 (larger buffer) and BPF filters to minimize impact.


Debugging Notes

# Global packet drop counters
ip -s link show eth0
netstat -s | grep -i drop

# Detailed NIC statistics (hardware-level drops)
ethtool -S eth0 | grep -i drop

# Socket buffer overruns (UDP drops due to full recv buffer)
ss -u -a -p | grep UNCONN

# RX ring buffer size (increase if drops at high pps)
ethtool -g eth0
ethtool -G eth0 rx 4096

# Kernel Rx drop tracking (per-CPU)
cat /proc/net/softnet_stat | awk '{print "cpu"NR-1": total="$1" dropped="$2" throttled="$3}'

bpftrace one-liners:

# Trace sk_buff allocation failures
bpftrace -e 'kprobe:__alloc_skb { @[comm] = count(); }'

# Track packet receive latency from NIC IRQ to socket
bpftrace -e 'kprobe:netif_receive_skb { @start[arg0] = nsecs; }
             kprobe:__tcp_data_queue { @lat = hist(nsecs - @start[arg0]); delete(@start[arg0]); }'

Security Implications

  • IP spoofing: Without rp_filter (reverse path filtering), a host will accept packets with source IPs that are not reachable via the interface they arrived on. Enable: sysctl -w net.ipv4.conf.all.rp_filter=1
  • Packet sniffing: Any process with CAP_NET_RAW (or root) can open AF_PACKET and receive all packets on an interface. Network namespaces provide isolation.
  • sk_buff information leak: Historically, uninitialized cb[] bytes have leaked kernel stack data to userspace via AF_PACKET. CVE-2016-8655 exploited a race in AF_PACKET.
  • NAPI budget starvation: A hostile VM or container generating floods can exhaust NAPI budget on shared host NICs — use hardware rate limiting or SR-IOV VF rate caps.

Performance Implications

Optimization Mechanism Typical Gain
RSS Hardware hash → multiple queues Linear with core count
NAPI Poll instead of interrupt Eliminates interrupt livelock
GRO Coalesce segments before stack 30–50% CPU reduction
RPS/RFS Software RSS + flow steering For single-queue NICs
Interrupt coalescing ethtool -C rx-usecs 50 Reduces IRQ rate 10x

The fastest path through the Linux network stack (loopback, AF_UNIX) incurs ~1–3 µs. A packet crossing a 10G NIC typically sees 10–30 µs end-to-end kernel overhead under light load, dominated by memory allocation and cache misses.


Failure Modes and Real Incidents

Incident: Receive livelock under UDP flood (2014, major CDN) A UDP amplification DDoS saturated all CPU time in NET_RX_SOFTIRQ — no application packets processed. Fix: NAPI budget reduction + early BPF drop at driver level (predates XDP).

Incident: sk_buff slab exhaustion A misconfigured application opened millions of sockets without closing them. Each socket holds at least one sk_buff in its receive queue. Kernel OOM killed unrelated processes. Fix: net.core.rmem_default reduction + per-process socket limit via ulimit -n.

Failure Mode: CPU affinity mismatch RSS queue IRQs on CPU 0, application on CPU 7. Every received packet crosses NUMA nodes. Symptom: high sys CPU on CPU 0, high latency. Diagnose with perf top -C 0 showing __copy_skb_header.


Modern Usage

  • XDP (eXpress Data Path) hooks before skb allocation for wire-speed processing — see 04-ebpf-networking.md
  • io_uring provides an asynchronous interface to the socket layer, reducing syscall overhead for high-rate applications
  • AF_XDP with ZEROCOPY flag allows userspace to DMA directly from NIC rings, bypassing the kernel stack entirely at the cost of application complexity
  • Multi-queue NICs with 32–128 queues are now common; tools like irqbalance auto-manage affinity but manual tuning outperforms it for latency-sensitive workloads

Future Directions

  • Kernel TLS (kTLS) pushes encryption into the network stack so sendfile() can work with TLS streams, eliminating an extra copy
  • BPF TCP header option parser (Linux 5.10) allows eBPF programs to parse and inject custom TCP options — enabling application-aware congestion control from userspace
  • Hardware-accelerated GRO (Intel E800 series): the NIC itself coalesces segments before DMA, reducing software GRO overhead
  • Memory-mapped packet sockets (PACKET_MMAP, AF_XDP) are converging toward a unified zero-copy API

Exercises

  1. Use bpftrace to instrument netif_receive_skb and measure the distribution of packet sizes received on your primary NIC over 10 seconds. Explain what the bimodal distribution (if present) means.

  2. Enable a second RSS queue on a test NIC (ethtool -L eth0 combined 2), then generate TCP traffic with iperf3 -P 4. Use cat /proc/net/softnet_stat before and after to observe CPU load distribution changes.

  3. Write a minimal character device driver stub that allocates an sk_buff, populates a UDP header using skb_push(), and calls dev_queue_xmit(). Explain each step.

  4. Reproduce time_squeeze by setting net.core.netdev_budget=1 and running iperf3. Observe the packet drop rate in ip -s link. Then restore the default and explain the trade-off between latency and throughput in budget selection.

  5. Using perf record -e net:netif_receive_skb -- iperf3 -s, generate a flame graph and identify the top three callees consuming CPU during packet receive. Explain what each does.


References

  • include/linux/skbuff.h — sk_buff definition
  • net/core/dev.cnetif_receive_skb, dev_queue_xmit, NAPI core
  • net/ipv4/ip_input.cip_rcv, ip_local_deliver
  • net/ipv4/tcp_input.ctcp_v4_rcv, tcp_rcv_established
  • Corbet, J. et al. Linux Device Drivers, 3rd Ed. — Chapter 17: Network Drivers
  • Benvenuti, C. Understanding Linux Network Internals. O'Reilly, 2006.
  • Salim, J.H. Beyond Softnet. Ottawa Linux Symposium 2001. (NAPI design rationale)
  • Documentation/networking/scaling.rst — RSS, RPS, RFS, XPS kernel docs
  • man 7 packet — AF_PACKET socket interface