05 — DPDK: Data Plane Development Kit

Technical Overview

DPDK (Data Plane Development Kit) is a set of libraries and drivers that allow user-space applications to transmit and receive network packets at wire speed by bypassing the Linux kernel network stack entirely. Introduced by Intel in 2010 and open-sourced in 2013, DPDK achieves performance that the kernel stack cannot match for specialized use cases: it eliminates interrupt overhead, removes context switches, avoids memory copies, and keeps packet data in CPU cache across the entire processing pipeline. The tradeoff is complete abandonment of the kernel's abstractions: no sockets, no iptables, no kernel TCP/IP.

Prerequisites

Linux network stack and interrupt model (see 01-linux-network-stack.md)
NUMA architecture, huge pages, CPU affinity
C programming, pointer arithmetic, cache-line alignment
Understanding of DMA, PCI device I/O

Core Content

Why Bypass the Kernel?

The Linux network stack imposes several unavoidable overheads per packet:

Overhead	Source	Approximate cost
Interrupt handling	NIC fires IRQ per batch	1–5 µs
Context switch (kernel→user)	`recv()` syscall	0.5–2 µs
sk_buff allocation	`kmem_cache_alloc`	100–300 ns
Memory copy	Kernel buffer → user buffer	1 ns/byte
Lock contention	Socket lock, netdev queue lock	Variable
Cache pollution	Stack code evicts application data	100–500 ns

At 10 Gbps with 64-byte packets, line rate is 14.88 Mpps. Each packet budget at line rate: 67 ns. The kernel's per-packet overhead alone exceeds this budget. DPDK eliminates all of it.

Poll Mode Driver (PMD)

DPDK's core innovation is the Poll Mode Driver. Instead of the NIC raising an interrupt when packets arrive, a DPDK lcore (logical core) runs a tight busy-poll loop reading from the NIC's RX descriptor ring:

while (running) {
    nb_rx = rte_eth_rx_burst(port_id, queue_id, mbufs, MAX_BURST);
    if (nb_rx == 0) continue;   /* empty, keep polling */

    for (int i = 0; i < nb_rx; i++) {
        process_packet(mbufs[i]);
    }

    nb_tx = rte_eth_tx_burst(port_id, queue_id, tx_mbufs, nb_tx_ready);
    /* free unsent mbufs */
}

This loop runs on a dedicated CPU core pinned via rte_eal_remote_launch(). The CPU runs at 100% utilization always — it is "wasted" on polling even when no packets arrive. This is acceptable in telecom and HFT where deterministic latency matters more than CPU efficiency.

The NIC is configured with PMD-specific initialization: DPDK takes ownership of the NIC by unbinding it from the kernel driver and binding to a DPDK-compatible driver (vfio-pci or uio_pci_generic):

# Unbind NIC from kernel driver
dpdk-devbind.py --unbind 0000:01:00.0

# Bind to vfio-pci (IOMMU-safe, recommended)
dpdk-devbind.py --bind=vfio-pci 0000:01:00.0

# Verify
dpdk-devbind.py --status

After this, the NIC is invisible to the kernel network stack — no ip link show entries, no ethtool access.

Huge Pages

DPDK requires huge pages for all packet memory pools. Standard 4KB pages fragment DMA mappings and cause TLB pressure — at 10 Gbps, TLB miss rates for packet memory with 4KB pages are significant.

DPDK uses 2MB or 1GB huge pages: - 2MB huge pages: 512x larger TLB coverage per entry - 1GB huge pages: 262144x, eliminates TLB misses for packet pools entirely

# Reserve 2MB huge pages
echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

# For NUMA systems, reserve per node
echo 512 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
echo 512 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages

# Mount hugepages filesystem
mount -t hugetlbfs nodev /mnt/huge

# In DPDK app (EAL arguments)
./dpdk_app -l 0-3 -n 4 --huge-dir /mnt/huge --file-prefix dpdk

Huge pages are allocated at DPDK EAL (Environment Abstraction Layer) initialization and divided into memory pools (rte_mempool) from which mbufs are allocated.

The mbuf Structure

struct rte_mbuf is DPDK's equivalent of the kernel's sk_buff — a packet buffer with metadata. It is designed for cache efficiency: the most-accessed fields fit in the first cache line (64 bytes):

rte_mbuf layout:
+----------------------------------------------------------+
| cache line 0 (64 bytes) — hot metadata                   |
|  buf_addr (8B) | pkt_len (4B) | data_len (2B) | ...      |
|  ol_flags | packet_type | vlan_tci | tx_offload ...       |
+----------------------------------------------------------+
| cache line 1 (64 bytes) — Rx/Tx offload info             |
|  l2_l3_len | tso_segsz | hash.rss | port | nb_segs ...   |
+----------------------------------------------------------+
| headroom (128B default) — space for prepending headers   |
+----------------------------------------------------------+
| packet data                                              |
|  (up to mbuf_data_room_size — RTE_MBUF_DEFAULT_DATAROOM) |
+----------------------------------------------------------+

The buf_addr field points directly into the huge page memory that was DMA'd into by the NIC — no copy.

mbuf pool creation:

struct rte_mempool *mbuf_pool = rte_pktmbuf_pool_create(
    "MBUF_POOL",
    NUM_MBUFS,              /* number of mbufs */
    MBUF_CACHE_SIZE,        /* per-lcore cache size (reduces contention) */
    0,                      /* private data size */
    RTE_MBUF_DEFAULT_BUF_SIZE,
    rte_socket_id()         /* NUMA node — critical */
);

DPDK Ring: Lock-Free Queue

rte_ring is DPDK's inter-lcore communication primitive — a lock-free circular buffer supporting single-producer/single-consumer (SPSC) and multi-producer/multi-consumer (MPMC) modes:

        enqueue head                dequeue tail
              |                         |
              v                         v
+---+---+---+---+---+---+---+---+---+---+---+---+
|   |   | p3| p2| p1| p0|   |   |   | c2| c1| c0|
+---+---+---+---+---+---+---+---+---+---+---+---+
              ^                         ^
              |                         |
          prod head                 cons tail

MPMC mode uses CAS (compare-and-swap) for head/tail updates — no mutex. In SPSC mode (most common in pipeline designs), even the CAS is unnecessary.

/* Create ring */
struct rte_ring *ring = rte_ring_create("PKT_RING", 1024,
                                         rte_socket_id(), RING_F_SP_ENQ | RING_F_SC_DEQ);

/* Enqueue from producer lcore */
rte_ring_enqueue_burst(ring, (void **)mbufs, nb_pkts, NULL);

/* Dequeue from consumer lcore */
uint16_t n = rte_ring_dequeue_burst(ring, (void **)mbufs, MAX_BURST, NULL);

DPDK Architecture Diagram

                    DPDK Application Architecture
                    ================================

 NIC (bound to vfio-pci)
  |          |
  | RX rings | TX rings
  |          |
  v          ^
  |          |
  +----------+----------+----------+
  |          |          |          |
  | lcore 1  | lcore 2  | lcore 3  |
  | (RX/PMD) | (worker) | (TX/PMD) |
  |          |          |          |
  | rte_eth  | process  | rte_eth  |
  | rx_burst | packet   | tx_burst |
  |          |          |          |
  +--ring--->+--ring--->+----------+
     (mbufs)    (mbufs)

  Memory:   /mnt/huge (2MB huge pages)
            rte_mempool (mbuf pool, per NUMA node)

  Control:  EAL (Environment Abstraction Layer)
            - huge page mapping
            - lcore pinning (sched_setaffinity)
            - PCI device init
            - RTE_LOG

DPDK Multi-core and NUMA

DPDK's lcore model maps directly to hardware threads. The EAL initializes each lcore and pins it to a physical CPU via sched_setaffinity(). Memory pools are allocated on the local NUMA node — a packet processed by lcore 3 on NUMA node 1 should have its mbuf from NUMA node 1's mempool to avoid cross-NUMA memory traffic.

# Run DPDK app on cores 2-5, using 4 memory channels
./l3fwd -l 2-5 -n 4 -- -p 0x3 --config="(0,0,2),(0,1,3),(1,0,4),(1,1,5)"

# Monitor DPDK per-lcore stats
./dpdk-procinfo -- --stats

DPDK Use Cases

Telecom VNFs (Virtual Network Functions): 5G User Plane Function (UPF) implementations (Open5GS, free5GC with DPDK UPF) process GTP-U tunneled packets at >10 Gbps with sub-100 µs latency. The kernel stack cannot meet 3GPP latency requirements for URLLC.

Software Load Balancers: Facebook's Katran (uses XDP, not DPDK) and L4 load balancers at Cloudflare, Alibaba, and ByteDance use DPDK for the fast path. DPDK-based LB achieves 40–100 Gbps on 2-socket servers.

OVS-DPDK: Open vSwitch with DPDK PMD replaces the kernel OVS datapath for high-performance hypervisors. Red Hat OpenStack and Canonical OpenStack both support OVS-DPDK for NFV workloads.

HFT (High-Frequency Trading): Market data feeds at exchanges deliver 1–5 Mpps of UDP multicast. DPDK is used to capture every packet with sub-1 µs jitter. Solarflare Onload (OpenOnload) uses a similar userspace bypass model.

DPDK vs XDP Performance Comparison

Metric	DPDK	Native XDP	AF_XDP zerocopy
Throughput (64B pkts)	80–150 Mpps	20 Mpps	20 Mpps
Latency (pkt processing)	<1 µs	1–3 µs	2–5 µs
CPU utilization	100% always	Only under load	Only under load
Kernel compatibility	None (NIC unbound)	Full	Partial
Driver support requirement	DPDK PMD	XDP-capable driver	XDP + zero-copy support
Setup complexity	High	Medium	High
Best for	Dedicated appliances	In-kernel fast path	Userspace + kernel hybrid

DPDK wins on throughput and latency but requires dedicating cores and NICs. XDP is simpler and kernel-integrated, ideal when you need 80% of DPDK performance without the operational complexity.

Historical Context

Intel released DPDK in 2010 as a performance proof-of-concept for their 10G NICs (82599, later X520/X710). It was initially internal, then released as open source in 2013 under BSD license. The DPDK community moved to the Linux Foundation in 2017.

The fundamental tension DPDK represents — kernel bypass vs kernel integration — has driven most of the XDP and AF_XDP development in the Linux kernel. The kernel community's response was: "if userspace needs that performance, let's add hooks instead of bypassing us entirely."

By 2020, most new DPDK deployments use SR-IOV virtual functions rather than physical functions, allowing DPDK and the kernel stack to coexist on the same physical NIC.

Debugging Notes

# Check DPDK app ports and stats
./dpdk-procinfo -- --stats
./dpdk-procinfo -- --stats-reset

# Verify huge pages are properly allocated
cat /proc/meminfo | grep Huge
ls /dev/hugepages/

# Check IOMMU is enabled (required for vfio-pci)
dmesg | grep -i iommu
ls /sys/kernel/iommu_groups/

# Verify NIC binding
dpdk-devbind.py --status | grep -A2 "Network devices"

# DPDK app logging
./dpdk_app --log-level=pmd.net.ixgbe:debug  # enable PMD debug log

# Monitor PMD statistics
ethtool -S eth0  # does NOT work after DPDK bind; use rte_eth_stats_get()

Security Implications

No kernel security enforcement: packets received by a DPDK application bypass Netfilter, conntrack, SELinux, and seccomp. All filtering must be implemented in userspace.
VFIO and IOMMU: binding via vfio-pci uses the IOMMU to restrict DMA to the application's memory — prevents rogue DMA attacks. uio_pci_generic has no IOMMU protection.
Huge page confidentiality: packet data in huge pages is not zeroed between mbuf reuse unless explicitly cleared — a bug in the free-path can expose previous packet contents.
Privilege: DPDK requires root (or CAP_SYS_RAWIO + CAP_NET_ADMIN + VFIO group access). There is no privilege separation within a DPDK application.

Performance Implications

Critical DPDK tuning checklist:

# 1. Isolate CPUs from kernel scheduler
# In /etc/default/grub: GRUB_CMDLINE_LINUX="isolcpus=2-7 nohz_full=2-7 rcu_nocbs=2-7"

# 2. Disable CPU frequency scaling on DPDK cores
cpupower frequency-set -c 2-7 -g performance

# 3. Disable hyperthreading (DPDK cores should not share physical core)
# BIOS setting or: echo 0 > /sys/devices/system/cpu/cpuN/online for sibling CPUs

# 4. NUMA-local memory allocation
# Pass --socket-mem argument matching packet pool size per socket

# 5. Prefetch and cache-line alignment in packet processing loop
rte_prefetch0(rte_pktmbuf_mtod(mbufs[i+4], void *));  /* prefetch 4 ahead */

Failure Modes and Real Incidents

Incident: mbuf pool exhaustion (telecom UPF, 2021) A traffic spike caused mbufs to be allocated faster than they were freed. rte_pktmbuf_alloc() returned NULL; the PMD silently dropped all packets. Fix: add monitoring for pool utilization (rte_mempool_avail_count()), implement backpressure, increase pool size.

Failure Mode: NUMA mismatch mbuf pool created on NUMA node 0, NIC on NUMA node 1. Cross-NUMA memory accesses added ~100 ns per packet, halving throughput. Diagnosis: numastat -m, perf stat -e remote_dram. Fix: always pass rte_socket_id() when creating mempools.

Failure Mode: Lcore contention on shared ring Two producer lcores sharing an MPMC ring cause CAS contention under high load — throughput drops 40%. Fix: design pipeline so each ring has a single producer (SPSC mode, no CAS).

Modern Usage

DPDK 23.11: supports ARM SVE for vector packet processing, enabling DPDK on Graviton3 and Ampere Altra at near-x86 performance
DPDK + eBPF: hybrid deployments use XDP to classify traffic and redirect bulk flows to AF_XDP/DPDK, while control traffic (BGP, ARP) stays in the kernel stack
SmartNIC offload: Nvidia BlueField-3 runs DPDK applications on the DPU's ARM cores, offloading packet processing from the host CPU entirely

Future Directions

DPDK graph framework (rte_graph): structured pipeline model with automatic vectorization and cache optimization, replacing hand-written packet loops
P4 on DPDK: p4c-dpdk compiles P4 programs to DPDK pipeline stages — declarative packet processing without C
Convergence with XDP: the performance gap between XDP and DPDK is closing; future kernels may offer XDP performance parity for most use cases, reducing DPDK's operational complexity advantage

Exercises

Build and run the dpdk-skeleton example (from DPDK source examples/skeleton). Bind a physical or virtual (virtio) NIC to vfio-pci. Measure maximum throughput with pktgen-dpdk and compare to the same NIC running in kernel mode with iperf3.
Implement a simple L2 forwarding application that reads packets from one DPDK port and writes them to a second port. Add per-lcore packet counters using rte_atomic64_t. Monitor them with rte_procinfo.
Simulate mbuf pool exhaustion by setting NUM_MBUFS=128 and sending a traffic burst. Instrument rte_pktmbuf_alloc() to count NULL returns. Implement a drop-and-log path and measure the packet loss rate.
Profile a DPDK forwarding application with perf record -g -- ./dpdk_app. Generate a flame graph. Identify what percentage of cycles are spent in the PMD's rx_burst function vs actual packet processing. What does this imply about the workload?
Configure OVS-DPDK on a hypervisor and connect two virtual machines via OVS bridges. Measure VM-to-VM latency with qperf and compare it to kernel OVS. Analyze the difference using ovs-appctl dpif-netdev/pmd-stats-show.

References

DPDK documentation. DPDK Programmer's Guide. doc.dpdk.org
lib/mbuf/rte_mbuf.h — mbuf structure and API
lib/ring/rte_ring.h — lock-free ring implementation
drivers/net/ixgbe/ — Intel 82599 PMD (reference implementation)
Intel. Data Plane Development Kit (DPDK): Getting Started Guide for Linux.
Emmerich, P. et al. MoonGen: A Scriptable High-Speed Packet Generator. IMC 2015.
Rizzo, L. netmap: a novel framework for fast packet I/O. USENIX ATC 2012. (DPDK contemporary/competitor)
man dpdk-devbind, DPDK release notes for 23.11