05 — DPDK: Data Plane Development Kit
Technical Overview
DPDK (Data Plane Development Kit) is a set of libraries and drivers that allow user-space applications to transmit and receive network packets at wire speed by bypassing the Linux kernel network stack entirely. Introduced by Intel in 2010 and open-sourced in 2013, DPDK achieves performance that the kernel stack cannot match for specialized use cases: it eliminates interrupt overhead, removes context switches, avoids memory copies, and keeps packet data in CPU cache across the entire processing pipeline. The tradeoff is complete abandonment of the kernel's abstractions: no sockets, no iptables, no kernel TCP/IP.
Prerequisites
- Linux network stack and interrupt model (see
01-linux-network-stack.md) - NUMA architecture, huge pages, CPU affinity
- C programming, pointer arithmetic, cache-line alignment
- Understanding of DMA, PCI device I/O
Core Content
Why Bypass the Kernel?
The Linux network stack imposes several unavoidable overheads per packet:
| Overhead | Source | Approximate cost |
|---|---|---|
| Interrupt handling | NIC fires IRQ per batch | 1–5 µs |
| Context switch (kernel→user) | recv() syscall |
0.5–2 µs |
| sk_buff allocation | kmem_cache_alloc |
100–300 ns |
| Memory copy | Kernel buffer → user buffer | 1 ns/byte |
| Lock contention | Socket lock, netdev queue lock | Variable |
| Cache pollution | Stack code evicts application data | 100–500 ns |
At 10 Gbps with 64-byte packets, line rate is 14.88 Mpps. Each packet budget at line rate: 67 ns. The kernel's per-packet overhead alone exceeds this budget. DPDK eliminates all of it.
Poll Mode Driver (PMD)
DPDK's core innovation is the Poll Mode Driver. Instead of the NIC raising an interrupt when packets arrive, a DPDK lcore (logical core) runs a tight busy-poll loop reading from the NIC's RX descriptor ring:
while (running) {
nb_rx = rte_eth_rx_burst(port_id, queue_id, mbufs, MAX_BURST);
if (nb_rx == 0) continue; /* empty, keep polling */
for (int i = 0; i < nb_rx; i++) {
process_packet(mbufs[i]);
}
nb_tx = rte_eth_tx_burst(port_id, queue_id, tx_mbufs, nb_tx_ready);
/* free unsent mbufs */
}
This loop runs on a dedicated CPU core pinned via rte_eal_remote_launch(). The CPU runs at 100% utilization always — it is "wasted" on polling even when no packets arrive. This is acceptable in telecom and HFT where deterministic latency matters more than CPU efficiency.
The NIC is configured with PMD-specific initialization: DPDK takes ownership of the NIC by unbinding it from the kernel driver and binding to a DPDK-compatible driver (vfio-pci or uio_pci_generic):
# Unbind NIC from kernel driver
dpdk-devbind.py --unbind 0000:01:00.0
# Bind to vfio-pci (IOMMU-safe, recommended)
dpdk-devbind.py --bind=vfio-pci 0000:01:00.0
# Verify
dpdk-devbind.py --status
After this, the NIC is invisible to the kernel network stack — no ip link show entries, no ethtool access.
Huge Pages
DPDK requires huge pages for all packet memory pools. Standard 4KB pages fragment DMA mappings and cause TLB pressure — at 10 Gbps, TLB miss rates for packet memory with 4KB pages are significant.
DPDK uses 2MB or 1GB huge pages: - 2MB huge pages: 512x larger TLB coverage per entry - 1GB huge pages: 262144x, eliminates TLB misses for packet pools entirely
# Reserve 2MB huge pages
echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# For NUMA systems, reserve per node
echo 512 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
echo 512 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
# Mount hugepages filesystem
mount -t hugetlbfs nodev /mnt/huge
# In DPDK app (EAL arguments)
./dpdk_app -l 0-3 -n 4 --huge-dir /mnt/huge --file-prefix dpdk
Huge pages are allocated at DPDK EAL (Environment Abstraction Layer) initialization and divided into memory pools (rte_mempool) from which mbufs are allocated.
The mbuf Structure
struct rte_mbuf is DPDK's equivalent of the kernel's sk_buff — a packet buffer with metadata. It is designed for cache efficiency: the most-accessed fields fit in the first cache line (64 bytes):
rte_mbuf layout:
+----------------------------------------------------------+
| cache line 0 (64 bytes) — hot metadata |
| buf_addr (8B) | pkt_len (4B) | data_len (2B) | ... |
| ol_flags | packet_type | vlan_tci | tx_offload ... |
+----------------------------------------------------------+
| cache line 1 (64 bytes) — Rx/Tx offload info |
| l2_l3_len | tso_segsz | hash.rss | port | nb_segs ... |
+----------------------------------------------------------+
| headroom (128B default) — space for prepending headers |
+----------------------------------------------------------+
| packet data |
| (up to mbuf_data_room_size — RTE_MBUF_DEFAULT_DATAROOM) |
+----------------------------------------------------------+
The buf_addr field points directly into the huge page memory that was DMA'd into by the NIC — no copy.
mbuf pool creation:
struct rte_mempool *mbuf_pool = rte_pktmbuf_pool_create(
"MBUF_POOL",
NUM_MBUFS, /* number of mbufs */
MBUF_CACHE_SIZE, /* per-lcore cache size (reduces contention) */
0, /* private data size */
RTE_MBUF_DEFAULT_BUF_SIZE,
rte_socket_id() /* NUMA node — critical */
);
DPDK Ring: Lock-Free Queue
rte_ring is DPDK's inter-lcore communication primitive — a lock-free circular buffer supporting single-producer/single-consumer (SPSC) and multi-producer/multi-consumer (MPMC) modes:
enqueue head dequeue tail
| |
v v
+---+---+---+---+---+---+---+---+---+---+---+---+
| | | p3| p2| p1| p0| | | | c2| c1| c0|
+---+---+---+---+---+---+---+---+---+---+---+---+
^ ^
| |
prod head cons tail
MPMC mode uses CAS (compare-and-swap) for head/tail updates — no mutex. In SPSC mode (most common in pipeline designs), even the CAS is unnecessary.
/* Create ring */
struct rte_ring *ring = rte_ring_create("PKT_RING", 1024,
rte_socket_id(), RING_F_SP_ENQ | RING_F_SC_DEQ);
/* Enqueue from producer lcore */
rte_ring_enqueue_burst(ring, (void **)mbufs, nb_pkts, NULL);
/* Dequeue from consumer lcore */
uint16_t n = rte_ring_dequeue_burst(ring, (void **)mbufs, MAX_BURST, NULL);
DPDK Architecture Diagram
DPDK Application Architecture
================================
NIC (bound to vfio-pci)
| |
| RX rings | TX rings
| |
v ^
| |
+----------+----------+----------+
| | | |
| lcore 1 | lcore 2 | lcore 3 |
| (RX/PMD) | (worker) | (TX/PMD) |
| | | |
| rte_eth | process | rte_eth |
| rx_burst | packet | tx_burst |
| | | |
+--ring--->+--ring--->+----------+
(mbufs) (mbufs)
Memory: /mnt/huge (2MB huge pages)
rte_mempool (mbuf pool, per NUMA node)
Control: EAL (Environment Abstraction Layer)
- huge page mapping
- lcore pinning (sched_setaffinity)
- PCI device init
- RTE_LOG
DPDK Multi-core and NUMA
DPDK's lcore model maps directly to hardware threads. The EAL initializes each lcore and pins it to a physical CPU via sched_setaffinity(). Memory pools are allocated on the local NUMA node — a packet processed by lcore 3 on NUMA node 1 should have its mbuf from NUMA node 1's mempool to avoid cross-NUMA memory traffic.
# Run DPDK app on cores 2-5, using 4 memory channels
./l3fwd -l 2-5 -n 4 -- -p 0x3 --config="(0,0,2),(0,1,3),(1,0,4),(1,1,5)"
# Monitor DPDK per-lcore stats
./dpdk-procinfo -- --stats
DPDK Use Cases
Telecom VNFs (Virtual Network Functions): 5G User Plane Function (UPF) implementations (Open5GS, free5GC with DPDK UPF) process GTP-U tunneled packets at >10 Gbps with sub-100 µs latency. The kernel stack cannot meet 3GPP latency requirements for URLLC.
Software Load Balancers: Facebook's Katran (uses XDP, not DPDK) and L4 load balancers at Cloudflare, Alibaba, and ByteDance use DPDK for the fast path. DPDK-based LB achieves 40–100 Gbps on 2-socket servers.
OVS-DPDK: Open vSwitch with DPDK PMD replaces the kernel OVS datapath for high-performance hypervisors. Red Hat OpenStack and Canonical OpenStack both support OVS-DPDK for NFV workloads.
HFT (High-Frequency Trading): Market data feeds at exchanges deliver 1–5 Mpps of UDP multicast. DPDK is used to capture every packet with sub-1 µs jitter. Solarflare Onload (OpenOnload) uses a similar userspace bypass model.
DPDK vs XDP Performance Comparison
| Metric | DPDK | Native XDP | AF_XDP zerocopy |
|---|---|---|---|
| Throughput (64B pkts) | 80–150 Mpps | 20 Mpps | 20 Mpps |
| Latency (pkt processing) | <1 µs | 1–3 µs | 2–5 µs |
| CPU utilization | 100% always | Only under load | Only under load |
| Kernel compatibility | None (NIC unbound) | Full | Partial |
| Driver support requirement | DPDK PMD | XDP-capable driver | XDP + zero-copy support |
| Setup complexity | High | Medium | High |
| Best for | Dedicated appliances | In-kernel fast path | Userspace + kernel hybrid |
DPDK wins on throughput and latency but requires dedicating cores and NICs. XDP is simpler and kernel-integrated, ideal when you need 80% of DPDK performance without the operational complexity.
Historical Context
Intel released DPDK in 2010 as a performance proof-of-concept for their 10G NICs (82599, later X520/X710). It was initially internal, then released as open source in 2013 under BSD license. The DPDK community moved to the Linux Foundation in 2017.
The fundamental tension DPDK represents — kernel bypass vs kernel integration — has driven most of the XDP and AF_XDP development in the Linux kernel. The kernel community's response was: "if userspace needs that performance, let's add hooks instead of bypassing us entirely."
By 2020, most new DPDK deployments use SR-IOV virtual functions rather than physical functions, allowing DPDK and the kernel stack to coexist on the same physical NIC.
Debugging Notes
# Check DPDK app ports and stats
./dpdk-procinfo -- --stats
./dpdk-procinfo -- --stats-reset
# Verify huge pages are properly allocated
cat /proc/meminfo | grep Huge
ls /dev/hugepages/
# Check IOMMU is enabled (required for vfio-pci)
dmesg | grep -i iommu
ls /sys/kernel/iommu_groups/
# Verify NIC binding
dpdk-devbind.py --status | grep -A2 "Network devices"
# DPDK app logging
./dpdk_app --log-level=pmd.net.ixgbe:debug # enable PMD debug log
# Monitor PMD statistics
ethtool -S eth0 # does NOT work after DPDK bind; use rte_eth_stats_get()
Security Implications
- No kernel security enforcement: packets received by a DPDK application bypass Netfilter, conntrack, SELinux, and seccomp. All filtering must be implemented in userspace.
- VFIO and IOMMU: binding via
vfio-pciuses the IOMMU to restrict DMA to the application's memory — prevents rogue DMA attacks.uio_pci_generichas no IOMMU protection. - Huge page confidentiality: packet data in huge pages is not zeroed between mbuf reuse unless explicitly cleared — a bug in the free-path can expose previous packet contents.
- Privilege: DPDK requires root (or
CAP_SYS_RAWIO+CAP_NET_ADMIN+ VFIO group access). There is no privilege separation within a DPDK application.
Performance Implications
Critical DPDK tuning checklist:
# 1. Isolate CPUs from kernel scheduler
# In /etc/default/grub: GRUB_CMDLINE_LINUX="isolcpus=2-7 nohz_full=2-7 rcu_nocbs=2-7"
# 2. Disable CPU frequency scaling on DPDK cores
cpupower frequency-set -c 2-7 -g performance
# 3. Disable hyperthreading (DPDK cores should not share physical core)
# BIOS setting or: echo 0 > /sys/devices/system/cpu/cpuN/online for sibling CPUs
# 4. NUMA-local memory allocation
# Pass --socket-mem argument matching packet pool size per socket
# 5. Prefetch and cache-line alignment in packet processing loop
rte_prefetch0(rte_pktmbuf_mtod(mbufs[i+4], void *)); /* prefetch 4 ahead */
Failure Modes and Real Incidents
Incident: mbuf pool exhaustion (telecom UPF, 2021)
A traffic spike caused mbufs to be allocated faster than they were freed. rte_pktmbuf_alloc() returned NULL; the PMD silently dropped all packets. Fix: add monitoring for pool utilization (rte_mempool_avail_count()), implement backpressure, increase pool size.
Failure Mode: NUMA mismatch
mbuf pool created on NUMA node 0, NIC on NUMA node 1. Cross-NUMA memory accesses added ~100 ns per packet, halving throughput. Diagnosis: numastat -m, perf stat -e remote_dram. Fix: always pass rte_socket_id() when creating mempools.
Failure Mode: Lcore contention on shared ring Two producer lcores sharing an MPMC ring cause CAS contention under high load — throughput drops 40%. Fix: design pipeline so each ring has a single producer (SPSC mode, no CAS).
Modern Usage
- DPDK 23.11: supports ARM SVE for vector packet processing, enabling DPDK on Graviton3 and Ampere Altra at near-x86 performance
- DPDK + eBPF: hybrid deployments use XDP to classify traffic and redirect bulk flows to AF_XDP/DPDK, while control traffic (BGP, ARP) stays in the kernel stack
- SmartNIC offload: Nvidia BlueField-3 runs DPDK applications on the DPU's ARM cores, offloading packet processing from the host CPU entirely
Future Directions
- DPDK graph framework (
rte_graph): structured pipeline model with automatic vectorization and cache optimization, replacing hand-written packet loops - P4 on DPDK:
p4c-dpdkcompiles P4 programs to DPDK pipeline stages — declarative packet processing without C - Convergence with XDP: the performance gap between XDP and DPDK is closing; future kernels may offer XDP performance parity for most use cases, reducing DPDK's operational complexity advantage
Exercises
-
Build and run the
dpdk-skeletonexample (from DPDK sourceexamples/skeleton). Bind a physical or virtual (virtio) NIC to vfio-pci. Measure maximum throughput withpktgen-dpdkand compare to the same NIC running in kernel mode withiperf3. -
Implement a simple L2 forwarding application that reads packets from one DPDK port and writes them to a second port. Add per-lcore packet counters using
rte_atomic64_t. Monitor them withrte_procinfo. -
Simulate mbuf pool exhaustion by setting
NUM_MBUFS=128and sending a traffic burst. Instrumentrte_pktmbuf_alloc()to count NULL returns. Implement a drop-and-log path and measure the packet loss rate. -
Profile a DPDK forwarding application with
perf record -g -- ./dpdk_app. Generate a flame graph. Identify what percentage of cycles are spent in the PMD'srx_burstfunction vs actual packet processing. What does this imply about the workload? -
Configure OVS-DPDK on a hypervisor and connect two virtual machines via OVS bridges. Measure VM-to-VM latency with
qperfand compare it to kernel OVS. Analyze the difference usingovs-appctl dpif-netdev/pmd-stats-show.
References
- DPDK documentation. DPDK Programmer's Guide. doc.dpdk.org
lib/mbuf/rte_mbuf.h— mbuf structure and APIlib/ring/rte_ring.h— lock-free ring implementationdrivers/net/ixgbe/— Intel 82599 PMD (reference implementation)- Intel. Data Plane Development Kit (DPDK): Getting Started Guide for Linux.
- Emmerich, P. et al. MoonGen: A Scriptable High-Speed Packet Generator. IMC 2015.
- Rizzo, L. netmap: a novel framework for fast packet I/O. USENIX ATC 2012. (DPDK contemporary/competitor)
man dpdk-devbind, DPDK release notes for 23.11