04 — eBPF Networking
Technical Overview
eBPF (extended Berkeley Packet Filter) transforms the Linux kernel into a programmable platform. Rather than modifying kernel source and shipping a new kernel version, operators and developers load verified bytecode programs at runtime that execute in privileged kernel context with safety guarantees. For networking, eBPF programs can intercept, inspect, modify, and drop packets at multiple points — from before the network stack allocates memory all the way down to individual socket calls. Combined with the BPF map infrastructure for shared state, eBPF has become the foundation of production-grade DDoS mitigation, Kubernetes networking, and service mesh observability.
Prerequisites
- Linux network stack layers (see
01-linux-network-stack.md) - Netfilter and iptables concepts (see
03-netfilter-and-iptables.md) - Basic C programming; familiarity with
bpftool,bpftrace,ip link,tc - Conceptual understanding of LLVM/BPF compiler toolchain
Core Content
eBPF Program Types for Networking
eBPF programs are typed — the kernel verifier enforces which helper functions and context fields each type can access.
| Program type | Hook location | Use case |
|---|---|---|
XDP |
Before sk_buff, at NIC driver | DDoS mitigation, load balancing |
TC ingress/egress |
After GRO, before/after routing | Deep packet inspection, marking |
sock_ops |
Per-TCP-connection events | RTT measurement, buffer tuning |
sk_skb |
sockmap stream parsing | Transparent proxy |
sk_msg |
sendmsg path | Application-layer inspection |
cgroup_skb |
Per-cgroup ingress/egress | Container policy |
socket filter |
Per-socket | tcpdump-style filtering |
flow_dissector |
Packet header parsing | RSS hash customization |
XDP: eXpress Data Path
XDP is the earliest possible hook in the receive path — it runs inside the driver's NAPI poll function, before an sk_buff is allocated. This makes it the fastest hook:
NIC DMA ring
|
| (NIC fills descriptor with packet pointer)
v
NAPI poll()
|
v
[XDP program executes here — raw packet data, no sk_buff]
|
+--- XDP_DROP → discard (packet never reaches kernel)
|
+--- XDP_TX → transmit back out same NIC (hairpin)
|
+--- XDP_REDIRECT → send to another NIC/CPU/AF_XDP socket
|
+--- XDP_PASS → continue to normal stack (sk_buff allocated)
|
v
netif_receive_skb() → normal stack
XDP operates on raw packet memory via struct xdp_md:
struct xdp_md {
__u32 data; /* pointer to start of packet data */
__u32 data_end; /* pointer to end of packet data */
__u32 data_meta; /* metadata area before packet */
__u32 ingress_ifindex;
__u32 rx_queue_index;
__u32 egress_ifindex; /* XDP_REDIRECT target */
};
A minimal XDP DDoS filter:
SEC("xdp")
int xdp_drop_syn(struct xdp_md *ctx) {
void *data = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end) return XDP_PASS;
if (eth->h_proto != bpf_htons(ETH_P_IP)) return XDP_PASS;
struct iphdr *ip = (void *)(eth + 1);
if ((void *)(ip + 1) > data_end) return XDP_PASS;
if (ip->protocol != IPPROTO_TCP) return XDP_PASS;
struct tcphdr *tcp = (void *)ip + (ip->ihl * 4);
if ((void *)(tcp + 1) > data_end) return XDP_PASS;
/* Drop SYN packets from blocked IPs */
if (tcp->syn && !tcp->ack) {
__u32 src = ip->saddr;
__u32 *blocked = bpf_map_lookup_elem(&blocklist_map, &src);
if (blocked) return XDP_DROP;
}
return XDP_PASS;
}
XDP operation modes:
- Native XDP (xdpdrv): runs inside driver NAPI poll. Requires driver support (mlx4, mlx5, i40e, ixgbe, veth, tun). Fastest.
- Generic XDP (xdpgeneric): runs after sk_buff allocation in the kernel, no driver support needed. ~3x slower but universally available.
- Offloaded XDP (xdpoffload): runs on SmartNIC firmware (Netronome). Truly zero CPU.
Load XDP program:
ip link set dev eth0 xdp obj xdp_prog.o sec xdp
ip link set dev eth0 xdpgeneric obj xdp_prog.o sec xdp # generic mode
ip link show eth0 | grep xdp # verify attachment
# Using bpftool
bpftool prog load xdp_prog.o /sys/fs/bpf/myprog
bpftool net attach xdp id <prog_id> dev eth0
Cloudflare XDP production numbers: 20 million packets per second dropped per CPU core on commodity x86 hardware using XDP. Pre-XDP, the same filtering via iptables was limited to ~1.5 Mpps before CPU saturation.
TC eBPF: Traffic Control
TC (Traffic Control) programs attach to the tc qdisc layer — after sk_buff allocation, giving access to the full packet metadata. TC supports both ingress (clsact qdisc, TC_ACT_*) and egress:
Ingress path:
netif_receive_skb()
|
v
[TC ingress BPF — access to sk_buff]
|
+--- TC_ACT_SHOT → drop
+--- TC_ACT_PASS → continue to stack
+--- TC_ACT_REDIRECT → redirect to another interface/socket
Egress path:
ip_output() → qdisc
|
v
[TC egress BPF]
|
v
dev_hard_start_xmit()
Attach TC programs:
# Create clsact qdisc (required for TC BPF)
tc qdisc add dev eth0 clsact
# Attach BPF program to ingress
tc filter add dev eth0 ingress bpf da obj tc_prog.o sec classifier
# List TC programs
tc filter show dev eth0 ingress
TC eBPF is used by Cilium for Kubernetes pod-to-pod traffic: every pod's veth interface has TC ingress/egress programs that enforce NetworkPolicy, load-balance services, and collect flow metrics.
BPF Maps: Shared State
BPF maps are kernel data structures accessible from both BPF programs and userspace. They are the communication channel between the fast path (BPF program) and the control plane (userspace daemon):
| Map type | Lookup | Use case |
|---|---|---|
BPF_MAP_TYPE_HASH |
O(1) average | Connection tracking, blocklist |
BPF_MAP_TYPE_ARRAY |
O(1) | Counters, config |
BPF_MAP_TYPE_LPM_TRIE |
O(log n) | IP prefix routing, GeoIP |
BPF_MAP_TYPE_PERCPU_HASH |
O(1), lock-free | Per-CPU counters |
BPF_MAP_TYPE_DEVMAP |
O(1) | XDP_REDIRECT target list |
BPF_MAP_TYPE_SOCKMAP |
O(1) | Socket redirection |
BPF_MAP_TYPE_RINGBUF |
N/A | High-perf event logging |
/* Map definition in BPF program */
struct {
__uint(type, BPF_MAP_TYPE_LPM_TRIE);
__uint(max_entries, 1024);
__type(key, struct bpf_lpm_trie_key_ipv4);
__type(value, __u32);
__uint(map_flags, BPF_F_NO_PREALLOC);
} ip_prefix_map SEC(".maps");
Userspace map interaction:
bpftool map list
bpftool map dump id <map_id>
bpftool map update id <map_id> key 1 2 3 4 value 0 0 0 1
Cilium: Kubernetes Networking with eBPF
Cilium replaces kube-proxy's iptables-based service load balancing with BPF maps:
- Service lookup:
SOCK_OPSBPF programs interceptconnect()calls and rewrite the destination to a healthy backend pod, doing DNAT at the socket level before any packet is sent. No iptables, no conntrack for most traffic. - NetworkPolicy: XDP and TC programs enforce L3/L4 policy at wire speed, with L7 (HTTP, gRPC) filtering via Envoy sidecar.
- Observability:
BPF_MAP_TYPE_RINGBUFmaps feed per-flow telemetry to the Hubble observability layer without kernel modifications.
At 10,000 services, iptables kube-proxy requires ~100,000 rules. Cilium handles the same with O(1) BPF map lookups.
cgroup BPF
cgroup BPF programs (BPF_PROG_TYPE_CGROUP_SKB, BPF_PROG_TYPE_CGROUP_SOCK_ADDR) attach to cgroup v2 hierarchies, enabling per-container network policy without network namespaces or iptables:
# Attach ingress filter to container cgroup
bpftool cgroup attach /sys/fs/cgroup/docker/<id> ingress id <prog_id>
# List attached programs
bpftool cgroup tree /sys/fs/cgroup/docker/<id>
Used by Kubernetes to implement per-pod bandwidth limits and L4 policy enforcement that is cgroup-aware rather than namespace-aware.
AF_XDP: Kernel Bypass
AF_XDP sockets allow userspace applications to receive and send packets via XDP without going through the kernel network stack. The XDP program redirects packets directly to an AF_XDP socket's receive ring, which is mapped into userspace:
NIC DMA ring ←→ XDP program ←→ AF_XDP socket (UMEM ring)
↑
mmap'd into userspace
(zero copy — no sk_buff)
/* AF_XDP socket creation */
int fd = socket(AF_XDP, SOCK_RAW, 0);
struct xdp_umem_reg umem = { .addr = buf, .len = buf_size, .chunk_size = 4096 };
setsockopt(fd, SOL_XDP, XDP_UMEM_REG, &umem, sizeof(umem));
/* ... setup fill/completion/rx/tx rings ... */
AF_XDP with XDP_ZEROCOPY flag DMA's directly from NIC to the UMEM buffer — no kernel memory copies. Used by:
- DPDK PMD via veth+AF_XDP (driver-independent kernel bypass)
- OVS-AF_XDP (Open vSwitch zero-copy)
- Snort/Suricata IDS for 10Gbps packet inspection
Historical Context
BPF was invented by McCanne and Jacobson (1992) for tcpdump packet filtering — a simple register machine that evaluated filter programs in kernel context. Linux adopted it in 1997.
eBPF (extended BPF) was introduced in Linux 3.15 (2014) by Alexei Starovoitov, expanding the register set, adding a verifier, JIT compiler, and map infrastructure. XDP was added in Linux 4.8 (2016) after a collaborative design by Netflixs Jesper Dangaard Brouer and others.
Cilium launched in 2017 using eBPF to replace iptables in Kubernetes — it was the first production deployment that demonstrated eBPF could fully subsume Netfilter for a real workload.
Production Examples
DDoS mitigation (Cloudflare, 2017+):
# Load XDP program on all bonded interfaces
for iface in bond0 eth0 eth1; do
ip link set dev $iface xdp obj ddos_filter.o sec xdp
done
# Monitor drop counters via BPF map
bpftool map dump id $(bpftool map list | grep drop_count | awk '{print $1}')
Per-flow latency tracking with bpftrace:
bpftrace -e '
kprobe:tcp_sendmsg { @send_time[tid] = nsecs; }
kretprobe:tcp_recvmsg /retval > 0/ {
$lat = nsecs - @send_time[tid];
@latency = hist($lat);
delete(@send_time[tid]);
}'
Debugging Notes
# List all loaded BPF programs
bpftool prog list
bpftool prog show id <id>
bpftool prog dump xlated id <id> # dump BPF instructions
bpftool prog dump jited id <id> # dump JIT-compiled x86
# Trace BPF program execution errors (verifier messages)
bpftool prog load bad.o /sys/fs/bpf/bad 2>&1 | head -30
# Monitor XDP drop rate
watch -n1 'ethtool -S eth0 | grep xdp_drop'
# XDP-enabled interface status
ip link show dev eth0 | grep xdp
# TC filter status
tc filter show dev eth0 ingress
tc filter show dev eth0 egress
# BPF map statistics
bpftool map dump id <id> | wc -l # entry count
Security Implications
- BPF verifier: prevents unbounded loops, out-of-bounds memory access, uninitialized reads. Programs failing verification are rejected. The verifier has had bugs (CVE-2021-3490, CVE-2021-31440) that allowed privilege escalation.
CAP_BPF(Linux 5.8): new capability separates BPF permissions fromCAP_SYS_ADMIN. Privileged BPF programs still requireCAP_NET_ADMINfor XDP/TC attachment.- Unprivileged BPF: historically disabled on production systems (
kernel.unprivileged_bpf_disabled=1). The attack surface of the verifier is large. - Side channels: BPF programs with hash map timing differences can be used for speculative execution side-channel attacks (Spectre). The verifier adds speculation barriers where needed.
Performance Implications
| Path | Throughput (single core, x86) |
|---|---|
| Native XDP drop | ~20 Mpps |
| Generic XDP drop | ~5 Mpps |
| iptables DROP | ~1.5 Mpps |
| TC BPF forward | ~5 Mpps |
| AF_XDP zerocopy | ~20 Mpps userspace |
XDP eliminates sk_buff allocation (~200 ns per packet), cache pollution from LRU flushing, and softirq overhead for dropped packets. For DDoS mitigation where most packets should be dropped, this is decisive.
Failure Modes and Real Incidents
Incident: XDP program crash loop (2020, financial exchange)
A bug in a custom XDP program dereferenced past data_end. The verifier should have caught it, but a verifier escape in an older kernel version allowed the load. Packet processing halted on affected CPUs until the program was unloaded. Fix: pin programs via bpftool prog pin and test with xdp_rxq_info before production rollout.
Failure Mode: map size exhaustion
A conntrack BPF map with max_entries=100000 fills when a DDoS creates 100K distinct source IPs. New entries are rejected with ENOMEM; the XDP program returns XDP_PASS for unknown IPs (fail-open). Monitor: bpftool map show for usage, set BPF_F_NO_PREALLOC for large sparse maps.
Modern Usage
- Cilium 1.14+: fully eBPF-based service mesh, replacing Envoy sidecar for L4 with BPF and using Envoy only for L7
- Meta (Facebook): uses XDP for L4 load balancing in Katran, handling all ingress traffic at dozens of datacenters
- Cloudflare: entire DDoS mitigation pipeline (SYN flood, UDP flood, application-layer rate limiting) runs on XDP
- Kernel 6.x:
BPF_PROG_TYPE_NETFILTER— eBPF programs can now be attached as Netfilter hooks, blurring the line between eBPF and nftables
Future Directions
- BPF struct_ops for TCP congestion control: custom CC algorithms loadable as BPF programs without kernel recompilation (Linux 5.13)
- XDP multi-buffer: support for jumbo frames and header-data split (Linux 5.18)
- BPF token: delegated permissions allowing containers to load pre-verified BPF programs without CAP_BPF
- HW offload expansion: more NICs supporting XDP offload; FPGA-based SmartNICs running arbitrary BPF programs at line rate
Exercises
-
Write an XDP program that counts packets per source IP and drops any IP that exceeds 10,000 packets in a 1-second window. Use a
BPF_MAP_TYPE_LRU_HASHfor the counter. Load it in generic mode and test withhping3. -
Attach a TC BPF program to a veth pair connecting two network namespaces. The program should add a 10ms delay to every 10th packet (packet loss simulation) without using
tc netem. Observe the effect withpingand explain the implementation challenge. -
Create an AF_XDP socket in C. Receive 1 million packets from a loopback XDP program, measure throughput, and compare it to the same workload using a regular
SOCK_DGRAMsocket. -
Using Cilium's Hubble CLI (
hubble observe), identify the top 5 flows by byte volume in a Kubernetes cluster. Trace how Cilium's TC BPF programs generate this telemetry from the BPF ringbuf map. -
Study the BPF verifier output for a program that accesses a packet field without bounds checking. Add the necessary bounds check, reload, and verify the verifier accepts it. Explain exactly which verifier rule was violated.
References
net/core/filter.c— BPF socket filter, XDP executionnet/sched/cls_bpf.c— TC BPF classifierkernel/bpf/verifier.c— BPF verifierinclude/uapi/linux/bpf.h— BPF program types, map types, helper functions- Starovoitov, A. & Borkmann, D. eBPF — in-kernel virtual machine. Linux Plumbers 2014.
- Brouer, J.D. et al. The eXpress Data Path. ACM CoNEXT 2018.
- Cilium documentation. eBPF & XDP Reference Guide. docs.cilium.io
- Cloudflare blog. L4Drop: XDP DDoS Mitigations. 2018.
man 2 bpf,bpftool-prog(8),bpftool-map(8),bpftool-net(8)Documentation/bpf/— kernel BPF documentation tree