04 — eBPF Networking

Technical Overview

eBPF (extended Berkeley Packet Filter) transforms the Linux kernel into a programmable platform. Rather than modifying kernel source and shipping a new kernel version, operators and developers load verified bytecode programs at runtime that execute in privileged kernel context with safety guarantees. For networking, eBPF programs can intercept, inspect, modify, and drop packets at multiple points — from before the network stack allocates memory all the way down to individual socket calls. Combined with the BPF map infrastructure for shared state, eBPF has become the foundation of production-grade DDoS mitigation, Kubernetes networking, and service mesh observability.

Prerequisites

Linux network stack layers (see 01-linux-network-stack.md)
Netfilter and iptables concepts (see 03-netfilter-and-iptables.md)
Basic C programming; familiarity with bpftool, bpftrace, ip link, tc
Conceptual understanding of LLVM/BPF compiler toolchain

Core Content

eBPF Program Types for Networking

eBPF programs are typed — the kernel verifier enforces which helper functions and context fields each type can access.

Program type	Hook location	Use case
`XDP`	Before sk_buff, at NIC driver	DDoS mitigation, load balancing
`TC ingress/egress`	After GRO, before/after routing	Deep packet inspection, marking
`sock_ops`	Per-TCP-connection events	RTT measurement, buffer tuning
`sk_skb`	sockmap stream parsing	Transparent proxy
`sk_msg`	sendmsg path	Application-layer inspection
`cgroup_skb`	Per-cgroup ingress/egress	Container policy
`socket filter`	Per-socket	tcpdump-style filtering
`flow_dissector`	Packet header parsing	RSS hash customization

XDP: eXpress Data Path

XDP is the earliest possible hook in the receive path — it runs inside the driver's NAPI poll function, before an sk_buff is allocated. This makes it the fastest hook:

NIC DMA ring
     |
     | (NIC fills descriptor with packet pointer)
     v
NAPI poll()
     |
     v
[XDP program executes here — raw packet data, no sk_buff]
     |
     +--- XDP_DROP    → discard (packet never reaches kernel)
     |
     +--- XDP_TX      → transmit back out same NIC (hairpin)
     |
     +--- XDP_REDIRECT → send to another NIC/CPU/AF_XDP socket
     |
     +--- XDP_PASS    → continue to normal stack (sk_buff allocated)
     |
     v
netif_receive_skb() → normal stack

XDP operates on raw packet memory via struct xdp_md:

struct xdp_md {
    __u32 data;           /* pointer to start of packet data    */
    __u32 data_end;       /* pointer to end of packet data      */
    __u32 data_meta;      /* metadata area before packet        */
    __u32 ingress_ifindex;
    __u32 rx_queue_index;
    __u32 egress_ifindex; /* XDP_REDIRECT target                */
};

A minimal XDP DDoS filter:

SEC("xdp")
int xdp_drop_syn(struct xdp_md *ctx) {
    void *data = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;

    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end) return XDP_PASS;
    if (eth->h_proto != bpf_htons(ETH_P_IP)) return XDP_PASS;

    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end) return XDP_PASS;
    if (ip->protocol != IPPROTO_TCP) return XDP_PASS;

    struct tcphdr *tcp = (void *)ip + (ip->ihl * 4);
    if ((void *)(tcp + 1) > data_end) return XDP_PASS;

    /* Drop SYN packets from blocked IPs */
    if (tcp->syn && !tcp->ack) {
        __u32 src = ip->saddr;
        __u32 *blocked = bpf_map_lookup_elem(&blocklist_map, &src);
        if (blocked) return XDP_DROP;
    }
    return XDP_PASS;
}

XDP operation modes: - Native XDP (xdpdrv): runs inside driver NAPI poll. Requires driver support (mlx4, mlx5, i40e, ixgbe, veth, tun). Fastest. - Generic XDP (xdpgeneric): runs after sk_buff allocation in the kernel, no driver support needed. ~3x slower but universally available. - Offloaded XDP (xdpoffload): runs on SmartNIC firmware (Netronome). Truly zero CPU.

Load XDP program:

ip link set dev eth0 xdp obj xdp_prog.o sec xdp
ip link set dev eth0 xdpgeneric obj xdp_prog.o sec xdp  # generic mode
ip link show eth0 | grep xdp   # verify attachment

# Using bpftool
bpftool prog load xdp_prog.o /sys/fs/bpf/myprog
bpftool net attach xdp id <prog_id> dev eth0

Cloudflare XDP production numbers: 20 million packets per second dropped per CPU core on commodity x86 hardware using XDP. Pre-XDP, the same filtering via iptables was limited to ~1.5 Mpps before CPU saturation.

TC eBPF: Traffic Control

TC (Traffic Control) programs attach to the tc qdisc layer — after sk_buff allocation, giving access to the full packet metadata. TC supports both ingress (clsact qdisc, TC_ACT_*) and egress:

Ingress path:
  netif_receive_skb()
       |
       v
  [TC ingress BPF — access to sk_buff]
       |
       +--- TC_ACT_SHOT     → drop
       +--- TC_ACT_PASS     → continue to stack
       +--- TC_ACT_REDIRECT → redirect to another interface/socket

Egress path:
  ip_output() → qdisc
       |
       v
  [TC egress BPF]
       |
       v
  dev_hard_start_xmit()

Attach TC programs:

# Create clsact qdisc (required for TC BPF)
tc qdisc add dev eth0 clsact

# Attach BPF program to ingress
tc filter add dev eth0 ingress bpf da obj tc_prog.o sec classifier

# List TC programs
tc filter show dev eth0 ingress

TC eBPF is used by Cilium for Kubernetes pod-to-pod traffic: every pod's veth interface has TC ingress/egress programs that enforce NetworkPolicy, load-balance services, and collect flow metrics.

BPF Maps: Shared State

BPF maps are kernel data structures accessible from both BPF programs and userspace. They are the communication channel between the fast path (BPF program) and the control plane (userspace daemon):

Map type	Lookup	Use case
`BPF_MAP_TYPE_HASH`	O(1) average	Connection tracking, blocklist
`BPF_MAP_TYPE_ARRAY`	O(1)	Counters, config
`BPF_MAP_TYPE_LPM_TRIE`	O(log n)	IP prefix routing, GeoIP
`BPF_MAP_TYPE_PERCPU_HASH`	O(1), lock-free	Per-CPU counters
`BPF_MAP_TYPE_DEVMAP`	O(1)	XDP_REDIRECT target list
`BPF_MAP_TYPE_SOCKMAP`	O(1)	Socket redirection
`BPF_MAP_TYPE_RINGBUF`	N/A	High-perf event logging

/* Map definition in BPF program */
struct {
    __uint(type, BPF_MAP_TYPE_LPM_TRIE);
    __uint(max_entries, 1024);
    __type(key, struct bpf_lpm_trie_key_ipv4);
    __type(value, __u32);
    __uint(map_flags, BPF_F_NO_PREALLOC);
} ip_prefix_map SEC(".maps");

Userspace map interaction:

bpftool map list
bpftool map dump id <map_id>
bpftool map update id <map_id> key 1 2 3 4 value 0 0 0 1

Cilium: Kubernetes Networking with eBPF

Cilium replaces kube-proxy's iptables-based service load balancing with BPF maps:

Service lookup: SOCK_OPS BPF programs intercept connect() calls and rewrite the destination to a healthy backend pod, doing DNAT at the socket level before any packet is sent. No iptables, no conntrack for most traffic.
NetworkPolicy: XDP and TC programs enforce L3/L4 policy at wire speed, with L7 (HTTP, gRPC) filtering via Envoy sidecar.
Observability: BPF_MAP_TYPE_RINGBUF maps feed per-flow telemetry to the Hubble observability layer without kernel modifications.

At 10,000 services, iptables kube-proxy requires ~100,000 rules. Cilium handles the same with O(1) BPF map lookups.

cgroup BPF

cgroup BPF programs (BPF_PROG_TYPE_CGROUP_SKB, BPF_PROG_TYPE_CGROUP_SOCK_ADDR) attach to cgroup v2 hierarchies, enabling per-container network policy without network namespaces or iptables:

# Attach ingress filter to container cgroup
bpftool cgroup attach /sys/fs/cgroup/docker/<id> ingress id <prog_id>

# List attached programs
bpftool cgroup tree /sys/fs/cgroup/docker/<id>

Used by Kubernetes to implement per-pod bandwidth limits and L4 policy enforcement that is cgroup-aware rather than namespace-aware.

AF_XDP: Kernel Bypass

AF_XDP sockets allow userspace applications to receive and send packets via XDP without going through the kernel network stack. The XDP program redirects packets directly to an AF_XDP socket's receive ring, which is mapped into userspace:

NIC DMA ring  ←→  XDP program  ←→  AF_XDP socket (UMEM ring)
                                           ↑
                                    mmap'd into userspace
                                    (zero copy — no sk_buff)

/* AF_XDP socket creation */
int fd = socket(AF_XDP, SOCK_RAW, 0);
struct xdp_umem_reg umem = { .addr = buf, .len = buf_size, .chunk_size = 4096 };
setsockopt(fd, SOL_XDP, XDP_UMEM_REG, &umem, sizeof(umem));
/* ... setup fill/completion/rx/tx rings ... */

AF_XDP with XDP_ZEROCOPY flag DMA's directly from NIC to the UMEM buffer — no kernel memory copies. Used by: - DPDK PMD via veth+AF_XDP (driver-independent kernel bypass) - OVS-AF_XDP (Open vSwitch zero-copy) - Snort/Suricata IDS for 10Gbps packet inspection

Historical Context

BPF was invented by McCanne and Jacobson (1992) for tcpdump packet filtering — a simple register machine that evaluated filter programs in kernel context. Linux adopted it in 1997.

eBPF (extended BPF) was introduced in Linux 3.15 (2014) by Alexei Starovoitov, expanding the register set, adding a verifier, JIT compiler, and map infrastructure. XDP was added in Linux 4.8 (2016) after a collaborative design by Netflixs Jesper Dangaard Brouer and others.

Cilium launched in 2017 using eBPF to replace iptables in Kubernetes — it was the first production deployment that demonstrated eBPF could fully subsume Netfilter for a real workload.

Production Examples

DDoS mitigation (Cloudflare, 2017+):

# Load XDP program on all bonded interfaces
for iface in bond0 eth0 eth1; do
    ip link set dev $iface xdp obj ddos_filter.o sec xdp
done

# Monitor drop counters via BPF map
bpftool map dump id $(bpftool map list | grep drop_count | awk '{print $1}')

Per-flow latency tracking with bpftrace:

bpftrace -e '
kprobe:tcp_sendmsg { @send_time[tid] = nsecs; }
kretprobe:tcp_recvmsg /retval > 0/ {
    $lat = nsecs - @send_time[tid];
    @latency = hist($lat);
    delete(@send_time[tid]);
}'

Debugging Notes

# List all loaded BPF programs
bpftool prog list
bpftool prog show id <id>
bpftool prog dump xlated id <id>     # dump BPF instructions
bpftool prog dump jited id <id>      # dump JIT-compiled x86

# Trace BPF program execution errors (verifier messages)
bpftool prog load bad.o /sys/fs/bpf/bad 2>&1 | head -30

# Monitor XDP drop rate
watch -n1 'ethtool -S eth0 | grep xdp_drop'

# XDP-enabled interface status
ip link show dev eth0 | grep xdp

# TC filter status
tc filter show dev eth0 ingress
tc filter show dev eth0 egress

# BPF map statistics
bpftool map dump id <id> | wc -l   # entry count

Security Implications

BPF verifier: prevents unbounded loops, out-of-bounds memory access, uninitialized reads. Programs failing verification are rejected. The verifier has had bugs (CVE-2021-3490, CVE-2021-31440) that allowed privilege escalation.
CAP_BPF (Linux 5.8): new capability separates BPF permissions from CAP_SYS_ADMIN. Privileged BPF programs still require CAP_NET_ADMIN for XDP/TC attachment.
Unprivileged BPF: historically disabled on production systems (kernel.unprivileged_bpf_disabled=1). The attack surface of the verifier is large.
Side channels: BPF programs with hash map timing differences can be used for speculative execution side-channel attacks (Spectre). The verifier adds speculation barriers where needed.

Performance Implications

Path	Throughput (single core, x86)
Native XDP drop	~20 Mpps
Generic XDP drop	~5 Mpps
iptables DROP	~1.5 Mpps
TC BPF forward	~5 Mpps
AF_XDP zerocopy	~20 Mpps userspace

XDP eliminates sk_buff allocation (~200 ns per packet), cache pollution from LRU flushing, and softirq overhead for dropped packets. For DDoS mitigation where most packets should be dropped, this is decisive.

Failure Modes and Real Incidents

Incident: XDP program crash loop (2020, financial exchange) A bug in a custom XDP program dereferenced past data_end. The verifier should have caught it, but a verifier escape in an older kernel version allowed the load. Packet processing halted on affected CPUs until the program was unloaded. Fix: pin programs via bpftool prog pin and test with xdp_rxq_info before production rollout.

Failure Mode: map size exhaustion A conntrack BPF map with max_entries=100000 fills when a DDoS creates 100K distinct source IPs. New entries are rejected with ENOMEM; the XDP program returns XDP_PASS for unknown IPs (fail-open). Monitor: bpftool map show for usage, set BPF_F_NO_PREALLOC for large sparse maps.

Modern Usage

Cilium 1.14+: fully eBPF-based service mesh, replacing Envoy sidecar for L4 with BPF and using Envoy only for L7
Meta (Facebook): uses XDP for L4 load balancing in Katran, handling all ingress traffic at dozens of datacenters
Cloudflare: entire DDoS mitigation pipeline (SYN flood, UDP flood, application-layer rate limiting) runs on XDP
Kernel 6.x: BPF_PROG_TYPE_NETFILTER — eBPF programs can now be attached as Netfilter hooks, blurring the line between eBPF and nftables

Future Directions

BPF struct_ops for TCP congestion control: custom CC algorithms loadable as BPF programs without kernel recompilation (Linux 5.13)
XDP multi-buffer: support for jumbo frames and header-data split (Linux 5.18)
BPF token: delegated permissions allowing containers to load pre-verified BPF programs without CAP_BPF
HW offload expansion: more NICs supporting XDP offload; FPGA-based SmartNICs running arbitrary BPF programs at line rate

Exercises

Write an XDP program that counts packets per source IP and drops any IP that exceeds 10,000 packets in a 1-second window. Use a BPF_MAP_TYPE_LRU_HASH for the counter. Load it in generic mode and test with hping3.
Attach a TC BPF program to a veth pair connecting two network namespaces. The program should add a 10ms delay to every 10th packet (packet loss simulation) without using tc netem. Observe the effect with ping and explain the implementation challenge.
Create an AF_XDP socket in C. Receive 1 million packets from a loopback XDP program, measure throughput, and compare it to the same workload using a regular SOCK_DGRAM socket.
Using Cilium's Hubble CLI (hubble observe), identify the top 5 flows by byte volume in a Kubernetes cluster. Trace how Cilium's TC BPF programs generate this telemetry from the BPF ringbuf map.
Study the BPF verifier output for a program that accesses a packet field without bounds checking. Add the necessary bounds check, reload, and verify the verifier accepts it. Explain exactly which verifier rule was violated.

References

net/core/filter.c — BPF socket filter, XDP execution
net/sched/cls_bpf.c — TC BPF classifier
kernel/bpf/verifier.c — BPF verifier
include/uapi/linux/bpf.h — BPF program types, map types, helper functions
Starovoitov, A. & Borkmann, D. eBPF — in-kernel virtual machine. Linux Plumbers 2014.
Brouer, J.D. et al. The eXpress Data Path. ACM CoNEXT 2018.
Cilium documentation. eBPF & XDP Reference Guide. docs.cilium.io
Cloudflare blog. L4Drop: XDP DDoS Mitigations. 2018.
man 2 bpf, bpftool-prog(8), bpftool-map(8), bpftool-net(8)
Documentation/bpf/ — kernel BPF documentation tree