The eBPF Revolution
Overview
Extended Berkeley Packet Filter (eBPF) has fundamentally changed how the Linux kernel is extended. What began in 1992 as a simple packet filtering mechanism has evolved into a general-purpose, safe, in-kernel virtual machine that allows userspace-authored programs to run in kernel context—without modifying kernel source code, without loading kernel modules, and without rebooting.
eBPF is now the foundation of the most significant advances in Linux networking, observability, and security over the past decade. It is not an exaggeration to say that eBPF has made the Linux kernel programmable in a way that was previously impossible without accepting the reliability risks of kernel modules.
Prerequisites
- Understanding of Linux system calls and kernel/user space boundary
- Familiarity with network packet processing (Ethernet frames, IP packets, TCP/UDP)
- Basic knowledge of C programming and Linux file descriptors
- Awareness of kernel tracing concepts (kprobes, tracepoints, perf)
Historical Context
BPF 1992: The Origin
Steven McCanne and Van Jacobson at Lawrence Berkeley National Laboratory published "The BSD Packet Filter: A New Architecture for User-level Packet Capture" in 1992. The original BPF was a register-based VM with a simple two-register (accumulator + index), 32-bit instruction set designed for one purpose: efficiently filtering packets in the kernel before copying them to userspace for tools like tcpdump.
The key insight was that evaluating filter expressions in the kernel eliminated the need to copy every packet to userspace for inspection—only matching packets crossed the kernel/user boundary.
ORIGINAL BPF (1992)
NIC receives packet
|
v
[Kernel Network Stack]
|
v
[BPF Filter VM] <-- runs filter bytecode: "is this TCP port 80?"
| \
YES NO
| \
[Copy to [Drop, no copy]
userspace]
eBPF 2014: Alexei Starovoitov's Transformation
In 2014, Alexei Starovoitov (then at Plexxi, later at Facebook/Meta) submitted a patch series that fundamentally redesigned BPF. The result, merged in Linux 3.18, was extended BPF (eBPF):
- Register count increased from 2 to 11 (10 general-purpose + stack pointer)
- Register width increased from 32-bit to 64-bit
- A new verifier that proves program safety at load time
- A JIT compiler that translates eBPF bytecode to native machine code
- Maps: kernel-resident data structures accessible from both eBPF programs and userspace
- Helper functions: a controlled API into kernel subsystems
- New hook points: not just packet filters, but kprobes, tracepoints, system call entry/exit, and more
eBPF Design
The Verifier
The eBPF verifier is the security and safety foundation of the entire system. Before any eBPF program runs, the kernel's verifier performs a static analysis of the bytecode to prove:
- No unbounded loops: The program must terminate. Bounded loops were added in Linux 5.3 with a provably finite iteration count.
- No invalid memory access: Every pointer dereference is checked; accessing out-of-bounds stack or map memory is rejected.
- No uninitialized reads: All register reads must follow writes; reading an uninitialized register is rejected.
- Type safety: Pointers to kernel structures must be used as the correct type; pointer arithmetic is restricted.
- No unbounded runtime: Combined with bounded loops, the verifier ensures worst-case execution time is finite.
eBPF PROGRAM LIFECYCLE
Developer writes eBPF C code
|
v
Clang compiles to eBPF bytecode (.o file)
|
v
Userspace loader calls bpf(BPF_PROG_LOAD, ...)
|
v
+-----------------------+
| KERNEL VERIFIER |
| - CFG construction |
| - Type checking |
| - Bounds checking |
| - Loop bound check |
+-----------+-----------+
| PASS (or REJECT with error)
v
+-----------------------+
| JIT COMPILER |
| eBPF -> x86/ARM/etc |
+-----------+-----------+
|
v
Program attached to hook point
(XDP, kprobe, cgroup, etc.)
JIT Compilation
After verification, eBPF bytecode is compiled by the kernel's JIT compiler to native machine code. The JIT is architecture-specific (x86-64, ARM64, s390, PowerPC, RISC-V are all supported). This means eBPF programs run at near-native speed—the VM overhead is zero at runtime.
Maps
Maps are the persistent data store of the eBPF ecosystem. They are kernel-resident data structures that survive between eBPF program invocations and can be read and written from both eBPF programs and userspace via the bpf() syscall.
Key map types:
| Type | Description | Use Case |
|---|---|---|
| BPF_MAP_TYPE_HASH | Hash table | IP → action lookups |
| BPF_MAP_TYPE_ARRAY | Fixed-size array | Per-CPU counters |
| BPF_MAP_TYPE_PERCPU_HASH | Per-CPU hash | Lock-free statistics |
| BPF_MAP_TYPE_LRU_HASH | LRU eviction hash | Connection tracking |
| BPF_MAP_TYPE_RINGBUF | Lock-free ring buffer | Event streaming to userspace |
| BPF_MAP_TYPE_PROG_ARRAY | Array of eBPF programs | Tail calls / dispatch tables |
| BPF_MAP_TYPE_SOCKMAP | Map of sockets | Socket redirection |
| BPF_MAP_TYPE_XSKMAP | Map of AF_XDP sockets | Zero-copy packet delivery |
Helper Functions
eBPF programs cannot call arbitrary kernel functions (this would break the safety model). Instead, a curated set of helper functions provides access to kernel services:
bpf_map_lookup_elem()/bpf_map_update_elem(): Map accessbpf_ktime_get_ns(): High-resolution clockbpf_get_current_pid_tgid(): Process identitybpf_probe_read_kernel(): Safe memory read from arbitrary kernel addressbpf_skb_store_bytes(): Modify packet databpf_redirect(): Redirect packet to another interfacebpf_perf_event_output(): Send data to userspace via perf ring bufferbpf_ringbuf_submit(): Send data to userspace via ring buffer map
As of Linux 5.13+, kfuncs (kernel functions annotated with BTF_KFUNC_ID) allow eBPF programs to call a wider set of kernel functions in a controlled manner.
eBPF Architecture Diagram
USERSPACE KERNEL
+-----------+ +--------------------------------+
| eBPF C | | |
| source | bpf() syscall | +----------+ +-----------+ |
+-----------+ BPF_PROG_LOAD | | | | | |
| Clang/ +----------------->+ | Verifier +->| JIT | |
| LLVM | | | (safety | | Compiler | |
| compiler | | | proof) | | | |
+-----------+ | +----------+ +-----+-----+ |
| libbpf / | | | |
| bpftool | bpf() MAP ops | v |
| loader +<---------------->+ +---------+ +------+------+ |
+-----------+ | | Maps | | Attached | |
| Userspace | | | (shared | | Program | |
| reader | | | data) | | runs at: | |
| (metrics, | | +---------+ | - XDP | |
| alerts) | | | - TC | |
+-----------+ | | - kprobe | |
| | - tracepoint| |
| | - LSM hook | |
| | - syscall | |
| +-------------+ |
+--------------------------------+
eBPF Use Cases
Networking
XDP (eXpress Data Path) attaches eBPF programs at the earliest possible point in the network stack—in the NIC driver, before any sk_buff allocation. This enables packet processing at line rate with minimal CPU overhead.
XDP actions:
- XDP_DROP: Drop packet immediately (DDoS mitigation)
- XDP_PASS: Pass to normal network stack
- XDP_TX: Retransmit out the same interface (echo server, load balancer)
- XDP_REDIRECT: Send to another NIC, CPU, or AF_XDP socket
TC (Traffic Control) eBPF hooks (at ingress and egress of network devices) can classify, redirect, and modify packets after sk_buff allocation, enabling more complex manipulation including connection tracking and NAT.
Socket filtering allows per-socket eBPF programs to filter incoming data, replacing cumbersome setsockopt-based approaches.
Observability
eBPF-based observability tools attach to kernel events without modifying application code:
- kprobes: Dynamic instrumentation of any kernel function entry/exit
- kretprobes: Capture return values from kernel functions
- tracepoints: Stable kernel instrumentation points (preferred over kprobes)
- uprobes: Dynamic instrumentation of userspace functions
- USDT (User Statically Defined Tracing): Pre-placed probe points in userspace binaries
Production observability tools using eBPF: - Pixie: Auto-instruments Kubernetes pods using uprobes, captures HTTP/gRPC/SQL traffic without code changes - Parca: Continuous profiling using eBPF perf_event sampling, zero overhead CPU profiler - Cilium Hubble: Network flow visibility across entire Kubernetes cluster - bpftrace: One-liner observability scripts (similar to DTrace on Solaris)
Security
LSM BPF (Linux Security Module via BPF, merged in 5.7) allows implementing MAC security policies in eBPF programs attached to LSM hooks—the same hooks used by SELinux and AppArmor.
seccomp-bpf: Uses classic BPF (not eBPF) to filter syscalls; eBPF-based approaches provide richer context.
Cilium Tetragon: Runtime security tool that uses kprobes and LSM BPF to detect and enforce policies on process execution, file access, network connections, and privilege escalation—all with eBPF.
eBPF at Companies
Cloudflare
Cloudflare processes 50+ million HTTP requests per second and has been at the forefront of eBPF-based DDoS mitigation:
- Entire DDoS mitigation pipeline implemented in XDP eBPF: SYN flood mitigation, UDP amplification blocking, IP reputation checks—all at wire speed before packets reach the Linux network stack
- Published "L4Drop: XDP DDoS Mitigations" explaining how XDP eBPF drops millions of packets per second per CPU core
- Replaced iptables entirely with eBPF-based packet filtering across their global network
Meta
Meta's network team undertook a multi-year project to replace iptables with eBPF:
- iptables rules are O(N) traversal; BPF hash map lookups are O(1)
- On servers with thousands of container firewall rules, iptables rule evaluation consumed significant CPU
- Replaced with Katran (Meta's open-source L4 load balancer in XDP eBPF) and custom BPF-based firewall maps
- Meta's
Magmaproject uses eBPF for 5G cellular network packet processing
Cilium
Cilium replaced kube-proxy (iptables-based Kubernetes service routing) with eBPF:
- Service load balancing via BPF maps: O(1) lookup regardless of service count
- Network policy enforcement at the socket layer, not just packet level
- Transparent encryption using WireGuard or IPsec with eBPF orchestration
- Used by Google GKE Dataplane V2, AWS EKS, and deployed at companies including Adobe, Capital One, and Datadog
Falco + Tetragon (CNCF)
Both use eBPF for cloud-native runtime security:
- Falco uses kprobes/tracepoints to detect anomalous syscall patterns
- Tetragon uses LSM BPF for policy enforcement (not just detection), capable of killing processes that violate policy
BTF and CO-RE: Compile Once, Run Everywhere
A historical pain point of eBPF programs was kernel header dependency. An eBPF program compiled against kernel 5.10's headers might fail on 5.15 if internal struct layouts changed. This required either embedding all relevant kernel headers or recompiling for each target kernel version.
BTF (BPF Type Format) is a compact debug info format (similar to DWARF but much smaller) that the kernel embeds in itself (/sys/kernel/btf/vmlinux). It describes all kernel types: struct layouts, function signatures, enum values.
CO-RE (Compile Once, Run Everywhere), implemented in libbpf, uses BTF to: 1. At compile time: record the field offsets the eBPF program uses, based on the compile-time kernel headers 2. At load time: compare against the running kernel's BTF and rewrite field accesses if structs have moved
CO-RE RELOCATION AT LOAD TIME
Compiled eBPF accesses: task_struct->pid at offset 1234
Running kernel BTF says: task_struct->pid is at offset 1238
libbpf rewrites the eBPF bytecode:
ldr r0, [r1 + 1234] --> ldr r0, [r1 + 1238]
Same binary, different kernels.
Combined with the vmlinux.h header (auto-generated from BTF, containing all kernel type definitions), eBPF programs can now be compiled once and distributed as a single binary that runs correctly across a range of kernel versions.
Failure Modes
- Verifier rejection: Programs that exceed the instruction limit (1 million instructions as of 5.22+), have unreachable code paths that the verifier cannot prune, or perform unsafe pointer arithmetic are rejected at load time with often cryptic error messages.
- Map exhaustion: A leaking program that fills an LRU hash map will start evicting legitimate entries, causing lookup failures. Monitor map usage via
bpftool mapand set appropriatemax_entries. - Tail call stack overflow: eBPF tail calls do not consume stack; however, the maximum chain length is 33. Programs that try to chain more calls silently return without executing remaining calls.
- JIT spraying: Early eBPF had a vulnerability class where an attacker could craft eBPF bytecode that, after JIT compilation, placed attacker-controlled values at predictable kernel addresses. Mitigated by JIT hardening (
/proc/sys/net/core/bpf_jit_harden). - CO-RE relocation failure: If a field is renamed or removed between kernel versions, CO-RE relocation fails and the program cannot load. This requires fallback probing or multi-version conditionals.
Security Implications
- Privilege requirement: By default, loading eBPF programs requires
CAP_BPF(Linux 5.8+) orCAP_SYS_ADMIN. Unprivileged eBPF (socket filtering only) has historically had numerous privilege escalation CVEs and is typically disabled. - Verifier bugs: The verifier itself has had bugs that allowed malicious eBPF programs to escape to arbitrary kernel execution (CVE-2021-3490, CVE-2022-23222). The verifier's complexity makes it a persistent attack surface.
- Speculative execution attacks: eBPF programs can be used to craft Spectre gadgets. The kernel applies speculative store bypass disable (SSBD) or pointer masking mitigations that add overhead to eBPF programs.
- Supply chain: A compromised eBPF program loaded by a privileged process can read arbitrary kernel memory, exfiltrate secrets, or modify packet data. Treat eBPF programs as privileged code.
Performance Implications
- XDP performance: Cloudflare measures 14–24 Mpps (million packets per second) per core for XDP DROP, compared to ~4 Mpps for iptables DROP—roughly 4–6x improvement.
- kprobe overhead: Attaching to a high-frequency kernel function (e.g.,
tcp_sendmsg) can add 50–200 ns per call. At millions of calls per second, this is measurable. Use tracepoints where available (lower overhead, more stable). - Map access latency: BPF hash map lookup is ~50–100 ns for a warmed-up entry in L3 cache. Per-CPU maps eliminate locking and reduce this to ~10–20 ns.
- Tail call overhead: Each tail call is approximately equivalent to a function call; chains of 10+ tail calls add up.
Modern Usage
- Linux 6.x: kfuncs for calling kernel functions from eBPF, BPF exceptions (structured error handling), BPF arenas (large mmap-backed memory), sleepable BPF programs for LSM
- bpftool: Production-grade tool for loading, inspecting, and managing eBPF programs and maps
- Wasm + eBPF: Proposals to use WebAssembly as an alternative eBPF VM for kernel extension (less mature)
Future Directions
- BPF token: Allow delegated, fine-grained eBPF permissions without full
CAP_BPF, enabling safer multi-tenant eBPF deployment - Verified eBPF: Research into using formal methods to prove eBPF program correctness beyond what the verifier currently checks
- eBPF for storage: Experimental work on BPF programs attached to block layer and NVMe driver for custom I/O scheduling
- eBPF for HPC: Offloading collective communication primitives (MPI reduce) to eBPF programs on smart NICs
Debugging Notes
# List all loaded eBPF programs
bpftool prog list
# Dump eBPF program instructions (JIT-compiled)
bpftool prog dump jited id <ID>
# Inspect a map
bpftool map dump id <MAP_ID>
# Trace eBPF program execution with debug output
# (requires program to call bpf_trace_printk)
cat /sys/kernel/debug/tracing/trace_pipe
# Check verifier log on load failure
# (use libbpf's verbose mode or bpftool --debug)
bpftool --debug prog load prog.o /sys/fs/bpf/prog
# Monitor XDP statistics
ip -s link show dev eth0
# Use bpftrace for one-liner tracing
bpftrace -e 'kprobe:do_sys_open { printf("%s opened %s\n", comm, str(arg1)); }'
Exercises
-
Write an XDP program in C that counts incoming packets per source IP address, stores counts in a BPF hash map, and exposes them to a userspace polling program. Use libbpf and CO-RE.
-
Use
bpftraceto write a one-liner that measures the distribution ofread()syscall latencies, broken down by process name. What is the p99 for your shell vs a database? -
Study the Cilium source code for its kube-proxy replacement. Map out which BPF program handles service VIP → backend translation and what map types it uses.
-
Compile and run the kernel's eBPF sample programs in
samples/bpf/. Modifyxdp_drop_kern.cto drop only UDP packets and measure the throughput difference withiperf3. -
Read CVE-2022-23222 (verifier type confusion). Explain at a high level what invariant the verifier failed to maintain and how the exploit leveraged it.
References
- McCanne, S. & Jacobson, V. "The BSD Packet Filter" (USENIX 1993)
- Starovoitov, A. eBPF patch series (LKML 2014): "net: filter: rework/optimize internal BPF interpreter's instruction set"
- Gregg, B. "BPF Performance Tools" (Addison-Wesley 2019) — comprehensive reference
- Corbet, J. "A thorough introduction to eBPF" (LWN.net 2017)
- Nakryiko, A. "BPF CO-RE reference guide" (nakryiko.com 2021)
- Cloudflare blog: "L4Drop: XDP DDoS Mitigations" (2018)
- Meta Engineering blog: "Open-sourcing Katran, a scalable network load balancer" (2019)
- Linux kernel documentation:
Documentation/bpf/ - eBPF.io: official eBPF documentation and tutorials