The eBPF Revolution

Overview

Extended Berkeley Packet Filter (eBPF) has fundamentally changed how the Linux kernel is extended. What began in 1992 as a simple packet filtering mechanism has evolved into a general-purpose, safe, in-kernel virtual machine that allows userspace-authored programs to run in kernel context—without modifying kernel source code, without loading kernel modules, and without rebooting.

eBPF is now the foundation of the most significant advances in Linux networking, observability, and security over the past decade. It is not an exaggeration to say that eBPF has made the Linux kernel programmable in a way that was previously impossible without accepting the reliability risks of kernel modules.

Prerequisites

Understanding of Linux system calls and kernel/user space boundary
Familiarity with network packet processing (Ethernet frames, IP packets, TCP/UDP)
Basic knowledge of C programming and Linux file descriptors
Awareness of kernel tracing concepts (kprobes, tracepoints, perf)

Historical Context

BPF 1992: The Origin

Steven McCanne and Van Jacobson at Lawrence Berkeley National Laboratory published "The BSD Packet Filter: A New Architecture for User-level Packet Capture" in 1992. The original BPF was a register-based VM with a simple two-register (accumulator + index), 32-bit instruction set designed for one purpose: efficiently filtering packets in the kernel before copying them to userspace for tools like tcpdump.

The key insight was that evaluating filter expressions in the kernel eliminated the need to copy every packet to userspace for inspection—only matching packets crossed the kernel/user boundary.

ORIGINAL BPF (1992)

  NIC receives packet
        |
        v
  [Kernel Network Stack]
        |
        v
  [BPF Filter VM] <-- runs filter bytecode: "is this TCP port 80?"
        |         \
      YES          NO
        |           \
  [Copy to           [Drop, no copy]
   userspace]

eBPF 2014: Alexei Starovoitov's Transformation

In 2014, Alexei Starovoitov (then at Plexxi, later at Facebook/Meta) submitted a patch series that fundamentally redesigned BPF. The result, merged in Linux 3.18, was extended BPF (eBPF):

Register count increased from 2 to 11 (10 general-purpose + stack pointer)
Register width increased from 32-bit to 64-bit
A new verifier that proves program safety at load time
A JIT compiler that translates eBPF bytecode to native machine code
Maps: kernel-resident data structures accessible from both eBPF programs and userspace
Helper functions: a controlled API into kernel subsystems
New hook points: not just packet filters, but kprobes, tracepoints, system call entry/exit, and more

eBPF Design

The Verifier

The eBPF verifier is the security and safety foundation of the entire system. Before any eBPF program runs, the kernel's verifier performs a static analysis of the bytecode to prove:

No unbounded loops: The program must terminate. Bounded loops were added in Linux 5.3 with a provably finite iteration count.
No invalid memory access: Every pointer dereference is checked; accessing out-of-bounds stack or map memory is rejected.
No uninitialized reads: All register reads must follow writes; reading an uninitialized register is rejected.
Type safety: Pointers to kernel structures must be used as the correct type; pointer arithmetic is restricted.
No unbounded runtime: Combined with bounded loops, the verifier ensures worst-case execution time is finite.

eBPF PROGRAM LIFECYCLE

  Developer writes eBPF C code
           |
           v
  Clang compiles to eBPF bytecode (.o file)
           |
           v
  Userspace loader calls bpf(BPF_PROG_LOAD, ...)
           |
           v
  +-----------------------+
  |   KERNEL VERIFIER     |
  |  - CFG construction   |
  |  - Type checking      |
  |  - Bounds checking    |
  |  - Loop bound check   |
  +-----------+-----------+
              | PASS (or REJECT with error)
              v
  +-----------------------+
  |   JIT COMPILER        |
  |  eBPF -> x86/ARM/etc  |
  +-----------+-----------+
              |
              v
  Program attached to hook point
  (XDP, kprobe, cgroup, etc.)

JIT Compilation

After verification, eBPF bytecode is compiled by the kernel's JIT compiler to native machine code. The JIT is architecture-specific (x86-64, ARM64, s390, PowerPC, RISC-V are all supported). This means eBPF programs run at near-native speed—the VM overhead is zero at runtime.

Maps

Maps are the persistent data store of the eBPF ecosystem. They are kernel-resident data structures that survive between eBPF program invocations and can be read and written from both eBPF programs and userspace via the bpf() syscall.

Key map types:

Type	Description	Use Case
BPF_MAP_TYPE_HASH	Hash table	IP → action lookups
BPF_MAP_TYPE_ARRAY	Fixed-size array	Per-CPU counters
BPF_MAP_TYPE_PERCPU_HASH	Per-CPU hash	Lock-free statistics
BPF_MAP_TYPE_LRU_HASH	LRU eviction hash	Connection tracking
BPF_MAP_TYPE_RINGBUF	Lock-free ring buffer	Event streaming to userspace
BPF_MAP_TYPE_PROG_ARRAY	Array of eBPF programs	Tail calls / dispatch tables
BPF_MAP_TYPE_SOCKMAP	Map of sockets	Socket redirection
BPF_MAP_TYPE_XSKMAP	Map of AF_XDP sockets	Zero-copy packet delivery

Helper Functions

eBPF programs cannot call arbitrary kernel functions (this would break the safety model). Instead, a curated set of helper functions provides access to kernel services:

bpf_map_lookup_elem() / bpf_map_update_elem(): Map access
bpf_ktime_get_ns(): High-resolution clock
bpf_get_current_pid_tgid(): Process identity
bpf_probe_read_kernel(): Safe memory read from arbitrary kernel address
bpf_skb_store_bytes(): Modify packet data
bpf_redirect(): Redirect packet to another interface
bpf_perf_event_output(): Send data to userspace via perf ring buffer
bpf_ringbuf_submit(): Send data to userspace via ring buffer map

As of Linux 5.13+, kfuncs (kernel functions annotated with BTF_KFUNC_ID) allow eBPF programs to call a wider set of kernel functions in a controlled manner.

eBPF Architecture Diagram

  USERSPACE                      KERNEL
  +-----------+                  +--------------------------------+
  | eBPF C    |                  |                                |
  | source    |   bpf() syscall  |  +----------+  +-----------+  |
  +-----------+  BPF_PROG_LOAD   |  |          |  |           |  |
  | Clang/    +----------------->+  | Verifier +->| JIT       |  |
  | LLVM      |                  |  | (safety  |  | Compiler  |  |
  | compiler  |                  |  |  proof)  |  |           |  |
  +-----------+                  |  +----------+  +-----+-----+  |
  | libbpf /  |                  |                      |         |
  | bpftool   |  bpf() MAP ops   |                      v         |
  | loader    +<---------------->+  +---------+  +------+------+  |
  +-----------+                  |  |  Maps   |  | Attached    |  |
  | Userspace |                  |  | (shared |  | Program     |  |
  | reader    |                  |  |  data)  |  | runs at:    |  |
  | (metrics, |                  |  +---------+  | - XDP       |  |
  |  alerts)  |                  |               | - TC        |  |
  +-----------+                  |               | - kprobe    |  |
                                 |               | - tracepoint|  |
                                 |               | - LSM hook  |  |
                                 |               | - syscall   |  |
                                 |               +-------------+  |
                                 +--------------------------------+

eBPF Use Cases

Networking

XDP (eXpress Data Path) attaches eBPF programs at the earliest possible point in the network stack—in the NIC driver, before any sk_buff allocation. This enables packet processing at line rate with minimal CPU overhead.

XDP actions: - XDP_DROP: Drop packet immediately (DDoS mitigation) - XDP_PASS: Pass to normal network stack - XDP_TX: Retransmit out the same interface (echo server, load balancer) - XDP_REDIRECT: Send to another NIC, CPU, or AF_XDP socket

TC (Traffic Control) eBPF hooks (at ingress and egress of network devices) can classify, redirect, and modify packets after sk_buff allocation, enabling more complex manipulation including connection tracking and NAT.

Socket filtering allows per-socket eBPF programs to filter incoming data, replacing cumbersome setsockopt-based approaches.

Observability

eBPF-based observability tools attach to kernel events without modifying application code:

kprobes: Dynamic instrumentation of any kernel function entry/exit
kretprobes: Capture return values from kernel functions
tracepoints: Stable kernel instrumentation points (preferred over kprobes)
uprobes: Dynamic instrumentation of userspace functions
USDT (User Statically Defined Tracing): Pre-placed probe points in userspace binaries

Production observability tools using eBPF: - Pixie: Auto-instruments Kubernetes pods using uprobes, captures HTTP/gRPC/SQL traffic without code changes - Parca: Continuous profiling using eBPF perf_event sampling, zero overhead CPU profiler - Cilium Hubble: Network flow visibility across entire Kubernetes cluster - bpftrace: One-liner observability scripts (similar to DTrace on Solaris)

Security

LSM BPF (Linux Security Module via BPF, merged in 5.7) allows implementing MAC security policies in eBPF programs attached to LSM hooks—the same hooks used by SELinux and AppArmor.

seccomp-bpf: Uses classic BPF (not eBPF) to filter syscalls; eBPF-based approaches provide richer context.

Cilium Tetragon: Runtime security tool that uses kprobes and LSM BPF to detect and enforce policies on process execution, file access, network connections, and privilege escalation—all with eBPF.

eBPF at Companies

Cloudflare

Cloudflare processes 50+ million HTTP requests per second and has been at the forefront of eBPF-based DDoS mitigation:

Entire DDoS mitigation pipeline implemented in XDP eBPF: SYN flood mitigation, UDP amplification blocking, IP reputation checks—all at wire speed before packets reach the Linux network stack
Published "L4Drop: XDP DDoS Mitigations" explaining how XDP eBPF drops millions of packets per second per CPU core
Replaced iptables entirely with eBPF-based packet filtering across their global network

Cilium

Cilium replaced kube-proxy (iptables-based Kubernetes service routing) with eBPF:

Service load balancing via BPF maps: O(1) lookup regardless of service count
Network policy enforcement at the socket layer, not just packet level
Transparent encryption using WireGuard or IPsec with eBPF orchestration
Used by Google GKE Dataplane V2, AWS EKS, and deployed at companies including Adobe, Capital One, and Datadog

Falco + Tetragon (CNCF)

Both use eBPF for cloud-native runtime security:

Falco uses kprobes/tracepoints to detect anomalous syscall patterns
Tetragon uses LSM BPF for policy enforcement (not just detection), capable of killing processes that violate policy

BTF and CO-RE: Compile Once, Run Everywhere

A historical pain point of eBPF programs was kernel header dependency. An eBPF program compiled against kernel 5.10's headers might fail on 5.15 if internal struct layouts changed. This required either embedding all relevant kernel headers or recompiling for each target kernel version.

BTF (BPF Type Format) is a compact debug info format (similar to DWARF but much smaller) that the kernel embeds in itself (/sys/kernel/btf/vmlinux). It describes all kernel types: struct layouts, function signatures, enum values.

CO-RE (Compile Once, Run Everywhere), implemented in libbpf, uses BTF to: 1. At compile time: record the field offsets the eBPF program uses, based on the compile-time kernel headers 2. At load time: compare against the running kernel's BTF and rewrite field accesses if structs have moved

CO-RE RELOCATION AT LOAD TIME

Compiled eBPF accesses: task_struct->pid at offset 1234

Running kernel BTF says: task_struct->pid is at offset 1238

libbpf rewrites the eBPF bytecode:
  ldr r0, [r1 + 1234]  -->  ldr r0, [r1 + 1238]

Same binary, different kernels.

Combined with the vmlinux.h header (auto-generated from BTF, containing all kernel type definitions), eBPF programs can now be compiled once and distributed as a single binary that runs correctly across a range of kernel versions.

Failure Modes

Verifier rejection: Programs that exceed the instruction limit (1 million instructions as of 5.22+), have unreachable code paths that the verifier cannot prune, or perform unsafe pointer arithmetic are rejected at load time with often cryptic error messages.
Map exhaustion: A leaking program that fills an LRU hash map will start evicting legitimate entries, causing lookup failures. Monitor map usage via bpftool map and set appropriate max_entries.
Tail call stack overflow: eBPF tail calls do not consume stack; however, the maximum chain length is 33. Programs that try to chain more calls silently return without executing remaining calls.
JIT spraying: Early eBPF had a vulnerability class where an attacker could craft eBPF bytecode that, after JIT compilation, placed attacker-controlled values at predictable kernel addresses. Mitigated by JIT hardening (/proc/sys/net/core/bpf_jit_harden).
CO-RE relocation failure: If a field is renamed or removed between kernel versions, CO-RE relocation fails and the program cannot load. This requires fallback probing or multi-version conditionals.

Security Implications

Privilege requirement: By default, loading eBPF programs requires CAP_BPF (Linux 5.8+) or CAP_SYS_ADMIN. Unprivileged eBPF (socket filtering only) has historically had numerous privilege escalation CVEs and is typically disabled.
Verifier bugs: The verifier itself has had bugs that allowed malicious eBPF programs to escape to arbitrary kernel execution (CVE-2021-3490, CVE-2022-23222). The verifier's complexity makes it a persistent attack surface.
Speculative execution attacks: eBPF programs can be used to craft Spectre gadgets. The kernel applies speculative store bypass disable (SSBD) or pointer masking mitigations that add overhead to eBPF programs.
Supply chain: A compromised eBPF program loaded by a privileged process can read arbitrary kernel memory, exfiltrate secrets, or modify packet data. Treat eBPF programs as privileged code.

Performance Implications

XDP performance: Cloudflare measures 14–24 Mpps (million packets per second) per core for XDP DROP, compared to ~4 Mpps for iptables DROP—roughly 4–6x improvement.
kprobe overhead: Attaching to a high-frequency kernel function (e.g., tcp_sendmsg) can add 50–200 ns per call. At millions of calls per second, this is measurable. Use tracepoints where available (lower overhead, more stable).
Map access latency: BPF hash map lookup is ~50–100 ns for a warmed-up entry in L3 cache. Per-CPU maps eliminate locking and reduce this to ~10–20 ns.
Tail call overhead: Each tail call is approximately equivalent to a function call; chains of 10+ tail calls add up.

Modern Usage

Linux 6.x: kfuncs for calling kernel functions from eBPF, BPF exceptions (structured error handling), BPF arenas (large mmap-backed memory), sleepable BPF programs for LSM
bpftool: Production-grade tool for loading, inspecting, and managing eBPF programs and maps
Wasm + eBPF: Proposals to use WebAssembly as an alternative eBPF VM for kernel extension (less mature)

Future Directions

BPF token: Allow delegated, fine-grained eBPF permissions without full CAP_BPF, enabling safer multi-tenant eBPF deployment
Verified eBPF: Research into using formal methods to prove eBPF program correctness beyond what the verifier currently checks
eBPF for storage: Experimental work on BPF programs attached to block layer and NVMe driver for custom I/O scheduling
eBPF for HPC: Offloading collective communication primitives (MPI reduce) to eBPF programs on smart NICs

Debugging Notes

# List all loaded eBPF programs
bpftool prog list

# Dump eBPF program instructions (JIT-compiled)
bpftool prog dump jited id <ID>

# Inspect a map
bpftool map dump id <MAP_ID>

# Trace eBPF program execution with debug output
# (requires program to call bpf_trace_printk)
cat /sys/kernel/debug/tracing/trace_pipe

# Check verifier log on load failure
# (use libbpf's verbose mode or bpftool --debug)
bpftool --debug prog load prog.o /sys/fs/bpf/prog

# Monitor XDP statistics
ip -s link show dev eth0

# Use bpftrace for one-liner tracing
bpftrace -e 'kprobe:do_sys_open { printf("%s opened %s\n", comm, str(arg1)); }'

Exercises

Write an XDP program in C that counts incoming packets per source IP address, stores counts in a BPF hash map, and exposes them to a userspace polling program. Use libbpf and CO-RE.
Use bpftrace to write a one-liner that measures the distribution of read() syscall latencies, broken down by process name. What is the p99 for your shell vs a database?
Study the Cilium source code for its kube-proxy replacement. Map out which BPF program handles service VIP → backend translation and what map types it uses.
Compile and run the kernel's eBPF sample programs in samples/bpf/. Modify xdp_drop_kern.c to drop only UDP packets and measure the throughput difference with iperf3.
Read CVE-2022-23222 (verifier type confusion). Explain at a high level what invariant the verifier failed to maintain and how the exploit leveraged it.

References

McCanne, S. & Jacobson, V. "The BSD Packet Filter" (USENIX 1993)
Starovoitov, A. eBPF patch series (LKML 2014): "net: filter: rework/optimize internal BPF interpreter's instruction set"
Gregg, B. "BPF Performance Tools" (Addison-Wesley 2019) — comprehensive reference
Corbet, J. "A thorough introduction to eBPF" (LWN.net 2017)
Nakryiko, A. "BPF CO-RE reference guide" (nakryiko.com 2021)
Cloudflare blog: "L4Drop: XDP DDoS Mitigations" (2018)
Meta Engineering blog: "Open-sourcing Katran, a scalable network load balancer" (2019)
Linux kernel documentation: Documentation/bpf/
eBPF.io: official eBPF documentation and tutorials