06 — eBPF Observability
Technical Overview
eBPF (extended Berkeley Packet Filter) is a Linux kernel subsystem that allows safe, sandboxed programs to run inside the kernel in response to events — system calls, kernel functions, network packets, hardware performance counters. For observability, eBPF is transformative: you can attach a tracing program to any kernel or userspace function without modifying application code, without recompiling the kernel, and without restarting services. The overhead is typically below 1% for most tracing workloads, making production-safe always-on instrumentation achievable.
eBPF programs are verified by the kernel's in-kernel verifier before execution, ensuring they cannot crash the kernel, loop infinitely, or access arbitrary memory. This safety guarantee is what makes eBPF deployable in production — it is not a raw kernel module.
Prerequisites
- Linux kernel fundamentals (system calls, kernel/userspace boundary)
- Familiarity with C programming at a basic level
- Understanding of network packets and TCP/IP basics
- Comfort with Linux perf tools and profiling concepts
Core Content
eBPF Architecture for Observability
eBPF OBSERVABILITY ARCHITECTURE
User Space Kernel Space
┌────────────────────┐ ┌─────────────────────────────────────────┐
│ │ │ │
│ bpftrace script │ load │ eBPF Program (verified, JIT compiled) │
│ or BCC Python │───────→│ │
│ or libbpf C app │ │ Attach points: │
│ │ │ ┌─────────────────────────────────────┐│
│ Read maps/ │ │ │ kprobe: kernel function entry/exit ││
│ perf buffers │←───────│ │ uprobe: userspace function ││
│ │ ring │ │ tracepoint: static kernel hooks ││
│ Output / │ buffer│ │ perf_event: hardware PMU events ││
│ histogram / │ │ │ socket/XDP: network packets ││
│ visualization │ │ └─────────────────────────────────────┘│
└────────────────────┘ │ │
│ eBPF Maps (shared state): │
│ ┌─────────────────────────────────────┐│
│ │ hash map: per-key accumulation ││
│ │ array: fixed-size slot storage ││
│ │ perf_buf: streaming events to user ││
│ │ ring_buf: low-overhead event stream ││
│ │ histogram: power-of-2 buckets ││
│ └─────────────────────────────────────┘│
└─────────────────────────────────────────┘
The key insight for observability: eBPF programs aggregate data inside the kernel. Instead of sending every raw event to userspace (which would require copying gigabytes of data), the eBPF program maintains histograms and counters in kernel maps and userspace reads only the summaries. This is why eBPF overhead is so low — the data reduction happens before leaving the kernel.
bpftrace: DTrace-like Scripting for eBPF
bpftrace provides a high-level scripting language for ad-hoc eBPF programs, analogous to DTrace on Solaris/macOS. It compiles bpftrace scripts to eBPF bytecode and loads them into the kernel.
bpftrace probe types:
kprobe:function_name # kernel function entry
kretprobe:function_name # kernel function return
tracepoint:subsys:name # static kernel tracepoint
uprobe:/path/to/bin:func # userspace function entry
uretprobe:/path/to/bin:func # userspace function return
usdt:/path/to/bin:probe # USDT static probe
profile:hz:99 # timer-based profiling at 99Hz
interval:s:5 # interval timer
Essential one-liners:
# Trace all execve() calls (what is being executed?)
bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf("%-6d %-16s %s\n", pid, comm, str(args->filename)); }'
# Count system calls by process
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
# Latency histogram for read() syscall
bpftrace -e '
tracepoint:syscalls:sys_enter_read { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_read /@start[tid]/
{
@latency_us = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}'
# Trace TCP connections (src:port → dst:port)
bpftrace -e 'kprobe:tcp_connect {
$sk = (struct sock *)arg0;
printf("%-6d %-16s %-16s → %-16s %d\n",
pid, comm,
ntop($sk->__sk_common.skc_rcv_saddr),
ntop($sk->__sk_common.skc_daddr),
$sk->__sk_common.skc_dport >> 8);
}'
# Trace file opens (which files is this process opening?)
bpftrace -e 'tracepoint:syscalls:sys_enter_openat {
printf("%-6d %-16s %s\n", pid, comm, str(args->filename));
}'
# Measure off-CPU time (time spent waiting, not on CPU)
bpftrace -e '
tracepoint:sched:sched_switch {
if (args->prev_state) {
@offcpu_start[args->prev_pid] = nsecs;
}
if (@offcpu_start[args->next_pid]) {
@offcpu_us[args->next_comm] = hist(
(nsecs - @offcpu_start[args->next_pid]) / 1000
);
delete(@offcpu_start[args->next_pid]);
}
}'
A full bpftrace script for request latency per service (example for HTTP servers via accept/read):
#!/usr/bin/bpftrace
// Trace time between accept() and close() as proxy for request latency
kretprobe:inet_csk_accept
{
@accept_time[retval] = nsecs;
}
kprobe:tcp_close
{
$sk = (struct sock *)arg0;
if (@accept_time[$sk]) {
$lat_us = (nsecs - @accept_time[$sk]) / 1000;
@latency_us = hist($lat_us);
delete(@accept_time[$sk]);
}
}
interval:s:10 { print(@latency_us); clear(@latency_us); }
BCC Tools
The BCC (BPF Compiler Collection) toolkit provides production-ready eBPF tools written in Python (with BCC kernel-side C compilation). The BCC/bcc-tools package includes dozens of tools:
# execsnoop: trace every new process exec
execsnoop-bpfcc
# Output:
# PCOMM PID PPID RET ARGS
# python 12345 1001 0 /usr/bin/python3 /usr/local/bin/gunicorn ...
# opensnoop: trace file opens with latency
opensnoop-bpfcc -T
# Output:
# TIME(s) PID COMM FD ERR PATH
# 0.000 1234 nginx 4 0 /etc/nginx/nginx.conf
# biolatency: block I/O latency histogram (kernel-level aggregation)
biolatency-bpfcc 10 # 10 second interval
# Output:
# usecs : count distribution
# 0 -> 1 : 0 | |
# 2 -> 3 : 0 | |
# 4 -> 7 : 1 | |
# 8 -> 15 : 52 |** |
# 16 -> 31 : 2164 |********************** |
# 32 -> 63 : 3123 |********************************|
# 64 -> 127 : 280 |** |
# tcplife: show TCP connection lifetimes with throughput
tcplife-bpfcc
# Output:
# PID COMM LADDR LPORT RADDR RPORT TX_KB RX_KB MS
# 1234 nginx 10.0.1.5 59832 10.0.2.3 80 5 143 432
# profile: CPU profiling with stack traces (99Hz sampling)
profile-bpfcc -F 99 -a 30 # 99Hz for 30 seconds, include all stacks
# offcputime: show time processes spent off-CPU (I/O, lock, sleep)
offcputime-bpfcc -p $(pgrep -f myapp) 10
# cpudist: CPU time distribution per process
cpudist-bpfcc 10
eBPF Maps for In-Kernel Aggregation
The power of eBPF observability comes from aggregating in the kernel. Consider measuring block I/O latency:
Without eBPF aggregation: every I/O event (potentially millions/second) is sent to userspace as a raw event. Userspace must buffer, aggregate, and compute histograms from millions of records.
With eBPF maps: the eBPF program maintains a histogram in a BPF array map in kernel memory. Each I/O event updates a single bucket counter (one atomic increment). Userspace reads the complete histogram once per second — hundreds of integers instead of millions of events.
// eBPF kernel-side C (simplified biolatency example)
struct {
__uint(type, BPF_MAP_TYPE_ARRAY);
__uint(max_entries, 64);
__type(key, u32);
__type(value, u64);
} hist SEC(".maps");
SEC("kprobe/blk_account_io_completion")
int trace_req_done(struct pt_regs *ctx) {
u64 delta_us = /* compute latency */;
u32 slot = log2l(delta_us);
if (slot >= 64) slot = 63;
__sync_fetch_and_add(&hist[slot], 1); // atomic increment in kernel
return 0;
}
eBPF Overhead
eBPF overhead is remarkably low:
- Event-driven probes (kprobe, tracepoint): overhead proportional to event frequency. Tracing read() at 1M syscalls/sec adds ~1-3% CPU. Tracing a rarely-called function: negligible.
- Profiling probes (99Hz timer): 99 stack captures/second per CPU. ~0.5% overhead regardless of application activity.
- In-kernel aggregation (histograms, counters): a single atomic increment per event. ~10ns overhead per event.
- Memory: BPF maps are preallocated. A 64-bucket histogram array = 64 * 8 bytes = 512 bytes of kernel memory.
Compare with: Valgrind (~20x slowdown), strace (~20-100x slowdown for traced processes), Dtrace (low overhead but kernel-version dependent), SystemTap (low overhead but requires kernel-devel packages).
Continuous Profiling with eBPF
Continuous profiling means always-on CPU (and memory) profiling in production — not just during incidents. This surfaces performance regressions before they become user-visible incidents.
Tools: - Parca (open source): stores profiles as time-series in a purpose-built profile database. Query like Prometheus: "show me the CPU profile for this service at 14:30 UTC yesterday." Written in Go, uses libbpf for eBPF collection. - Pyroscope (Grafana Labs, acquired 2023): similar to Parca. Supports eBPF, Go, Java (async-profiler), Python, Ruby. Storage: Parquet files in object storage. - Polar Signals Cloud: commercial offering based on Parca. - Grafana Pyroscope: Grafana's integrated profiling solution, supporting flame graph visualization within the Grafana UI alongside traces and metrics.
# Install Parca agent (eBPF-based, runs as DaemonSet)
kubectl apply -f https://github.com/parca-dev/parca-agent/releases/latest/download/kubernetes-manifest.yaml
# Query a CPU profile for a specific service via parca CLI
parcactl profile query \
--node="$(kubectl get node -o jsonpath='{.items[0].metadata.name}')" \
--from="$(date -d '10 minutes ago' --utc +%s)000000000" \
--to="$(date --utc +%s)000000000" \
--selector='{comm="myservice"}' \
--output flamegraph
CPU Profiling: Frame Pointer vs DWARF Unwinding
Stack unwinding is the process of reconstructing the call stack from the current instruction pointer. Two approaches:
Frame pointers (-fno-omit-frame-pointer): each function preserves the frame pointer register (RBP on x86-64), creating a linked list of stack frames. Unwinding is fast: follow the chain of frame pointers. Overhead: ~1-2% (one register that could otherwise hold data is reserved). eBPF and Linux perf use frame pointer unwinding.
DWARF unwinding: uses DWARF debug info (.eh_frame section) to unwind stacks. Works for code compiled without frame pointers. But DWARF unwinding in eBPF is complex (DWARF is complex), slow (requires large tables), and eBPF verifier constraints make it hard to implement. The eBPF bpf_get_stackid() helper uses frame pointers.
Linux distributions (Ubuntu 24.04+, Fedora 38+) are shipping production binaries compiled with frame pointers enabled (at ~1% cost) to improve eBPF-based profiling. Previously, perf and Parca agents would produce broken stack traces for many processes due to missing frame pointers.
Network Observability via eBPF: Cilium Hubble
Cilium is a Kubernetes CNI (Container Network Interface) plugin built on eBPF. Hubble is Cilium's network observability component.
What Hubble provides: - L3/L4 flow visibility: every TCP connection, its source/destination, latency, bytes transferred - L7 application visibility: HTTP method/URL/status, gRPC method/status, DNS queries/responses — extracted by eBPF programs attached to the network stack, without any application instrumentation - Service dependency graph from observed flows - Network policy enforcement audit log - Flow-based alerting (network-level anomaly detection)
# Observe all HTTP flows in a Kubernetes namespace
hubble observe --namespace payments --protocol http --last 100
# Sample output:
# TIMESTAMP SOURCE DESTINATION TYPE VERDICT
# 2024-01-15T14:32:01Z payments/checkout-pod-abc payments/postgres-svc:5432 TCP FORWARDED
# 2024-01-15T14:32:01Z payments/checkout-pod-abc payments/redis-svc:6379 TCP FORWARDED
# 2024-01-15T14:32:02Z payments/api-pod-xyz payments/checkout-svc:8080 HTTP GET FORWARDED
# http: path=/api/checkout status=200 duration=123ms
# DNS observation (what is this pod resolving?)
hubble observe --namespace payments --protocol dns
# Drop reason analysis
hubble observe --verdict DROPPED --last 200
Production eBPF Observability at Scale
Cloudflare: uses eBPF extensively for network monitoring at 100Gbps+ scale. They wrote multiple blog posts about using bpftrace and custom eBPF programs to debug DDoS patterns, connection rate limiting, and TCP stack issues. Their post "Programmable Packet Filtering at the Linux kernel level" (2017) was influential.
Meta (Facebook): uses continuous eBPF profiling fleet-wide. Their "strobelight" system profiles all production services at 10Hz, continuously. This data feeds into their performance regression detection system — every production change is compared against the eBPF profile before/after.
Netflix: Brendan Gregg (author of "BPF Performance Tools") developed many of the canonical BCC tools while at Netflix. Their production use cases include: disk I/O latency analysis, TCP retransmit analysis, CPU flame graph generation for any process on demand.
Historical Context
BPF (Berkeley Packet Filter) was created by Steven McCanne and Van Jacobson in 1992 for efficient network packet filtering in tcpdump. The original BPF was a simple register machine for matching packets.
eBPF (extended BPF) was introduced in Linux 3.18 (2014) by Alexei Starovoitov and Daniel Borkmann. The "extended" refers to a completely redesigned ISA: 64-bit registers, 10 general-purpose registers, more memory, and a verifier that proves safety. eBPF evolved rapidly: kprobes support (3.19), tracepoints (4.7), perf events (4.9), socket operations (4.9), XDP (4.8).
The BCC toolkit (Brenden Blanco, Brendan Gregg et al., 2015) made eBPF accessible to sysadmins and SREs who didn't want to write raw C eBPF programs. bpftrace (Alastair Robertson, 2018) further simplified ad-hoc eBPF tracing with DTrace-inspired syntax. Brendan Gregg's book "BPF Performance Tools" (2019) systematized the entire field.
Debugging Notes
"cannot attach kprobe" errors: The function name doesn't exist in the kernel (different kernel version). Use grep -r function_name /proc/kallsyms to verify the symbol exists. Kernel functions may be inlined (invisible to kprobes) — use tracepoints instead when available.
Verifier rejections: The eBPF verifier is strict. Common rejections:
- Unbounded loops (use bounded loops or bpf_loop() helper)
- Null pointer dereference without null check (verifier requires explicit null checks before pointer dereference)
- Stack too deep (512 bytes max per eBPF program)
# Check verifier log for a failed load
bpftool prog load myprogram.o /sys/fs/bpf/myprogram type kprobe 2>&1
Missing stack frames: Compiled without frame pointers. Recompile with -fno-omit-frame-pointer or use DWARF-based unwinding (if your tool supports it). Alternatively, run parca-agent with DWARF unwinding mode.
Security Implications
- eBPF programs require
CAP_BPF(Linux 5.8+) orCAP_SYS_ADMIN. In Kubernetes, running privileged containers for eBPF agents is a significant security risk — a compromised eBPF agent can read all kernel memory. - eBPF programs can observe all network traffic (including traffic from other containers on the same host). In multi-tenant environments, ensure eBPF observability agents are scoped per-namespace or that multi-tenancy is enforced at the Kubernetes level.
- The eBPF verifier prevents malicious programs from crashing the kernel, but cannot prevent programs from reading sensitive data (passwords in syscall arguments, encryption keys in memory). Privileged eBPF = kernel-level access.
- Unprivileged eBPF (
/proc/sys/kernel/unprivileged_bpf_disabled) is disabled on most distributions by default (and should remain disabled) due to historical privilege escalation vulnerabilities in the verifier.
Performance Implications
- Kprobe overhead scales linearly with event rate. At 1M events/second, expect 1-5% CPU. At 100k events/second, less than 0.5%.
- Profile-based eBPF (99Hz) is fixed-cost: ~0.5-1% CPU for stack capture + symbolization.
- eBPF map reads from userspace require a
bpf()syscall — not zero cost. For real-time dashboards, read maps every second; for profiling data, read every 10-30 seconds. bpf_ringbuf_output()is more efficient thanbpf_perf_event_output()for high-frequency event streaming. Ring buffers have lower per-event overhead and better memory efficiency.
Failure Modes and Real Incidents
kprobe instability on kernel upgrade: A team pinned their bpftrace scripts to specific kernel function names. After a kernel upgrade, several functions were renamed or inlined, causing all kprobe-based monitors to silently fail. Fix: use tracepoints (stable API) over kprobes when available; version-test eBPF scripts in CI against all supported kernel versions.
eBPF agent OOM: A Parca agent with DWARF unwinding mode exhausted memory on a node with many processes. DWARF table loading for every binary on the system consumed several GB. Fix: configure --profiling-normalizer-addr-normalizer-cache-size limits; use frame-pointer mode for nodes with limited memory.
Modern Usage
- Cilium + Hubble: standard eBPF CNI for production Kubernetes. Provides network policy enforcement and L7 observability simultaneously.
- Grafana Beyla (2023): eBPF-based automatic instrumentation for HTTP, gRPC, and database calls — zero-code, zero-agent installation. Exports spans and metrics in OTel format.
- Tetragon (Isovalent): eBPF-based security observability and runtime enforcement (detect cryptomining, privilege escalation, data exfiltration at the kernel level).
Future Directions
- eBPF for CPU architecture beyond Linux: eBPF is being ported to Windows (Microsoft's eBPF-for-Windows). When mature, this brings zero-instrumentation observability to Windows workloads.
- eBPF CO-RE (Compile Once, Run Everywhere): libbpf + BTF (BPF Type Format) allows a single compiled eBPF program to run on any kernel version ≥5.2 without recompilation, solving the kernel-version fragmentation problem.
- eBPF maps in user space (userspace eBPF): running eBPF programs in user space via uBPF for testing and portable execution.
Exercises
-
bpftrace exploration: On a Linux system with bpftrace installed, trace all
open()system calls made by a running nginx or Apache process. Capture the filenames and access times. Write the results to a file every 10 seconds usinginterval:s:10. -
Histogram vs raw events: Write two versions of a program to measure block I/O latency: one that prints every raw I/O event using
bpf_perf_event_output(), and one that builds a histogram in an eBPF map. Generate 10,000 I/Os and compare: CPU overhead, memory usage, data volume transferred to userspace, and the latency distribution accuracy. -
Off-CPU analysis: Use
offcputime-bpfcc(orbpftraceequivalent) to profile a Java application. Identify the top-3 reasons the application goes off-CPU (waiting on I/O, locks, sleep). Compare with CPU flame graph fromprofile-bpfcc. What percentage of total time is the application on-CPU vs off-CPU? -
Hubble network tracing: In a Kubernetes cluster with Cilium, use
hubble observeto trace HTTP flows between two services. Identify any failed connections (DROPPED verdict). Write a Hubble policy that alerts (via Prometheus metric) when any service receives more than 100 connection resets per minute. -
Continuous profiling comparison: Deploy Parca agent as a DaemonSet. Run a CPU-intensive application for 30 minutes with a known performance regression introduced at t=15min (e.g., an N+1 database query). Using Parca's differential flame graph feature, identify exactly which function's CPU time increased at t=15min.
References
- Gregg, Brendan. BPF Performance Tools. Addison-Wesley, 2019. (The definitive reference)
- Gregg, Brendan. Systems Performance. 2nd ed. Addison-Wesley, 2020. Chapter 15 (eBPF).
- Starovoitov, Alexei. "BPF: the universal in-kernel virtual machine." LWN.net, 2014.
- Cloudflare Blog. "BPF: the next frontier." 2018. https://blog.cloudflare.com/
- bpftrace Reference Guide: https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md
- BCC Tools: https://github.com/iovisor/bcc
- Cilium/Hubble: https://docs.cilium.io/en/stable/observability/hubble/
- Parca Documentation: https://www.parca.dev/docs/
- libbpf CO-RE: https://nakryiko.com/posts/bpf-core-reference-guide/