04 — eBPF Debugging Tools
Technical Overview
eBPF transforms kernel debugging from a specialist skill requiring kernel patches or reboots into an on-demand, production-safe capability. The tooling ecosystem — bpftrace for ad-hoc scripts, BCC for Python-driven programs, and libbpf for compiled production-grade tools — covers nearly every observable kernel event. For a production SRE, eBPF tools answer questions that previously required deep kernel expertise, custom patches, or strace (with its 20-100x slowdown): which processes are making which syscalls, which files are being opened, what is the latency of every disk I/O, which TCP connections are dying and why.
This chapter covers the practical taxonomy of eBPF debugging tools, the bpftrace language, key BCC tools and their use in production debugging, and the USDT (User Statically Defined Traces) mechanism that allows zero-instrumentation debugging of language runtimes.
Prerequisites
- Linux kernel 4.9+ (most features require 4.15+; bpftrace requires 4.9+, best at 5.3+)
CONFIG_BPF=y,CONFIG_BPF_SYSCALL=y,CONFIG_BPF_JIT=y(most distros ship with these)- Root or
CAP_BPF+CAP_PERFMON(Linux 5.8+) - Familiarity with Linux system calls, kernel functions, and C structs
- Basic understanding of eBPF concepts (see 06-ebpf-observability.md)
Core Content
eBPF Tool Taxonomy
EBPF DEBUGGING TOOL ECOSYSTEM
Abstraction Level
(High)
│
│ bpftrace Shell-like scripting language. Ad-hoc investigation.
│ ───────────── One-liners to multi-page scripts. Compiled to eBPF at runtime.
│ Best for: exploration, one-off questions
│
│ BCC tools Python frontend + kernel-side C compiled at runtime.
│ ───────────── Pre-built tools for common use cases.
│ Best for: standard debugging tasks (execsnoop, biolatency, etc.)
│
│ libbpf (CO-RE) C/C++ eBPF programs compiled ahead-of-time.
│ ───────────── Portable via BTF/CO-RE. Production-grade agents.
│ Best for: always-on tools, production deployment
│
│ Raw bpf() Direct syscall programming.
│ ───────────── No framework overhead.
(Low) Best for: extreme performance needs (rarely needed)
bpftrace Language
bpftrace programs have the structure: probe /filter/ { action }. Multiple probes can coexist in one program.
Probe types:
kprobe:function # kernel function entry (dynamic, fragile)
kretprobe:function # kernel function return, accesses retval
tracepoint:category:name # static kernel tracepoint (stable)
uprobe:binary:function # user-space function entry
uretprobe:binary:function # user-space function return
usdt:binary:probe_name # USDT static probe in binary
hardware:event_name # hardware PMU event
software:event_name # software event (page-faults, etc.)
profile:hz:N # timer sampling at N Hz (CPU profiling)
interval:s:N # periodic interval, N seconds
BEGIN # runs once at program start
END # runs once at program exit (Ctrl+C)
Built-in variables:
pid # process ID of current process
tid # thread ID
uid # user ID
gid # group ID
comm # process name (comm string, max 16 chars)
nsecs # current timestamp in nanoseconds
cpu # CPU number
curtask # pointer to current task_struct
args # tracepoint/kprobe arguments struct
retval # return value (kretprobe/uretprobe only)
func # current function name
probe # full probe specifier
curtask # current task_struct pointer
Maps (aggregation structures):
# Named map: @map_name[key] = value
@latency[pid] = nsecs # assign
@count[comm]++ # increment
delete(@latency[pid]) # remove key
clear(@count) # clear all keys
# Histogram: auto-binning into power-of-2 buckets
@hist = hist(value) # power-of-2 histogram
@lhist = lhist(value, 0, 100, 10) # linear histogram: min=0, max=100, step=10
# Aggregation functions
@sum = sum(value)
@count = count()
@min = min(value)
@max = max(value)
@avg = avg(value)
# Print all maps at program exit
// By default, all maps are printed at END
Complete bpftrace programs:
#!/usr/bin/bpftrace
// Program 1: Trace all execve() calls with command line arguments
tracepoint:syscalls:sys_enter_execve
{
printf("%-6d %-16s %s\n", pid, comm, str(args->filename));
}
// Program 2: Measure read() system call latency per process
tracepoint:syscalls:sys_enter_read
{
@start[tid] = nsecs;
}
tracepoint:syscalls:sys_exit_read
/@start[tid]/ // filter: only process if we saw the entry
{
$duration_us = (nsecs - @start[tid]) / 1000;
@latency_us[comm] = hist($duration_us);
delete(@start[tid]);
}
END
{
printf("\nRead() latency distribution per process:\n");
print(@latency_us);
clear(@latency_us);
}
#!/usr/bin/bpftrace
// Program 3: TCP connection lifecycle (established connections only)
kprobe:tcp_set_state
{
$sk = (struct sock *)arg0;
$state = arg1;
// 1 = TCP_ESTABLISHED
if ($state == 1) {
printf("ESTABLISHED pid=%-6d comm=%-16s %s:%d -> %s:%d\n",
pid, comm,
ntop(AF_INET, $sk->__sk_common.skc_rcv_saddr),
$sk->__sk_common.skc_num,
ntop(AF_INET, $sk->__sk_common.skc_daddr),
bswap16($sk->__sk_common.skc_dport));
}
// 7 = TCP_CLOSE
if ($state == 7) {
printf("CLOSE pid=%-6d comm=%-16s %s:%d -> %s:%d\n",
pid, comm,
ntop(AF_INET, $sk->__sk_common.skc_rcv_saddr),
$sk->__sk_common.skc_num,
ntop(AF_INET, $sk->__sk_common.skc_daddr),
bswap16($sk->__sk_common.skc_dport));
}
}
#!/usr/bin/bpftrace
// Program 4: Detect mutex contention (off-CPU waiting on mutex)
// Traces the time processes spend waiting to acquire a mutex
kprobe:mutex_lock_slowpath
{
@waiting_since[tid] = nsecs;
@waiting_comm[tid] = comm;
}
kretprobe:mutex_lock_slowpath
/@waiting_since[tid]/
{
$wait_us = (nsecs - @waiting_since[tid]) / 1000;
if ($wait_us > 100) { // only show waits > 100 microseconds
printf("CONTENTION comm=%-16s pid=%-6d waited=%d us\n",
@waiting_comm[tid], pid, $wait_us);
}
delete(@waiting_since[tid]);
delete(@waiting_comm[tid]);
}
kprobe vs Tracepoint vs USDT
kprobe (kernel probe): attaches to any kernel function by name. Dynamic — added at runtime without kernel recompilation. Fragile: if the function is renamed, inlined, or removed in a kernel update, the kprobe silently fails. Use when no tracepoint exists.
# Check if a kprobe target exists
bpftrace -l 'kprobe:tcp_connect'
# Output: kprobe:tcp_connect (if the function exists)
# No output = function doesn't exist in this kernel
# Tracepoints are stable API; kprobes are not
bpftrace -l 'tracepoint:*' | grep "net:" # list all net: tracepoints
Tracepoint: static, defined in kernel source with TRACE_EVENT() macro. Stable API maintained across kernel versions. Preferred over kprobes when available. Has a defined argument structure (args->) accessible without struct casting.
# List all syscall tracepoints
bpftrace -l 'tracepoint:syscalls:*' | wc -l
# ~700 syscall tracepoints available
# List all available tracepoints
bpftrace -l 'tracepoint:*' | wc -l
# ~2000+ tracepoints in a typical production kernel
USDT (User Statically Defined Traces): static probes compiled into user-space binaries and language runtimes. Stable API for probing language internals.
Available in:
- CPython: python:function__entry, python:function__return, python:gc__start, python:gc__done, python:import__find__load__start
- Node.js: node:http__server__request, node:http__client__request, node:gc__start, node:gc__done
- Ruby: ruby:method__entry, ruby:gc__mark__begin
- JVM (via folly/libsdt or JVMTI): hotspot:method__entry, hotspot:gc__begin
- PostgreSQL: postgresql:query__start, postgresql:query__done, postgresql:lock__wait__start
- MySQL: mysql:query__start, mysql:query__done
# List available USDT probes in a binary
bpftrace -l 'usdt:/usr/bin/python3:*'
# Output:
# usdt:/usr/bin/python3:python:function__entry
# usdt:/usr/bin/python3:python:function__return
# usdt:/usr/bin/python3:python:gc__start
# ...
# Trace Python function calls
bpftrace -e '
usdt:/usr/bin/python3:python:function__entry
{
printf("%-6d %-10s %s:%d %s\n",
pid, comm,
str(arg0), // filename
arg1, // line number
str(arg2)); // function name
}'
# Trace PostgreSQL slow queries (> 100ms)
bpftrace -e '
usdt:/usr/lib/postgresql/14/bin/postgres:postgresql:query__start
{
@query_start[pid] = nsecs;
@query[pid] = str(arg0); // SQL query string
}
usdt:/usr/lib/postgresql/14/bin/postgres:postgresql:query__done
/@query_start[pid]/
{
$duration_ms = (nsecs - @query_start[pid]) / 1000000;
if ($duration_ms > 100) {
printf("SLOW QUERY: %dms: %s\n", $duration_ms, @query[pid]);
}
delete(@query_start[pid]);
delete(@query[pid]);
}'
BCC Tools Reference
BCC (BPF Compiler Collection) provides dozens of pre-built tools. These are in /usr/share/bcc/tools/ or invoked as toolname-bpfcc on Debian/Ubuntu:
Process and execution:
# execsnoop: every new process execution
execsnoop-bpfcc
# PID PPID RET COMM ARGS
# 12345 1001 0 curl curl https://example.com
# opensnoop: file opens with latency and error codes
opensnoop-bpfcc -T -e # -T: timestamps, -e: show errors
# TIME(s) PID COMM FD ERR PATH
# 0.000 1234 nginx 4 0 /etc/nginx/nginx.conf
# 0.001 1234 nginx -1 2 /etc/nginx/nonexistent (errno 2 = ENOENT)
# filetop: top files by I/O
filetop-bpfcc 1 # 1-second interval
# statsnoop: trace stat() calls
statsnoop-bpfcc -p 12345 # only show for one process
CPU and scheduling:
# cpudist: CPU time distribution per process
cpudist-bpfcc 5 # 5-second interval
# usecs : count distribution
# 0 -> 1 : 1234 |**** |
# 2 -> 3 : 4567 |******************* |
# 4 -> 7 : 2345 |********** |
# runqlat: run queue latency (time waiting to be scheduled)
runqlat-bpfcc
# usecs : count distribution
# 0 -> 1 : 8234 |*********************** |
# 2 -> 3 : 1234 |*** |
# 4 -> 7 : 234 | |
# 1000 -> 2047: 12 | | ← scheduling latency spikes
# profile: CPU flame graph data (stack samples at 49Hz)
profile-bpfcc -F 49 -a 30 > /tmp/profile.txt
# (then generate flame graph with FlameGraph scripts)
# offcputime: off-CPU time analysis
offcputime-bpfcc -K 10 # kernel stacks, 10 seconds
offcputime-bpfcc -U 10 # user stacks, 10 seconds
Memory:
# memleak: memory allocation tracking (leak detection)
memleak-bpfcc -p 12345 --older 5 # show allocs older than 5s not freed
# shmsnoop: shared memory operations
shmsnoop-bpfcc
# drsnoop: direct reclaim snoop (memory pressure causing reclaim)
drsnoop-bpfcc
Block I/O:
# biolatency: block I/O latency histogram
biolatency-bpfcc -D 10 # -D: per-disk breakdown
# Tracing block device I/O... Hit Ctrl-C to end.
#
# disk = sda
# usecs : count distribution
# 0 -> 1 : 0 | |
# 2 -> 3 : 0 | |
# 4 -> 7 : 23 |*** |
# 8 -> 15 : 456 |******************** |
# 16 -> 31 : 789 |********************** | ← peak latency 16-31us (SSD)
# biotop: top block I/O sources
biotop-bpfcc 1 # 1-second interval, sort by bytes
# biosnoop: trace every block I/O with latency
biosnoop-bpfcc
# TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms)
# 0.000004 java 1234 sda R 12345678 4096 0.58
Network:
# tcpconnect: trace TCP connection attempts
tcpconnect-bpfcc
# PID COMM IP SADDR DADDR DPORT
# 1234 curl 4 10.0.1.5 93.184.216.34 443
# tcpretrans: trace TCP retransmissions (network issues)
tcpretrans-bpfcc
# TIME PID IP LADDR:LPORT T> RADDR:RPORT STATE
# 14:32:01 0 10.0.1.5:52000 R> 10.0.2.3:8080 ESTABLISHED
# tcplife: TCP connection lifetime with throughput
tcplife-bpfcc
# PID COMM LADDR LPORT RADDR RPORT TX_KB RX_KB MS
# 1234 nginx 10.0.1.5 80 10.0.0.1 54321 0 45 843
# tcpdrop: trace TCP packet drops with reason
tcpdrop-bpfcc
# nettop: top network traffic sources
nettop-bpfcc 1
Kernel:
# hardirqs: hard interrupt handler latencies
hardirqs-bpfcc
# HARDIRQ TOTAL_usecs COUNT AVG_usecs
# ata_sff 234 12 19
# eth0 12345 1234 10
# softirqs: soft interrupt handler latencies
softirqs-bpfcc
# wakeuptime: off-CPU time by wakeup stack (who woke me up)
wakeuptime-bpfcc -p 12345
# criticalstat: detect long atomic operations
criticalstat-bpfcc # finds long spinlock/RCU-held sections
uprobe: User-Space Function Probing
uprobe attaches to user-space functions in running binaries without modifying them:
# Trace malloc() calls from a specific process
bpftrace -e '
uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc
/pid == 12345/
{
@alloc_sizes = hist(arg0); // arg0 = size parameter to malloc()
}
END { print(@alloc_sizes); }
'
# Trace a Go function (Go uses plan9-style ABI, need goroutine ID via goid)
# First find the binary's function address
nm /proc/12345/exe | grep "main.processRequest"
# Then probe by address if symbol not in uprobe name form
bpftrace -e '
uprobe:/proc/12345/exe:main.processRequest
{
printf("processRequest called by pid %d\n", pid);
}'
# Trace OpenSSL to capture plaintext before encryption (SSL stripping)
bpftrace -e '
uprobe:/lib/x86_64-linux-gnu/libssl.so.3:SSL_write
{
printf("SSL_write pid=%-6d len=%-6d data=%r\n",
pid, arg2, buf(arg1, arg2 > 64 ? 64 : arg2));
}'
Off-CPU Analysis with offcputime
# Full off-CPU analysis workflow (Brendan Gregg methodology)
# 1. Capture off-CPU stacks for 30 seconds
offcputime-bpfcc -f 30 > /tmp/offcpu.stacks
# 2. Generate off-CPU flame graph
./stackcollapse.pl /tmp/offcpu.stacks > /tmp/offcpu.folded
./flamegraph.pl --title "Off-CPU Time" --color=io \
--bgcolor=grey /tmp/offcpu.folded > /tmp/offcpu.svg
# Using bpftrace for off-CPU analysis manually:
bpftrace -e '
tracepoint:sched:sched_switch
/args->prev_state/ // filter: only when going off-CPU (not voluntary)
{
@sleep_start[args->prev_pid] = nsecs;
@sleep_comm[args->prev_pid] = args->prev_comm;
}
tracepoint:sched:sched_switch
/@sleep_start[args->next_pid]/
{
$t = (nsecs - @sleep_start[args->next_pid]) / 1000;
@wakeup_latency_us[str(@sleep_comm[args->next_pid])] = hist($t);
delete(@sleep_start[args->next_pid]);
delete(@sleep_comm[args->next_pid]);
}'
Historical Context
The BCC (BPF Compiler Collection) project was started by Brenden Blanco at PLUMgrid around 2014-2015. Brendan Gregg joined the project and contributed most of the production-oriented tools (biolatency, offcputime, tcpretrans, etc.) while at Netflix. The project moved to GitHub (iovisor/bcc) and became a CNCF project.
bpftrace was created by Alastair Robertson (a UK-based engineer) starting in 2018, inspired by DTrace's scripting model. It was designed to make eBPF accessible to engineers who knew DTrace but found BCC's Python/C dual-language model cumbersome. Brendan Gregg adopted bpftrace heavily and contributed numerous one-liners and scripts.
USDT probes for language runtimes originated in DTrace on Solaris/macOS, where they provided language-level visibility. When eBPF gained uprobe support, USDT probes became available on Linux for the same languages (Python, Ruby, Node.js added USDT probes around 2012-2016).
Production Examples
# Production incident: "why is this service suddenly making more syscalls than yesterday?"
# Step 1: Find what syscalls are being made
bpftrace -e 'tracepoint:raw_syscalls:sys_enter /pid == 12345/ {
@[ksym(args->id)] = count();
}' -p 12345 &
sleep 30
kill %1
# Output: @[read]: 45234, @[write]: 12345, @[futex]: 234567 ← futex spikes indicate lock contention
# Step 2: Find the futex call sites (where in the code are futexes being called?)
bpftrace -e '
tracepoint:syscalls:sys_enter_futex
/pid == 12345/
{
@[ustack] = count();
}' -p 12345 &
sleep 10
kill %1
# Output shows user-space call stacks leading to futex → identifies the contended lock
# Production incident: "file open errors returning ENOENT for /tmp files"
opensnoop-bpfcc -e 2 -T
# Shows only ENOENT errors with timestamps
# TIME(s) PID COMM FD ERR PATH
# 14:32:01 1234 java -1 2 /tmp/config-12345.json ← race condition in temp file handling
Debugging Notes
"Could not attach probe: No such file or directory" on uprobe: The binary path is wrong or the function is inlined. Use nm binary | grep funcname to verify the symbol exists. If inlined, there is no symbol to probe — use a higher-level function or USDT if available.
bpftrace map not printing at exit: If the program exits due to Ctrl+C before the END block runs, maps print automatically. If the program exits via exit() in an interval block, maps before exit() must be explicitly printed.
Kernel struct access in bpftrace requires correct BTF: On kernels with BTF support (5.2+), bpftrace can access kernel structs by name. Without BTF, you must use pointer arithmetic: *(uint32_t *)(arg0 + 0x18) instead of $sk->skc_dport. Check BTF availability: ls /sys/kernel/btf/vmlinux.
BCC tools failing with "cannot open BPF object file": BCC compiles eBPF code at runtime using the kernel headers. On some systems (minimal containers), kernel headers may not be installed. Install linux-headers-$(uname -r). For containers, use the libbpf CO-RE approach (bpftrace with BTF) instead.
Security Implications
- bpftrace can capture SSL_write/SSL_read plaintext before encryption. This is legitimate for debugging but is also a significant capability for an attacker with root. Audit who has CAP_BPF on production systems.
- uprobe on a shared library (
libc.so) attaches to that function for ALL processes using the library. If you accidentally tracemalloc()for all PIDs, you will get enormous volume and may affect other services. - uretprobes have a known performance issue: they require a trampolined return, adding overhead per call. At >1M calls/second, uretprobe overhead is measurable. Prefer tracepoints when available.
- USDT probes in PostgreSQL can expose query text (including potentially user data in WHERE clauses). Only use query tracing in development or with appropriate data handling controls.
Performance Implications
- bpftrace compilation (first run): 2-10 seconds to compile eBPF bytecode. Subsequent runs reuse the compiled program.
- kprobe on a heavily-called function: ~100ns overhead per call. At 1M calls/sec: ~100ms/sec CPU overhead = ~1%.
- uprobe overhead: ~100-300ns per call (higher than kprobe due to user-space trap cost). At high call rates, be careful.
- USDT: ~50ns when disabled (semaphore check). ~100-200ns when enabled.
- Map operations (histogram update): ~10-50ns per event (atomic operation in kernel memory).
Failure Modes and Real Incidents
bpftrace infinite loop from missing guard: A script that tracked TCP state transitions with @ts[skaddr] = nsecs and delete(@ts[skaddr]) had a bug: the delete was inside a filter that sometimes didn't match, causing unbounded map growth. The BPF map reached its 10,000 entry limit and bpftrace started logging "failed to update map" errors every second. Fix: always use delete() in the return probe unconditionally, or set explicit map size limits.
offcputime OOM on a busy scheduler: Running offcputime-bpfcc -a 60 on a host with 2000+ threads context-switching frequently caused the BCC user-space Python process to allocate 4GB+ of memory storing stack trace strings. Fix: use -m 1000 (minimum 1ms off-CPU duration filter) to reduce output volume; or use bpftrace with a histogram instead of full stacks.
uprobe crash from mismatch between binary and running process: An engineer ran uprobe:/usr/bin/myservice:main.handleRequest against PID 12345, but PID 12345 was a different version of myservice (hot-deployed 5 minutes earlier). The symbol existed at a different offset in the new binary, causing the probe to fire at the wrong location. Fix: always verify the binary matches the running process: md5sum /usr/bin/myservice /proc/12345/exe.
Modern Usage
- bpftrace 0.19+ (2023): adds support for iterators (iterate kernel data structures without event triggers), improved BTF support, new built-in functions.
- Inspektor Gadget: a Kubernetes-native eBPF debugging toolkit that wraps BCC/bpftrace tools in a Kubernetes-aware interface. Deploy as a DaemonSet, then run
kubectl gadget trace execto trace execve across the cluster. - Beyla (Grafana): eBPF-based automatic HTTP/gRPC instrumentation without code changes. Uses uprobes + USDT for zero-instrumentation distributed tracing.
- Tetragon (Isovalent/Cilium): eBPF-based security observability — detects privilege escalation, file access violations, network anomalies.
Future Directions
- libbpf CO-RE as default: the trend is moving from BCC (compile at runtime, requires kernel headers) to libbpf CO-RE (compile once, run everywhere via BTF type info). bpftrace is also adopting CO-RE backend.
- bpftrace map iterators: upcoming feature allowing bpftrace programs to iterate kernel data structures (TCP socket table, file descriptor table) without waiting for events.
- eBPF for Windows: Microsoft's eBPF-for-Windows project will bring bpftrace-like tooling to Windows environments.
Exercises
-
bpftrace system call profiler: Write a bpftrace program that counts system calls by call name for all processes for 30 seconds and prints the top-20 most frequent, with their percentages. Add a filter to exclude kernel threads (
pid > 1). -
Latency-spiking connection finder: Write a bpftrace program that traces all TCP connections and flags any connection where the time between
tcp_connectandtcp_closeis less than 10ms (probably a connection reset or refused). Print the destination IP/port and the process name. -
Python hot function tracer: Use USDT to trace all Python function entries in a running Python web service. Record the function name and call count over 60 seconds. Produce a frequency-ordered list of the top-10 most-called Python functions. Compare with the CPU flame graph from
profile-bpfcc. Are the hot Python functions also the hot CPU functions? -
Off-CPU investigation: Use
offcputime-bpfccto analyze a Java service. Capture kernel stacks for 30 seconds. Identify the top-3 kernel call paths where the service spends time off-CPU. For each, explain what it is waiting for (I/O, lock, sleep?) by reading the kernel function names in the stack. -
uprobe malloc size histogram: Use uprobe on
malloc()in libc to trace allocation sizes for a specific process (e.g., a running Java process or a Go service). Build a histogram of allocation sizes. What is the most common allocation size? Are there any unusually large single allocations that might indicate memory leaks?
References
- Gregg, Brendan. BPF Performance Tools. Addison-Wesley, 2019. (Complete reference; chapters 1-5 cover tools taxonomy)
- bpftrace Reference Guide: https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md
- BCC Tools README: https://github.com/iovisor/bcc/blob/master/docs/tutorial.md
- Gregg, Brendan. "Linux eBPF Superpowers." http://brendangregg.com/blog/2016-03-05/linux-bpf-superpowers.html
- Gregg, Brendan. "Off-CPU Analysis." http://brendangregg.com/offcpuanalysis.html
- libbpf CO-RE guide: https://nakryiko.com/posts/bpf-core-reference-guide/
- USDT probe tutorial: https://leezhenghui.github.io/linux/2019/03/05/exploring-usdt-on-linux.html
- Inspektor Gadget: https://www.inspektor-gadget.io/