04 — eBPF Debugging Tools

Technical Overview

eBPF transforms kernel debugging from a specialist skill requiring kernel patches or reboots into an on-demand, production-safe capability. The tooling ecosystem — bpftrace for ad-hoc scripts, BCC for Python-driven programs, and libbpf for compiled production-grade tools — covers nearly every observable kernel event. For a production SRE, eBPF tools answer questions that previously required deep kernel expertise, custom patches, or strace (with its 20-100x slowdown): which processes are making which syscalls, which files are being opened, what is the latency of every disk I/O, which TCP connections are dying and why.

This chapter covers the practical taxonomy of eBPF debugging tools, the bpftrace language, key BCC tools and their use in production debugging, and the USDT (User Statically Defined Traces) mechanism that allows zero-instrumentation debugging of language runtimes.

Prerequisites

Linux kernel 4.9+ (most features require 4.15+; bpftrace requires 4.9+, best at 5.3+)
CONFIG_BPF=y, CONFIG_BPF_SYSCALL=y, CONFIG_BPF_JIT=y (most distros ship with these)
Root or CAP_BPF + CAP_PERFMON (Linux 5.8+)
Familiarity with Linux system calls, kernel functions, and C structs
Basic understanding of eBPF concepts (see 06-ebpf-observability.md)

Core Content

eBPF Tool Taxonomy

EBPF DEBUGGING TOOL ECOSYSTEM

  Abstraction Level
  (High)
   │
   │  bpftrace          Shell-like scripting language. Ad-hoc investigation.
   │  ─────────────     One-liners to multi-page scripts. Compiled to eBPF at runtime.
   │                    Best for: exploration, one-off questions
   │
   │  BCC tools         Python frontend + kernel-side C compiled at runtime.
   │  ─────────────     Pre-built tools for common use cases.
   │                    Best for: standard debugging tasks (execsnoop, biolatency, etc.)
   │
   │  libbpf (CO-RE)    C/C++ eBPF programs compiled ahead-of-time.
   │  ─────────────     Portable via BTF/CO-RE. Production-grade agents.
   │                    Best for: always-on tools, production deployment
   │
   │  Raw bpf()         Direct syscall programming.
   │  ─────────────     No framework overhead.
  (Low)                 Best for: extreme performance needs (rarely needed)

bpftrace Language

bpftrace programs have the structure: probe /filter/ { action }. Multiple probes can coexist in one program.

Probe types:

kprobe:function            # kernel function entry (dynamic, fragile)
kretprobe:function         # kernel function return, accesses retval
tracepoint:category:name   # static kernel tracepoint (stable)
uprobe:binary:function     # user-space function entry
uretprobe:binary:function  # user-space function return
usdt:binary:probe_name     # USDT static probe in binary
hardware:event_name        # hardware PMU event
software:event_name        # software event (page-faults, etc.)
profile:hz:N               # timer sampling at N Hz (CPU profiling)
interval:s:N               # periodic interval, N seconds
BEGIN                       # runs once at program start
END                         # runs once at program exit (Ctrl+C)

Built-in variables:

pid         # process ID of current process
tid         # thread ID
uid         # user ID
gid         # group ID
comm        # process name (comm string, max 16 chars)
nsecs       # current timestamp in nanoseconds
cpu         # CPU number
curtask     # pointer to current task_struct
args        # tracepoint/kprobe arguments struct
retval      # return value (kretprobe/uretprobe only)
func        # current function name
probe       # full probe specifier
curtask     # current task_struct pointer

Maps (aggregation structures):

# Named map: @map_name[key] = value
@latency[pid] = nsecs     # assign
@count[comm]++            # increment
delete(@latency[pid])     # remove key
clear(@count)             # clear all keys

# Histogram: auto-binning into power-of-2 buckets
@hist = hist(value)               # power-of-2 histogram
@lhist = lhist(value, 0, 100, 10) # linear histogram: min=0, max=100, step=10

# Aggregation functions
@sum = sum(value)
@count = count()
@min = min(value)
@max = max(value)
@avg = avg(value)

# Print all maps at program exit
// By default, all maps are printed at END

Complete bpftrace programs:

#!/usr/bin/bpftrace
// Program 1: Trace all execve() calls with command line arguments
tracepoint:syscalls:sys_enter_execve
{
    printf("%-6d %-16s %s\n", pid, comm, str(args->filename));
}

// Program 2: Measure read() system call latency per process
tracepoint:syscalls:sys_enter_read
{
    @start[tid] = nsecs;
}

tracepoint:syscalls:sys_exit_read
/@start[tid]/  // filter: only process if we saw the entry
{
    $duration_us = (nsecs - @start[tid]) / 1000;
    @latency_us[comm] = hist($duration_us);
    delete(@start[tid]);
}

END
{
    printf("\nRead() latency distribution per process:\n");
    print(@latency_us);
    clear(@latency_us);
}

#!/usr/bin/bpftrace
// Program 3: TCP connection lifecycle (established connections only)
kprobe:tcp_set_state
{
    $sk = (struct sock *)arg0;
    $state = arg1;

    // 1 = TCP_ESTABLISHED
    if ($state == 1) {
        printf("ESTABLISHED pid=%-6d comm=%-16s %s:%d -> %s:%d\n",
            pid, comm,
            ntop(AF_INET, $sk->__sk_common.skc_rcv_saddr),
            $sk->__sk_common.skc_num,
            ntop(AF_INET, $sk->__sk_common.skc_daddr),
            bswap16($sk->__sk_common.skc_dport));
    }

    // 7 = TCP_CLOSE
    if ($state == 7) {
        printf("CLOSE      pid=%-6d comm=%-16s %s:%d -> %s:%d\n",
            pid, comm,
            ntop(AF_INET, $sk->__sk_common.skc_rcv_saddr),
            $sk->__sk_common.skc_num,
            ntop(AF_INET, $sk->__sk_common.skc_daddr),
            bswap16($sk->__sk_common.skc_dport));
    }
}

#!/usr/bin/bpftrace
// Program 4: Detect mutex contention (off-CPU waiting on mutex)
// Traces the time processes spend waiting to acquire a mutex
kprobe:mutex_lock_slowpath
{
    @waiting_since[tid] = nsecs;
    @waiting_comm[tid] = comm;
}

kretprobe:mutex_lock_slowpath
/@waiting_since[tid]/
{
    $wait_us = (nsecs - @waiting_since[tid]) / 1000;
    if ($wait_us > 100) {  // only show waits > 100 microseconds
        printf("CONTENTION comm=%-16s pid=%-6d waited=%d us\n",
            @waiting_comm[tid], pid, $wait_us);
    }
    delete(@waiting_since[tid]);
    delete(@waiting_comm[tid]);
}

kprobe vs Tracepoint vs USDT

kprobe (kernel probe): attaches to any kernel function by name. Dynamic — added at runtime without kernel recompilation. Fragile: if the function is renamed, inlined, or removed in a kernel update, the kprobe silently fails. Use when no tracepoint exists.

# Check if a kprobe target exists
bpftrace -l 'kprobe:tcp_connect'
# Output: kprobe:tcp_connect   (if the function exists)
# No output = function doesn't exist in this kernel

# Tracepoints are stable API; kprobes are not
bpftrace -l 'tracepoint:*' | grep "net:"  # list all net: tracepoints

Tracepoint: static, defined in kernel source with TRACE_EVENT() macro. Stable API maintained across kernel versions. Preferred over kprobes when available. Has a defined argument structure (args->) accessible without struct casting.

# List all syscall tracepoints
bpftrace -l 'tracepoint:syscalls:*' | wc -l
# ~700 syscall tracepoints available

# List all available tracepoints
bpftrace -l 'tracepoint:*' | wc -l
# ~2000+ tracepoints in a typical production kernel

USDT (User Statically Defined Traces): static probes compiled into user-space binaries and language runtimes. Stable API for probing language internals.

Available in: - CPython: python:function__entry, python:function__return, python:gc__start, python:gc__done, python:import__find__load__start - Node.js: node:http__server__request, node:http__client__request, node:gc__start, node:gc__done - Ruby: ruby:method__entry, ruby:gc__mark__begin - JVM (via folly/libsdt or JVMTI): hotspot:method__entry, hotspot:gc__begin - PostgreSQL: postgresql:query__start, postgresql:query__done, postgresql:lock__wait__start - MySQL: mysql:query__start, mysql:query__done

# List available USDT probes in a binary
bpftrace -l 'usdt:/usr/bin/python3:*'
# Output:
# usdt:/usr/bin/python3:python:function__entry
# usdt:/usr/bin/python3:python:function__return
# usdt:/usr/bin/python3:python:gc__start
# ...

# Trace Python function calls
bpftrace -e '
usdt:/usr/bin/python3:python:function__entry
{
    printf("%-6d %-10s %s:%d %s\n",
        pid, comm,
        str(arg0),    // filename
        arg1,         // line number
        str(arg2));   // function name
}'

# Trace PostgreSQL slow queries (> 100ms)
bpftrace -e '
usdt:/usr/lib/postgresql/14/bin/postgres:postgresql:query__start
{
    @query_start[pid] = nsecs;
    @query[pid] = str(arg0);  // SQL query string
}

usdt:/usr/lib/postgresql/14/bin/postgres:postgresql:query__done
/@query_start[pid]/
{
    $duration_ms = (nsecs - @query_start[pid]) / 1000000;
    if ($duration_ms > 100) {
        printf("SLOW QUERY: %dms: %s\n", $duration_ms, @query[pid]);
    }
    delete(@query_start[pid]);
    delete(@query[pid]);
}'

BCC Tools Reference

BCC (BPF Compiler Collection) provides dozens of pre-built tools. These are in /usr/share/bcc/tools/ or invoked as toolname-bpfcc on Debian/Ubuntu:

Process and execution:

# execsnoop: every new process execution
execsnoop-bpfcc
# PID   PPID  RET COMM            ARGS
# 12345 1001   0  curl            curl https://example.com

# opensnoop: file opens with latency and error codes
opensnoop-bpfcc -T -e   # -T: timestamps, -e: show errors
# TIME(s)   PID    COMM  FD ERR PATH
# 0.000     1234   nginx  4  0  /etc/nginx/nginx.conf
# 0.001     1234   nginx -1  2  /etc/nginx/nonexistent  (errno 2 = ENOENT)

# filetop: top files by I/O
filetop-bpfcc 1    # 1-second interval

# statsnoop: trace stat() calls
statsnoop-bpfcc -p 12345  # only show for one process

CPU and scheduling:

# cpudist: CPU time distribution per process
cpudist-bpfcc 5    # 5-second interval
# usecs      : count  distribution
# 0 -> 1     : 1234  |****                    |
# 2 -> 3     : 4567  |*******************     |
# 4 -> 7     : 2345  |**********              |

# runqlat: run queue latency (time waiting to be scheduled)
runqlat-bpfcc
# usecs      : count  distribution
# 0 -> 1     : 8234  |***********************  |
# 2 -> 3     : 1234  |***                      |
# 4 -> 7     : 234   |                         |
# 1000 -> 2047:  12   |                         |  ← scheduling latency spikes

# profile: CPU flame graph data (stack samples at 49Hz)
profile-bpfcc -F 49 -a 30 > /tmp/profile.txt
# (then generate flame graph with FlameGraph scripts)

# offcputime: off-CPU time analysis
offcputime-bpfcc -K 10   # kernel stacks, 10 seconds
offcputime-bpfcc -U 10   # user stacks, 10 seconds

Memory:

# memleak: memory allocation tracking (leak detection)
memleak-bpfcc -p 12345 --older 5   # show allocs older than 5s not freed

# shmsnoop: shared memory operations
shmsnoop-bpfcc

# drsnoop: direct reclaim snoop (memory pressure causing reclaim)
drsnoop-bpfcc

Block I/O:

# biolatency: block I/O latency histogram
biolatency-bpfcc -D 10   # -D: per-disk breakdown
# Tracing block device I/O... Hit Ctrl-C to end.
# 
# disk = sda
# usecs      : count  distribution
# 0 -> 1     : 0     |                        |
# 2 -> 3     : 0     |                        |
# 4 -> 7     : 23    |***                     |
# 8 -> 15    : 456   |********************    |
# 16 -> 31   : 789   |**********************  |  ← peak latency 16-31us (SSD)

# biotop: top block I/O sources
biotop-bpfcc 1    # 1-second interval, sort by bytes

# biosnoop: trace every block I/O with latency
biosnoop-bpfcc
# TIME(s)     COMM         PID    DISK    T SECTOR     BYTES  LAT(ms)
# 0.000004    java         1234   sda     R 12345678   4096   0.58

Network:

# tcpconnect: trace TCP connection attempts
tcpconnect-bpfcc
# PID    COMM    IP SADDR           DADDR           DPORT
# 1234   curl     4 10.0.1.5        93.184.216.34   443

# tcpretrans: trace TCP retransmissions (network issues)
tcpretrans-bpfcc
# TIME     PID    IP LADDR:LPORT          T> RADDR:RPORT          STATE
# 14:32:01 0         10.0.1.5:52000     R> 10.0.2.3:8080        ESTABLISHED

# tcplife: TCP connection lifetime with throughput
tcplife-bpfcc
# PID   COMM     LADDR    LPORT RADDR     RPORT TX_KB RX_KB MS
# 1234  nginx    10.0.1.5 80    10.0.0.1  54321  0     45   843

# tcpdrop: trace TCP packet drops with reason
tcpdrop-bpfcc

# nettop: top network traffic sources
nettop-bpfcc 1

Kernel:

# hardirqs: hard interrupt handler latencies
hardirqs-bpfcc
# HARDIRQ                    TOTAL_usecs   COUNT     AVG_usecs
# ata_sff                         234         12        19
# eth0                          12345       1234        10

# softirqs: soft interrupt handler latencies  
softirqs-bpfcc

# wakeuptime: off-CPU time by wakeup stack (who woke me up)
wakeuptime-bpfcc -p 12345

# criticalstat: detect long atomic operations
criticalstat-bpfcc  # finds long spinlock/RCU-held sections

uprobe: User-Space Function Probing

uprobe attaches to user-space functions in running binaries without modifying them:

# Trace malloc() calls from a specific process
bpftrace -e '
uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc
/pid == 12345/
{
    @alloc_sizes = hist(arg0);  // arg0 = size parameter to malloc()
}
END { print(@alloc_sizes); }
'

# Trace a Go function (Go uses plan9-style ABI, need goroutine ID via goid)
# First find the binary's function address
nm /proc/12345/exe | grep "main.processRequest"
# Then probe by address if symbol not in uprobe name form
bpftrace -e '
uprobe:/proc/12345/exe:main.processRequest
{
    printf("processRequest called by pid %d\n", pid);
}'

# Trace OpenSSL to capture plaintext before encryption (SSL stripping)
bpftrace -e '
uprobe:/lib/x86_64-linux-gnu/libssl.so.3:SSL_write
{
    printf("SSL_write pid=%-6d len=%-6d data=%r\n",
        pid, arg2, buf(arg1, arg2 > 64 ? 64 : arg2));
}'

Off-CPU Analysis with offcputime

# Full off-CPU analysis workflow (Brendan Gregg methodology)

# 1. Capture off-CPU stacks for 30 seconds
offcputime-bpfcc -f 30 > /tmp/offcpu.stacks

# 2. Generate off-CPU flame graph
./stackcollapse.pl /tmp/offcpu.stacks > /tmp/offcpu.folded
./flamegraph.pl --title "Off-CPU Time" --color=io \
    --bgcolor=grey /tmp/offcpu.folded > /tmp/offcpu.svg

# Using bpftrace for off-CPU analysis manually:
bpftrace -e '
tracepoint:sched:sched_switch
/args->prev_state/   // filter: only when going off-CPU (not voluntary)
{
    @sleep_start[args->prev_pid] = nsecs;
    @sleep_comm[args->prev_pid] = args->prev_comm;
}

tracepoint:sched:sched_switch
/@sleep_start[args->next_pid]/
{
    $t = (nsecs - @sleep_start[args->next_pid]) / 1000;
    @wakeup_latency_us[str(@sleep_comm[args->next_pid])] = hist($t);
    delete(@sleep_start[args->next_pid]);
    delete(@sleep_comm[args->next_pid]);
}'

Historical Context

The BCC (BPF Compiler Collection) project was started by Brenden Blanco at PLUMgrid around 2014-2015. Brendan Gregg joined the project and contributed most of the production-oriented tools (biolatency, offcputime, tcpretrans, etc.) while at Netflix. The project moved to GitHub (iovisor/bcc) and became a CNCF project.

bpftrace was created by Alastair Robertson (a UK-based engineer) starting in 2018, inspired by DTrace's scripting model. It was designed to make eBPF accessible to engineers who knew DTrace but found BCC's Python/C dual-language model cumbersome. Brendan Gregg adopted bpftrace heavily and contributed numerous one-liners and scripts.

USDT probes for language runtimes originated in DTrace on Solaris/macOS, where they provided language-level visibility. When eBPF gained uprobe support, USDT probes became available on Linux for the same languages (Python, Ruby, Node.js added USDT probes around 2012-2016).

Production Examples

# Production incident: "why is this service suddenly making more syscalls than yesterday?"

# Step 1: Find what syscalls are being made
bpftrace -e 'tracepoint:raw_syscalls:sys_enter /pid == 12345/ {
    @[ksym(args->id)] = count();
}' -p 12345 &
sleep 30
kill %1
# Output: @[read]: 45234, @[write]: 12345, @[futex]: 234567  ← futex spikes indicate lock contention

# Step 2: Find the futex call sites (where in the code are futexes being called?)
bpftrace -e '
tracepoint:syscalls:sys_enter_futex
/pid == 12345/
{
    @[ustack] = count();
}' -p 12345 &
sleep 10
kill %1
# Output shows user-space call stacks leading to futex → identifies the contended lock

# Production incident: "file open errors returning ENOENT for /tmp files"
opensnoop-bpfcc -e 2 -T
# Shows only ENOENT errors with timestamps
# TIME(s)   PID    COMM  FD ERR PATH
# 14:32:01  1234   java  -1   2 /tmp/config-12345.json   ← race condition in temp file handling

Debugging Notes

"Could not attach probe: No such file or directory" on uprobe: The binary path is wrong or the function is inlined. Use nm binary | grep funcname to verify the symbol exists. If inlined, there is no symbol to probe — use a higher-level function or USDT if available.

bpftrace map not printing at exit: If the program exits due to Ctrl+C before the END block runs, maps print automatically. If the program exits via exit() in an interval block, maps before exit() must be explicitly printed.

Kernel struct access in bpftrace requires correct BTF: On kernels with BTF support (5.2+), bpftrace can access kernel structs by name. Without BTF, you must use pointer arithmetic: *(uint32_t *)(arg0 + 0x18) instead of $sk->skc_dport. Check BTF availability: ls /sys/kernel/btf/vmlinux.

BCC tools failing with "cannot open BPF object file": BCC compiles eBPF code at runtime using the kernel headers. On some systems (minimal containers), kernel headers may not be installed. Install linux-headers-$(uname -r). For containers, use the libbpf CO-RE approach (bpftrace with BTF) instead.

Security Implications

bpftrace can capture SSL_write/SSL_read plaintext before encryption. This is legitimate for debugging but is also a significant capability for an attacker with root. Audit who has CAP_BPF on production systems.
uprobe on a shared library (libc.so) attaches to that function for ALL processes using the library. If you accidentally trace malloc() for all PIDs, you will get enormous volume and may affect other services.
uretprobes have a known performance issue: they require a trampolined return, adding overhead per call. At >1M calls/second, uretprobe overhead is measurable. Prefer tracepoints when available.
USDT probes in PostgreSQL can expose query text (including potentially user data in WHERE clauses). Only use query tracing in development or with appropriate data handling controls.

Performance Implications

bpftrace compilation (first run): 2-10 seconds to compile eBPF bytecode. Subsequent runs reuse the compiled program.
kprobe on a heavily-called function: ~100ns overhead per call. At 1M calls/sec: ~100ms/sec CPU overhead = ~1%.
uprobe overhead: ~100-300ns per call (higher than kprobe due to user-space trap cost). At high call rates, be careful.
USDT: ~50ns when disabled (semaphore check). ~100-200ns when enabled.
Map operations (histogram update): ~10-50ns per event (atomic operation in kernel memory).

Failure Modes and Real Incidents

bpftrace infinite loop from missing guard: A script that tracked TCP state transitions with @ts[skaddr] = nsecs and delete(@ts[skaddr]) had a bug: the delete was inside a filter that sometimes didn't match, causing unbounded map growth. The BPF map reached its 10,000 entry limit and bpftrace started logging "failed to update map" errors every second. Fix: always use delete() in the return probe unconditionally, or set explicit map size limits.

offcputime OOM on a busy scheduler: Running offcputime-bpfcc -a 60 on a host with 2000+ threads context-switching frequently caused the BCC user-space Python process to allocate 4GB+ of memory storing stack trace strings. Fix: use -m 1000 (minimum 1ms off-CPU duration filter) to reduce output volume; or use bpftrace with a histogram instead of full stacks.

uprobe crash from mismatch between binary and running process: An engineer ran uprobe:/usr/bin/myservice:main.handleRequest against PID 12345, but PID 12345 was a different version of myservice (hot-deployed 5 minutes earlier). The symbol existed at a different offset in the new binary, causing the probe to fire at the wrong location. Fix: always verify the binary matches the running process: md5sum /usr/bin/myservice /proc/12345/exe.

Modern Usage

bpftrace 0.19+ (2023): adds support for iterators (iterate kernel data structures without event triggers), improved BTF support, new built-in functions.
Inspektor Gadget: a Kubernetes-native eBPF debugging toolkit that wraps BCC/bpftrace tools in a Kubernetes-aware interface. Deploy as a DaemonSet, then run kubectl gadget trace exec to trace execve across the cluster.
Beyla (Grafana): eBPF-based automatic HTTP/gRPC instrumentation without code changes. Uses uprobes + USDT for zero-instrumentation distributed tracing.
Tetragon (Isovalent/Cilium): eBPF-based security observability — detects privilege escalation, file access violations, network anomalies.

Future Directions

libbpf CO-RE as default: the trend is moving from BCC (compile at runtime, requires kernel headers) to libbpf CO-RE (compile once, run everywhere via BTF type info). bpftrace is also adopting CO-RE backend.
bpftrace map iterators: upcoming feature allowing bpftrace programs to iterate kernel data structures (TCP socket table, file descriptor table) without waiting for events.
eBPF for Windows: Microsoft's eBPF-for-Windows project will bring bpftrace-like tooling to Windows environments.

Exercises

bpftrace system call profiler: Write a bpftrace program that counts system calls by call name for all processes for 30 seconds and prints the top-20 most frequent, with their percentages. Add a filter to exclude kernel threads (pid > 1).
Latency-spiking connection finder: Write a bpftrace program that traces all TCP connections and flags any connection where the time between tcp_connect and tcp_close is less than 10ms (probably a connection reset or refused). Print the destination IP/port and the process name.
Python hot function tracer: Use USDT to trace all Python function entries in a running Python web service. Record the function name and call count over 60 seconds. Produce a frequency-ordered list of the top-10 most-called Python functions. Compare with the CPU flame graph from profile-bpfcc. Are the hot Python functions also the hot CPU functions?
Off-CPU investigation: Use offcputime-bpfcc to analyze a Java service. Capture kernel stacks for 30 seconds. Identify the top-3 kernel call paths where the service spends time off-CPU. For each, explain what it is waiting for (I/O, lock, sleep?) by reading the kernel function names in the stack.
uprobe malloc size histogram: Use uprobe on malloc() in libc to trace allocation sizes for a specific process (e.g., a running Java process or a Go service). Build a histogram of allocation sizes. What is the most common allocation size? Are there any unusually large single allocations that might indicate memory leaks?

References

Gregg, Brendan. BPF Performance Tools. Addison-Wesley, 2019. (Complete reference; chapters 1-5 cover tools taxonomy)
bpftrace Reference Guide: https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md
BCC Tools README: https://github.com/iovisor/bcc/blob/master/docs/tutorial.md
Gregg, Brendan. "Linux eBPF Superpowers." http://brendangregg.com/blog/2016-03-05/linux-bpf-superpowers.html
Gregg, Brendan. "Off-CPU Analysis." http://brendangregg.com/offcpuanalysis.html
libbpf CO-RE guide: https://nakryiko.com/posts/bpf-core-reference-guide/
USDT probe tutorial: https://leezhenghui.github.io/linux/2019/03/05/exploring-usdt-on-linux.html
Inspektor Gadget: https://www.inspektor-gadget.io/