07 — Profiling and Flame Graphs
Technical Overview
Profiling answers the question: "Where does the program spend its time?" It is the act of attributing execution time or resource consumption to program locations—functions, lines, or instructions. Without profiling, optimization is speculation; with profiling, the bottleneck is visible.
Flame graphs, invented by Brendan Gregg in 2011 and published in 2013, are the most information-dense visualization ever devised for CPU profiles. They represent an entire profile—potentially millions of stack samples—in a single image where the hot path is visually obvious and every call chain is navigable.
Prerequisites
- Stack frame mechanics on x86-64 (rbp, rsp, return addresses).
- ELF binary format: symbols, DWARF debug sections.
perftool basics.- Understanding of sampling vs. tracing distinction.
Core Content
Profiling Types
| Type | What It Measures | Primary Tool | Overhead |
|---|---|---|---|
| CPU on-CPU | Time executing (not blocked) | perf record, gprof |
1–5% |
| CPU off-CPU | Time blocked (I/O, lock, sleep) | offcputime (BCC), perf sched |
1–5% |
| Memory allocation | Heap allocation sites, sizes | heaptrack, valgrind massif |
10–100% |
| Memory access | Cache miss locations (PEBS) | perf mem |
~5% |
| I/O | Disk I/O time per call site | fileslower (BCC) |
< 2% |
| Mutex / lock | Lock contention per site | perf lock, mutrace |
2–10% |
| System call | Syscall duration and frequency | perf trace, strace -c |
1–50% |
Sample-based (statistical) profiling: the profiler interrupts the program at regular intervals (N Hz), captures the program counter and call stack, and accumulates a histogram. Each sample is a snapshot; the histogram converges to the true time distribution. Overhead is proportional to sample rate—typically 1–5% at 99–999 Hz.
Instrumentation-based profiling: the profiler injects code at every function entry/exit (gprof -pg, Java's JVM TI). Every call is counted and timed. Overhead can be 10–100x and perturbs the very behavior being measured. Use for detailed call counts; avoid for latency-sensitive production profiling.
Sample-Based CPU Profiling Mechanics
On Linux, perf record uses one of two mechanisms:
-
SIGPROF(timer-based): OS sends SIGPROF every N microseconds. The signal handler captures the current PC. Works in user space; cannot capture kernel frames. -
PMU interrupt (PMI — Performance Monitoring Interrupt): The PMU is configured to interrupt after N CPU cycles (e.g., every 1,000,000 cycles = ~333 µs at 3 GHz for 99 Hz effective rate). The interrupt fires in kernel context and can capture the full kernel + user stack.
# Sample at 99 Hz, all CPUs, capture call graphs
perf record -F 99 -a -g -- sleep 30
# Sample at 999 Hz for specific PID
perf record -F 999 -g -p <pid> -- sleep 10
# Sample with DWARF unwinding (more accurate but higher overhead)
perf record -F 99 -g --call-graph dwarf -p <pid> -- sleep 30
Stack Unwinding Methods
When the PMI fires, the kernel must capture the call stack. How it does so depends on the configuration:
Frame pointer unwinding (fast, requires -fno-omit-frame-pointer):
Stack frame layout (x86-64 with frame pointers):
Higher address
┌─────────────────────────┐
│ Caller's frame pointer │ ← rbp (frame pointer register)
│ Return address │
│ Local variables │
│ ... │
└─────────────────────────┘
Lower address (current rsp)
Unwinding: read rbp → dereference to get caller's rbp → repeat. Very fast (no memory map lookup needed), but GCC/Clang default to omitting frame pointers (-fomit-frame-pointer) for an ~1–3% speedup.
# Compile with frame pointers
gcc -O2 -fno-omit-frame-pointer -o myapp myapp.c
# Or for an entire system (Fedora's approach since Fedora 38):
# rpm packages compiled with frame pointers
DWARF unwinding (accurate, high overhead):
DWARF (Debugging With Arbitrary Record Formats) stores unwind tables (.eh_frame section) that describe how to reconstruct the call stack at any PC. The kernel reads these tables to unwind. Accurate even without frame pointers, but requires copying large DWARF data per sample.
perf record -F 99 --call-graph dwarf -p <pid>
# Overhead: 3–10% (higher than frame pointer)
ORC (Oops Rewind Capability) (Linux kernel only):
A simplified, faster alternative to DWARF for the kernel itself. ORC tables are generated by objtool during kernel build and stored in a compact format. Enables reliable kernel stack unwinding at PMI time.
Symbol Resolution
A profile is useful only when sample PCs are translated to function names.
- Kernel symbols:
/proc/kallsymsmaps kernel function addresses to names. Requiresperf_event_paranoid < 1or root. - DWARF debug info: strips are separate (
debuginfopackages).perf reportuses these automatically if installed. - JIT/dynamic code: JVM, V8, and similar runtimes JIT-compile code at runtime.
perfcan't resolve these without help. - JVM:
-XX:+PreserveFramePointer -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints+perf-map-agentwhich writes/tmp/perf-<pid>.mapwith JIT mappings. - Node.js:
node --perf-basic-profgenerates/tmp/perf-<pid>.map.
Flame Graph Methodology
Brendan Gregg invented the flame graph in 2011 after spending hours trying to read tabular perf report output for a MySQL performance problem. The insight: represent thousands of stack samples as a single stacked bar chart where wider = more time.
Reading a flame graph:
Width of frame = proportion of total samples with this frame on stack
Height (y-axis) = call depth (bottom = thread/process, top = hot functions)
Color = typically module or file (not hot/cold — use differential flame graphs for that)
Order = alphabetical within a level (NOT time order)
Example:
┌────┐
│foo │ ← foo() is hot: wide at the top
┌─────────┴────┴─────────┐
│ bar() │
┌───────┴──────────────────────┐ │
│ main_loop() │ │
├──────────────────────────────┴─┤
│ start_thread() │
└────────────────────────────────┘
→ main_loop() calls bar() which calls foo() most of the time
→ foo() is the bottleneck (wide frame at top of call stack)
What to look for: - Wide frames near the top of the stack: hottest functions, most optimization potential. - Flat tops (wide frame with nothing above it): leaf functions spending all their time here—compute bound. - Towers (tall narrow columns): many function calls with little time per frame—call overhead or call chains to investigate.
Flame Graph Generation
# Step 1: Collect profile
perf record -F 99 -a -g -- sleep 30
# Step 2: Export to text
perf script > out.perf
# Step 3: Fold stacks (collapse identical stacks)
stackcollapse-perf.pl out.perf > out.folded
# Step 4: Generate SVG
flamegraph.pl out.folded > flame.svg
# View in browser (interactive SVG: click to zoom, hover for percentages)
open flame.svg
Tools:
- stackcollapse-perf.pl, flamegraph.pl: https://github.com/brendangregg/FlameGraph
- inferno (Rust implementation, faster): https://github.com/jonhoo/inferno
- parca / pyroscope: continuous production flame graphs with web UI
One-liner for quick profiling:
perf record -F 99 -a -g -- sleep 30; \
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
Annotated Flame Graph Reading Guide
FLAME GRAPH READING GUIDE
═══════════════════════════
Y-AXIS (vertical):
Top = leaf functions (where time is spent)
Bottom = root frames (thread start, main())
X-AXIS (horizontal):
Width = % of total samples
Order = alphabetical (NOT chronological)
Narrow = infrequently on-stack
COLORS (brendangregg default palette):
Yellow/Orange = kernel space
Red/Brown = user space C/C++
Green = Java (when using perf-map-agent)
Blue = shell scripts
(Custom palettes common in modern tools)
PATTERNS TO RECOGNIZE:
┌────────────────────┐
│ compute_hash() │ ← Flat top + wide = hottest function
└────────────────────┘ Fix: optimize this function
┌──┐
│b │ ← Narrow = rarely called, ignore
└──┘
┌──────┐ ┌──────┐ ┌──────┐
│func1 │ │func2 │ │func3 │ ← Wide frame split into
└──────┘ └──────┘ └──────┘ many sub-frames =
coarse framing; drill down
┌──────────────────────────┐
│ lock_wait() │ ← Wide "blocked" frame
└──────────────────────────┘ = off-CPU time if using off-CPU FG
= lock contention hotspot
ZOOM: In interactive SVG, click any frame to zoom into that subtree.
RESET: Click "Reset Zoom" or press Esc.
SEARCH: Use the search box to highlight all frames matching a regex.
Off-CPU Flame Graphs
On-CPU flame graphs show where the CPU is busy. Off-CPU flame graphs show where threads are blocked—waiting on I/O, locks, sleep, or scheduling.
Off-CPU time is captured by tracing schedule() calls in the kernel (when a thread is descheduled) and timestamping wake-up events.
# BCC offcputime: trace off-CPU time for a PID
/usr/share/bcc/tools/offcputime -p <pid> 30 > out.offcpu
flamegraph.pl --color=io --title="Off-CPU" < out.offcpu > offcpu.svg
# bpftrace version
bpftrace -e '
tracepoint:sched:sched_switch {
@start[args->prev_pid] = nsecs;
}
tracepoint:sched:sched_wakeup {
$dur = nsecs - @start[args->pid];
if ($dur > 0) {
@offcpu_ns[args->pid, ustack, kstack] = sum($dur);
}
}
'
The resulting flame graph shows the call stack when the thread went to sleep, sorted by total blocked time. A wide frame in an off-CPU flame graph at epoll_wait means the thread is I/O bound; at futex_wait means lock contention; at nanosleep means intentional sleeps.
Memory Flame Graphs
Memory flame graphs show which call stacks are responsible for heap allocations.
# heaptrack: track all allocations
heaptrack -p <pid>
heaptrack_print heaptrack.<pid>.gz -F | flamegraph.pl > memory.svg
# valgrind massif
valgrind --tool=massif --pages-as-heap=yes ./program
ms_print massif.out.<pid> > massif_report.txt
# perf mem (sample memory access locations with PEBS)
perf mem record -p <pid> -- sleep 10
perf mem report --sort=mem,sym
Memory flame graphs reveal: which code paths allocate the most (total bytes), allocation hotspots (high allocation rate causing GC pressure), memory leaks (growing allocations not freed).
Historical Context
gprof (1982, GNU) was the first widely-used profiler on Unix. It used compile-time instrumentation and was inaccurate due to measurement perturbation. oprofile (2002) introduced PMU-based profiling on Linux. perf (2009, merged into Linux 2.6.31) replaced oprofile with a more complete and actively developed tool.
Brendan Gregg's flame graph was born from a MySQL performance problem at Sun in 2011. He was staring at tabular perf report output and found it impossible to see the hot path across a recursive call chain. The first flame graph was drawn, the problem was immediately visible, and the visualization was published in a 2013 ACM Queue article.
The introduction of eBPF (2014) and BCC (2015) enabled continuous production profiling without recompilation—profiling tools became safe to run against production servers with < 2% overhead. This catalyzed the continuous profiling industry (Polar Signals, Pyroscope, Parca, Datadog Continuous Profiler).
Production Examples
Case: MySQL bottleneck found in minutes with flame graph. Brendan Gregg's original use case: a MySQL server was slow. perf record + flame graph showed 40% of CPU in String_Copy() inside the join optimization code. A quick code change reduced this to < 5%. Total time from problem report to fix: 4 hours, most of which was waiting for the flame graph to generate. Without flame graphs, the same investigation would have taken days of reading gprof tables.
Case: Node.js memory leak via allocation flame graph. A Node.js service at Uber slowly grew its heap from 100 MB to 2 GB over 24 hours. Using heaptrack against a development replica, the memory flame graph showed 80% of allocations from a specific route handler that was capturing a closure with a reference to the full request object. The closure outlived the request, preventing GC. Fix: extract only needed fields from the request.
Debugging Notes
perf reportin text mode: useperf report --stdio --no-childrenfor a flat profile (no inclusive call counts). Sort byselfto find leaf hotspots.- If symbol names appear as hex addresses: install debuginfo packages (
debuginfo-install <pkg>on RHEL/Fedora,apt install <pkg>-dbgon Debian). - JVM profiling: use async-profiler instead of perf for Java. It handles Java frames correctly, resolves JIT symbols, and supports both CPU and allocation profiling.
perf recordcreatesperf.datawhich can be copied to a developer machine for analysis offline. Useperf archiveto bundle debug symbols.- Flamescope (Netflix): visualizes flame graphs over time as a heatmap, enabling identification of intermittent hot paths.
Security Implications
Flame graphs expose function names, file paths, and call chains. In production, they may reveal security-sensitive logic (crypto operations, authentication flows). Treat flame graph SVG files as sensitive and restrict access appropriately.
Symbol resolution via /proc/kallsyms requires perf_event_paranoid < 1 or kernel.kptr_restrict = 0. In production:
sysctl -w kernel.perf_event_paranoid = 2 # Restrict perf events
sysctl -w kernel.kptr_restrict = 2 # Hide kernel pointers
eBPF-based profilers (parca, pyroscope) require CAP_PERFMON or CAP_BPF—audit who can deploy them in multi-tenant environments.
Performance Implications
Sample-based profiling at 99 Hz adds 1–3% CPU overhead. The PMU interrupt is handled in kernel context and requires saving/restoring registers + unwinding the stack. At 999 Hz, overhead is 5–10%. For production continuous profiling, 19 Hz is common (< 0.5% overhead).
Stack unwinding overhead: frame-pointer unwinding is ~1 µs per sample; DWARF unwinding is ~10 µs per sample. For a 10-CPU system at 99 Hz, frame-pointer profiling adds ~10,000 unwind operations/second = ~10 ms/s of kernel time ≈ 0.1% overhead.
Failure Modes and Real Incidents
Broken stacks / missing frames. The most common complaint: flame graphs showing [unknown] frames or stacks that terminate early. Causes:
1. Missing frame pointers: recompile with -fno-omit-frame-pointer, or use DWARF unwinding.
2. Stack grew beyond perf's max stack depth: sysctl -w kernel.perf_event_max_stack=256.
3. JIT code without symbol maps: configure the runtime to emit perf maps.
Profiler heisenbug. A Python service showed a "performance problem" only when profiled. The profiler's signal delivery (SIGPROF) was interfering with a signal-sensitive critical section, causing the profiler to create the problem it was measuring. Fix: switched to PMU-based profiling, which doesn't use SIGPROF.
Modern Usage
Continuous profiling is now standard at hyperscale: Google Perfetto, Netflix Vector, Datadog APM, Grafana Pyroscope, and Polar Signals Parca all offer always-on CPU profiling via eBPF. Overhead is < 1% at 19 Hz sampling.
async-profiler (Java) uses AsyncGetCallTrace API to sample Java stacks safely from signal handlers, bypassing the safepoint bias problem that afflicts JVM-TI profilers. It is the correct tool for any Java CPU profiling.
Differential flame graphs compare two profiles (before/after a change) with red indicating functions that got more time and blue indicating less. Essential for confirming that an optimization actually worked.
Future Directions
- Continuous profiling with CI integration: automatically compare flame graphs for each commit, flagging regressions.
- AI annotation of flame graphs: LLMs identifying optimization opportunities from flame graph structure.
- eBPF-based allocation profiling: tracing
malloc/freewith stack unwinding in eBPF, approaching zero overhead for production memory profiling. - Distributed flame graphs: showing call stacks that span microservices via distributed tracing + profiling correlation (Grafana Tempo + Pyroscope integration).
Exercises
-
Profile a compute-intensive program (
./stress --cpu 1) withperf record -F 99 -gfor 30 seconds. Generate a flame graph. Identify the top-3 widest frames and explain what they represent. -
Profile a program with frequent blocking I/O using
offcputime(BCC). Generate an off-CPU flame graph. Compare the off-CPU hot paths with the on-CPU hot paths. What does the difference reveal about the program's bottleneck? -
Introduce a deliberate hot path in a C program (a tight loop calling a hash function repeatedly). Profile with and without
-fno-omit-frame-pointer. Compare the flame graph quality—specifically, look for broken stacks. -
Profile a Java application using async-profiler (
./profiler.sh -d 30 -f profile.html <pid>). Open the HTML flame graph and identify the most CPU-intensive method. Explain how the JIT heuristic might affect profiling accuracy. -
Generate a differential flame graph: profile a program before and after changing an algorithm (e.g., linear search to binary search). Use
flamegraph.pl --negateto generate the diff. Explain what the red and blue regions mean.
References
- Gregg, B. "Flame Graphs." ACM Queue, 2016. https://queue.acm.org/detail.cfm?id=2927301
- Gregg, B. Systems Performance (2nd ed., 2020). Chapter 2: Methodologies. Chapter 5: Applications.
- FlameGraph repository: https://github.com/brendangregg/FlameGraph
- Gregg, B. BPF Performance Tools. Pearson, 2019.
- async-profiler: https://github.com/async-profiler/async-profiler
- Inferno (Rust flamegraph): https://github.com/jonhoo/inferno
- Polar Signals Parca: https://www.parca.dev/
- Grafana Pyroscope: https://grafana.com/oss/pyroscope/
- DWARF standard: https://dwarfstd.org/
- ORC unwinder: https://www.kernel.org/doc/html/latest/x86/orc-unwinder.html