07 — Profiling and Flame Graphs

Technical Overview

Profiling answers the question: "Where does the program spend its time?" It is the act of attributing execution time or resource consumption to program locations—functions, lines, or instructions. Without profiling, optimization is speculation; with profiling, the bottleneck is visible.

Flame graphs, invented by Brendan Gregg in 2011 and published in 2013, are the most information-dense visualization ever devised for CPU profiles. They represent an entire profile—potentially millions of stack samples—in a single image where the hot path is visually obvious and every call chain is navigable.

Prerequisites

Stack frame mechanics on x86-64 (rbp, rsp, return addresses).
ELF binary format: symbols, DWARF debug sections.
perf tool basics.
Understanding of sampling vs. tracing distinction.

Core Content

Profiling Types

Type	What It Measures	Primary Tool	Overhead
CPU on-CPU	Time executing (not blocked)	`perf record`, `gprof`	1–5%
CPU off-CPU	Time blocked (I/O, lock, sleep)	`offcputime` (BCC), `perf sched`	1–5%
Memory allocation	Heap allocation sites, sizes	`heaptrack`, `valgrind massif`	10–100%
Memory access	Cache miss locations (PEBS)	`perf mem`	~5%
I/O	Disk I/O time per call site	`fileslower` (BCC)	< 2%
Mutex / lock	Lock contention per site	`perf lock`, `mutrace`	2–10%
System call	Syscall duration and frequency	`perf trace`, `strace -c`	1–50%

Sample-based (statistical) profiling: the profiler interrupts the program at regular intervals (N Hz), captures the program counter and call stack, and accumulates a histogram. Each sample is a snapshot; the histogram converges to the true time distribution. Overhead is proportional to sample rate—typically 1–5% at 99–999 Hz.

Instrumentation-based profiling: the profiler injects code at every function entry/exit (gprof -pg, Java's JVM TI). Every call is counted and timed. Overhead can be 10–100x and perturbs the very behavior being measured. Use for detailed call counts; avoid for latency-sensitive production profiling.

Sample-Based CPU Profiling Mechanics

On Linux, perf record uses one of two mechanisms:

SIGPROF (timer-based): OS sends SIGPROF every N microseconds. The signal handler captures the current PC. Works in user space; cannot capture kernel frames.
PMU interrupt (PMI — Performance Monitoring Interrupt): The PMU is configured to interrupt after N CPU cycles (e.g., every 1,000,000 cycles = ~333 µs at 3 GHz for 99 Hz effective rate). The interrupt fires in kernel context and can capture the full kernel + user stack.

# Sample at 99 Hz, all CPUs, capture call graphs
perf record -F 99 -a -g -- sleep 30

# Sample at 999 Hz for specific PID
perf record -F 999 -g -p <pid> -- sleep 10

# Sample with DWARF unwinding (more accurate but higher overhead)
perf record -F 99 -g --call-graph dwarf -p <pid> -- sleep 30

Stack Unwinding Methods

When the PMI fires, the kernel must capture the call stack. How it does so depends on the configuration:

Frame pointer unwinding (fast, requires -fno-omit-frame-pointer):

Stack frame layout (x86-64 with frame pointers):
        Higher address
┌─────────────────────────┐
│  Caller's frame pointer │ ← rbp (frame pointer register)
│  Return address          │
│  Local variables         │
│  ...                     │
└─────────────────────────┘
        Lower address (current rsp)

Unwinding: read rbp → dereference to get caller's rbp → repeat. Very fast (no memory map lookup needed), but GCC/Clang default to omitting frame pointers (-fomit-frame-pointer) for an ~1–3% speedup.

# Compile with frame pointers
gcc -O2 -fno-omit-frame-pointer -o myapp myapp.c

# Or for an entire system (Fedora's approach since Fedora 38):
# rpm packages compiled with frame pointers

DWARF unwinding (accurate, high overhead):

DWARF (Debugging With Arbitrary Record Formats) stores unwind tables (.eh_frame section) that describe how to reconstruct the call stack at any PC. The kernel reads these tables to unwind. Accurate even without frame pointers, but requires copying large DWARF data per sample.

perf record -F 99 --call-graph dwarf -p <pid>
# Overhead: 3–10% (higher than frame pointer)

ORC (Oops Rewind Capability) (Linux kernel only):

A simplified, faster alternative to DWARF for the kernel itself. ORC tables are generated by objtool during kernel build and stored in a compact format. Enables reliable kernel stack unwinding at PMI time.

Symbol Resolution

A profile is useful only when sample PCs are translated to function names.

Kernel symbols: /proc/kallsyms maps kernel function addresses to names. Requires perf_event_paranoid < 1 or root.
DWARF debug info: strips are separate (debuginfo packages). perf report uses these automatically if installed.
JIT/dynamic code: JVM, V8, and similar runtimes JIT-compile code at runtime. perf can't resolve these without help.
JVM: -XX:+PreserveFramePointer -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints + perf-map-agent which writes /tmp/perf-<pid>.map with JIT mappings.
Node.js: node --perf-basic-prof generates /tmp/perf-<pid>.map.

Flame Graph Methodology

Brendan Gregg invented the flame graph in 2011 after spending hours trying to read tabular perf report output for a MySQL performance problem. The insight: represent thousands of stack samples as a single stacked bar chart where wider = more time.

Reading a flame graph:

Width of frame = proportion of total samples with this frame on stack
Height (y-axis) = call depth (bottom = thread/process, top = hot functions)
Color = typically module or file (not hot/cold — use differential flame graphs for that)
Order = alphabetical within a level (NOT time order)

Example:
                    ┌────┐
                    │foo │ ← foo() is hot: wide at the top
          ┌─────────┴────┴─────────┐
          │       bar()            │
  ┌───────┴──────────────────────┐ │
  │         main_loop()          │ │
  ├──────────────────────────────┴─┤
  │           start_thread()       │
  └────────────────────────────────┘

→ main_loop() calls bar() which calls foo() most of the time
→ foo() is the bottleneck (wide frame at top of call stack)

What to look for: - Wide frames near the top of the stack: hottest functions, most optimization potential. - Flat tops (wide frame with nothing above it): leaf functions spending all their time here—compute bound. - Towers (tall narrow columns): many function calls with little time per frame—call overhead or call chains to investigate.

Flame Graph Generation

# Step 1: Collect profile
perf record -F 99 -a -g -- sleep 30

# Step 2: Export to text
perf script > out.perf

# Step 3: Fold stacks (collapse identical stacks)
stackcollapse-perf.pl out.perf > out.folded

# Step 4: Generate SVG
flamegraph.pl out.folded > flame.svg

# View in browser (interactive SVG: click to zoom, hover for percentages)
open flame.svg

Tools: - stackcollapse-perf.pl, flamegraph.pl: https://github.com/brendangregg/FlameGraph - inferno (Rust implementation, faster): https://github.com/jonhoo/inferno - parca / pyroscope: continuous production flame graphs with web UI

One-liner for quick profiling:

perf record -F 99 -a -g -- sleep 30; \
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

Annotated Flame Graph Reading Guide

FLAME GRAPH READING GUIDE
═══════════════════════════

Y-AXIS (vertical):
  Top    = leaf functions (where time is spent)
  Bottom = root frames (thread start, main())

X-AXIS (horizontal):
  Width  = % of total samples
  Order  = alphabetical (NOT chronological)
  Narrow = infrequently on-stack

COLORS (brendangregg default palette):
  Yellow/Orange = kernel space
  Red/Brown     = user space C/C++
  Green         = Java (when using perf-map-agent)
  Blue          = shell scripts
  (Custom palettes common in modern tools)

PATTERNS TO RECOGNIZE:
  ┌────────────────────┐
  │   compute_hash()   │  ← Flat top + wide = hottest function
  └────────────────────┘    Fix: optimize this function

  ┌──┐
  │b │  ← Narrow = rarely called, ignore
  └──┘

  ┌──────┐   ┌──────┐   ┌──────┐
  │func1 │   │func2 │   │func3 │  ← Wide frame split into
  └──────┘   └──────┘   └──────┘    many sub-frames = 
                                      coarse framing; drill down

  ┌──────────────────────────┐
  │     lock_wait()          │  ← Wide "blocked" frame
  └──────────────────────────┘    = off-CPU time if using off-CPU FG
                                   = lock contention hotspot

ZOOM: In interactive SVG, click any frame to zoom into that subtree.
RESET: Click "Reset Zoom" or press Esc.
SEARCH: Use the search box to highlight all frames matching a regex.

Off-CPU Flame Graphs

On-CPU flame graphs show where the CPU is busy. Off-CPU flame graphs show where threads are blocked—waiting on I/O, locks, sleep, or scheduling.

Off-CPU time is captured by tracing schedule() calls in the kernel (when a thread is descheduled) and timestamping wake-up events.

# BCC offcputime: trace off-CPU time for a PID
/usr/share/bcc/tools/offcputime -p <pid> 30 > out.offcpu
flamegraph.pl --color=io --title="Off-CPU" < out.offcpu > offcpu.svg

# bpftrace version
bpftrace -e '
  tracepoint:sched:sched_switch {
    @start[args->prev_pid] = nsecs;
  }
  tracepoint:sched:sched_wakeup {
    $dur = nsecs - @start[args->pid];
    if ($dur > 0) {
      @offcpu_ns[args->pid, ustack, kstack] = sum($dur);
    }
  }
'

The resulting flame graph shows the call stack when the thread went to sleep, sorted by total blocked time. A wide frame in an off-CPU flame graph at epoll_wait means the thread is I/O bound; at futex_wait means lock contention; at nanosleep means intentional sleeps.

Memory Flame Graphs

Memory flame graphs show which call stacks are responsible for heap allocations.

# heaptrack: track all allocations
heaptrack -p <pid>
heaptrack_print heaptrack.<pid>.gz -F | flamegraph.pl > memory.svg

# valgrind massif
valgrind --tool=massif --pages-as-heap=yes ./program
ms_print massif.out.<pid> > massif_report.txt

# perf mem (sample memory access locations with PEBS)
perf mem record -p <pid> -- sleep 10
perf mem report --sort=mem,sym

Memory flame graphs reveal: which code paths allocate the most (total bytes), allocation hotspots (high allocation rate causing GC pressure), memory leaks (growing allocations not freed).

Historical Context

gprof (1982, GNU) was the first widely-used profiler on Unix. It used compile-time instrumentation and was inaccurate due to measurement perturbation. oprofile (2002) introduced PMU-based profiling on Linux. perf (2009, merged into Linux 2.6.31) replaced oprofile with a more complete and actively developed tool.

Brendan Gregg's flame graph was born from a MySQL performance problem at Sun in 2011. He was staring at tabular perf report output and found it impossible to see the hot path across a recursive call chain. The first flame graph was drawn, the problem was immediately visible, and the visualization was published in a 2013 ACM Queue article.

The introduction of eBPF (2014) and BCC (2015) enabled continuous production profiling without recompilation—profiling tools became safe to run against production servers with < 2% overhead. This catalyzed the continuous profiling industry (Polar Signals, Pyroscope, Parca, Datadog Continuous Profiler).

Production Examples

Case: MySQL bottleneck found in minutes with flame graph. Brendan Gregg's original use case: a MySQL server was slow. perf record + flame graph showed 40% of CPU in String_Copy() inside the join optimization code. A quick code change reduced this to < 5%. Total time from problem report to fix: 4 hours, most of which was waiting for the flame graph to generate. Without flame graphs, the same investigation would have taken days of reading gprof tables.

Case: Node.js memory leak via allocation flame graph. A Node.js service at Uber slowly grew its heap from 100 MB to 2 GB over 24 hours. Using heaptrack against a development replica, the memory flame graph showed 80% of allocations from a specific route handler that was capturing a closure with a reference to the full request object. The closure outlived the request, preventing GC. Fix: extract only needed fields from the request.

Debugging Notes

perf report in text mode: use perf report --stdio --no-children for a flat profile (no inclusive call counts). Sort by self to find leaf hotspots.
If symbol names appear as hex addresses: install debuginfo packages (debuginfo-install <pkg> on RHEL/Fedora, apt install <pkg>-dbg on Debian).
JVM profiling: use async-profiler instead of perf for Java. It handles Java frames correctly, resolves JIT symbols, and supports both CPU and allocation profiling.
perf record creates perf.data which can be copied to a developer machine for analysis offline. Use perf archive to bundle debug symbols.
Flamescope (Netflix): visualizes flame graphs over time as a heatmap, enabling identification of intermittent hot paths.

Security Implications

Flame graphs expose function names, file paths, and call chains. In production, they may reveal security-sensitive logic (crypto operations, authentication flows). Treat flame graph SVG files as sensitive and restrict access appropriately.

Symbol resolution via /proc/kallsyms requires perf_event_paranoid < 1 or kernel.kptr_restrict = 0. In production:

sysctl -w kernel.perf_event_paranoid = 2  # Restrict perf events
sysctl -w kernel.kptr_restrict = 2         # Hide kernel pointers

eBPF-based profilers (parca, pyroscope) require CAP_PERFMON or CAP_BPF—audit who can deploy them in multi-tenant environments.

Performance Implications

Sample-based profiling at 99 Hz adds 1–3% CPU overhead. The PMU interrupt is handled in kernel context and requires saving/restoring registers + unwinding the stack. At 999 Hz, overhead is 5–10%. For production continuous profiling, 19 Hz is common (< 0.5% overhead).

Stack unwinding overhead: frame-pointer unwinding is ~1 µs per sample; DWARF unwinding is ~10 µs per sample. For a 10-CPU system at 99 Hz, frame-pointer profiling adds ~10,000 unwind operations/second = ~10 ms/s of kernel time ≈ 0.1% overhead.

Failure Modes and Real Incidents

Broken stacks / missing frames. The most common complaint: flame graphs showing [unknown] frames or stacks that terminate early. Causes: 1. Missing frame pointers: recompile with -fno-omit-frame-pointer, or use DWARF unwinding. 2. Stack grew beyond perf's max stack depth: sysctl -w kernel.perf_event_max_stack=256. 3. JIT code without symbol maps: configure the runtime to emit perf maps.

Profiler heisenbug. A Python service showed a "performance problem" only when profiled. The profiler's signal delivery (SIGPROF) was interfering with a signal-sensitive critical section, causing the profiler to create the problem it was measuring. Fix: switched to PMU-based profiling, which doesn't use SIGPROF.

Modern Usage

Continuous profiling is now standard at hyperscale: Google Perfetto, Netflix Vector, Datadog APM, Grafana Pyroscope, and Polar Signals Parca all offer always-on CPU profiling via eBPF. Overhead is < 1% at 19 Hz sampling.

async-profiler (Java) uses AsyncGetCallTrace API to sample Java stacks safely from signal handlers, bypassing the safepoint bias problem that afflicts JVM-TI profilers. It is the correct tool for any Java CPU profiling.

Differential flame graphs compare two profiles (before/after a change) with red indicating functions that got more time and blue indicating less. Essential for confirming that an optimization actually worked.

Future Directions

Continuous profiling with CI integration: automatically compare flame graphs for each commit, flagging regressions.
AI annotation of flame graphs: LLMs identifying optimization opportunities from flame graph structure.
eBPF-based allocation profiling: tracing malloc/free with stack unwinding in eBPF, approaching zero overhead for production memory profiling.
Distributed flame graphs: showing call stacks that span microservices via distributed tracing + profiling correlation (Grafana Tempo + Pyroscope integration).

Exercises

Profile a compute-intensive program (./stress --cpu 1) with perf record -F 99 -g for 30 seconds. Generate a flame graph. Identify the top-3 widest frames and explain what they represent.
Profile a program with frequent blocking I/O using offcputime (BCC). Generate an off-CPU flame graph. Compare the off-CPU hot paths with the on-CPU hot paths. What does the difference reveal about the program's bottleneck?
Introduce a deliberate hot path in a C program (a tight loop calling a hash function repeatedly). Profile with and without -fno-omit-frame-pointer. Compare the flame graph quality—specifically, look for broken stacks.
Profile a Java application using async-profiler (./profiler.sh -d 30 -f profile.html <pid>). Open the HTML flame graph and identify the most CPU-intensive method. Explain how the JIT heuristic might affect profiling accuracy.
Generate a differential flame graph: profile a program before and after changing an algorithm (e.g., linear search to binary search). Use flamegraph.pl --negate to generate the diff. Explain what the red and blue regions mean.

References

Gregg, B. "Flame Graphs." ACM Queue, 2016. https://queue.acm.org/detail.cfm?id=2927301
Gregg, B. Systems Performance (2nd ed., 2020). Chapter 2: Methodologies. Chapter 5: Applications.
FlameGraph repository: https://github.com/brendangregg/FlameGraph
Gregg, B. BPF Performance Tools. Pearson, 2019.
async-profiler: https://github.com/async-profiler/async-profiler
Inferno (Rust flamegraph): https://github.com/jonhoo/inferno
Polar Signals Parca: https://www.parca.dev/
Grafana Pyroscope: https://grafana.com/oss/pyroscope/
DWARF standard: https://dwarfstd.org/
ORC unwinder: https://www.kernel.org/doc/html/latest/x86/orc-unwinder.html