Performance Profiling Tools
Overview
Performance profiling is the systematic process of measuring where a program spends its time and resources. Without profiling, optimization is guesswork — "fast enough" is not a measurement, and the 80/20 rule means that most programs spend 80% of their time in 20% of their code. The wrong optimizations waste engineering time and can introduce bugs. The right profiling tools find the actual bottleneck, which is often surprising.
This document covers the taxonomy of profiling tools from Linux system-level (perf, eBPF) to language-specific (async-profiler for JVM), with emphasis on flame graphs as the universal visualization. It also covers advanced techniques: off-CPU profiling, differential flame graphs, and hardware counter analysis for microarchitecture bottlenecks.
Prerequisites
- Linux kernel fundamentals: scheduler, system calls, memory subsystem
- Basic familiarity with x86-64 architecture: CPU pipeline, cache hierarchy
- Understanding of stack frames and call graphs
- Working knowledge of at least one compiled language (C/C++/Go/Java/Rust)
- Kernel symbol access (
/proc/kallsyms, debug symbols)
Historical Context
Before modern profilers, engineers used gprof (1988, GNU profiler) which required recompilation with -pg flag and introduced significant overhead. gprof used statistical sampling but required instrumentation, a fundamental compromise.
The modern era of Linux profiling began with the perf subsystem, introduced in Linux 2.6.31 (2009) by Ingo Molnar and others. perf unified hardware performance counter access, software event tracing, and statistical sampling into a single kernel subsystem. It replaced OProfile and dozens of ad-hoc tools.
Flame graphs were invented by Brendan Gregg at Netflix in 2011. He was investigating a production MySQL CPU regression and needed a way to visualize thousands of stack traces as a single comprehensible image. The result became one of the most impactful performance visualization tools ever created.
eBPF entered the profiling space around 2015-2016 when BCC (BPF Compiler Collection) tools matured, enabling in-kernel aggregation of profiling data without the overhead of copying all data to userspace.
async-profiler (2017) solved the JVM profiling problem: the JVM's built-in JVMTI profiling had safepoint bias (it could only sample at garbage collection safepoints, missing CPU-intensive non-GC code). async-profiler combines JVMTI with Linux perf events to get unbiased JVM profiles.
Profiling Tool Taxonomy
Profiling Dimensions:
+-----------------+------------------------------------------+
| Dimension | Tools |
+-----------------+------------------------------------------+
| CPU on-CPU | perf, async-profiler, pprof (Go), VTune |
| CPU off-CPU | perf sched, offcputime (eBPF), async-prof |
| Memory alloc | heaptrack, Valgrind massif, jemalloc prof |
| Memory leaks | Valgrind memcheck, AddressSanitizer |
| I/O latency | biolatency (eBPF), iostat, blktrace |
| Network I/O | tcpdump, Wireshark, nethogs, eBPF |
| Lock contention | perf lock, mutrace, async-profiler |
| System calls | perf trace, strace (high overhead) |
| Hardware cache | perf stat + PMU events, VTune topdown |
+-----------------+------------------------------------------+
Collection method:
Sampling (statistical):
- Take stack snapshot every N microseconds (e.g., 99 Hz)
- Low overhead (1-5% CPU), statistical approximation
- Cannot find every function call, only hot paths
Tracing (deterministic):
- Record EVERY function entry/exit or event
- Exact counts, high overhead for frequent events
- Use for rare events: syscalls, page faults, cache misses
Counting (hardware counters):
- Hardware PMU increments counters on specific microarchitecture events
- Zero overhead until overflow (then NMI for sampling)
- Reports aggregate counts per period
The perf Tool
perf is the Swiss Army knife of Linux performance analysis. It accesses the kernel's perf_event subsystem.
perf stat: Counter Summary
# Basic hardware counter summary for a command
perf stat ./my-program
# Output:
Performance counter stats for './my-program':
1,234.56 msec task-clock # 0.999 CPUs utilized
5 context-switches # 4.047 /sec
0 cpu-migrations # 0.000 /sec
247 page-faults # 200.079 /sec
3,456,789,012 cycles # 2.800 GHz
2,100,000,000 instructions # 0.61 insn per cycle ← IPC
450,000,000 branches # 364.465 M/sec
22,500,000 branch-misses # 5.00% of all branches
120,000,000 cache-references # 97.195 M/sec
18,000,000 cache-misses # 15.00% of cache refs ← HIGH
# Key metrics:
# IPC (instructions per cycle):
# >3: excellent (out-of-order execution working well)
# 1-3: normal
# <1: likely memory-bound or branch misprediction heavy
#
# Cache miss rate:
# <1%: cache-friendly code
# 5-15%: significant cache pressure
# >15%: memory bandwidth bottleneck
#
# Branch miss rate:
# <1%: predictor works well
# >5%: consider restructuring branches or using branchless code
perf record and report: CPU Profiling
# Record CPU samples with call graphs, 99Hz sampling, for 30 seconds
perf record -g -F 99 -p <PID> -- sleep 30
# Or record for a specific command:
perf record -g -F 99 -- ./my-program
# This produces perf.data in the current directory
# Interactive TUI report:
perf report
# Output (simplified):
# Overhead Command Shared Object Symbol
# 35.23% myapp myapp compute_hash
# 22.17% myapp libc.so malloc
# 18.54% myapp myapp parse_request
# 8.33% myapp [kernel] copy_to_user
# ...
# Report as flat text:
perf report --stdio
# Show call graph (callers):
perf report -g caller
perf top: Live CPU View
# Live top-like view of hot functions (refreshes every 2s)
perf top -g
# Filter to specific process:
perf top -p <PID>
# Show kernel symbols (requires root or /proc/sys/kernel/perf_event_paranoid <= 1):
perf top -g --kernel
perf trace: System Call Tracing
# Trace all syscalls for a process (lower overhead than strace)
perf trace -p <PID>
# Trace specific syscalls only:
perf trace -e read,write,epoll_wait -p <PID>
# Summary mode (like strace -c):
perf trace --summary -p <PID>
# Output:
# syscall calls total min avg max
# epoll_wait 1234 35.023 ms 0.020 ms 0.028 ms 1.245 ms
# read 5678 12.456 ms 0.001 ms 0.002 ms 0.456 ms
PMU Event Counting
# List available hardware events
perf list hardware
# Common useful hardware events:
perf stat -e \
cache-references,\
cache-misses,\
L1-dcache-load-misses,\
L1-dcache-loads,\
LLC-load-misses,\
LLC-loads,\
branch-instructions,\
branch-misses,\
instructions,\
cycles \
-- ./my-program
# L1 miss rate: L1-dcache-load-misses / L1-dcache-loads
# LLC (L3) miss rate: LLC-load-misses / LLC-loads → measures DRAM pressure
# CPU-specific events (Intel Sandy Bridge and later):
perf stat -e \
cpu/event=0xD1,umask=0x20,name=MEM_LOAD_UOPS_RETIRED.LLC_MISS/ \
-- ./my-program
perf sched: Scheduler Analysis
# Record scheduler events
perf sched record -- sleep 10
# Show per-task latency statistics
perf sched latency
# Output:
# Task | sleep | switch | wait time | sch delay
# my-server:1234 | 5.123s | 24567 | 0.123ms | 0.045ms
# Replay schedule events (for debugging RT behavior):
perf sched replay
Flame Graph Generation
Flame graphs are the best way to visualize profiling data collected via sampling. The X-axis represents time (width = CPU share), the Y-axis is call stack depth. The color is meaningless for regular flame graphs (it's random for aesthetics). The key skill is reading the WIDTH of each frame.
Reading a Flame Graph:
┌─────────────────────────────────────────────────────────────────────┐
│ main (100%) │
├──────────────────────────────────────┬──────────────────────────────┤
│ handle_request (58%) │ background_work (42%) │
├──────────────┬───────────────────────┤──────────┬───────────────────┤
│ parse (15%) │ process (43%) │ gc (12%) │ serialize (30%) │
├──────────────┼────────┬──────────────┼──────────┼───────────────────┤
│ json (15%) │db (20%)│ compute (23%)│ │ compress (30%) │
│ │ ├──────────────┤ ├───────────────── │
│ │ │ hash (15%) │ │ zlib (30%) │
└──────────────┴────────┴──────────────┴──────────┴───────────────────┘
Wide frames at the bottom: hot code paths — prioritize these for optimization.
Narrow frames: infrequently called — ignore for CPU optimization.
Flat tops (no callees): leaf functions doing actual work.
Insight from above:
- compress/zlib takes 30% — is compression necessary? Can it be async?
- hash takes 15% — is this a crypto hash? Can it be replaced with xxHash?
- db takes 20% — are these N+1 queries? Connection pool exhausted?
Generating Flame Graphs from perf
# 1. Record with call graphs (frame pointers must be enabled)
perf record -g -F 99 -p <PID> -- sleep 60
# If frame pointers are missing (compiled without -fno-omit-frame-pointer):
# Use DWARF-based unwinding (slower but doesn't require recompilation):
perf record --call-graph dwarf -F 99 -p <PID> -- sleep 60
# 2. Export to text format
perf script > perf.out
# 3. Stack collapse (from Brendan Gregg's FlameGraph repo)
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./stackcollapse-perf.pl < /path/to/perf.out > collapsed.txt
# 4. Generate SVG
./flamegraph.pl collapsed.txt > flamegraph.svg
# Open in browser: file:///path/to/flamegraph.svg
# Interactive: click to zoom, Ctrl+F to search
# Search for function names:
./flamegraph.pl --search "malloc" collapsed.txt > flamegraph.svg
# malloc frames highlighted in magenta
Ensuring frame pointers: Many distributions compile with -O2 -fomit-frame-pointer, which discards frame pointers for a slight speed improvement. This breaks perf record -g. Solutions:
- Recompile with -fno-omit-frame-pointer (add to CFLAGS)
- Use --call-graph dwarf (slower collection, larger perf.data)
- Use --call-graph lbr (Last Branch Record — CPU hardware, fast but shallow stacks ~30 frames)
Off-CPU Flame Graphs
Standard CPU profiling only captures where threads are running on CPU (on-CPU time). Threads blocked waiting for I/O, locks, or sleep are invisible. Off-CPU profiling captures this waiting time.
Off-CPU time = time spent NOT on CPU:
- Blocking on disk I/O (read/write)
- Waiting for network data (recv)
- Waiting for a mutex (futex)
- Sleeping (nanosleep, poll timeout)
- Waiting for page fault (page in from disk)
Tool 1: perf with sched:sched_switch
perf record -e sched:sched_switch -a -g -- sleep 30
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl \
--color=io --title="Off-CPU Flame Graph" > offcpu.svg
Tool 2: eBPF offcputime (more accurate, less overhead)
/usr/share/bcc/tools/offcputime -p <PID> 30 > offcpu.txt
./stackcollapse.pl offcpu.txt | ./flamegraph.pl \
--color=io --title="Off-CPU Flame Graph" > offcpu.svg
Reading off-CPU graphs:
- X-axis represents total TIME BLOCKED (not CPU cycles)
- Wide frames = long blocking = latency source
- Look for: futex_wait (lock contention), sys_read/sys_write (I/O),
poll (waiting for network), do_page_fault (memory pressure)
Differential Flame Graphs
Differential flame graphs compare two profiles (before vs after a change) to highlight regressions and improvements.
Use case: A deploy caused p99 latency regression.
Capture: perf record baseline (before deploy), then regression (after).
Generation:
# Normalize sample counts to same total (critical for fair comparison):
./stackcollapse-perf.pl baseline_perf.out > baseline.txt
./stackcollapse-perf.pl regression_perf.out > regression.txt
# Generate diff:
./difffolded.pl baseline.txt regression.txt > diff.txt
# Generate differential flame graph:
./flamegraph.pl --negate diff.txt > diff_flamegraph.svg
Color coding:
- RED frames: more CPU in regression → regression introduced here
- BLUE frames: less CPU in regression → improvement (or code moved)
- PURPLE: new code in regression not present in baseline
Common findings:
- Red malloc/free → new memory allocation hot path introduced
- Red kernel path → lock contention or syscall regression
- Blue path disappears → code was optimized or removed
Intel VTune: Topdown Microarchitecture Analysis
VTune implements the Intel Topdown Microarchitecture Analysis (TMA) methodology, which categorizes CPU cycles into four buckets:
Topdown Analysis Tree:
100% of cycles
├── Frontend Bound (X%)
│ CPU cannot deliver instructions fast enough
│ Causes: instruction cache misses, branch misprediction stalls,
│ fetch bandwidth, iTLB misses
│ Fix: reduce code size, improve branch prediction, PGO
│
├── Backend Bound (Y%)
│ CPU has instructions but execution units are stalled
│ ├── Memory Bound: stalled waiting for cache/memory
│ │ Fix: improve data locality, reduce working set, prefetch
│ └── Core Bound: stalled on execution unit (ALU/FPU)
│ Fix: vectorize, reduce dependency chains, use SIMD
│
├── Bad Speculation (Z%)
│ Branch mispredictions causing pipeline flush and replay
│ Fix: reduce unpredictable branches, branchless algorithms
│
└── Retiring (W%)
Useful work — the only "good" category
Goal: maximize Retiring percentage
Rule of thumb:
- Memory Bound > 20%: optimize data structures (SoA vs AoS, cache lines)
- Bad Speculation > 10%: profile branches, use profile-guided optimization
- Frontend Bound > 20%: check i-cache, instruction bloat from templates/macros
# VTune CLI (requires Intel VTune installation):
vtune -collect hotspots -app ./my-program
vtune -report hotspots -r vtune_results/
# Topdown analysis:
vtune -collect uarch-exploration -app ./my-program
vtune -report uarch-exploration -r vtune_results/
async-profiler for JVM
The JVM presents unique profiling challenges: - Safepoint bias: traditional JVMTI profilers only sample at GC safepoints, missing hot non-GC code - JIT compilation: code changes shape at runtime; deoptimization can appear as hot frames - Mixed mode: Java frames, native frames, and kernel frames all mixed
async-profiler solves safepoint bias by using the OS-level AsyncGetCallTrace API combined with Linux perf events:
# Download async-profiler
curl -L https://github.com/async-profiler/async-profiler/releases/download/v3.0/async-profiler-3.0-linux-x64.tar.gz | tar xz
# Profile for 60 seconds, output flame graph
./asprof -d 60 -f flamegraph.html <PID>
# Profile CPU and allocation combined:
./asprof -d 60 -e cpu,alloc -f flamegraph.html <PID>
# Wall-clock profiling (includes off-CPU threads — blocked I/O, locks):
./asprof -d 60 -e wall -f flamegraph.html <PID>
# Lock profiling (find contended monitors):
./asprof -d 60 -e lock -f flamegraph.html <PID>
# Attach to running JVM (no restart needed):
./asprof start -e cpu <PID>
sleep 30
./asprof stop -f flamegraph.html <PID>
Safepoint bias example:
JVM safepoint profiling (biased):
GC frames dominate because GC creates safepoints.
Hot computation loop shows as "safe" — almost invisible.
async-profiler (unbiased):
Actual hot method visible: "HashMap.get" at 34% of cycles
GC shows realistic percentage: 8%
Profiling in Production
Low-overhead continuous profiling:
Tools: Parca, Pyroscope, Polar Signals, Grafana Continuous Profiling
Approach:
- Run profiler as sidecar or DaemonSet on every node
- Sample at 1-100 Hz (typical: 19 Hz, off from common timer frequencies)
- Aggregate samples in eBPF ring buffer (kernel-side)
- Upload symbolized profiles to central store
- Query: "What was the hottest function between 14:00 and 14:05?"
eBPF-based profiling overhead: <1% CPU at 99 Hz
(vs 5-15% for userspace sampling)
Example: Parca agent (eBPF):
kubectl apply -f https://github.com/parca-dev/parca-agent/releases/latest/.../
# DaemonSet deploys to all nodes
# Profiles all processes on host, including Kubernetes pods
# Kernel stacks unified with userspace stacks
Debugging Notes
# Verify perf is working:
perf stat ls
# If "Permission denied": /proc/sys/kernel/perf_event_paranoid is too high
echo 1 > /proc/sys/kernel/perf_event_paranoid # requires root, temporary
# Missing kernel symbols in perf report:
# Need kernel debug symbols:
# Ubuntu: apt install linux-tools-$(uname -r) linux-cloud-tools-$(uname -r)
# RHEL: yum install kernel-debuginfo
# Broken stack traces (all show as [unknown]):
# Missing frame pointers — recompile with -fno-omit-frame-pointer
# Or use DWARF: perf record --call-graph dwarf
# JVM: Java frames showing as hex addresses:
# Need perf-map-agent or async-profiler (generates /tmp/perf-<PID>.map)
# async-profiler does this automatically
# Flame graph too wide / too many frames:
# Use --minwidth 0.5 to hide frames <0.5% of total
./flamegraph.pl --minwidth 0.5 collapsed.txt > flamegraph.svg
# perf.data too large:
# Reduce frequency: -F 49 (49 Hz instead of 99)
# Limit time: -- sleep 10 instead of 30
# Limit processes: -p <PID> instead of system-wide
Security Implications
perfwith system-wide profiling can read kernel memory layouts, which can assist kernel exploitation (defeating KASLR). Hence/proc/sys/kernel/perf_event_paranoiddefaults to 2 in production systems.- Flame graphs can inadvertently expose sensitive information: function names may reveal encryption algorithms, data processing logic, or internal API names. Treat flame graphs as confidential in regulated environments.
- Attaching a profiler to a production process may violate change management policies. Establish pre-approved runbooks for profiling in production.
- eBPF-based profilers (Parca, Pyroscope) run with elevated kernel privileges; audit their RBAC permissions carefully in Kubernetes environments.
Performance Implications
perf record -g -F 99: approximately 1-3% CPU overhead. Safe for brief production profiling (5-10 minutes).perf record --call-graph dwarf: 5-15% overhead due to stack copying. Use only in dev/staging.strace: 100-1000x overhead per syscall. Never use on production servers; useperf traceinstead.- Continuous eBPF profiling (Parca/Pyroscope) at 19 Hz: <0.5% overhead. Safe for permanent production deployment.
Modern Usage
- Continuous profiling as standard practice: Companies like Google (Pprof), Datadog (Continuous Profiler), and Grafana (Pyroscope) have made always-on profiling a standard observability pillar alongside metrics, logs, and traces.
- Profile-guided optimization (PGO): Use flame graph data to identify hot paths, feed them to
clang -fprofile-useor Java's GraalVM PGO for 5-20% performance gains. - eBPF profiling without root: New Linux capabilities (
CAP_BPF,CAP_PERFMON) allow non-root profiling on Linux 5.8+, enabling profiling in hardened containers.
Future Directions
- Continuous profiling standardization: OpenTelemetry is adding profiling as a fourth signal alongside metrics/logs/traces (Profiling SIG, 2023-2024).
- Hardware topdown in eBPF: Projects like
toplevcombined with eBPF to bring TMA analysis to every process without VTune. - ML-assisted bottleneck identification: Tools that automatically correlate profiling data with latency changes and suggest optimization strategies.
Exercises
-
Record a CPU flame graph of a known-slow program (e.g., a sorting algorithm). Identify the top three functions by width. Verify your findings match a manual code review of hot paths.
-
Use
perf statto compare two implementations of string hashing (e.g., FNV-1a vs SHA-256). Compare IPC, cache miss rate, and branch miss rate. Explain why one is faster in microarchitecture terms. -
Generate an off-CPU flame graph for a program that does file I/O. Identify where it spends most of its time waiting. Compare to the on-CPU flame graph and note which functions appear only off-CPU.
-
Install async-profiler. Profile a Java web server under load. Find the top memory allocation hotspot. Suggest how to reduce allocation rate.
-
Generate a differential flame graph between two versions of a program (introduce a deliberate regression, e.g., add unnecessary
malloc/freein a hot path). Verify the differential graph correctly highlights the regression in red.
References
- Brendan Gregg, "Systems Performance" (2nd ed., 2020) — definitive reference; chapters on CPU and profiling
- Brendan Gregg, "BPF Performance Tools" (2019) — eBPF-based profiling tools
- FlameGraph repository: github.com/brendangregg/FlameGraph (Brendan Gregg, original scripts)
- async-profiler: github.com/async-profiler/async-profiler
- Intel Topdown Microarchitecture Analysis: intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference-type-topics/cpu-metrics-reference.html
- "The Flame Graph" — Brendan Gregg, ACMQ 2016 (original paper)
- Linux perf wiki: perf.wiki.kernel.org
- Parca continuous profiling: parca.dev
- "Stop Safepointing Everything" — JVM profiling talk, JVM Language Summit
- "Linux Profiling at Netflix" — Brendan Gregg, USENIX Lisa 2015