Performance Profiling Tools

Overview

Performance profiling is the systematic process of measuring where a program spends its time and resources. Without profiling, optimization is guesswork — "fast enough" is not a measurement, and the 80/20 rule means that most programs spend 80% of their time in 20% of their code. The wrong optimizations waste engineering time and can introduce bugs. The right profiling tools find the actual bottleneck, which is often surprising.

This document covers the taxonomy of profiling tools from Linux system-level (perf, eBPF) to language-specific (async-profiler for JVM), with emphasis on flame graphs as the universal visualization. It also covers advanced techniques: off-CPU profiling, differential flame graphs, and hardware counter analysis for microarchitecture bottlenecks.

Prerequisites

Linux kernel fundamentals: scheduler, system calls, memory subsystem
Basic familiarity with x86-64 architecture: CPU pipeline, cache hierarchy
Understanding of stack frames and call graphs
Working knowledge of at least one compiled language (C/C++/Go/Java/Rust)
Kernel symbol access (/proc/kallsyms, debug symbols)

Historical Context

Before modern profilers, engineers used gprof (1988, GNU profiler) which required recompilation with -pg flag and introduced significant overhead. gprof used statistical sampling but required instrumentation, a fundamental compromise.

The modern era of Linux profiling began with the perf subsystem, introduced in Linux 2.6.31 (2009) by Ingo Molnar and others. perf unified hardware performance counter access, software event tracing, and statistical sampling into a single kernel subsystem. It replaced OProfile and dozens of ad-hoc tools.

Flame graphs were invented by Brendan Gregg at Netflix in 2011. He was investigating a production MySQL CPU regression and needed a way to visualize thousands of stack traces as a single comprehensible image. The result became one of the most impactful performance visualization tools ever created.

eBPF entered the profiling space around 2015-2016 when BCC (BPF Compiler Collection) tools matured, enabling in-kernel aggregation of profiling data without the overhead of copying all data to userspace.

async-profiler (2017) solved the JVM profiling problem: the JVM's built-in JVMTI profiling had safepoint bias (it could only sample at garbage collection safepoints, missing CPU-intensive non-GC code). async-profiler combines JVMTI with Linux perf events to get unbiased JVM profiles.

Profiling Tool Taxonomy

  Profiling Dimensions:

  +-----------------+------------------------------------------+
  | Dimension       | Tools                                    |
  +-----------------+------------------------------------------+
  | CPU on-CPU      | perf, async-profiler, pprof (Go), VTune  |
  | CPU off-CPU     | perf sched, offcputime (eBPF), async-prof |
  | Memory alloc    | heaptrack, Valgrind massif, jemalloc prof |
  | Memory leaks    | Valgrind memcheck, AddressSanitizer       |
  | I/O latency     | biolatency (eBPF), iostat, blktrace       |
  | Network I/O     | tcpdump, Wireshark, nethogs, eBPF         |
  | Lock contention | perf lock, mutrace, async-profiler        |
  | System calls    | perf trace, strace (high overhead)        |
  | Hardware cache  | perf stat + PMU events, VTune topdown     |
  +-----------------+------------------------------------------+

  Collection method:

  Sampling (statistical):
    - Take stack snapshot every N microseconds (e.g., 99 Hz)
    - Low overhead (1-5% CPU), statistical approximation
    - Cannot find every function call, only hot paths

  Tracing (deterministic):
    - Record EVERY function entry/exit or event
    - Exact counts, high overhead for frequent events
    - Use for rare events: syscalls, page faults, cache misses

  Counting (hardware counters):
    - Hardware PMU increments counters on specific microarchitecture events
    - Zero overhead until overflow (then NMI for sampling)
    - Reports aggregate counts per period

The perf Tool

perf is the Swiss Army knife of Linux performance analysis. It accesses the kernel's perf_event subsystem.

perf stat: Counter Summary

# Basic hardware counter summary for a command
perf stat ./my-program

# Output:
  Performance counter stats for './my-program':

       1,234.56 msec task-clock                #    0.999 CPUs utilized
              5      context-switches          #    4.047 /sec
              0      cpu-migrations            #    0.000 /sec
            247      page-faults               #  200.079 /sec
    3,456,789,012    cycles                    #    2.800 GHz
    2,100,000,000    instructions              #    0.61  insn per cycle  ← IPC
      450,000,000    branches                  #  364.465 M/sec
       22,500,000    branch-misses             #    5.00% of all branches
      120,000,000    cache-references          #   97.195 M/sec
       18,000,000    cache-misses              #   15.00% of cache refs  ← HIGH

# Key metrics:
# IPC (instructions per cycle):
#   >3: excellent (out-of-order execution working well)
#   1-3: normal
#   <1: likely memory-bound or branch misprediction heavy
#
# Cache miss rate:
#   <1%: cache-friendly code
#   5-15%: significant cache pressure
#   >15%: memory bandwidth bottleneck
#
# Branch miss rate:
#   <1%: predictor works well
#   >5%: consider restructuring branches or using branchless code

perf record and report: CPU Profiling

# Record CPU samples with call graphs, 99Hz sampling, for 30 seconds
perf record -g -F 99 -p <PID> -- sleep 30

# Or record for a specific command:
perf record -g -F 99 -- ./my-program

# This produces perf.data in the current directory

# Interactive TUI report:
perf report

# Output (simplified):
# Overhead  Command  Shared Object     Symbol
#    35.23%  myapp    myapp             compute_hash
#    22.17%  myapp    libc.so           malloc
#    18.54%  myapp    myapp             parse_request
#     8.33%  myapp    [kernel]          copy_to_user
#     ...

# Report as flat text:
perf report --stdio

# Show call graph (callers):
perf report -g caller

perf top: Live CPU View

# Live top-like view of hot functions (refreshes every 2s)
perf top -g

# Filter to specific process:
perf top -p <PID>

# Show kernel symbols (requires root or /proc/sys/kernel/perf_event_paranoid <= 1):
perf top -g --kernel

perf trace: System Call Tracing

# Trace all syscalls for a process (lower overhead than strace)
perf trace -p <PID>

# Trace specific syscalls only:
perf trace -e read,write,epoll_wait -p <PID>

# Summary mode (like strace -c):
perf trace --summary -p <PID>

# Output:
# syscall            calls    total       min       avg       max
# epoll_wait          1234  35.023 ms   0.020 ms  0.028 ms  1.245 ms
# read                5678  12.456 ms   0.001 ms  0.002 ms  0.456 ms

PMU Event Counting

# List available hardware events
perf list hardware

# Common useful hardware events:
perf stat -e \
  cache-references,\
  cache-misses,\
  L1-dcache-load-misses,\
  L1-dcache-loads,\
  LLC-load-misses,\
  LLC-loads,\
  branch-instructions,\
  branch-misses,\
  instructions,\
  cycles \
  -- ./my-program

# L1 miss rate:    L1-dcache-load-misses / L1-dcache-loads
# LLC (L3) miss rate: LLC-load-misses / LLC-loads  → measures DRAM pressure

# CPU-specific events (Intel Sandy Bridge and later):
perf stat -e \
  cpu/event=0xD1,umask=0x20,name=MEM_LOAD_UOPS_RETIRED.LLC_MISS/ \
  -- ./my-program

perf sched: Scheduler Analysis

# Record scheduler events
perf sched record -- sleep 10

# Show per-task latency statistics
perf sched latency

# Output:
# Task               | sleep  | switch | wait time | sch delay
# my-server:1234     | 5.123s | 24567  |   0.123ms | 0.045ms

# Replay schedule events (for debugging RT behavior):
perf sched replay

Flame Graph Generation

Flame graphs are the best way to visualize profiling data collected via sampling. The X-axis represents time (width = CPU share), the Y-axis is call stack depth. The color is meaningless for regular flame graphs (it's random for aesthetics). The key skill is reading the WIDTH of each frame.

  Reading a Flame Graph:

  ┌─────────────────────────────────────────────────────────────────────┐
  │                         main (100%)                                  │
  ├──────────────────────────────────────┬──────────────────────────────┤
  │       handle_request (58%)           │    background_work (42%)     │
  ├──────────────┬───────────────────────┤──────────┬───────────────────┤
  │  parse (15%) │    process (43%)      │ gc (12%) │  serialize (30%) │
  ├──────────────┼────────┬──────────────┼──────────┼───────────────────┤
  │  json (15%)  │db (20%)│ compute (23%)│          │  compress (30%)  │
  │              │        ├──────────────┤          ├─────────────────  │
  │              │        │ hash (15%)   │          │ zlib (30%)        │
  └──────────────┴────────┴──────────────┴──────────┴───────────────────┘

  Wide frames at the bottom: hot code paths — prioritize these for optimization.
  Narrow frames: infrequently called — ignore for CPU optimization.
  Flat tops (no callees): leaf functions doing actual work.

  Insight from above:
  - compress/zlib takes 30% — is compression necessary? Can it be async?
  - hash takes 15% — is this a crypto hash? Can it be replaced with xxHash?
  - db takes 20% — are these N+1 queries? Connection pool exhausted?

Generating Flame Graphs from perf

# 1. Record with call graphs (frame pointers must be enabled)
perf record -g -F 99 -p <PID> -- sleep 60

# If frame pointers are missing (compiled without -fno-omit-frame-pointer):
# Use DWARF-based unwinding (slower but doesn't require recompilation):
perf record --call-graph dwarf -F 99 -p <PID> -- sleep 60

# 2. Export to text format
perf script > perf.out

# 3. Stack collapse (from Brendan Gregg's FlameGraph repo)
git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
./stackcollapse-perf.pl < /path/to/perf.out > collapsed.txt

# 4. Generate SVG
./flamegraph.pl collapsed.txt > flamegraph.svg

# Open in browser: file:///path/to/flamegraph.svg
# Interactive: click to zoom, Ctrl+F to search

# Search for function names:
./flamegraph.pl --search "malloc" collapsed.txt > flamegraph.svg
# malloc frames highlighted in magenta

Ensuring frame pointers: Many distributions compile with -O2 -fomit-frame-pointer, which discards frame pointers for a slight speed improvement. This breaks perf record -g. Solutions: - Recompile with -fno-omit-frame-pointer (add to CFLAGS) - Use --call-graph dwarf (slower collection, larger perf.data) - Use --call-graph lbr (Last Branch Record — CPU hardware, fast but shallow stacks ~30 frames)

Off-CPU Flame Graphs

Standard CPU profiling only captures where threads are running on CPU (on-CPU time). Threads blocked waiting for I/O, locks, or sleep are invisible. Off-CPU profiling captures this waiting time.

  Off-CPU time = time spent NOT on CPU:
  - Blocking on disk I/O (read/write)
  - Waiting for network data (recv)
  - Waiting for a mutex (futex)
  - Sleeping (nanosleep, poll timeout)
  - Waiting for page fault (page in from disk)

  Tool 1: perf with sched:sched_switch

  perf record -e sched:sched_switch -a -g -- sleep 30
  perf script | ./stackcollapse-perf.pl | ./flamegraph.pl \
    --color=io --title="Off-CPU Flame Graph" > offcpu.svg

  Tool 2: eBPF offcputime (more accurate, less overhead)

  /usr/share/bcc/tools/offcputime -p <PID> 30 > offcpu.txt
  ./stackcollapse.pl offcpu.txt | ./flamegraph.pl \
    --color=io --title="Off-CPU Flame Graph" > offcpu.svg

  Reading off-CPU graphs:
  - X-axis represents total TIME BLOCKED (not CPU cycles)
  - Wide frames = long blocking = latency source
  - Look for: futex_wait (lock contention), sys_read/sys_write (I/O),
              poll (waiting for network), do_page_fault (memory pressure)

Differential Flame Graphs

Differential flame graphs compare two profiles (before vs after a change) to highlight regressions and improvements.

  Use case: A deploy caused p99 latency regression.
  Capture: perf record baseline (before deploy), then regression (after).

  Generation:

  # Normalize sample counts to same total (critical for fair comparison):
  ./stackcollapse-perf.pl baseline_perf.out > baseline.txt
  ./stackcollapse-perf.pl regression_perf.out > regression.txt

  # Generate diff:
  ./difffolded.pl baseline.txt regression.txt > diff.txt

  # Generate differential flame graph:
  ./flamegraph.pl --negate diff.txt > diff_flamegraph.svg

  Color coding:
  - RED frames:  more CPU in regression → regression introduced here
  - BLUE frames: less CPU in regression → improvement (or code moved)
  - PURPLE:      new code in regression not present in baseline

  Common findings:
  - Red malloc/free → new memory allocation hot path introduced
  - Red kernel path → lock contention or syscall regression
  - Blue path disappears → code was optimized or removed

Intel VTune: Topdown Microarchitecture Analysis

VTune implements the Intel Topdown Microarchitecture Analysis (TMA) methodology, which categorizes CPU cycles into four buckets:

  Topdown Analysis Tree:

  100% of cycles
  ├── Frontend Bound (X%)
  │   CPU cannot deliver instructions fast enough
  │   Causes: instruction cache misses, branch misprediction stalls,
  │           fetch bandwidth, iTLB misses
  │   Fix: reduce code size, improve branch prediction, PGO
  │
  ├── Backend Bound (Y%)
  │   CPU has instructions but execution units are stalled
  │   ├── Memory Bound: stalled waiting for cache/memory
  │   │   Fix: improve data locality, reduce working set, prefetch
  │   └── Core Bound: stalled on execution unit (ALU/FPU)
  │       Fix: vectorize, reduce dependency chains, use SIMD
  │
  ├── Bad Speculation (Z%)
  │   Branch mispredictions causing pipeline flush and replay
  │   Fix: reduce unpredictable branches, branchless algorithms
  │
  └── Retiring (W%)
      Useful work — the only "good" category
      Goal: maximize Retiring percentage

  Rule of thumb:
  - Memory Bound > 20%: optimize data structures (SoA vs AoS, cache lines)
  - Bad Speculation > 10%: profile branches, use profile-guided optimization
  - Frontend Bound > 20%: check i-cache, instruction bloat from templates/macros

# VTune CLI (requires Intel VTune installation):
vtune -collect hotspots -app ./my-program
vtune -report hotspots -r vtune_results/

# Topdown analysis:
vtune -collect uarch-exploration -app ./my-program
vtune -report uarch-exploration -r vtune_results/

async-profiler for JVM

The JVM presents unique profiling challenges: - Safepoint bias: traditional JVMTI profilers only sample at GC safepoints, missing hot non-GC code - JIT compilation: code changes shape at runtime; deoptimization can appear as hot frames - Mixed mode: Java frames, native frames, and kernel frames all mixed

async-profiler solves safepoint bias by using the OS-level AsyncGetCallTrace API combined with Linux perf events:

# Download async-profiler
curl -L https://github.com/async-profiler/async-profiler/releases/download/v3.0/async-profiler-3.0-linux-x64.tar.gz | tar xz

# Profile for 60 seconds, output flame graph
./asprof -d 60 -f flamegraph.html <PID>

# Profile CPU and allocation combined:
./asprof -d 60 -e cpu,alloc -f flamegraph.html <PID>

# Wall-clock profiling (includes off-CPU threads — blocked I/O, locks):
./asprof -d 60 -e wall -f flamegraph.html <PID>

# Lock profiling (find contended monitors):
./asprof -d 60 -e lock -f flamegraph.html <PID>

# Attach to running JVM (no restart needed):
./asprof start -e cpu <PID>
sleep 30
./asprof stop -f flamegraph.html <PID>

Safepoint bias example:

  JVM safepoint profiling (biased):
  GC frames dominate because GC creates safepoints.
  Hot computation loop shows as "safe" — almost invisible.

  async-profiler (unbiased):
  Actual hot method visible: "HashMap.get" at 34% of cycles
  GC shows realistic percentage: 8%

Profiling in Production

  Low-overhead continuous profiling:

  Tools: Parca, Pyroscope, Polar Signals, Grafana Continuous Profiling

  Approach:
  - Run profiler as sidecar or DaemonSet on every node
  - Sample at 1-100 Hz (typical: 19 Hz, off from common timer frequencies)
  - Aggregate samples in eBPF ring buffer (kernel-side)
  - Upload symbolized profiles to central store
  - Query: "What was the hottest function between 14:00 and 14:05?"

  eBPF-based profiling overhead: <1% CPU at 99 Hz
  (vs 5-15% for userspace sampling)

  Example: Parca agent (eBPF):
  kubectl apply -f https://github.com/parca-dev/parca-agent/releases/latest/.../
  # DaemonSet deploys to all nodes
  # Profiles all processes on host, including Kubernetes pods
  # Kernel stacks unified with userspace stacks

Debugging Notes

# Verify perf is working:
perf stat ls
# If "Permission denied": /proc/sys/kernel/perf_event_paranoid is too high
echo 1 > /proc/sys/kernel/perf_event_paranoid  # requires root, temporary

# Missing kernel symbols in perf report:
# Need kernel debug symbols:
# Ubuntu: apt install linux-tools-$(uname -r) linux-cloud-tools-$(uname -r)
# RHEL: yum install kernel-debuginfo

# Broken stack traces (all show as [unknown]):
# Missing frame pointers — recompile with -fno-omit-frame-pointer
# Or use DWARF: perf record --call-graph dwarf

# JVM: Java frames showing as hex addresses:
# Need perf-map-agent or async-profiler (generates /tmp/perf-<PID>.map)
# async-profiler does this automatically

# Flame graph too wide / too many frames:
# Use --minwidth 0.5 to hide frames <0.5% of total
./flamegraph.pl --minwidth 0.5 collapsed.txt > flamegraph.svg

# perf.data too large:
# Reduce frequency: -F 49 (49 Hz instead of 99)
# Limit time: -- sleep 10 instead of 30
# Limit processes: -p <PID> instead of system-wide

Security Implications

perf with system-wide profiling can read kernel memory layouts, which can assist kernel exploitation (defeating KASLR). Hence /proc/sys/kernel/perf_event_paranoid defaults to 2 in production systems.
Flame graphs can inadvertently expose sensitive information: function names may reveal encryption algorithms, data processing logic, or internal API names. Treat flame graphs as confidential in regulated environments.
Attaching a profiler to a production process may violate change management policies. Establish pre-approved runbooks for profiling in production.
eBPF-based profilers (Parca, Pyroscope) run with elevated kernel privileges; audit their RBAC permissions carefully in Kubernetes environments.

Performance Implications

perf record -g -F 99: approximately 1-3% CPU overhead. Safe for brief production profiling (5-10 minutes).
perf record --call-graph dwarf: 5-15% overhead due to stack copying. Use only in dev/staging.
strace: 100-1000x overhead per syscall. Never use on production servers; use perf trace instead.
Continuous eBPF profiling (Parca/Pyroscope) at 19 Hz: <0.5% overhead. Safe for permanent production deployment.

Modern Usage

Continuous profiling as standard practice: Companies like Google (Pprof), Datadog (Continuous Profiler), and Grafana (Pyroscope) have made always-on profiling a standard observability pillar alongside metrics, logs, and traces.
Profile-guided optimization (PGO): Use flame graph data to identify hot paths, feed them to clang -fprofile-use or Java's GraalVM PGO for 5-20% performance gains.
eBPF profiling without root: New Linux capabilities (CAP_BPF, CAP_PERFMON) allow non-root profiling on Linux 5.8+, enabling profiling in hardened containers.

Future Directions

Continuous profiling standardization: OpenTelemetry is adding profiling as a fourth signal alongside metrics/logs/traces (Profiling SIG, 2023-2024).
Hardware topdown in eBPF: Projects like toplev combined with eBPF to bring TMA analysis to every process without VTune.
ML-assisted bottleneck identification: Tools that automatically correlate profiling data with latency changes and suggest optimization strategies.

Exercises

Record a CPU flame graph of a known-slow program (e.g., a sorting algorithm). Identify the top three functions by width. Verify your findings match a manual code review of hot paths.
Use perf stat to compare two implementations of string hashing (e.g., FNV-1a vs SHA-256). Compare IPC, cache miss rate, and branch miss rate. Explain why one is faster in microarchitecture terms.
Generate an off-CPU flame graph for a program that does file I/O. Identify where it spends most of its time waiting. Compare to the on-CPU flame graph and note which functions appear only off-CPU.
Install async-profiler. Profile a Java web server under load. Find the top memory allocation hotspot. Suggest how to reduce allocation rate.
Generate a differential flame graph between two versions of a program (introduce a deliberate regression, e.g., add unnecessary malloc/free in a hot path). Verify the differential graph correctly highlights the regression in red.

References

Brendan Gregg, "Systems Performance" (2nd ed., 2020) — definitive reference; chapters on CPU and profiling
Brendan Gregg, "BPF Performance Tools" (2019) — eBPF-based profiling tools
FlameGraph repository: github.com/brendangregg/FlameGraph (Brendan Gregg, original scripts)
async-profiler: github.com/async-profiler/async-profiler
Intel Topdown Microarchitecture Analysis: intel.com/content/www/us/en/develop/documentation/vtune-help/top/reference-type-topics/cpu-metrics-reference.html
"The Flame Graph" — Brendan Gregg, ACMQ 2016 (original paper)
Linux perf wiki: perf.wiki.kernel.org
Parca continuous profiling: parca.dev
"Stop Safepointing Everything" — JVM profiling talk, JVM Language Summit
"Linux Profiling at Netflix" — Brendan Gregg, USENIX Lisa 2015