08 — Continuous Profiling

Technical Overview

Continuous profiling is the practice of collecting CPU and memory profiles from production services always-on — not just during incidents or on demand, but continuously, at low overhead, with results stored as time-series data. The key word is "continuous": traditional profiling was a reactive tool (start profiling when something is slow, stop when done). Continuous profiling makes profiling data available retroactively — you can answer "what was the CPU profile of this service at 14:30 UTC yesterday, before the regression?" without having started profiling in advance.

This transforms performance engineering from reactive to proactive: performance regressions are detected automatically by comparing profiles before and after a deployment, often before they become user-visible incidents or SLO violations.

Brendan Gregg's work at Netflix (and documented extensively in BPF Performance Tools and Systems Performance) established the technical foundations: eBPF-based stack sampling, flame graph visualization, and off-CPU profiling. Google's 2010 paper "Continuous Profiling: Where Have All the Cycles Gone?" (Berkhout et al. predecessor work) showed that continuous profiling of Google's fleet saved 10-13% of CPU capacity fleet-wide through systematic identification and elimination of hot paths.

Prerequisites

Understanding of CPU architecture (cache hierarchy, branch prediction, IPC)
Familiarity with Linux perf and eBPF basics (see 06-ebpf-observability.md)
Basic knowledge of call stacks and function execution
Understanding of JVM or compiled language execution models (for language-specific profiling)

Core Content

What Continuous Profiling Measures

A profile is a statistical sample of execution state. The most common types:

CPU profiling: at regular intervals (99Hz = every ~10ms), interrupt execution and capture the current call stack. After N samples, you know which functions appear most frequently — these are where the CPU is spending time.

Off-CPU profiling: captures time spent waiting (I/O, lock contention, sleep, page faults). A thread that is blocking on a database query is off-CPU — CPU profiling misses this time entirely. Off-CPU profiling attaches to scheduler events (context switches) to record when threads go off-CPU and for how long.

Memory/heap profiling: samples allocations — records which call stacks are allocating memory, and how much. Identifies allocation-heavy code paths (GC pressure in managed languages).

Lock contention profiling: records time spent waiting to acquire locks (mutexes, semaphores). Identifies synchronization bottlenecks.

PROFILING TYPE COVERAGE:

Application time breakdown (hypothetical service):
  ┌─────────────────────────────────────────────────────┐
  │ Total wall-clock time for a request (500ms)          │
  ├─────────────┬──────────────────┬────────────────────┤
  │ On CPU      │ Waiting (off-CPU)│ I/O wait           │
  │ (CPU prof.) │ (off-CPU prof.)  │ (off-CPU prof.)    │
  │   40%       │   35%            │   25%              │
  │   200ms     │   175ms          │   125ms            │
  └─────────────┴──────────────────┴────────────────────┘

  CPU profiler shows: 200ms of CPU work → breaks down by function
  Off-CPU profiler shows: 175ms lock wait + 125ms I/O = 300ms sleeping
  Together: complete picture of where time goes

Sample-Based CPU Profiling

The dominant approach is sample-based (statistical) profiling: periodically interrupt execution and capture the call stack. This is not exact — a function that runs for exactly 9ms then completes is never captured at a 10ms interval. But over thousands of samples, the statistical distribution accurately reflects where CPU time is spent.

Sampling mechanisms: 1. Signal-based (traditional perf): send SIGPROF to process at N Hz; signal handler captures stack. Problem: only captures threads at user-space signal delivery points, missing async-signal-unsafe code. 2. Hardware PMU (Performance Monitoring Unit): configure the CPU's cycle counter to overflow at N cycles, generate a PMU interrupt. This is what perf record -F 99 uses. Very accurate, no signal delivery latency. 3. eBPF timer (Parca, Pyroscope eBPF agent): use BPF timer or perf_event_open() with eBPF program to capture stacks in-kernel, aggregate in BPF maps.

Flame Graph Interpretation

The flame graph (Brendan Gregg, 2011) visualizes profiling samples:

FLAME GRAPH (CPU profiling, read bottom-up)

  main
  ├─ serve_request ────────────────────────────────────  (60% of samples)
  │  ├─ parse_json ──────────────────────────────────    (20%)
  │  │  └─ malloc ────────────────────────────────────   (15%) ← HOT PATH
  │  ├─ db_query ────────────────────────────────────    (30%)
  │  │  ├─ execute_sql ──────────────────────────────    (20%)
  │  │  │  └─ network_write ─────────────────────────   (18%)
  │  │  └─ serialize_result ──────────────────────────   (10%)
  │  └─ encode_response ─────────────────────────────    (10%)
  └─ health_check ────────                               (5%)

ASCII Flame Graph (horizontal = proportion of CPU time):
  ────────────────────────── serve_request ──────────────────── │ health │
  ──────── parse_json ──────── │ ───────────── db_query ─────── │
  ──── malloc ─────────────── │ execute_sql  │ serialize_result │
                               │ network_write│

Rules for reading:
  - Width = fraction of total CPU time (wider = more time)
  - y-axis = call stack depth (bottom = top of stack / entry point, top = deepest frame)
  - A wide frame at the TOP of a stack tower is a hot leaf function (this is where you optimize)
  - A wide frame in the MIDDLE of a stack tower means that function's children are slow
  - Flat plateaus at the top = CPU is doing work in that function

Identifying performance problems from flame graphs: - Wide towers at the top: hot functions consuming CPU directly. Optimization target. - Unexpectedly wide internal frames: a function calls many things. Check for N+1 patterns (loop calling a function N times). - Tall, narrow spikes: deep call stacks with little CPU at any level. Indicates overhead from abstraction layers rather than actual work. - Missing frames (flat bottom): frame pointer issues or inlined functions. Recompile with -fno-omit-frame-pointer.

Differential Flame Graphs

Differential flame graphs compare two profiles (before and after a deployment, or two time windows):

DIFFERENTIAL FLAME GRAPH

  Red  = MORE CPU time (regression) → regressed after change
  Blue = LESS CPU time (improvement) → faster after change
  Gray = No change
  Width still represents total sample count (reference profile)

  After code change:
  ──────────────────────────── serve_request ──────────────────
  ──── parse_json ──── │ ─────────────────── db_query ─────────
  [RED: malloc ──────] │ [RED: execute_sql ──────] │ serialize  
                       │ [RED: schema_validate ───]│

  Interpretation: malloc and schema_validate both got more expensive.
  schema_validate is NEW (didn't exist before) — introduced by the change.

# Generate differential flame graph with Brendan Gregg's FlameGraph tools
perf record -F 99 -a --call-graph dwarf -- sleep 30   # before profile
mv perf.data perf_before.data

# Deploy change...

perf record -F 99 -a --call-graph dwarf -- sleep 30   # after profile
mv perf.data perf_after.data

# Generate stacks
perf script -i perf_before.data | ./stackcollapse-perf.pl > before.folded
perf script -i perf_after.data  | ./stackcollapse-perf.pl > after.folded

# Generate differential flame graph
./difffolded.pl before.folded after.folded | ./flamegraph.pl > diff.svg

Profiling Storage and Query (Parca / Pyroscope)

Traditional profiling produces one-shot files (.prof, .pprof). Continuous profiling requires storing profiles as a time series — thousands of profiles per service per day.

pprof format (Google's standard): protobuf-encoded profile containing sample types (CPU, memory, goroutine), stack traces, and string tables. Used by Go natively; Java (async-profiler), Python, and Ruby export pprof via OTel or direct tooling.

Parca architecture:

Parca Agent (eBPF, per-node DaemonSet)
  │ collects stacks, symbolizes, exports as pprof
  ▼
Parca Server
  │ ingests pprof profiles, stores in local columnar storage
  │ (based on FrostDB — columnar, Parquet-like)
  │ indexes by: service, time, profile type
  ▼
Query API (PromQL-like selector syntax)
  │ serves profile data to UI and tooling
  ▼
Parca Web UI
  - flame graph for any time range
  - diff flame graphs between deployments
  - merge profiles across pod replicas

# Query Parca for CPU profile of checkout service at a specific time
parcactl profile query \
  --selector='{service_name="checkout-service", profile_type="process_cpu:cpu:nanoseconds:cpu:nanoseconds"}' \
  --from=1715869200000000000 \
  --to=1715869260000000000 \
  --output=flamegraph | open -f -a /Applications/Firefox.app

# Compare profiles before/after deployment
parcactl profile diff \
  --selector-a='{service_name="checkout-service"}' \
  --from-a=1715865600000000000 --to-a=1715869200000000000 \
  --selector-b='{service_name="checkout-service"}' \
  --from-b=1715869200000000000 --to-b=1715872800000000000 \
  --output=flamegraph

Java Profiling: async-profiler

Java profiling has a historically difficult problem: the JVM's safepoint mechanism causes traditional JVMTI-based profilers to only sample at safepoints — specific points where all threads are paused for GC. This creates safepoint bias: code that runs between safepoints (most hot native code) is never sampled. Profilers that use safepoints produce inaccurate results.

async-profiler solves this by using OS-level signals (AsyncGetCallTrace API) to interrupt the JVM at arbitrary points, capturing the stack even if not at a safepoint. It also captures native frames (C/C++ code in the JVM and JNI libraries) alongside Java frames, giving complete stack traces.

# Profile a JVM process for 30 seconds, output flamegraph
./profiler.sh -d 30 -f /tmp/flamegraph.html $(pgrep -f myapp)

# Profile and output pprof format (for Parca/Pyroscope ingestion)
./profiler.sh -d 30 -f /tmp/profile.pprof -o pprof $(pgrep -f myapp)

# Profile with allocation tracking (heap profiler)
./profiler.sh -e alloc -d 30 -f /tmp/alloc.html $(pgrep -f myapp)

# Profile with lock contention
./profiler.sh -e lock -d 30 -f /tmp/locks.html $(pgrep -f myapp)

# Profile at a specific rate (100Hz instead of default 1000Hz)
./profiler.sh -d 30 -i 10000000 -f /tmp/cpu.html $(pgrep -f myapp)
# -i 10000000 = 10ms interval = 100Hz

Continuous profiling with Pyroscope + async-profiler (Java):

# Start application with Pyroscope Java agent
java -javaagent:pyroscope.jar \
     -Dpyroscope.server.address=http://pyroscope:4040 \
     -Dpyroscope.application.name=checkout-service \
     -Dpyroscope.profiling.interval=10ms \
     -Dpyroscope.profiler.event=itimer \
     -jar app.jar

Google's Continuous Profiling Research

The paper "Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers" (Ren et al., 2010) — often cited as the foundation of the continuous profiling concept — described: - Profiling of every production job at 100Hz continuously across Google's fleet - Profiles stored as pprof format and queryable through an internal UI - Fleet-wide optimization: top CPU consumers across all services visible in one view - Outcome: 10-13% fleet-wide CPU savings by identifying and fixing top consumers

The key insight: at fleet scale, even a 0.1% CPU saving in a commonly-called function (like a JSON serializer or a hash function) translates to significant infrastructure cost reduction. Without continuous profiling, these micro-inefficiencies are invisible.

Off-CPU Profiling

Off-CPU profiling traces time spent not executing on a CPU core — waiting for: - Disk I/O (read/write syscalls) - Network I/O (send/recv) - Lock acquisition (mutex, semaphore, RWlock) - Sleep (explicit sleeps, timers) - Page faults (memory access to swapped/not-yet-loaded pages)

# off-CPU profiling with BCC tool (30 second window, all threads > 1ms off-CPU)
offcputime-bpfcc -m 1000 30 > /tmp/offcpu.txt  # -m 1000 = min 1000 microseconds

# Convert to flame graph
./stackcollapse.pl /tmp/offcpu.txt | ./flamegraph.pl --color=io --bgcolor=grey > offcpu.svg

# bpftrace one-liner for off-CPU analysis by process
bpftrace -e '
tracepoint:sched:sched_switch {
  if (args->prev_state) {
    @ts[args->prev_pid] = nsecs;
    @comm[args->prev_pid] = args->prev_comm;
  }
  if (@ts[args->next_pid]) {
    @offcpu_ms[str(@comm[args->next_pid])] =
      hist((nsecs - @ts[args->next_pid]) / 1000000);
    delete(@ts[args->next_pid]);
    delete(@comm[args->next_pid]);
  }
}
interval:s:10 { print(@offcpu_ms); exit(); }
'

Off-CPU flame graphs use the same format as CPU flame graphs but the width represents time spent sleeping rather than CPU cycles.

Profiling in CI/CD: Continuous Performance Testing

Continuous profiling is most powerful when integrated with CI/CD:

CI/CD Integration:

  Code merge to main
        │
        ▼
  Deploy to staging
        │
        ▼
  Run load test (5 minutes at production traffic rate)
        │
        ▼
  Collect profiles during load test
        │
        ▼
  Compare with baseline profile (last known-good deploy)
        │
        ▼
  If CPU regression > 5% in any function → fail build, show diff flamegraph
  If memory allocation increase > 20% → fail build
        │
        ▼
  Deploy to production (with confidence profile hasn't regressed)

This requires a baseline profile store (what was "normal" for the last 10 deployments?) and automated comparison tooling.

Historical Context

CPU profiling tools have existed since the 1970s (UNIX prof, gprof in the 1980s). These were sampling profilers but were designed for development use, not production. Instrumenting profilers (which added function entry/exit hooks) were available but had 10-50% overhead — too expensive for production.

The eBPF revolution (2014-2018) made sub-1% overhead profiling possible in production. Brendan Gregg's systematic development of the off-CPU analysis methodology and the flame graph visualization (patented by Netflix, but freely licensed) created the mental models needed to use profiling data effectively.

Google's GWP paper (2010) demonstrated fleet-wide profiling at scale. Apple introduced Instruments (time profiler, allocations) for macOS/iOS profiling. The modern wave of continuous profiling products (Parca 2021, Pyroscope 2021, Polar Signals 2021, Grafana Pyroscope 2023) made fleet-scale continuous profiling accessible to companies without Google's engineering resources.

Production Examples

# Using perf for CPU profiling of a specific PID
perf record -F 99 -p $(pgrep -f checkout-service) --call-graph fp -g -- sleep 30
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > cpu_profile.svg

# Profile system-wide (all processes) for 10 seconds
perf record -F 99 -a --call-graph dwarf -- sleep 10
perf report --stdio | head -50

# perf stat for hardware counter summary
perf stat -p $(pgrep -f myservice) sleep 10
# Output:
#  Performance counter stats for process id '12345':
#      5,234,567,890   cycles
#      2,891,234,567   instructions  #    0.55  insns per cycle
#        123,456,789   cache-misses  #    4.23% of all cache refs
#         12,345,678   branch-misses #    2.11% of all branches

# Java: async-profiler continuous mode (write profile to file every 60s)
./profiler.sh start -e cpu -i 10ms $(pgrep -f java)
sleep 60
./profiler.sh dump -f /tmp/profile-$(date +%s).jfr $(pgrep -f java)

Debugging Notes

Missing or broken stacks in flame graph: Most common cause is frame pointers not being emitted. Check: readelf -p .comment /proc/$(pgrep app)/exe | grep -i "omit-frame-pointer". If compiled with -O2 without -fno-omit-frame-pointer, rebuild with frame pointers or use DWARF-based unwinding.

Java safepoint bias in JFR profiles: If using Java Flight Recorder (JFR) or VisualVM and your flame graphs show functions at safepoints (loop back-edges, method returns) over-represented, switch to async-profiler which uses AsyncGetCallTrace.

Parca agent not symbolizing Go binaries: Ensure the Go binary is compiled without stripping symbols (-ldflags '-s -w' strips symbols and breaks symbolization). Use nm /proc/PID/exe to check symbol availability.

Profile data gaps in Pyroscope: Default scrape interval is 10s. Gaps in the time series may indicate the Pyroscope agent was unable to access the target process (permission denied, process restart). Check agent logs: kubectl logs -n monitoring -l app=pyroscope-agent.

Security Implications

Continuous profiling data reveals detailed function-level code paths, potentially exposing cryptographic algorithm choices, authentication mechanisms, and data handling code structure. Treat profiles as sensitive technical documentation.
eBPF-based profiling agents require CAP_SYS_ADMIN (or CAP_BPF on modern kernels). A compromised profiling agent has kernel-level observation capability. Use separate service accounts with minimal permissions; run agents in a dedicated monitoring namespace with network policies.
Profile data may inadvertently contain sensitive information: function names in stack traces can reveal cryptographic operations, secret handling, or internal business logic. Evaluate before exposing profile data to third-party vendors.

Performance Implications

eBPF-based continuous profiling at 99Hz: ~0.5-1% CPU overhead per core. Memory: ~50-100MB for Parca agent on a node with 50 active processes.
async-profiler at 100Hz: ~1-2% CPU overhead on the profiled JVM. At 10Hz: <0.5%.
Profile storage: a Go service generating profiles at 10-second intervals, 30-day retention: ~5-10 MB/day per service (after pprof compression). Fleet-wide continuous profiling of 1000 services: ~5-10 GB/day.
Differential flame graph computation: O(n) where n is profile size. Typically <100ms for typical profiles.

Failure Modes and Real Incidents

Performance regression detected by continuous profiling (Shopify, 2022 blog): A deployment that added input validation to a high-throughput API endpoint caused a 3% CPU regression fleet-wide. The regression was invisible in latency SLOs (too small) and error rates (zero errors introduced). Continuous profiling showed the new validate_input() function consuming 3% of total service CPU. The root cause: a regex being recompiled on every request instead of being compiled once at startup.

Off-CPU profiling reveals lock contention: A Go service was handling 50K req/s but p99 latency was 150ms (much higher than the 20ms expected from CPU work). CPU profiling showed only 15ms of CPU work. Off-CPU profiling revealed 130ms of time spent waiting on a sync.Mutex in the session cache. The lock was unnecessarily global; sharding it reduced p99 by 87%.

Memory leak found via allocation profiling: A Node.js service was growing by 50MB/hour. heap profiling (via --heap-prof) showed EventEmitter listener allocations accumulating in a request handler — on() was being called without corresponding off() calls, creating listener leaks.

Modern Usage

Grafana Pyroscope 1.x (2023): merged Grafana's profiling into Pyroscope (acquired from Grafana Labs). Fully integrated in Grafana UI alongside traces (Tempo) and metrics (Prometheus). Supports eBPF, Go, Java (async-profiler), Python, Ruby, .NET.
OTel Profiling signal (2024): OpenTelemetry is standardizing the profiling data model (as a fourth signal alongside traces, metrics, logs). The OTel Profiling spec defines ProfileContainer as the top-level message, enabling unified profiling data collection via OTLP.
Continuous profiling in GitHub Actions: tools like Bencher and CodSpeed run micro-benchmarks in CI and track them over time, providing profiling-level regression detection for library code.

Future Directions

Profiling-informed autoscaling: instead of CPU utilization-based HPA (horizontal pod autoscaler), use profiling data to predict load before resource exhaustion, and scale based on application-specific performance signatures.
Cross-service profile correlation: linking a distributed trace (showing service A is slow) to the CPU profile of service A's processes during the slow trace, enabling single-click root cause from trace to flame graph.
AI-driven flame graph analysis: LLM-based tools that explain flame graphs in natural language ("The malloc calls are hot because parse_json is creating temporary []string slices on every request — consider using a sync.Pool") to democratize profiling analysis beyond expert performance engineers.

Exercises

Flame graph interpretation: Generate a CPU flame graph for a web server (e.g., nginx or a simple Go HTTP server) under load using perf record -F 99 -a --call-graph fp -- sleep 30. Identify: the top-3 hottest leaf functions, any functions appearing unexpectedly wide (potential performance issues), and any functions that should not be in the hot path.
Differential flame graph: Make a deliberate performance regression: add a time.Sleep(1ms) or an expensive hash computation to a hot path in a test application. Generate CPU profiles before and after the change. Use FlameGraph's difffolded.pl to generate a differential flame graph. Verify the regression appears as a red frame.
Off-CPU vs on-CPU: Profile a Go service that makes a database query for each request. Generate both a CPU flame graph (profile-bpfcc) and an off-CPU profile (offcputime-bpfcc). Calculate the ratio of on-CPU time to off-CPU time. Explain what the off-CPU profile shows that the CPU profile cannot.
Java async-profiler vs JFR comparison: Profile the same Java application (a Spring Boot app under load) with both async-profiler and Java Flight Recorder. Compare the top-5 functions in each profile. Where do they agree? Where do they disagree? Explain any discrepancies in terms of safepoint bias.
Continuous profiling pipeline: Deploy Grafana Pyroscope in Kubernetes. Instrument a service with the Pyroscope Go SDK. Introduce a performance regression at a known time (a goroutine that runs a tight loop). Use Pyroscope's time-series flame graph to identify exactly when the regression was introduced, which function regressed, and by how much (in CPU seconds/second).

References

Gregg, Brendan. BPF Performance Tools. Addison-Wesley, 2019. Chapters 5-6 (CPU and Memory Profiling).
Gregg, Brendan. Systems Performance. 2nd ed. Addison-Wesley, 2020. Chapter 6 (CPUs).
Gregg, Brendan. "Flame Graphs." http://brendangregg.com/flamegraphs.html
Gregg, Brendan. "Off-CPU Analysis." http://brendangregg.com/offcpuanalysis.html
Ren, Gang et al. "Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers." IEEE Micro, 2010.
async-profiler: https://github.com/async-profiler/async-profiler
Parca Documentation: https://www.parca.dev/docs/
Grafana Pyroscope: https://grafana.com/docs/pyroscope/
OTel Profiling Signal spec: https://opentelemetry.io/docs/specs/otel/profiles/
Mytkowicz, Todd et al. "Evaluating the Accuracy of Java Profilers." PLDI 2010 (safepoint bias paper).