Skip to content

Performance Engineering Learning Roadmap

A complete roadmap for becoming a performance engineer — from benchmarking basics through eBPF tracing, kernel bypass, and HFT-grade latency optimization. This is a practical discipline; every concept must be verified on a running system.


Overview

Phase Duration Focus Target Outcome
Prerequisites 1–2 months Profiling tools, CPU architecture, benchmarking methodology Can profile a process and interpret results without guidance
Core Skills 2–6 months Systems Performance book, USE method, Linux tuning Can diagnose any Linux performance problem systematically
Intermediate 6–12 months eBPF, memory profiling, lock contention, network perf Can instrument production workloads safely
Advanced 12–24 months Kernel bypass, NUMA, compiler optimization, sub-100µs latency Can achieve HFT-grade latency in a network server

Phase 1: Prerequisites (Months 1–2)

CPU Architecture Fundamentals

Understanding the hardware is non-negotiable for performance work. Without this, profiling numbers are uninterpretable.

Concept Key Facts Why It Matters
Cache hierarchy L1: ~4 cycles, L2: ~12 cycles, L3: ~40 cycles, DRAM: ~200 cycles Cache misses dominate latency profiles
Cache line size 64 bytes on x86 Explains false sharing, prefetch patterns
Branch predictor Modern CPUs predict 1–2 branches ahead; misprediction: ~15 cycles Explain why branch-heavy code is slow
Out-of-order execution CPU reorders instructions within a window (~200 ops) Explains memory model, barriers needed
SIMD/AVX 256-bit (AVX2) or 512-bit (AVX-512) per cycle Auto-vectorization and manual SIMD
Hyper-threading Two hardware threads per core; share L1/L2 Causes contention in latency-sensitive apps
NUMA Each socket has local memory; cross-socket adds ~40–80ns Critical for multi-socket servers
TLB Covers 512 pages (4 KB) or 32 huge pages (2 MB) TLB misses for large working sets

Resource: "What Every Programmer Should Know About Memory" — Ulrich Drepper, 2007 (free PDF, ~100 pages). Read Sections 1–5 thoroughly.

Benchmarking Methodology

Before measuring anything, establish the methodology. Measuring wrong is worse than not measuring at all.

The Seven Questions Before Any Benchmark: 1. What are you measuring? (Latency? Throughput? CPU efficiency?) 2. What is the unit? (ns, µs, ops/sec, MB/s) 3. Are you measuring the right thing? (Is the profiler itself distorting results?) 4. What is the workload distribution? (Uniform? Zipfian? Bursty?) 5. What percentile matters? (Mean is almost always wrong; use p50/p99/p99.9) 6. Is the system warmed up? (JIT, CPU frequency scaling, page cache) 7. Are results statistically significant? (Run at least 30 iterations; compute confidence intervals)

Tools Introduction:

Tool Install Best For
perf stat Built into Linux kernel Hardware counter overview
perf record + perf report Same Sampling profiler
flamegraph.pl https://github.com/brendangregg/FlameGraph Visualize perf output
bpftrace apt install bpftrace Dynamic tracing, histograms
strace apt install strace Syscall tracing (high overhead — never in prod)
ltrace apt install ltrace Library call tracing
pmap Built into procps Process memory map
numastat apt install numactl NUMA allocation statistics

Month 1 Lab Exercise:

Instrument a simple C++ HTTP server (use cpp-httplib or nginx) with perf:

perf record -F 99 -a -g -- sleep 30     # sample all CPUs at 99 Hz
perf script | stackcollapse-perf.pl | flamegraph.pl > profile.svg

Open the flame graph in a browser. Identify the top 3 functions consuming CPU. This single exercise teaches more than any tutorial.

Month 2 Lab Exercise: CPU Microbenchmarks

Write a benchmark that demonstrates each of the following at measurable cost: 1. Cache miss: sequential array access vs. random pointer chasing 2. False sharing: two threads incrementing adjacent variables vs. padded variables 3. Branch misprediction: sorted array sum vs. random array sum 4. TLB miss: access 1 GB array with 64-byte stride vs. 4096-byte stride

Use perf stat -e cache-misses,branch-misses,instructions,cycles to confirm your measurements.

Success Criteria for Phase 1: - Can produce a flame graph for any process within 5 minutes - Can explain what a cache miss costs in cycles, not just conceptually - Have reproduced all four microbenchmark effects with measured numbers


Phase 2: Core Skills (Months 2–6)

Primary Text

Systems Performance: Enterprise and the Cloud (2nd ed.) — Brendan Gregg - Publisher: Addison-Wesley, 2020 - ISBN: 978-0136820154 - This is the bible. Read every chapter. The examples use Linux but the methodology is universal.

USE Method

The USE (Utilization, Saturation, Errors) method is the systematic framework for performance analysis. Apply it before reaching for any profiling tool.

USE Method Checklist for Linux Systems:

Resource Utilization Metric Saturation Metric Errors Metric
CPU mpstat 1%idle vmstat 1r (run queue) perf stat — stalled cycles
Memory free -m — used/total vmstat 1si/so (swap in/out) dmesg — OOM killer
Network interface sar -n DEV 1%ifutil netstat -s — drop counters ip -s link — errors
Disk I/O iostat -xz 1%util iostatawait (ms) smartctl -a
File descriptors cat /proc/sys/fs/file-nr dmesg — "too many open files"
Kernel locks perf lock record/report lockdep output

Month 2–3 Reading Plan (Systems Performance): - Chapters 1–2: Methodology, tools overview - Chapter 3: Operating systems concepts (review) - Chapter 4: Observability tools — vmstat, iostat, netstat, sar - Chapter 5: Applications - Chapter 6: CPUs — CPU profiling, scheduling, affinity

Month 4 Reading Plan: - Chapter 7: Memory — virtual memory, paging, allocators - Chapter 8: File systems — I/O latency, VFS, buffer cache - Chapter 9: Disks

Month 5–6 Reading Plan: - Chapter 10: Network - Chapter 11: Cloud computing performance - Chapter 12: Benchmarking - Chapter 13–15: perf, Ftrace, BPF (foundation for next phase)

Flame Graph Mastery

Types of Flame Graphs and When to Use Each:

Flame Graph Type Command Best For
CPU on-CPU perf record -F 99 -ag CPU-bound code — find hot functions
Off-CPU offcputime-bpfcc I/O-bound and lock-wait analysis
Memory allocation perf record -e malloc Heap allocation hot paths
Differential difffolded.pl Before/after comparison of two profiles
Package flamegraph Language-specific (async-profiler for JVM) JVM/Python/Node profiling

Month 3 Lab Exercise: Optimize a Real Service

Take any open-source service (Redis, nginx, PostgreSQL). Run a load test with wrk or pgbench. Generate a flame graph. Identify the single hottest non-kernel function. Read its source. Propose (and implement if possible) an optimization. Measure improvement.

Linux Performance Tuning Reference

CPU Tuning:

Knob Where Effect
CPU frequency governor /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor Set to performance for latency-sensitive workloads
C-state disable /sys/devices/system/cpu/cpu*/cpuidle/state*/disable Prevents deep sleep; reduces wake latency
IRQ affinity /proc/irq/*/smp_affinity Pin IRQs away from application cores
NUMA balancing /proc/sys/kernel/numa_balancing Disable for latency-sensitive; enable for throughput
Transparent huge pages /sys/kernel/mm/transparent_hugepage/enabled always for throughput; madvise for mixed

Memory Tuning:

Knob Default Recommended (latency) Effect
vm.swappiness 60 1–10 Reduce swap activity
vm.dirty_ratio 20 5–10 Flush dirty pages earlier
vm.dirty_background_ratio 10 3–5 Background flush threshold
vm.overcommit_memory 0 1 (latency) Eliminate OOM-related stalls
vm.min_free_kbytes auto 2× default Keep emergency reserves

Success Criteria for Phase 2: - Can apply USE method to diagnose a contrived performance problem (cache thrash, lock contention, I/O saturation) in under 15 minutes - Have read all 15 chapters of Systems Performance - Can produce and interpret all five types of flame graphs


Phase 3: Intermediate (Months 6–12)

eBPF Tracing

eBPF has become the standard tool for production-safe, zero-modification tracing.

Primary Text: "BPF Performance Tools" — Brendan Gregg (ISBN: 978-0136554820, Addison-Wesley, 2019)

Key eBPF Concepts:

Concept Description
eBPF programs Small C programs compiled to eBPF bytecode, verified before load
Probe types kprobe (kernel function entry), kretprobe (return), tracepoint (static), uprobe (userspace)
Maps Shared data structures between eBPF programs and userspace
BCC Python frontend for eBPF programs
bpftrace High-level awk-like language for one-liners
libbpf C library for production eBPF programs (CO-RE: portable across kernel versions)

bpftrace One-Liners Reference:

# Trace all execve() calls with arguments
bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf("%s -> %s\n", comm, str(args->filename)); }'

# Syscall latency histogram for open()
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { @start[tid] = nsecs; }
             tracepoint:syscalls:sys_exit_openat  { @[comm] = hist(nsecs - @start[tid]); delete(@start[tid]); }'

# Off-CPU analysis: which stacks are blocking?
bpftrace -e 'software:cpu-clock:100 { @[kstack] = count(); }'

# Block I/O latency histogram
bpftrace -e 'tracepoint:block:block_rq_issue { @start[args->dev, args->sector] = nsecs; }
             tracepoint:block:block_rq_complete { @io_ms = hist((nsecs - @start[args->dev, args->sector]) / 1000000); }'

# TCP retransmit rate
bpftrace -e 'kprobe:tcp_retransmit_skb { @[comm] = count(); } interval:s:1 { print(@); clear(@); }'

# Memory allocation flame graph data
bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc { @[ustack, comm] = sum(arg0); }'

Month 7–8 Lab Exercise: Instrument a Production Service

  1. Pick a service under load (nginx serving a static site at 50K RPS is fine)
  2. Identify top 5 syscalls by count: bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[args->id] = count(); }'
  3. Measure P99 latency of the most frequent syscall
  4. Find any syscall taking >1ms and identify which code path triggers it
  5. Write a bpftrace script that produces a per-second histogram of request latency

Memory Profiling

Tools:

Tool What It Finds Overhead
Valgrind massif Heap allocation profile over time 10–20x slowdown
heaptrack Heap allocation with call stacks 3–5x slowdown
perf mem Memory access patterns, cache behavior ~5% sampling
numastat -p <pid> Per-process NUMA allocation Negligible
/proc/<pid>/smaps_rollup Virtual memory breakdown None
vmtouch File cache status None

Month 9 Lab Exercise: Memory Profile a JVM Application

  1. Use async-profiler (Java) in --alloc mode to capture allocation flame graph
  2. Identify top allocation sites
  3. Use -XX:+PrintGCDetails -XX:+PrintGCDateStamps to correlate GC pauses with allocation rate
  4. Reduce allocation rate by 30% through object pooling or escape analysis hints
  5. Verify with JMH benchmark before and after

Lock Contention Analysis

Month 10 Lab Exercise:

# Find hot lock contention with perf lock
perf lock record -a -- sleep 10
perf lock report

# With bpftrace: find mutex wait time per function
bpftrace -e '
kprobe:mutex_lock_slowpath { @start[tid] = nsecs; }
kretprobe:mutex_lock_slowpath { @[kstack] = hist(nsecs - @start[tid]); delete(@start[tid]); }
'

Common Lock Contention Patterns:

Pattern Symptom Fix
Global lock serialization One core at 100%, others idle Per-CPU data structures, sharding
Read-write lock imbalance Readers starved by writers RCU for read-heavy workloads
Lock-free queue head High CAS retry rate visible in perf stat Exponential backoff, elimination array
Convoy effect Lock holder descheduled; all waiters block Avoid kernel preemption with lock held
False sharing Adjacent cache lines modified by different threads Pad structs to 64 bytes between threads

Network Performance

Month 11–12 Lab Exercise: Scale a Network Server from 100K to 1M RPS

Baseline Setup:

# Install wrk for HTTP load testing
wrk -t 12 -c 400 -d 30s http://localhost:8080/

# Baseline: 100K RPS on a single-core server

Optimization Steps (apply in order, measure each step):

Step Change Expected Gain
1 Increase socket backlog: net.core.somaxconn=65535 Reduces connection errors
2 Enable TCP_NODELAY Reduces latency for small messages
3 Tune buffer sizes: net.core.rmem_max=134217728 Reduces buffer exhaustion
4 Enable TCP BBR: net.ipv4.tcp_congestion_control=bbr Better throughput under loss
5 Use SO_REUSEPORT (multiple listeners) Linear scaling with threads
6 Move to io_uring for async I/O Reduces syscall overhead
7 CPU affinity: pin workers to cores Reduces context switch overhead
8 Disable Nagle on client and server Removes 40ms delay on small writes

Expected: After all optimizations, reach 800K–1.2M RPS (hardware-dependent).


Phase 4: Advanced (Months 12–24)

Kernel Bypass — DPDK

What it is: Bypass the kernel network stack entirely. NIC driver runs in userspace. No syscalls. No context switches. No kernel scheduler involvement.

When to use: When you need < 5µs end-to-end latency or > 10 Mpps throughput.

DPDK Setup Lab:

# Bind NIC to vfio-pci driver (remove from kernel)
modprobe vfio-pci
dpdk-devbind.py --bind=vfio-pci 0000:01:00.0

# Run l3fwd sample app
./dpdk-l3fwd -l 0-3 -n 4 -- -p 0x1 --config="(0,0,0),(0,1,1)"

Key DPDK Concepts:

Concept Description
Huge pages Required: 2 MB or 1 GB pages for DMA buffers
IOVA mode Physical or virtual address mode for DMA
mempool Pre-allocated packet buffer pools
rte_ring Lock-free SPSC/MPSC/MPMC ring buffers
Poll mode driver (PMD) Busy-polls NIC instead of using interrupts
RSS (Receive Side Scaling) Hash-based NIC flow distribution across queues

io_uring

Why io_uring matters: Reduces syscall count by submitting batches of I/O through shared memory rings. Can achieve zero-syscall steady-state operation.

io_uring Architecture:

Userspace                    Kernel
  SQ Ring     ─── submit ──→ io_uring_enter()
  (Submission)               (or auto via SQPOLL)

  CQ Ring     ←── complete── I/O completion
  (Completion)

Month 13 Lab Exercise:

Rewrite a file copy utility three ways: 1. read()/write() in a loop 2. preadv()/pwritev() with large buffers 3. io_uring with fixed buffers and registered files

Measure: throughput (MB/s), syscall count (perf stat), CPU utilization.

Expected results: io_uring version should show ~5–10x fewer syscalls and 20–30% CPU reduction at high throughput.

NUMA Optimization

Month 14 Lab Exercise:

# Observe NUMA allocation
numastat -p <pid>        # per-node allocation
numactl --hardware       # topology

# Bind process to NUMA node 0
numactl --cpunodebind=0 --membind=0 ./myserver

# libnuma in application
#include <numa.h>
void *buf = numa_alloc_onnode(size, 0);  // allocate on node 0

NUMA Performance Rules:

Rule Reason
Allocate memory on the same node as the thread Cross-node access adds 40–100ns
Use mbind() with MPOL_BIND for critical buffers Prevent kernel from migrating pages
Avoid NUMA balancing for latency-sensitive apps Automatic migration causes pauses
Per-NUMA-node data structures Eliminate cross-node cache invalidation
Disable hyper-threading on latency-critical cores Reduces LLC contention with sibling

Compiler Optimization

Month 15 Lab Exercise: Profile-Guided Optimization (PGO)

# Step 1: Compile with instrumentation
gcc -O2 -fprofile-generate -o server_pgo server.c

# Step 2: Run representative workload
./server_pgo &
wrk -t 8 -c 200 -d 60s http://localhost:8080/
kill %1

# Step 3: Compile with profile data
gcc -O2 -fprofile-use -fprofile-correction -o server_opt server.c

# Expected: 10–20% throughput improvement from branch prediction hints

Key Compiler Flags for Performance:

Flag Effect When to Use
-O3 Full optimization including auto-vectorization Throughput-critical code
-march=native Target-specific instructions (AVX2, etc.) When binary runs on same CPU
-flto Link-time optimization (cross-file inlining) Always for production builds
-fprofile-generate/-use PGO After representative workload profiling
-funroll-loops Unroll small loops Tight inner loops (measure first)
__builtin_expect Branch prediction hint When branch is almost always taken
__attribute__((hot)) Mark hot functions for layout Functions in critical path

HFT-Grade Latency — Sub-100µs Target

Latency Budget Analysis:

Operation Typical Cost Optimization
L1 cache hit 1–4 ns Structure data for locality
L3 cache hit 30–40 ns Prefetch with __builtin_prefetch
DRAM access 60–100 ns Minimize working set
Context switch 1–10 µs Dedicated CPU core, no sharing
Kernel network stack 5–50 µs DPDK bypass
IRQ latency (with C-states) 10–200 µs Disable C-states
TCP round trip (local) 50–200 µs UDP or shared memory
Mutex acquire (uncontended) 20–100 ns Lock-free or RCU

Kernel Configuration for Minimum Latency:

CONFIG_PREEMPT=y          # Full kernel preemption
CONFIG_HZ=1000            # 1ms timer tick
CONFIG_NO_HZ_FULL=y       # Tickless for isolated CPUs
CONFIG_CPUSETS=y          # CPU isolation

Boot parameters:

isolcpus=2-7 nohz_full=2-7 rcu_nocbs=2-7 nosmt

Performance Bottleneck Cookbook

Systematic diagnosis guide for the five most common production bottlenecks:

Symptom First Check Second Check Likely Fix
High CPU, low throughput perf top — which functions? Flame graph — hot path Algorithmic optimization or SIMD
Low CPU, high latency bpftrace off-CPU analysis strace — which syscall blocks? Reduce lock contention or I/O
Memory grows unbounded valgrind massif or heaptrack /proc/<pid>/smaps Fix memory leak or tune allocator
Network throughput plateau sar -n DEV 1 — % util ethtool -S — ring drops Increase ring buffers, RSS, or DPDK
Disk I/O saturation iostat -xz 1%util=100 blktrace — queue depth Async I/O, larger block size, NVMe

Tools Reference Card

Category Tool Key Command Output
CPU profiling perf record perf record -F 99 -ag -- cmd Sampling profile
CPU profiling perf top perf top -g Live function view
CPU profiling flamegraph Script pipeline SVG flame graph
Tracing bpftrace bpftrace -e 'kprobe:...' Aggregated stats
Tracing ftrace echo function > current_tracer Function call log
Memory valgrind valgrind --tool=massif Heap timeline
Memory perf mem perf mem record/report Memory access stats
Network iperf3 iperf3 -c host -P 4 Throughput
Network ss ss -ntp Socket states
Disk iostat iostat -xz 1 I/O utilization
Disk blktrace blktrace /dev/sda Block-level trace
Benchmarking wrk wrk -t 8 -c 200 -d 30s url HTTP throughput
Benchmarking fio fio --name=test --rw=randread Storage benchmark

Success Criteria Summary

Phase Key Checkpoints
Phase 1 Generated flame graph for a real process; reproduced all four microbenchmark effects with measured numbers
Phase 2 Applied USE method to diagnose a contrived problem in <15 min; read Systems Performance cover to cover
Phase 3 Instrumented a production service with bpftrace; scaled a server from 100K to 1M RPS
Phase 4 Built a DPDK-based packet forwarder; measured sub-10µs P99 latency on an isolated core; achieved PGO improvement

The core insight of performance engineering: measure first, hypothesize second, optimize third, measure again. Intuition is almost always wrong about where the bottleneck actually lives.