Performance Engineering Learning Roadmap
A complete roadmap for becoming a performance engineer — from benchmarking basics through eBPF tracing, kernel bypass, and HFT-grade latency optimization. This is a practical discipline; every concept must be verified on a running system.
Overview
| Phase | Duration | Focus | Target Outcome |
|---|---|---|---|
| Prerequisites | 1–2 months | Profiling tools, CPU architecture, benchmarking methodology | Can profile a process and interpret results without guidance |
| Core Skills | 2–6 months | Systems Performance book, USE method, Linux tuning | Can diagnose any Linux performance problem systematically |
| Intermediate | 6–12 months | eBPF, memory profiling, lock contention, network perf | Can instrument production workloads safely |
| Advanced | 12–24 months | Kernel bypass, NUMA, compiler optimization, sub-100µs latency | Can achieve HFT-grade latency in a network server |
Phase 1: Prerequisites (Months 1–2)
CPU Architecture Fundamentals
Understanding the hardware is non-negotiable for performance work. Without this, profiling numbers are uninterpretable.
| Concept | Key Facts | Why It Matters |
|---|---|---|
| Cache hierarchy | L1: ~4 cycles, L2: ~12 cycles, L3: ~40 cycles, DRAM: ~200 cycles | Cache misses dominate latency profiles |
| Cache line size | 64 bytes on x86 | Explains false sharing, prefetch patterns |
| Branch predictor | Modern CPUs predict 1–2 branches ahead; misprediction: ~15 cycles | Explain why branch-heavy code is slow |
| Out-of-order execution | CPU reorders instructions within a window (~200 ops) | Explains memory model, barriers needed |
| SIMD/AVX | 256-bit (AVX2) or 512-bit (AVX-512) per cycle | Auto-vectorization and manual SIMD |
| Hyper-threading | Two hardware threads per core; share L1/L2 | Causes contention in latency-sensitive apps |
| NUMA | Each socket has local memory; cross-socket adds ~40–80ns | Critical for multi-socket servers |
| TLB | Covers 512 pages (4 KB) or 32 huge pages (2 MB) | TLB misses for large working sets |
Resource: "What Every Programmer Should Know About Memory" — Ulrich Drepper, 2007 (free PDF, ~100 pages). Read Sections 1–5 thoroughly.
Benchmarking Methodology
Before measuring anything, establish the methodology. Measuring wrong is worse than not measuring at all.
The Seven Questions Before Any Benchmark: 1. What are you measuring? (Latency? Throughput? CPU efficiency?) 2. What is the unit? (ns, µs, ops/sec, MB/s) 3. Are you measuring the right thing? (Is the profiler itself distorting results?) 4. What is the workload distribution? (Uniform? Zipfian? Bursty?) 5. What percentile matters? (Mean is almost always wrong; use p50/p99/p99.9) 6. Is the system warmed up? (JIT, CPU frequency scaling, page cache) 7. Are results statistically significant? (Run at least 30 iterations; compute confidence intervals)
Tools Introduction:
| Tool | Install | Best For |
|---|---|---|
perf stat |
Built into Linux kernel | Hardware counter overview |
perf record + perf report |
Same | Sampling profiler |
flamegraph.pl |
https://github.com/brendangregg/FlameGraph | Visualize perf output |
bpftrace |
apt install bpftrace |
Dynamic tracing, histograms |
strace |
apt install strace |
Syscall tracing (high overhead — never in prod) |
ltrace |
apt install ltrace |
Library call tracing |
pmap |
Built into procps | Process memory map |
numastat |
apt install numactl |
NUMA allocation statistics |
Month 1 Lab Exercise:
Instrument a simple C++ HTTP server (use cpp-httplib or nginx) with perf:
perf record -F 99 -a -g -- sleep 30 # sample all CPUs at 99 Hz
perf script | stackcollapse-perf.pl | flamegraph.pl > profile.svg
Open the flame graph in a browser. Identify the top 3 functions consuming CPU. This single exercise teaches more than any tutorial.
Month 2 Lab Exercise: CPU Microbenchmarks
Write a benchmark that demonstrates each of the following at measurable cost: 1. Cache miss: sequential array access vs. random pointer chasing 2. False sharing: two threads incrementing adjacent variables vs. padded variables 3. Branch misprediction: sorted array sum vs. random array sum 4. TLB miss: access 1 GB array with 64-byte stride vs. 4096-byte stride
Use perf stat -e cache-misses,branch-misses,instructions,cycles to confirm your measurements.
Success Criteria for Phase 1: - Can produce a flame graph for any process within 5 minutes - Can explain what a cache miss costs in cycles, not just conceptually - Have reproduced all four microbenchmark effects with measured numbers
Phase 2: Core Skills (Months 2–6)
Primary Text
Systems Performance: Enterprise and the Cloud (2nd ed.) — Brendan Gregg - Publisher: Addison-Wesley, 2020 - ISBN: 978-0136820154 - This is the bible. Read every chapter. The examples use Linux but the methodology is universal.
USE Method
The USE (Utilization, Saturation, Errors) method is the systematic framework for performance analysis. Apply it before reaching for any profiling tool.
USE Method Checklist for Linux Systems:
| Resource | Utilization Metric | Saturation Metric | Errors Metric |
|---|---|---|---|
| CPU | mpstat 1 — %idle |
vmstat 1 — r (run queue) |
perf stat — stalled cycles |
| Memory | free -m — used/total |
vmstat 1 — si/so (swap in/out) |
dmesg — OOM killer |
| Network interface | sar -n DEV 1 — %ifutil |
netstat -s — drop counters |
ip -s link — errors |
| Disk I/O | iostat -xz 1 — %util |
iostat — await (ms) |
smartctl -a |
| File descriptors | cat /proc/sys/fs/file-nr |
— | dmesg — "too many open files" |
| Kernel locks | perf lock record/report |
— | lockdep output |
Month 2–3 Reading Plan (Systems Performance):
- Chapters 1–2: Methodology, tools overview
- Chapter 3: Operating systems concepts (review)
- Chapter 4: Observability tools — vmstat, iostat, netstat, sar
- Chapter 5: Applications
- Chapter 6: CPUs — CPU profiling, scheduling, affinity
Month 4 Reading Plan: - Chapter 7: Memory — virtual memory, paging, allocators - Chapter 8: File systems — I/O latency, VFS, buffer cache - Chapter 9: Disks
Month 5–6 Reading Plan: - Chapter 10: Network - Chapter 11: Cloud computing performance - Chapter 12: Benchmarking - Chapter 13–15: perf, Ftrace, BPF (foundation for next phase)
Flame Graph Mastery
Types of Flame Graphs and When to Use Each:
| Flame Graph Type | Command | Best For |
|---|---|---|
| CPU on-CPU | perf record -F 99 -ag |
CPU-bound code — find hot functions |
| Off-CPU | offcputime-bpfcc |
I/O-bound and lock-wait analysis |
| Memory allocation | perf record -e malloc |
Heap allocation hot paths |
| Differential | difffolded.pl |
Before/after comparison of two profiles |
| Package flamegraph | Language-specific (async-profiler for JVM) | JVM/Python/Node profiling |
Month 3 Lab Exercise: Optimize a Real Service
Take any open-source service (Redis, nginx, PostgreSQL). Run a load test with wrk or pgbench. Generate a flame graph. Identify the single hottest non-kernel function. Read its source. Propose (and implement if possible) an optimization. Measure improvement.
Linux Performance Tuning Reference
CPU Tuning:
| Knob | Where | Effect |
|---|---|---|
| CPU frequency governor | /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor |
Set to performance for latency-sensitive workloads |
| C-state disable | /sys/devices/system/cpu/cpu*/cpuidle/state*/disable |
Prevents deep sleep; reduces wake latency |
| IRQ affinity | /proc/irq/*/smp_affinity |
Pin IRQs away from application cores |
| NUMA balancing | /proc/sys/kernel/numa_balancing |
Disable for latency-sensitive; enable for throughput |
| Transparent huge pages | /sys/kernel/mm/transparent_hugepage/enabled |
always for throughput; madvise for mixed |
Memory Tuning:
| Knob | Default | Recommended (latency) | Effect |
|---|---|---|---|
vm.swappiness |
60 | 1–10 | Reduce swap activity |
vm.dirty_ratio |
20 | 5–10 | Flush dirty pages earlier |
vm.dirty_background_ratio |
10 | 3–5 | Background flush threshold |
vm.overcommit_memory |
0 | 1 (latency) | Eliminate OOM-related stalls |
vm.min_free_kbytes |
auto | 2× default | Keep emergency reserves |
Success Criteria for Phase 2: - Can apply USE method to diagnose a contrived performance problem (cache thrash, lock contention, I/O saturation) in under 15 minutes - Have read all 15 chapters of Systems Performance - Can produce and interpret all five types of flame graphs
Phase 3: Intermediate (Months 6–12)
eBPF Tracing
eBPF has become the standard tool for production-safe, zero-modification tracing.
Primary Text: "BPF Performance Tools" — Brendan Gregg (ISBN: 978-0136554820, Addison-Wesley, 2019)
Key eBPF Concepts:
| Concept | Description |
|---|---|
| eBPF programs | Small C programs compiled to eBPF bytecode, verified before load |
| Probe types | kprobe (kernel function entry), kretprobe (return), tracepoint (static), uprobe (userspace) |
| Maps | Shared data structures between eBPF programs and userspace |
| BCC | Python frontend for eBPF programs |
| bpftrace | High-level awk-like language for one-liners |
| libbpf | C library for production eBPF programs (CO-RE: portable across kernel versions) |
bpftrace One-Liners Reference:
# Trace all execve() calls with arguments
bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf("%s -> %s\n", comm, str(args->filename)); }'
# Syscall latency histogram for open()
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_openat { @[comm] = hist(nsecs - @start[tid]); delete(@start[tid]); }'
# Off-CPU analysis: which stacks are blocking?
bpftrace -e 'software:cpu-clock:100 { @[kstack] = count(); }'
# Block I/O latency histogram
bpftrace -e 'tracepoint:block:block_rq_issue { @start[args->dev, args->sector] = nsecs; }
tracepoint:block:block_rq_complete { @io_ms = hist((nsecs - @start[args->dev, args->sector]) / 1000000); }'
# TCP retransmit rate
bpftrace -e 'kprobe:tcp_retransmit_skb { @[comm] = count(); } interval:s:1 { print(@); clear(@); }'
# Memory allocation flame graph data
bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc { @[ustack, comm] = sum(arg0); }'
Month 7–8 Lab Exercise: Instrument a Production Service
- Pick a service under load (nginx serving a static site at 50K RPS is fine)
- Identify top 5 syscalls by count:
bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[args->id] = count(); }' - Measure P99 latency of the most frequent syscall
- Find any syscall taking >1ms and identify which code path triggers it
- Write a bpftrace script that produces a per-second histogram of request latency
Memory Profiling
Tools:
| Tool | What It Finds | Overhead |
|---|---|---|
| Valgrind massif | Heap allocation profile over time | 10–20x slowdown |
| heaptrack | Heap allocation with call stacks | 3–5x slowdown |
perf mem |
Memory access patterns, cache behavior | ~5% sampling |
numastat -p <pid> |
Per-process NUMA allocation | Negligible |
/proc/<pid>/smaps_rollup |
Virtual memory breakdown | None |
vmtouch |
File cache status | None |
Month 9 Lab Exercise: Memory Profile a JVM Application
- Use
async-profiler(Java) in--allocmode to capture allocation flame graph - Identify top allocation sites
- Use
-XX:+PrintGCDetails -XX:+PrintGCDateStampsto correlate GC pauses with allocation rate - Reduce allocation rate by 30% through object pooling or escape analysis hints
- Verify with JMH benchmark before and after
Lock Contention Analysis
Month 10 Lab Exercise:
# Find hot lock contention with perf lock
perf lock record -a -- sleep 10
perf lock report
# With bpftrace: find mutex wait time per function
bpftrace -e '
kprobe:mutex_lock_slowpath { @start[tid] = nsecs; }
kretprobe:mutex_lock_slowpath { @[kstack] = hist(nsecs - @start[tid]); delete(@start[tid]); }
'
Common Lock Contention Patterns:
| Pattern | Symptom | Fix |
|---|---|---|
| Global lock serialization | One core at 100%, others idle | Per-CPU data structures, sharding |
| Read-write lock imbalance | Readers starved by writers | RCU for read-heavy workloads |
| Lock-free queue head | High CAS retry rate visible in perf stat |
Exponential backoff, elimination array |
| Convoy effect | Lock holder descheduled; all waiters block | Avoid kernel preemption with lock held |
| False sharing | Adjacent cache lines modified by different threads | Pad structs to 64 bytes between threads |
Network Performance
Month 11–12 Lab Exercise: Scale a Network Server from 100K to 1M RPS
Baseline Setup:
# Install wrk for HTTP load testing
wrk -t 12 -c 400 -d 30s http://localhost:8080/
# Baseline: 100K RPS on a single-core server
Optimization Steps (apply in order, measure each step):
| Step | Change | Expected Gain |
|---|---|---|
| 1 | Increase socket backlog: net.core.somaxconn=65535 |
Reduces connection errors |
| 2 | Enable TCP_NODELAY | Reduces latency for small messages |
| 3 | Tune buffer sizes: net.core.rmem_max=134217728 |
Reduces buffer exhaustion |
| 4 | Enable TCP BBR: net.ipv4.tcp_congestion_control=bbr |
Better throughput under loss |
| 5 | Use SO_REUSEPORT (multiple listeners) | Linear scaling with threads |
| 6 | Move to io_uring for async I/O |
Reduces syscall overhead |
| 7 | CPU affinity: pin workers to cores | Reduces context switch overhead |
| 8 | Disable Nagle on client and server | Removes 40ms delay on small writes |
Expected: After all optimizations, reach 800K–1.2M RPS (hardware-dependent).
Phase 4: Advanced (Months 12–24)
Kernel Bypass — DPDK
What it is: Bypass the kernel network stack entirely. NIC driver runs in userspace. No syscalls. No context switches. No kernel scheduler involvement.
When to use: When you need < 5µs end-to-end latency or > 10 Mpps throughput.
DPDK Setup Lab:
# Bind NIC to vfio-pci driver (remove from kernel)
modprobe vfio-pci
dpdk-devbind.py --bind=vfio-pci 0000:01:00.0
# Run l3fwd sample app
./dpdk-l3fwd -l 0-3 -n 4 -- -p 0x1 --config="(0,0,0),(0,1,1)"
Key DPDK Concepts:
| Concept | Description |
|---|---|
| Huge pages | Required: 2 MB or 1 GB pages for DMA buffers |
| IOVA mode | Physical or virtual address mode for DMA |
| mempool | Pre-allocated packet buffer pools |
| rte_ring | Lock-free SPSC/MPSC/MPMC ring buffers |
| Poll mode driver (PMD) | Busy-polls NIC instead of using interrupts |
| RSS (Receive Side Scaling) | Hash-based NIC flow distribution across queues |
io_uring
Why io_uring matters: Reduces syscall count by submitting batches of I/O through shared memory rings. Can achieve zero-syscall steady-state operation.
io_uring Architecture:
Userspace Kernel
SQ Ring ─── submit ──→ io_uring_enter()
(Submission) (or auto via SQPOLL)
CQ Ring ←── complete── I/O completion
(Completion)
Month 13 Lab Exercise:
Rewrite a file copy utility three ways:
1. read()/write() in a loop
2. preadv()/pwritev() with large buffers
3. io_uring with fixed buffers and registered files
Measure: throughput (MB/s), syscall count (perf stat), CPU utilization.
Expected results: io_uring version should show ~5–10x fewer syscalls and 20–30% CPU reduction at high throughput.
NUMA Optimization
Month 14 Lab Exercise:
# Observe NUMA allocation
numastat -p <pid> # per-node allocation
numactl --hardware # topology
# Bind process to NUMA node 0
numactl --cpunodebind=0 --membind=0 ./myserver
# libnuma in application
#include <numa.h>
void *buf = numa_alloc_onnode(size, 0); // allocate on node 0
NUMA Performance Rules:
| Rule | Reason |
|---|---|
| Allocate memory on the same node as the thread | Cross-node access adds 40–100ns |
Use mbind() with MPOL_BIND for critical buffers |
Prevent kernel from migrating pages |
| Avoid NUMA balancing for latency-sensitive apps | Automatic migration causes pauses |
| Per-NUMA-node data structures | Eliminate cross-node cache invalidation |
| Disable hyper-threading on latency-critical cores | Reduces LLC contention with sibling |
Compiler Optimization
Month 15 Lab Exercise: Profile-Guided Optimization (PGO)
# Step 1: Compile with instrumentation
gcc -O2 -fprofile-generate -o server_pgo server.c
# Step 2: Run representative workload
./server_pgo &
wrk -t 8 -c 200 -d 60s http://localhost:8080/
kill %1
# Step 3: Compile with profile data
gcc -O2 -fprofile-use -fprofile-correction -o server_opt server.c
# Expected: 10–20% throughput improvement from branch prediction hints
Key Compiler Flags for Performance:
| Flag | Effect | When to Use |
|---|---|---|
-O3 |
Full optimization including auto-vectorization | Throughput-critical code |
-march=native |
Target-specific instructions (AVX2, etc.) | When binary runs on same CPU |
-flto |
Link-time optimization (cross-file inlining) | Always for production builds |
-fprofile-generate/-use |
PGO | After representative workload profiling |
-funroll-loops |
Unroll small loops | Tight inner loops (measure first) |
__builtin_expect |
Branch prediction hint | When branch is almost always taken |
__attribute__((hot)) |
Mark hot functions for layout | Functions in critical path |
HFT-Grade Latency — Sub-100µs Target
Latency Budget Analysis:
| Operation | Typical Cost | Optimization |
|---|---|---|
| L1 cache hit | 1–4 ns | Structure data for locality |
| L3 cache hit | 30–40 ns | Prefetch with __builtin_prefetch |
| DRAM access | 60–100 ns | Minimize working set |
| Context switch | 1–10 µs | Dedicated CPU core, no sharing |
| Kernel network stack | 5–50 µs | DPDK bypass |
| IRQ latency (with C-states) | 10–200 µs | Disable C-states |
| TCP round trip (local) | 50–200 µs | UDP or shared memory |
| Mutex acquire (uncontended) | 20–100 ns | Lock-free or RCU |
Kernel Configuration for Minimum Latency:
CONFIG_PREEMPT=y # Full kernel preemption
CONFIG_HZ=1000 # 1ms timer tick
CONFIG_NO_HZ_FULL=y # Tickless for isolated CPUs
CONFIG_CPUSETS=y # CPU isolation
Boot parameters:
isolcpus=2-7 nohz_full=2-7 rcu_nocbs=2-7 nosmt
Performance Bottleneck Cookbook
Systematic diagnosis guide for the five most common production bottlenecks:
| Symptom | First Check | Second Check | Likely Fix |
|---|---|---|---|
| High CPU, low throughput | perf top — which functions? |
Flame graph — hot path | Algorithmic optimization or SIMD |
| Low CPU, high latency | bpftrace off-CPU analysis |
strace — which syscall blocks? |
Reduce lock contention or I/O |
| Memory grows unbounded | valgrind massif or heaptrack |
/proc/<pid>/smaps |
Fix memory leak or tune allocator |
| Network throughput plateau | sar -n DEV 1 — % util |
ethtool -S — ring drops |
Increase ring buffers, RSS, or DPDK |
| Disk I/O saturation | iostat -xz 1 — %util=100 |
blktrace — queue depth |
Async I/O, larger block size, NVMe |
Tools Reference Card
| Category | Tool | Key Command | Output |
|---|---|---|---|
| CPU profiling | perf record |
perf record -F 99 -ag -- cmd |
Sampling profile |
| CPU profiling | perf top |
perf top -g |
Live function view |
| CPU profiling | flamegraph |
Script pipeline | SVG flame graph |
| Tracing | bpftrace |
bpftrace -e 'kprobe:...' |
Aggregated stats |
| Tracing | ftrace |
echo function > current_tracer |
Function call log |
| Memory | valgrind |
valgrind --tool=massif |
Heap timeline |
| Memory | perf mem |
perf mem record/report |
Memory access stats |
| Network | iperf3 |
iperf3 -c host -P 4 |
Throughput |
| Network | ss |
ss -ntp |
Socket states |
| Disk | iostat |
iostat -xz 1 |
I/O utilization |
| Disk | blktrace |
blktrace /dev/sda |
Block-level trace |
| Benchmarking | wrk |
wrk -t 8 -c 200 -d 30s url |
HTTP throughput |
| Benchmarking | fio |
fio --name=test --rw=randread |
Storage benchmark |
Success Criteria Summary
| Phase | Key Checkpoints |
|---|---|
| Phase 1 | Generated flame graph for a real process; reproduced all four microbenchmark effects with measured numbers |
| Phase 2 | Applied USE method to diagnose a contrived problem in <15 min; read Systems Performance cover to cover |
| Phase 3 | Instrumented a production service with bpftrace; scaled a server from 100K to 1M RPS |
| Phase 4 | Built a DPDK-based packet forwarder; measured sub-10µs P99 latency on an isolated core; achieved PGO improvement |
The core insight of performance engineering: measure first, hypothesize second, optimize third, measure again. Intuition is almost always wrong about where the bottleneck actually lives.