Performance Engineering Learning Roadmap

A complete roadmap for becoming a performance engineer — from benchmarking basics through eBPF tracing, kernel bypass, and HFT-grade latency optimization. This is a practical discipline; every concept must be verified on a running system.

Overview

Phase	Duration	Focus	Target Outcome
Prerequisites	1–2 months	Profiling tools, CPU architecture, benchmarking methodology	Can profile a process and interpret results without guidance
Core Skills	2–6 months	Systems Performance book, USE method, Linux tuning	Can diagnose any Linux performance problem systematically
Intermediate	6–12 months	eBPF, memory profiling, lock contention, network perf	Can instrument production workloads safely
Advanced	12–24 months	Kernel bypass, NUMA, compiler optimization, sub-100µs latency	Can achieve HFT-grade latency in a network server

Phase 1: Prerequisites (Months 1–2)

CPU Architecture Fundamentals

Understanding the hardware is non-negotiable for performance work. Without this, profiling numbers are uninterpretable.

Concept	Key Facts	Why It Matters
Cache hierarchy	L1: ~4 cycles, L2: ~12 cycles, L3: ~40 cycles, DRAM: ~200 cycles	Cache misses dominate latency profiles
Cache line size	64 bytes on x86	Explains false sharing, prefetch patterns
Branch predictor	Modern CPUs predict 1–2 branches ahead; misprediction: ~15 cycles	Explain why branch-heavy code is slow
Out-of-order execution	CPU reorders instructions within a window (~200 ops)	Explains memory model, barriers needed
SIMD/AVX	256-bit (AVX2) or 512-bit (AVX-512) per cycle	Auto-vectorization and manual SIMD
Hyper-threading	Two hardware threads per core; share L1/L2	Causes contention in latency-sensitive apps
NUMA	Each socket has local memory; cross-socket adds ~40–80ns	Critical for multi-socket servers
TLB	Covers 512 pages (4 KB) or 32 huge pages (2 MB)	TLB misses for large working sets

Resource: "What Every Programmer Should Know About Memory" — Ulrich Drepper, 2007 (free PDF, ~100 pages). Read Sections 1–5 thoroughly.

Benchmarking Methodology

Before measuring anything, establish the methodology. Measuring wrong is worse than not measuring at all.

The Seven Questions Before Any Benchmark: 1. What are you measuring? (Latency? Throughput? CPU efficiency?) 2. What is the unit? (ns, µs, ops/sec, MB/s) 3. Are you measuring the right thing? (Is the profiler itself distorting results?) 4. What is the workload distribution? (Uniform? Zipfian? Bursty?) 5. What percentile matters? (Mean is almost always wrong; use p50/p99/p99.9) 6. Is the system warmed up? (JIT, CPU frequency scaling, page cache) 7. Are results statistically significant? (Run at least 30 iterations; compute confidence intervals)

Tools Introduction:

Tool	Install	Best For
`perf stat`	Built into Linux kernel	Hardware counter overview
`perf record` + `perf report`	Same	Sampling profiler
`flamegraph.pl`	https://github.com/brendangregg/FlameGraph	Visualize perf output
`bpftrace`	`apt install bpftrace`	Dynamic tracing, histograms
`strace`	`apt install strace`	Syscall tracing (high overhead — never in prod)
`ltrace`	`apt install ltrace`	Library call tracing
`pmap`	Built into procps	Process memory map
`numastat`	`apt install numactl`	NUMA allocation statistics

Month 1 Lab Exercise:

Instrument a simple C++ HTTP server (use cpp-httplib or nginx) with perf:

perf record -F 99 -a -g -- sleep 30     # sample all CPUs at 99 Hz
perf script | stackcollapse-perf.pl | flamegraph.pl > profile.svg

Open the flame graph in a browser. Identify the top 3 functions consuming CPU. This single exercise teaches more than any tutorial.

Month 2 Lab Exercise: CPU Microbenchmarks

Write a benchmark that demonstrates each of the following at measurable cost: 1. Cache miss: sequential array access vs. random pointer chasing 2. False sharing: two threads incrementing adjacent variables vs. padded variables 3. Branch misprediction: sorted array sum vs. random array sum 4. TLB miss: access 1 GB array with 64-byte stride vs. 4096-byte stride

Use perf stat -e cache-misses,branch-misses,instructions,cycles to confirm your measurements.

Success Criteria for Phase 1: - Can produce a flame graph for any process within 5 minutes - Can explain what a cache miss costs in cycles, not just conceptually - Have reproduced all four microbenchmark effects with measured numbers

Phase 2: Core Skills (Months 2–6)

Primary Text

Systems Performance: Enterprise and the Cloud (2nd ed.) — Brendan Gregg - Publisher: Addison-Wesley, 2020 - ISBN: 978-0136820154 - This is the bible. Read every chapter. The examples use Linux but the methodology is universal.

USE Method

The USE (Utilization, Saturation, Errors) method is the systematic framework for performance analysis. Apply it before reaching for any profiling tool.

USE Method Checklist for Linux Systems:

Resource	Utilization Metric	Saturation Metric	Errors Metric
CPU	`mpstat 1` — `%idle`	`vmstat 1` — `r` (run queue)	`perf stat` — stalled cycles
Memory	`free -m` — used/total	`vmstat 1` — `si/so` (swap in/out)	`dmesg` — OOM killer
Network interface	`sar -n DEV 1` — `%ifutil`	`netstat -s` — drop counters	`ip -s link` — errors
Disk I/O	`iostat -xz 1` — `%util`	`iostat` — `await` (ms)	`smartctl -a`
File descriptors	`cat /proc/sys/fs/file-nr`	—	`dmesg` — "too many open files"
Kernel locks	`perf lock record/report`	—	lockdep output

Month 2–3 Reading Plan (Systems Performance): - Chapters 1–2: Methodology, tools overview - Chapter 3: Operating systems concepts (review) - Chapter 4: Observability tools — vmstat, iostat, netstat, sar - Chapter 5: Applications - Chapter 6: CPUs — CPU profiling, scheduling, affinity

Month 4 Reading Plan: - Chapter 7: Memory — virtual memory, paging, allocators - Chapter 8: File systems — I/O latency, VFS, buffer cache - Chapter 9: Disks

Month 5–6 Reading Plan: - Chapter 10: Network - Chapter 11: Cloud computing performance - Chapter 12: Benchmarking - Chapter 13–15: perf, Ftrace, BPF (foundation for next phase)

Flame Graph Mastery

Types of Flame Graphs and When to Use Each:

Flame Graph Type	Command	Best For
CPU on-CPU	`perf record -F 99 -ag`	CPU-bound code — find hot functions
Off-CPU	`offcputime-bpfcc`	I/O-bound and lock-wait analysis
Memory allocation	`perf record -e malloc`	Heap allocation hot paths
Differential	`difffolded.pl`	Before/after comparison of two profiles
Package flamegraph	Language-specific (async-profiler for JVM)	JVM/Python/Node profiling

Month 3 Lab Exercise: Optimize a Real Service

Take any open-source service (Redis, nginx, PostgreSQL). Run a load test with wrk or pgbench. Generate a flame graph. Identify the single hottest non-kernel function. Read its source. Propose (and implement if possible) an optimization. Measure improvement.

Linux Performance Tuning Reference

CPU Tuning:

Knob	Where	Effect
CPU frequency governor	`/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor`	Set to `performance` for latency-sensitive workloads
C-state disable	`/sys/devices/system/cpu/cpu/cpuidle/state/disable`	Prevents deep sleep; reduces wake latency
IRQ affinity	`/proc/irq/*/smp_affinity`	Pin IRQs away from application cores
NUMA balancing	`/proc/sys/kernel/numa_balancing`	Disable for latency-sensitive; enable for throughput
Transparent huge pages	`/sys/kernel/mm/transparent_hugepage/enabled`	`always` for throughput; `madvise` for mixed

Memory Tuning:

Knob	Default	Recommended (latency)	Effect
`vm.swappiness`	60	1–10	Reduce swap activity
`vm.dirty_ratio`	20	5–10	Flush dirty pages earlier
`vm.dirty_background_ratio`	10	3–5	Background flush threshold
`vm.overcommit_memory`	0	1 (latency)	Eliminate OOM-related stalls
`vm.min_free_kbytes`	auto	2× default	Keep emergency reserves

Success Criteria for Phase 2: - Can apply USE method to diagnose a contrived performance problem (cache thrash, lock contention, I/O saturation) in under 15 minutes - Have read all 15 chapters of Systems Performance - Can produce and interpret all five types of flame graphs

Phase 3: Intermediate (Months 6–12)

eBPF Tracing

eBPF has become the standard tool for production-safe, zero-modification tracing.

Primary Text: "BPF Performance Tools" — Brendan Gregg (ISBN: 978-0136554820, Addison-Wesley, 2019)

Key eBPF Concepts:

Concept	Description
eBPF programs	Small C programs compiled to eBPF bytecode, verified before load
Probe types	kprobe (kernel function entry), kretprobe (return), tracepoint (static), uprobe (userspace)
Maps	Shared data structures between eBPF programs and userspace
BCC	Python frontend for eBPF programs
bpftrace	High-level awk-like language for one-liners
libbpf	C library for production eBPF programs (CO-RE: portable across kernel versions)

bpftrace One-Liners Reference:

# Trace all execve() calls with arguments
bpftrace -e 'tracepoint:syscalls:sys_enter_execve { printf("%s -> %s\n", comm, str(args->filename)); }'

# Syscall latency histogram for open()
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { @start[tid] = nsecs; }
             tracepoint:syscalls:sys_exit_openat  { @[comm] = hist(nsecs - @start[tid]); delete(@start[tid]); }'

# Off-CPU analysis: which stacks are blocking?
bpftrace -e 'software:cpu-clock:100 { @[kstack] = count(); }'

# Block I/O latency histogram
bpftrace -e 'tracepoint:block:block_rq_issue { @start[args->dev, args->sector] = nsecs; }
             tracepoint:block:block_rq_complete { @io_ms = hist((nsecs - @start[args->dev, args->sector]) / 1000000); }'

# TCP retransmit rate
bpftrace -e 'kprobe:tcp_retransmit_skb { @[comm] = count(); } interval:s:1 { print(@); clear(@); }'

# Memory allocation flame graph data
bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libc.so.6:malloc { @[ustack, comm] = sum(arg0); }'

Month 7–8 Lab Exercise: Instrument a Production Service

Pick a service under load (nginx serving a static site at 50K RPS is fine)
Identify top 5 syscalls by count: bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[args->id] = count(); }'
Measure P99 latency of the most frequent syscall
Find any syscall taking >1ms and identify which code path triggers it
Write a bpftrace script that produces a per-second histogram of request latency

Memory Profiling

Tools:

Tool	What It Finds	Overhead
Valgrind massif	Heap allocation profile over time	10–20x slowdown
heaptrack	Heap allocation with call stacks	3–5x slowdown
`perf mem`	Memory access patterns, cache behavior	~5% sampling
`numastat -p <pid>`	Per-process NUMA allocation	Negligible
`/proc/<pid>/smaps_rollup`	Virtual memory breakdown	None
`vmtouch`	File cache status	None

Month 9 Lab Exercise: Memory Profile a JVM Application

Use async-profiler (Java) in --alloc mode to capture allocation flame graph
Identify top allocation sites
Use -XX:+PrintGCDetails -XX:+PrintGCDateStamps to correlate GC pauses with allocation rate
Reduce allocation rate by 30% through object pooling or escape analysis hints
Verify with JMH benchmark before and after

Lock Contention Analysis

Month 10 Lab Exercise:

# Find hot lock contention with perf lock
perf lock record -a -- sleep 10
perf lock report

# With bpftrace: find mutex wait time per function
bpftrace -e '
kprobe:mutex_lock_slowpath { @start[tid] = nsecs; }
kretprobe:mutex_lock_slowpath { @[kstack] = hist(nsecs - @start[tid]); delete(@start[tid]); }
'

Common Lock Contention Patterns:

Pattern	Symptom	Fix
Global lock serialization	One core at 100%, others idle	Per-CPU data structures, sharding
Read-write lock imbalance	Readers starved by writers	RCU for read-heavy workloads
Lock-free queue head	High CAS retry rate visible in `perf stat`	Exponential backoff, elimination array
Convoy effect	Lock holder descheduled; all waiters block	Avoid kernel preemption with lock held
False sharing	Adjacent cache lines modified by different threads	Pad structs to 64 bytes between threads

Network Performance

Month 11–12 Lab Exercise: Scale a Network Server from 100K to 1M RPS

Baseline Setup:

# Install wrk for HTTP load testing
wrk -t 12 -c 400 -d 30s http://localhost:8080/

# Baseline: 100K RPS on a single-core server

Optimization Steps (apply in order, measure each step):

Step	Change	Expected Gain
1	Increase socket backlog: `net.core.somaxconn=65535`	Reduces connection errors
2	Enable TCP_NODELAY	Reduces latency for small messages
3	Tune buffer sizes: `net.core.rmem_max=134217728`	Reduces buffer exhaustion
4	Enable TCP BBR: `net.ipv4.tcp_congestion_control=bbr`	Better throughput under loss
5	Use SO_REUSEPORT (multiple listeners)	Linear scaling with threads
6	Move to `io_uring` for async I/O	Reduces syscall overhead
7	CPU affinity: pin workers to cores	Reduces context switch overhead
8	Disable Nagle on client and server	Removes 40ms delay on small writes

Expected: After all optimizations, reach 800K–1.2M RPS (hardware-dependent).

Phase 4: Advanced (Months 12–24)

Kernel Bypass — DPDK

What it is: Bypass the kernel network stack entirely. NIC driver runs in userspace. No syscalls. No context switches. No kernel scheduler involvement.

When to use: When you need < 5µs end-to-end latency or > 10 Mpps throughput.

DPDK Setup Lab:

# Bind NIC to vfio-pci driver (remove from kernel)
modprobe vfio-pci
dpdk-devbind.py --bind=vfio-pci 0000:01:00.0

# Run l3fwd sample app
./dpdk-l3fwd -l 0-3 -n 4 -- -p 0x1 --config="(0,0,0),(0,1,1)"

Key DPDK Concepts:

Concept	Description
Huge pages	Required: 2 MB or 1 GB pages for DMA buffers
IOVA mode	Physical or virtual address mode for DMA
mempool	Pre-allocated packet buffer pools
rte_ring	Lock-free SPSC/MPSC/MPMC ring buffers
Poll mode driver (PMD)	Busy-polls NIC instead of using interrupts
RSS (Receive Side Scaling)	Hash-based NIC flow distribution across queues

io_uring

Why io_uring matters: Reduces syscall count by submitting batches of I/O through shared memory rings. Can achieve zero-syscall steady-state operation.

io_uring Architecture:

Userspace                    Kernel
  SQ Ring     ─── submit ──→ io_uring_enter()
  (Submission)               (or auto via SQPOLL)

  CQ Ring     ←── complete── I/O completion
  (Completion)

Month 13 Lab Exercise:

Rewrite a file copy utility three ways: 1. read()/write() in a loop 2. preadv()/pwritev() with large buffers 3. io_uring with fixed buffers and registered files

Measure: throughput (MB/s), syscall count (perf stat), CPU utilization.

Expected results: io_uring version should show ~5–10x fewer syscalls and 20–30% CPU reduction at high throughput.

NUMA Optimization

Month 14 Lab Exercise:

# Observe NUMA allocation
numastat -p <pid>        # per-node allocation
numactl --hardware       # topology

# Bind process to NUMA node 0
numactl --cpunodebind=0 --membind=0 ./myserver

# libnuma in application
#include <numa.h>
void *buf = numa_alloc_onnode(size, 0);  // allocate on node 0

NUMA Performance Rules:

Rule	Reason
Allocate memory on the same node as the thread	Cross-node access adds 40–100ns
Use `mbind()` with MPOL_BIND for critical buffers	Prevent kernel from migrating pages
Avoid NUMA balancing for latency-sensitive apps	Automatic migration causes pauses
Per-NUMA-node data structures	Eliminate cross-node cache invalidation
Disable hyper-threading on latency-critical cores	Reduces LLC contention with sibling

Compiler Optimization

Month 15 Lab Exercise: Profile-Guided Optimization (PGO)

# Step 1: Compile with instrumentation
gcc -O2 -fprofile-generate -o server_pgo server.c

# Step 2: Run representative workload
./server_pgo &
wrk -t 8 -c 200 -d 60s http://localhost:8080/
kill %1

# Step 3: Compile with profile data
gcc -O2 -fprofile-use -fprofile-correction -o server_opt server.c

# Expected: 10–20% throughput improvement from branch prediction hints

Key Compiler Flags for Performance:

Flag	Effect	When to Use
`-O3`	Full optimization including auto-vectorization	Throughput-critical code
`-march=native`	Target-specific instructions (AVX2, etc.)	When binary runs on same CPU
`-flto`	Link-time optimization (cross-file inlining)	Always for production builds
`-fprofile-generate/-use`	PGO	After representative workload profiling
`-funroll-loops`	Unroll small loops	Tight inner loops (measure first)
`__builtin_expect`	Branch prediction hint	When branch is almost always taken
`__attribute__((hot))`	Mark hot functions for layout	Functions in critical path

HFT-Grade Latency — Sub-100µs Target

Latency Budget Analysis:

Operation	Typical Cost	Optimization
L1 cache hit	1–4 ns	Structure data for locality
L3 cache hit	30–40 ns	Prefetch with `__builtin_prefetch`
DRAM access	60–100 ns	Minimize working set
Context switch	1–10 µs	Dedicated CPU core, no sharing
Kernel network stack	5–50 µs	DPDK bypass
IRQ latency (with C-states)	10–200 µs	Disable C-states
TCP round trip (local)	50–200 µs	UDP or shared memory
Mutex acquire (uncontended)	20–100 ns	Lock-free or RCU

Kernel Configuration for Minimum Latency:

CONFIG_PREEMPT=y          # Full kernel preemption
CONFIG_HZ=1000            # 1ms timer tick
CONFIG_NO_HZ_FULL=y       # Tickless for isolated CPUs
CONFIG_CPUSETS=y          # CPU isolation

Boot parameters:

isolcpus=2-7 nohz_full=2-7 rcu_nocbs=2-7 nosmt

Performance Bottleneck Cookbook

Systematic diagnosis guide for the five most common production bottlenecks:

Symptom	First Check	Second Check	Likely Fix
High CPU, low throughput	`perf top` — which functions?	Flame graph — hot path	Algorithmic optimization or SIMD
Low CPU, high latency	`bpftrace` off-CPU analysis	`strace` — which syscall blocks?	Reduce lock contention or I/O
Memory grows unbounded	`valgrind massif` or heaptrack	`/proc/<pid>/smaps`	Fix memory leak or tune allocator
Network throughput plateau	`sar -n DEV 1` — % util	`ethtool -S` — ring drops	Increase ring buffers, RSS, or DPDK
Disk I/O saturation	`iostat -xz 1` — `%util=100`	`blktrace` — queue depth	Async I/O, larger block size, NVMe

Tools Reference Card

Category	Tool	Key Command	Output
CPU profiling	`perf record`	`perf record -F 99 -ag -- cmd`	Sampling profile
CPU profiling	`perf top`	`perf top -g`	Live function view
CPU profiling	`flamegraph`	Script pipeline	SVG flame graph
Tracing	`bpftrace`	`bpftrace -e 'kprobe:...'`	Aggregated stats
Tracing	`ftrace`	`echo function > current_tracer`	Function call log
Memory	`valgrind`	`valgrind --tool=massif`	Heap timeline
Memory	`perf mem`	`perf mem record/report`	Memory access stats
Network	`iperf3`	`iperf3 -c host -P 4`	Throughput
Network	`ss`	`ss -ntp`	Socket states
Disk	`iostat`	`iostat -xz 1`	I/O utilization
Disk	`blktrace`	`blktrace /dev/sda`	Block-level trace
Benchmarking	`wrk`	`wrk -t 8 -c 200 -d 30s url`	HTTP throughput
Benchmarking	`fio`	`fio --name=test --rw=randread`	Storage benchmark

Success Criteria Summary

Phase	Key Checkpoints
Phase 1	Generated flame graph for a real process; reproduced all four microbenchmark effects with measured numbers
Phase 2	Applied USE method to diagnose a contrived problem in <15 min; read Systems Performance cover to cover
Phase 3	Instrumented a production service with bpftrace; scaled a server from 100K to 1M RPS
Phase 4	Built a DPDK-based packet forwarder; measured sub-10µs P99 latency on an isolated core; achieved PGO improvement

The core insight of performance engineering: measure first, hypothesize second, optimize third, measure again. Intuition is almost always wrong about where the bottleneck actually lives.