02 — CPU Performance
Technical Overview
CPU performance is not a single metric—it is the product of clock frequency, instructions per cycle (IPC), and how efficiently those instructions map onto the underlying microarchitecture. A program running at 3 GHz with IPC 1.0 executes 3 billion instructions per second; the same program on the same CPU achieving IPC 3.5 executes 10.5 billion. Modern x86 microarchitectures (Intel Golden Cove, AMD Zen 4) have theoretical peak IPC well above 4 due to superscalar out-of-order execution, yet most production workloads achieve IPC between 0.8 and 2.5. The gap represents optimization opportunity.
Understanding CPU performance requires reasoning simultaneously about: - Frontend efficiency: fetching, decoding, and feeding instructions to execution units. - Backend efficiency: execution units, latency vs. throughput, instruction-level parallelism. - Memory subsystem: how cache misses stall the pipeline. - Power and frequency: turbo boost, thermal throttling, and how they interact with benchmarks.
Prerequisites
- x86-64 instruction set basics (registers, addressing modes).
- Understanding of pipeline stages (fetch, decode, execute, retire).
- Linux
perftool basics. - Concepts: cache hierarchy, NUMA, SMT/Hyperthreading.
Core Content
Key CPU Performance Metrics
IPC (Instructions Per Cycle) is the most important indicator of microarchitectural efficiency. Values:
| IPC Range | Interpretation |
|---|---|
| < 0.5 | Memory-bound or severe branch mispredictions |
| 0.5–1.5 | Typical mixed workload |
| 1.5–3.0 | Compute-efficient; good vectorization |
| > 3.0 | Excellent; likely SIMD-heavy |
CPI (Cycles Per Instruction) = 1/IPC. Useful for expressing stall cost: each LLC miss adds ~200 cycles; in a loop with one miss per 10 instructions, CPI grows by 20.
FLOPS (Floating-Point Operations Per Second): relevant for HPC and ML. A single AVX-512 FMA instruction executes 16 double-precision multiply-adds in one cycle on capable hardware, yielding 32 FLOPS/cycle/core—at 3 GHz, 96 GFLOPS/core theoretical peak.
PMU: Performance Monitoring Unit
Every modern CPU exposes hardware performance counters via the PMU. On x86, the PMU includes:
- Fixed counters (always present):
INST_RETIRED.ANY,CPU_CLK_UNHALTED.THREAD,CPU_CLK_UNHALTED.REF_TSC. - Programmable counters (4–8 per logical core): configurable to count thousands of microarchitectural events.
List available counters:
perf list | grep -E "cache|branch|cycle|instruction" | head -40
Sample PMU events programmatically:
perf stat -e cycles,instructions,cache-misses,cache-references,\
branch-misses,branches,LLC-loads,LLC-load-misses \
-p <pid> sleep 10
Typical output interpretation:
Performance counter stats for process 1234:
24,103,456,789 cycles
18,501,234,567 instructions # IPC: 0.77 — memory-bound
1,203,456 cache-misses # cache miss rate: high
12,034,561 cache-references
891,234 branch-misses
42,103,456 branches # misprediction rate: 2.1%
5,201,345 LLC-loads
1,203,456 LLC-load-misses # LLC miss rate: 23% — severe
Cache Miss Analysis
Cache misses are the dominant source of CPU stalls. The pipeline executes at ~3–4 instructions/cycle when hot in L1; it stalls to a fraction of that when waiting on DRAM.
LLC miss rate = LLC-load-misses / LLC-loads. Above 10% is problematic.
DRAM bandwidth consumption can be measured with Intel PCM or perf stat -e uncore_imc/data_reads/,uncore_imc/data_writes/:
# Intel PCM bandwidth tool
pcm-memory 1 # print per-channel bandwidth every second
Perf performance counter cheat sheet:
| Counter Event | What It Measures | Concern Threshold |
|---|---|---|
cycles |
Raw CPU cycles | — |
instructions |
Retired instructions | IPC < 0.5 |
cache-misses |
L1/L2/L3 cache misses | > 1% miss rate |
LLC-load-misses |
Last-level cache load misses | > 10% LLC miss rate |
dTLB-load-misses |
Data TLB misses | > 0.5% |
iTLB-load-misses |
Instruction TLB misses | > 0.1% |
branch-misses |
Branch mispredictions | > 1% |
branches |
Total branches | — |
mem-loads |
Memory loads (with PEBS) | — |
mem-stores |
Memory stores (with PEBS) | — |
cpu-cycles |
Cycles (freq-scaled) | Prefer to task-clock |
task-clock |
Wall-clock ms of CPU | Does NOT scale with freq |
Branch Misprediction
Modern CPUs use tournament predictors (TAGE predictor family in Intel/AMD) to guess branch direction before the condition is resolved. A misprediction incurs a pipeline flush: ~15–20 cycles on modern microarchitectures.
perf stat -e branches,branch-misses ./program
# Misprediction rate = branch-misses / branches
# Rates above 2% are significant
Common sources: - Data-dependent branches on unpredictable data (hash table probing, binary search on unsorted keys). - Indirect calls (virtual dispatch, function pointers)—use PGO or devirtualization. - Conditional branches on random input—consider branchless alternatives:
// Branchy (unpredictable if x is random)
if (x > 0) result = a; else result = b;
// Branchless (CMOV instruction)
result = (x > 0) ? a : b; // Compiler may emit cmov
Instruction Mix Analysis: SIMD Utilization
Scalar throughput processes one element per instruction; AVX2 processes 8 floats (256-bit); AVX-512 processes 16 floats (512-bit). SIMD utilization measures how much of this width is used.
# Check SIMD events (Intel)
perf stat -e fp_arith_inst_retired.scalar_single,\
fp_arith_inst_retired.128b_packed_single,\
fp_arith_inst_retired.256b_packed_single,\
fp_arith_inst_retired.512b_packed_single \
./program
Caveat on AVX-512: On some Intel CPUs (Skylake-X, Ice Lake), AVX-512 instructions cause frequency downclocking (because of higher power draw). This can make SIMD slower if the vector computation doesn't sufficiently amortize the frequency penalty. Always benchmark, never assume.
CPU Frequency Scaling
P-states (Performance states) are CPU frequency/voltage operating points managed by the OS governor or the CPU's Hardware P-state (HWP) mechanism.
Critical measurement trap: perf stat reports task-clock in milliseconds of wall-clock time, not CPU cycles. If turbo boost is active and the CPU runs at 4.5 GHz instead of the nominal 3.0 GHz, task-clock underestimates work done. Always use cpu-cycles (hardware counter) for apples-to-apples comparisons:
# Bad: task-clock affected by frequency
perf stat -e task-clock ./program
# Good: cpu-cycles is the actual hardware counter
perf stat -e cpu-cycles,instructions ./program
# Pin frequency for reproducible benchmarks
cpupower frequency-set -g performance
cpupower frequency-set -f 3000MHz
Turbo boost behavior:
Nominal: 3.0 GHz (all cores loaded)
Turbo: 4.5 GHz (single core, thermal headroom allows)
AVX-512: 3.2 GHz (power limit reduces boost)
NUMA Effects on CPU Performance
Non-Uniform Memory Access: each CPU socket has local DRAM banks (fast, ~70 ns) and remote DRAM banks (slow, ~120–140 ns). On a 4-socket system, cross-socket latency can be 3x local.
# Identify NUMA topology
numactl --hardware
# Measure remote vs local bandwidth
numactl --cpunodebind=0 --membind=1 ./bandwidth_test # remote
numactl --cpunodebind=0 --membind=0 ./bandwidth_test # local
# NUMA stats (pages accessed from remote node)
numastat -p <pid>
The numa_miss counter in numastat shows pages that were allocated on a remote node and then accessed frequently—a strong signal to use numactl --membind or mbind(2) in the application.
SMT / Hyperthreading Performance Effects
SMT (Simultaneous Multi-Threading, marketed as HyperThreading by Intel) presents two logical CPUs per physical core. Both share: - Execution units (ALUs, FPUs, load/store units) - L1 and L2 caches - Branch predictor - TLB
When both logical CPUs are active on the same physical core, IPC drops—execution unit contention and cache thrashing between threads. Typical IPC reduction: 20–40% per thread.
Physical Core (2 logical CPUs)
┌─────────────────────────────────┐
│ Thread 0 │ Thread 1 │
│ (HT0) │ (HT1) │
│─────────────────────────────────│
│ Shared Execution Units │
│ 4 ALUs, 2 FPUs, 2 Load/Store │
│─────────────────────────────────│
│ Shared L1 Cache (48 KB) │
│ Shared L2 Cache (1.25 MB) │
└─────────────────────────────────┘
For latency-sensitive workloads (HFT, real-time audio), disable SMT:
echo off > /sys/devices/system/cpu/smt/control
For throughput workloads with high cache miss rates (streaming analytics), SMT helps hide memory latency.
Compiler Optimizations
| Flag | Effect | Use Case |
|---|---|---|
-O2 |
Inlining, loop optimization, no unsafe math | Default production |
-O3 |
Aggressive vectorization, loop unrolling, fsanitize unsafe |
HPC, verified safe |
-march=native |
Use all CPU features of build host | Single-machine deployment |
-march=x86-64-v3 |
AVX2 baseline (Haswell+) | Cloud deployments |
-fprofile-generate |
Instrument binary for PGO | Phase 1 of PGO |
-fprofile-use |
Compile with PGO data | Phase 2 of PGO; 10–20% gain |
-flto |
Link-Time Optimization; cross-TU inlining | Full program optimization |
Profile-Guided Optimization (PGO) workflow:
# Phase 1: Build with instrumentation
gcc -O2 -fprofile-generate -o app_instr app.c
# Phase 2: Run with representative workload
./app_instr < production_trace.txt
# Phase 3: Compile with profile data
gcc -O2 -fprofile-use -fprofile-correction -o app_pgo app.c
PGO teaches the compiler which branches are hot (enabling better layout, inlining decisions) and which loops are executed frequently (enabling better vectorization).
SIMD Vectorization: AVX-512 Throughput
AVX-512 on Intel Skylake-X and later: - 512-bit FMA (fused multiply-add): 16 floats or 8 doubles per cycle. - Gather/scatter instructions: load/store non-contiguous data. - Mask registers (k0–k7): predicated execution, avoids branches in loops.
Auto-vectorization with GCC/Clang:
// Auto-vectorizable loop (no aliasing, known bounds)
void scale(float *restrict a, const float *restrict b, float s, int n) {
for (int i = 0; i < n; i++)
a[i] = b[i] * s;
}
Check vectorization:
gcc -O3 -march=native -fopt-info-vec-optimized -c scale.c
# Look for: "loop vectorized" in output
# Inspect generated assembly
objdump -d scale.o | grep -E "vmul|vadd|vfma"
Historical Context
The CPU performance measurement discipline matured during the RISC wars of the 1990s (SPEC CPU92, SPEC CPU95) when marketing teams drove "benchmarketing" to absurdity—using specially hand-tuned library builds and proprietary compilers that bore no relation to production deployments. The SPEC committee responded with increasingly strict run rules.
Intel's Pentium 4 (Netburst, 2000–2006) pushed clock frequency to 3.8 GHz at the cost of IPC (its pipeline had 31 stages; a branch misprediction wasted 31 cycles). The Core 2 (2006) returned to shorter pipelines and far better IPC—a lesson in multi-dimensional performance that still informs microarchitecture today.
The emergence of PMU-based profiling tools (oprofile 2002, perf 2009) democratized microarchitectural analysis.
Production Examples
Case: Unexplained 30% regression after kernel upgrade. A C++ real-time trading application regressed after upgrading from kernel 4.19 to 5.15. perf stat showed IPC dropped from 2.1 to 1.5. LLC-load-misses nearly doubled. Investigation: the new kernel changed the NUMA page migration policy; threads were now allocated memory on a remote NUMA node. Fix: set numactl --membind=0 in the service startup script. IPC restored to 2.0.
Case: AVX-512 frequency downclocking causing latency spikes. A video encoding service deployed on Skylake-X instances. p99 latency spiked 40% after enabling AVX-512 in the encoder. Root cause: perf stat showed cpu-cycles diverging from task-clock—the CPU was running at 2.8 GHz instead of the expected 3.8 GHz turbo. Fix: switched to AVX2 code path. The 2x throughput improvement of AVX-512 over AVX2 was negated by the 26% frequency reduction. Lesson: always verify clock frequency when switching SIMD widths.
Debugging Notes
perf stat -r 5runs the benchmark 5 times and reports mean ± stddev—always use this.- Use
perf annotateto see which specific assembly instructions are consuming cycles. toplev.py(from Intel's pmu-tools) implements the Top-Down Microarchitecture Analysis (TMA) methodology—it decomposes cycles into Frontend Bound, Backend Bound, Bad Speculation, and Retiring categories.- When LLC miss rate is high, use
perf mem record+perf mem reportto identify the specific load instructions causing misses. perf c2cidentifies cache lines with false sharing (contended cache lines between cores).
Security Implications
PMU access can leak information: side-channel attacks use performance counters to infer cache state (FLUSH+RELOAD, PRIME+PROBE). The perf_event_paranoid sysctl controls access:
-1: Allow all perf events (dangerous)
0: Allow CPU events for unprivileged users
1: Allow user-space sampling only (default on most distros)
2: Allow only CAP_PERFMON or root (recommended for production)
3: Disallow all perf events for unprivileged users (RHEL default)
SMT side-channels: when two tenants share a physical core via SMT, the branch predictor, cache, and execution ports can leak information. Cloud providers now offer dedicated cores (AWS bare metal, GCP sole-tenant nodes) to mitigate this.
Performance Implications
Branch misprediction overhead: 15–20 cycles per misprediction. At 1 billion branches/second with 5% misprediction rate, this wastes 10–15% of CPU bandwidth. Branchless code, CMOV, and sorted data structures (enabling predictable branches) are remedies.
False sharing: two variables on the same 64-byte cache line written by different cores invalidate each other's L1 cache copies via MESI protocol. The symptom is high cache-misses even though each core writes a different variable. Pad to 64 bytes:
struct padded_counter {
uint64_t value;
char padding[56]; // total = 64 bytes = one cache line
} __attribute__((aligned(64)));
Failure Modes and Real Incidents
Intel Spectre v2 (2018): Branch Target Injection exploits the branch predictor shared between privilege levels. The Branch History Buffer and Branch Target Buffer are not flushed on privilege transitions (pre-mitigation), allowing ring-3 code to influence ring-0 speculation. Mitigation: retpoline replaces indirect jumps with a construct that defeats speculation; IBPB flushes the predictor at context switches. Performance cost: 5–15% on heavily syscall-bound workloads.
The 2016 Skylake-SP AVX-512 downclocking discovery. Linus Torvalds famously called AVX-512 "a power virus" in 2020 because a single AVX-512 instruction on Skylake could drop the entire socket's frequency, harming co-located workloads. Intel's Alder Lake and later generations improved this but did not eliminate it.
Modern Usage
Continuous CPU profiling with Pyroscope (Grafana), Polar Signals (Parca), or Datadog Continuous Profiler delivers always-on flame graphs in production with < 2% overhead via eBPF-based stack sampling.
BOLT (Binary Optimization and Layout Tool): post-link optimizer from Meta that uses profiling data to reorder functions and basic blocks in the binary for better i-cache utilization. Reported 5–10% CPU improvement on production services at Meta.
Hardware PMU in VMs: AWS Graviton3 and Intel Xeon in EC2 expose PMU counters to guests, enabling perf stat inside VMs.
Future Directions
- Chiplet architectures (AMD EPYC, Intel Sapphire Rapids) introduce additional latency tiers between chiplets—a new dimension in NUMA-like analysis.
- ARM Neoverse microarchitectures are increasingly common in cloud (AWS Graviton, Ampere Altra); their PMU events differ from x86, requiring tool adaptation.
- Intel AMX (Advanced Matrix Extensions): hardware accelerator tiles for matrix multiply, delivering TFLOPS on-die—blurring the line between CPU and accelerator profiling.
Exercises
-
Run
perf stat -e cycles,instructions,LLC-load-misses,branch-misses ./your_programand compute IPC, LLC miss rate, and misprediction rate. Interpret whether the program is frontend-bound, backend-bound, or memory-bound. -
Write two versions of a loop-heavy function: one with predictable branches (sorted input) and one with random branches (shuffled input). Use
perf stat -e branch-missesto quantify the difference. -
Use
toplev.py --level 2 -a sleep 5(requires pmu-tools) on a running workload. Identify the primary bottleneck category and explain what hardware event corresponds to it. -
Demonstrate false sharing: create two threads each incrementing a counter in a shared struct. Measure with
perf c2c. Then pad the struct to 64 bytes and measure again. -
Enable PGO for a compute-intensive C program. Profile with a representative workload, recompile with
-fprofile-use, and benchmark the speedup. Explain which optimizations PGO enabled by diffing the assembly.
References
- Gregg, B. Systems Performance (2nd ed., 2020). Chapter 6: CPUs.
- Intel. Intel 64 and IA-32 Architectures Optimization Reference Manual (2023). https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
- Akinshin, A. Pro .NET Benchmarking. Apress, 2019. (Methodology applies beyond .NET.)
- Ahmad, A. "Top-Down Microarchitecture Analysis." https://www.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/methodologies/top-down-microarchitecture-analysis-method.html
- pmu-tools / toplev: https://github.com/andikleen/pmu-tools
- BOLT: Panchenko, M. et al. "BOLT: A Practical Binary Optimizer for Data Centers and Beyond." CGO 2019.
- Torvalds, L. "Re: [GIT PULL] x86/avx512 for v5.8." LKML, 2020.