02 — CPU Performance

Technical Overview

CPU performance is not a single metric—it is the product of clock frequency, instructions per cycle (IPC), and how efficiently those instructions map onto the underlying microarchitecture. A program running at 3 GHz with IPC 1.0 executes 3 billion instructions per second; the same program on the same CPU achieving IPC 3.5 executes 10.5 billion. Modern x86 microarchitectures (Intel Golden Cove, AMD Zen 4) have theoretical peak IPC well above 4 due to superscalar out-of-order execution, yet most production workloads achieve IPC between 0.8 and 2.5. The gap represents optimization opportunity.

Understanding CPU performance requires reasoning simultaneously about: - Frontend efficiency: fetching, decoding, and feeding instructions to execution units. - Backend efficiency: execution units, latency vs. throughput, instruction-level parallelism. - Memory subsystem: how cache misses stall the pipeline. - Power and frequency: turbo boost, thermal throttling, and how they interact with benchmarks.

Prerequisites

x86-64 instruction set basics (registers, addressing modes).
Understanding of pipeline stages (fetch, decode, execute, retire).
Linux perf tool basics.
Concepts: cache hierarchy, NUMA, SMT/Hyperthreading.

Core Content

Key CPU Performance Metrics

IPC (Instructions Per Cycle) is the most important indicator of microarchitectural efficiency. Values:

IPC Range	Interpretation
< 0.5	Memory-bound or severe branch mispredictions
0.5–1.5	Typical mixed workload
1.5–3.0	Compute-efficient; good vectorization
> 3.0	Excellent; likely SIMD-heavy

CPI (Cycles Per Instruction) = 1/IPC. Useful for expressing stall cost: each LLC miss adds ~200 cycles; in a loop with one miss per 10 instructions, CPI grows by 20.

FLOPS (Floating-Point Operations Per Second): relevant for HPC and ML. A single AVX-512 FMA instruction executes 16 double-precision multiply-adds in one cycle on capable hardware, yielding 32 FLOPS/cycle/core—at 3 GHz, 96 GFLOPS/core theoretical peak.

PMU: Performance Monitoring Unit

Every modern CPU exposes hardware performance counters via the PMU. On x86, the PMU includes:

Fixed counters (always present): INST_RETIRED.ANY, CPU_CLK_UNHALTED.THREAD, CPU_CLK_UNHALTED.REF_TSC.
Programmable counters (4–8 per logical core): configurable to count thousands of microarchitectural events.

List available counters:

perf list | grep -E "cache|branch|cycle|instruction" | head -40

Sample PMU events programmatically:

perf stat -e cycles,instructions,cache-misses,cache-references,\
branch-misses,branches,LLC-loads,LLC-load-misses \
-p <pid> sleep 10

Typical output interpretation:

Performance counter stats for process 1234:

 24,103,456,789  cycles
 18,501,234,567  instructions    # IPC: 0.77 — memory-bound
      1,203,456  cache-misses    # cache miss rate: high
     12,034,561  cache-references
        891,234  branch-misses
     42,103,456  branches        # misprediction rate: 2.1%
      5,201,345  LLC-loads
      1,203,456  LLC-load-misses # LLC miss rate: 23% — severe

Cache Miss Analysis

Cache misses are the dominant source of CPU stalls. The pipeline executes at ~3–4 instructions/cycle when hot in L1; it stalls to a fraction of that when waiting on DRAM.

LLC miss rate = LLC-load-misses / LLC-loads. Above 10% is problematic.

DRAM bandwidth consumption can be measured with Intel PCM or perf stat -e uncore_imc/data_reads/,uncore_imc/data_writes/:

# Intel PCM bandwidth tool
pcm-memory 1  # print per-channel bandwidth every second

Perf performance counter cheat sheet:

Counter Event	What It Measures	Concern Threshold
`cycles`	Raw CPU cycles	—
`instructions`	Retired instructions	IPC < 0.5
`cache-misses`	L1/L2/L3 cache misses	> 1% miss rate
`LLC-load-misses`	Last-level cache load misses	> 10% LLC miss rate
`dTLB-load-misses`	Data TLB misses	> 0.5%
`iTLB-load-misses`	Instruction TLB misses	> 0.1%
`branch-misses`	Branch mispredictions	> 1%
`branches`	Total branches	—
`mem-loads`	Memory loads (with PEBS)	—
`mem-stores`	Memory stores (with PEBS)	—
`cpu-cycles`	Cycles (freq-scaled)	Prefer to `task-clock`
`task-clock`	Wall-clock ms of CPU	Does NOT scale with freq

Branch Misprediction

Modern CPUs use tournament predictors (TAGE predictor family in Intel/AMD) to guess branch direction before the condition is resolved. A misprediction incurs a pipeline flush: ~15–20 cycles on modern microarchitectures.

perf stat -e branches,branch-misses ./program
# Misprediction rate = branch-misses / branches
# Rates above 2% are significant

Common sources: - Data-dependent branches on unpredictable data (hash table probing, binary search on unsorted keys). - Indirect calls (virtual dispatch, function pointers)—use PGO or devirtualization. - Conditional branches on random input—consider branchless alternatives:

// Branchy (unpredictable if x is random)
if (x > 0) result = a; else result = b;

// Branchless (CMOV instruction)
result = (x > 0) ? a : b;  // Compiler may emit cmov

Instruction Mix Analysis: SIMD Utilization

Scalar throughput processes one element per instruction; AVX2 processes 8 floats (256-bit); AVX-512 processes 16 floats (512-bit). SIMD utilization measures how much of this width is used.

# Check SIMD events (Intel)
perf stat -e fp_arith_inst_retired.scalar_single,\
fp_arith_inst_retired.128b_packed_single,\
fp_arith_inst_retired.256b_packed_single,\
fp_arith_inst_retired.512b_packed_single \
./program

Caveat on AVX-512: On some Intel CPUs (Skylake-X, Ice Lake), AVX-512 instructions cause frequency downclocking (because of higher power draw). This can make SIMD slower if the vector computation doesn't sufficiently amortize the frequency penalty. Always benchmark, never assume.

CPU Frequency Scaling

P-states (Performance states) are CPU frequency/voltage operating points managed by the OS governor or the CPU's Hardware P-state (HWP) mechanism.

Critical measurement trap: perf stat reports task-clock in milliseconds of wall-clock time, not CPU cycles. If turbo boost is active and the CPU runs at 4.5 GHz instead of the nominal 3.0 GHz, task-clock underestimates work done. Always use cpu-cycles (hardware counter) for apples-to-apples comparisons:

# Bad: task-clock affected by frequency
perf stat -e task-clock ./program

# Good: cpu-cycles is the actual hardware counter
perf stat -e cpu-cycles,instructions ./program

# Pin frequency for reproducible benchmarks
cpupower frequency-set -g performance
cpupower frequency-set -f 3000MHz

Turbo boost behavior:

Nominal: 3.0 GHz (all cores loaded)
Turbo:   4.5 GHz (single core, thermal headroom allows)
AVX-512: 3.2 GHz (power limit reduces boost)

NUMA Effects on CPU Performance

Non-Uniform Memory Access: each CPU socket has local DRAM banks (fast, ~70 ns) and remote DRAM banks (slow, ~120–140 ns). On a 4-socket system, cross-socket latency can be 3x local.

# Identify NUMA topology
numactl --hardware

# Measure remote vs local bandwidth
numactl --cpunodebind=0 --membind=1 ./bandwidth_test  # remote
numactl --cpunodebind=0 --membind=0 ./bandwidth_test  # local

# NUMA stats (pages accessed from remote node)
numastat -p <pid>

The numa_miss counter in numastat shows pages that were allocated on a remote node and then accessed frequently—a strong signal to use numactl --membind or mbind(2) in the application.

SMT / Hyperthreading Performance Effects

SMT (Simultaneous Multi-Threading, marketed as HyperThreading by Intel) presents two logical CPUs per physical core. Both share: - Execution units (ALUs, FPUs, load/store units) - L1 and L2 caches - Branch predictor - TLB

When both logical CPUs are active on the same physical core, IPC drops—execution unit contention and cache thrashing between threads. Typical IPC reduction: 20–40% per thread.

Physical Core (2 logical CPUs)
┌─────────────────────────────────┐
│  Thread 0  │  Thread 1          │
│  (HT0)     │  (HT1)             │
│─────────────────────────────────│
│  Shared Execution Units         │
│  4 ALUs, 2 FPUs, 2 Load/Store   │
│─────────────────────────────────│
│  Shared L1 Cache (48 KB)        │
│  Shared L2 Cache (1.25 MB)      │
└─────────────────────────────────┘

For latency-sensitive workloads (HFT, real-time audio), disable SMT:

echo off > /sys/devices/system/cpu/smt/control

For throughput workloads with high cache miss rates (streaming analytics), SMT helps hide memory latency.

Compiler Optimizations

Flag	Effect	Use Case
`-O2`	Inlining, loop optimization, no unsafe math	Default production
`-O3`	Aggressive vectorization, loop unrolling, `fsanitize` unsafe	HPC, verified safe
`-march=native`	Use all CPU features of build host	Single-machine deployment
`-march=x86-64-v3`	AVX2 baseline (Haswell+)	Cloud deployments
`-fprofile-generate`	Instrument binary for PGO	Phase 1 of PGO
`-fprofile-use`	Compile with PGO data	Phase 2 of PGO; 10–20% gain
`-flto`	Link-Time Optimization; cross-TU inlining	Full program optimization

Profile-Guided Optimization (PGO) workflow:

# Phase 1: Build with instrumentation
gcc -O2 -fprofile-generate -o app_instr app.c

# Phase 2: Run with representative workload
./app_instr < production_trace.txt

# Phase 3: Compile with profile data
gcc -O2 -fprofile-use -fprofile-correction -o app_pgo app.c

PGO teaches the compiler which branches are hot (enabling better layout, inlining decisions) and which loops are executed frequently (enabling better vectorization).

SIMD Vectorization: AVX-512 Throughput

AVX-512 on Intel Skylake-X and later: - 512-bit FMA (fused multiply-add): 16 floats or 8 doubles per cycle. - Gather/scatter instructions: load/store non-contiguous data. - Mask registers (k0–k7): predicated execution, avoids branches in loops.

Auto-vectorization with GCC/Clang:

// Auto-vectorizable loop (no aliasing, known bounds)
void scale(float *restrict a, const float *restrict b, float s, int n) {
    for (int i = 0; i < n; i++)
        a[i] = b[i] * s;
}

Check vectorization:

gcc -O3 -march=native -fopt-info-vec-optimized -c scale.c
# Look for: "loop vectorized" in output

# Inspect generated assembly
objdump -d scale.o | grep -E "vmul|vadd|vfma"

Historical Context

The CPU performance measurement discipline matured during the RISC wars of the 1990s (SPEC CPU92, SPEC CPU95) when marketing teams drove "benchmarketing" to absurdity—using specially hand-tuned library builds and proprietary compilers that bore no relation to production deployments. The SPEC committee responded with increasingly strict run rules.

Intel's Pentium 4 (Netburst, 2000–2006) pushed clock frequency to 3.8 GHz at the cost of IPC (its pipeline had 31 stages; a branch misprediction wasted 31 cycles). The Core 2 (2006) returned to shorter pipelines and far better IPC—a lesson in multi-dimensional performance that still informs microarchitecture today.

The emergence of PMU-based profiling tools (oprofile 2002, perf 2009) democratized microarchitectural analysis.

Production Examples

Case: Unexplained 30% regression after kernel upgrade. A C++ real-time trading application regressed after upgrading from kernel 4.19 to 5.15. perf stat showed IPC dropped from 2.1 to 1.5. LLC-load-misses nearly doubled. Investigation: the new kernel changed the NUMA page migration policy; threads were now allocated memory on a remote NUMA node. Fix: set numactl --membind=0 in the service startup script. IPC restored to 2.0.

Case: AVX-512 frequency downclocking causing latency spikes. A video encoding service deployed on Skylake-X instances. p99 latency spiked 40% after enabling AVX-512 in the encoder. Root cause: perf stat showed cpu-cycles diverging from task-clock—the CPU was running at 2.8 GHz instead of the expected 3.8 GHz turbo. Fix: switched to AVX2 code path. The 2x throughput improvement of AVX-512 over AVX2 was negated by the 26% frequency reduction. Lesson: always verify clock frequency when switching SIMD widths.

Debugging Notes

perf stat -r 5 runs the benchmark 5 times and reports mean ± stddev—always use this.
Use perf annotate to see which specific assembly instructions are consuming cycles.
toplev.py (from Intel's pmu-tools) implements the Top-Down Microarchitecture Analysis (TMA) methodology—it decomposes cycles into Frontend Bound, Backend Bound, Bad Speculation, and Retiring categories.
When LLC miss rate is high, use perf mem record + perf mem report to identify the specific load instructions causing misses.
perf c2c identifies cache lines with false sharing (contended cache lines between cores).

Security Implications

PMU access can leak information: side-channel attacks use performance counters to infer cache state (FLUSH+RELOAD, PRIME+PROBE). The perf_event_paranoid sysctl controls access:

-1: Allow all perf events (dangerous)
 0: Allow CPU events for unprivileged users
 1: Allow user-space sampling only (default on most distros)
 2: Allow only CAP_PERFMON or root (recommended for production)
 3: Disallow all perf events for unprivileged users (RHEL default)

SMT side-channels: when two tenants share a physical core via SMT, the branch predictor, cache, and execution ports can leak information. Cloud providers now offer dedicated cores (AWS bare metal, GCP sole-tenant nodes) to mitigate this.

Performance Implications

Branch misprediction overhead: 15–20 cycles per misprediction. At 1 billion branches/second with 5% misprediction rate, this wastes 10–15% of CPU bandwidth. Branchless code, CMOV, and sorted data structures (enabling predictable branches) are remedies.

False sharing: two variables on the same 64-byte cache line written by different cores invalidate each other's L1 cache copies via MESI protocol. The symptom is high cache-misses even though each core writes a different variable. Pad to 64 bytes:

struct padded_counter {
    uint64_t value;
    char padding[56];  // total = 64 bytes = one cache line
} __attribute__((aligned(64)));

Failure Modes and Real Incidents

Intel Spectre v2 (2018): Branch Target Injection exploits the branch predictor shared between privilege levels. The Branch History Buffer and Branch Target Buffer are not flushed on privilege transitions (pre-mitigation), allowing ring-3 code to influence ring-0 speculation. Mitigation: retpoline replaces indirect jumps with a construct that defeats speculation; IBPB flushes the predictor at context switches. Performance cost: 5–15% on heavily syscall-bound workloads.

The 2016 Skylake-SP AVX-512 downclocking discovery. Linus Torvalds famously called AVX-512 "a power virus" in 2020 because a single AVX-512 instruction on Skylake could drop the entire socket's frequency, harming co-located workloads. Intel's Alder Lake and later generations improved this but did not eliminate it.

Modern Usage

Continuous CPU profiling with Pyroscope (Grafana), Polar Signals (Parca), or Datadog Continuous Profiler delivers always-on flame graphs in production with < 2% overhead via eBPF-based stack sampling.

BOLT (Binary Optimization and Layout Tool): post-link optimizer from Meta that uses profiling data to reorder functions and basic blocks in the binary for better i-cache utilization. Reported 5–10% CPU improvement on production services at Meta.

Hardware PMU in VMs: AWS Graviton3 and Intel Xeon in EC2 expose PMU counters to guests, enabling perf stat inside VMs.

Future Directions

Chiplet architectures (AMD EPYC, Intel Sapphire Rapids) introduce additional latency tiers between chiplets—a new dimension in NUMA-like analysis.
ARM Neoverse microarchitectures are increasingly common in cloud (AWS Graviton, Ampere Altra); their PMU events differ from x86, requiring tool adaptation.
Intel AMX (Advanced Matrix Extensions): hardware accelerator tiles for matrix multiply, delivering TFLOPS on-die—blurring the line between CPU and accelerator profiling.

Exercises

Run perf stat -e cycles,instructions,LLC-load-misses,branch-misses ./your_program and compute IPC, LLC miss rate, and misprediction rate. Interpret whether the program is frontend-bound, backend-bound, or memory-bound.
Write two versions of a loop-heavy function: one with predictable branches (sorted input) and one with random branches (shuffled input). Use perf stat -e branch-misses to quantify the difference.
Use toplev.py --level 2 -a sleep 5 (requires pmu-tools) on a running workload. Identify the primary bottleneck category and explain what hardware event corresponds to it.
Demonstrate false sharing: create two threads each incrementing a counter in a shared struct. Measure with perf c2c. Then pad the struct to 64 bytes and measure again.
Enable PGO for a compute-intensive C program. Profile with a representative workload, recompile with -fprofile-use, and benchmark the speedup. Explain which optimizations PGO enabled by diffing the assembly.

References

Gregg, B. Systems Performance (2nd ed., 2020). Chapter 6: CPUs.
Intel. Intel 64 and IA-32 Architectures Optimization Reference Manual (2023). https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
Akinshin, A. Pro .NET Benchmarking. Apress, 2019. (Methodology applies beyond .NET.)
Ahmad, A. "Top-Down Microarchitecture Analysis." https://www.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/methodologies/top-down-microarchitecture-analysis-method.html
pmu-tools / toplev: https://github.com/andikleen/pmu-tools
BOLT: Panchenko, M. et al. "BOLT: A Practical Binary Optimizer for Data Centers and Beyond." CGO 2019.
Torvalds, L. "Re: [GIT PULL] x86/avx512 for v5.8." LKML, 2020.