CPU Pipeline Deep Dive: Superscalar Out-of-Order Execution

Technical Overview

Modern high-performance CPUs implement a superscalar out-of-order (OoO) execution engine capable of executing 4–6 instructions simultaneously while dynamically reordering them to avoid data hazards. Intel's Golden Cove microarchitecture (Alder Lake P-core, 2021) and AMD's Zen 4 (Ryzen 7000/EPYC Genoa, 2022) represent the current state of the art in x86-64 superscalar OoO design. Understanding the pipeline at this depth requires grasping the entire instruction lifecycle: fetch → decode → rename → dispatch → issue → execute → writeback → retire—each stage involving complex hardware mechanisms that take decades of engineering to optimize.

Prerequisites

Familiarity with x86-64 ISA: instruction encoding, registers, addressing modes
Understanding of data hazards: RAW, WAW, WAR
Basic knowledge of cache hierarchies (L1/L2/L3)
Understanding of branch prediction concepts
Familiarity with SIMD (SSE/AVX) instruction extensions

Core Content

Overview: Pipeline Stages

Modern superscalar OoO CPUs separate the pipeline into a front-end (fetching and decoding instructions) and a back-end (executing and retiring them). The back-end executes instructions out of program order, while retirement always proceeds in order, maintaining the architectural state's correctness.

Intel Golden Cove / AMD Zen 4 Pipeline:

FRONTEND:
  ┌───────────────────────────────────────────────────────────────────────┐
  │  L1-I Cache  →  Fetch  →  Pre-Decode  →  Instruction Queue / Buffer  │
  │                   │                                                   │
  │  Branch Predictor  ─────────────────────────────────────────────────▶ │
  │  (predicts next PC before decode)                                    │
  │                                                                      │
  │  Decode (up to 6 uops/cycle) → Decoded Stream Buffer (DSB/IDQ) →    │
  │  Loop Stream Detector → Micro-Op Queue (IDQ)                         │
  └───────────────────────────────────────────────────────────────────────┘
                   ↓ (up to 6 uops/cycle)
BACKEND:
  ┌───────────────────────────────────────────────────────────────────────┐
  │  Register Alias Table (RAT) / Rename                                  │
  │  (eliminate false WAR/WAW dependencies by assigning physical regs)    │
  │                   ↓                                                   │
  │  Reorder Buffer (ROB)   [512 entries on Zen 4 / 512 on Golden Cove]   │
  │  (tracks all in-flight instructions in program order)                 │
  │                   ↓                                                   │
  │  Scheduler / Reservation Station (RS)  [~96 entries]                  │
  │  (waits for operands; issues when ready)                              │
  │                   ↓                                                   │
  │  Execution Ports / Units:                                             │
  │  ┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────────┐  │
  │  │P0   │P1   │P2   │P3   │P4   │P5   │P6   │P7   │P8   │P9       │  │
  │  │ALU  │ALU  │AGU  │AGU  │Store│ALU  │ALU  │Store│Load │FP/SIMD  │  │
  │  │MUL  │FP   │Load │Load │Data │Shift│JMP  │Addr │Data │...      │  │
  │  └─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────────┘  │
  │                   ↓                                                   │
  │  Load/Store Queue (LSQ) — tracks memory ops, detects forwarding       │
  │  Store Buffer — holds uncommitted stores visible to younger loads     │
  │                   ↓                                                   │
  │  Retirement / ROB Head — commit in program order                     │
  └───────────────────────────────────────────────────────────────────────┘

Frontend: Instruction Fetch and Decode

L1 Instruction Cache: Intel Golden Cove: 32 KB 8-way L1-I. AMD Zen 4: 32 KB 8-way L1-I. Fetch bandwidth: 16–32 bytes per cycle (128–256 bits). The branch predictor runs alongside fetch, predicting the next PC before the current fetch bundle has even been decoded.

Pre-decode: Identifies instruction boundaries (x86 instructions are variable-length, 1–15 bytes). Pre-decoder finds instruction starts to enable parallel decoding. This is a significant complexity in x86 vs fixed-length RISC architectures.

Decoder: Golden Cove: 6-wide decode (decode up to 6 instructions per cycle simultaneously). Zen 4: 4-wide decode (4 instructions/cycle) extended by 1 additional decoder in some configurations. Each decoder translates x86 instructions to micro-ops (µops). Simple instructions (ADD, MOV) decode to 1 µop. Complex instructions (PUSH, complex addressing) decode to 2–4 µops via the microcode ROM.

Decoded Stream Buffer (DSB) / Instruction Decode Queue (IDQ): Intel caches up to 1,536 decoded µops in the DSB (Golden Cove). When the same code loop repeats, µops are served from DSB, bypassing the complex x86 decoder entirely—this is the primary reason loop performance is so good.

Loop Stream Detector (LSD): Detects small loops (≤25 µops in the loop on Skylake, similar on Golden Cove). Serves µops directly from a small buffer inside the OoO engine without even accessing DSB.

Micro-op Queue (IDQ): 98-entry IDQ in Golden Cove. Provides a buffer between frontend and backend, absorbing frontend bubbles.

Frontend bandwidth ceiling: - Ideal: 6 µops/cycle (Golden Cove) × 3.9 GHz = 23.4 billion µops/second - Practical: 4–5 µops/cycle average (decode penalties, mispredictions)

Backend: Register Renaming

The false dependency problem: Consider:

ADD RAX, RBX    ; writes RAX (architectural register)
MOV RCX, RDX
ADD RAX, RCX    ; writes RAX — must wait for first ADD (true RAW)
MOV RAX, R8     ; writes RAX — false dependency on second ADD (WAW)

The last instruction has a Write-After-Write dependency on RAX, but semantically it doesn't need to wait—it's just writing a new value that overwrites the previous one. Register renaming eliminates false dependencies.

Physical Register File (PRF): The processor has many more physical registers than the 16 architectural x86 registers. Golden Cove: 280 integer physical registers, 332 vector (FP/SIMD) physical registers. Zen 4: 224 integer physical registers, 192 vector physical registers.

Register Alias Table (RAT): Maps architectural register → physical register. At rename: 1. Allocate a new physical register for each destination. 2. Record the old mapping in the ROB for retirement. 3. Issue instructions use the physical register IDs.

Before rename:
  ADD RAX(arch), RBX(arch)   → ADD p100, p25  (p100 = new phys reg for RAX)
  MOV RAX(arch), R8(arch)    → MOV p101, p80  (p101 = new phys reg for RAX)

After rename, no false dependency exists between ADD and MOV.
The ADD can execute and complete independently of the MOV.

Reorder Buffer (ROB)

The ROB is a circular buffer tracking all in-flight µops in program order. Every renamed µop is assigned an ROB entry.

ROB sizes: - Intel Golden Cove: 512 entries - AMD Zen 4: 320 entries (but Zen 4 has larger "macro-op queue") - Intel Skylake: 224 entries (for comparison)

ROB entry state: - Instruction PC, µop type - Physical destination register number - Execution status (ready/complete/pending) - Exception/fault flags

Why larger ROB = better performance: A larger ROB allows the CPU to look further ahead in the instruction stream for independent operations. An ILP (Instruction-Level Parallelism) of 4 with a 512-entry ROB allows overlapping 512/4 = 128 instructions of latency—covering L3 cache misses (300+ cycles on modern CPUs).

Scheduler / Reservation Station (RS)

The RS holds µops that have been renamed but whose operands are not yet available.

Tomasulo's algorithm (1967, IBM 360/91): A µop is "ready" when all source physical registers have been written (i.e., the producing µop has completed execution, not just retired). The RS monitors result broadcasts on the Common Data Bus (CDB) and wakes up waiting µops when their sources arrive.

RS operation:

Cycle 1: µop enters RS. Source operands checked:
  - Source A available (register file read): ready
  - Source B unavailable (waiting for MUL latency): not ready

Cycle 2-4: MUL executes (3-cycle latency)

Cycle 4: MUL result broadcast on CDB. RS entry woken up.

Cycle 5: µop scheduled to an execution port (if port is free)

Cycle 6: µop begins execution

RS sizes: Golden Cove: 97 entries. Zen 4: 48-entry RS per cluster (2 clusters = 96 total). The RS is smaller than the ROB because once a µop is issued to execute, it leaves the RS but remains in the ROB.

Age-priority scheduling: When multiple ready µops compete for the same port, oldest-first scheduling is preferred to maintain forward progress and prevent starvation.

Execution Units

Intel Golden Cove execution ports:

Port	Units
0	ALU, MUL, DIV, SIMD-int, FMA, crypto
1	ALU, MUL, SIMD-int, FMA, FADD
2	Load/Store AGU, LEA
3	Load/Store AGU, LEA
4	Store data
5	ALU, SIMD-shuffle, branch
6	ALU, branch, rotate/shift
7	Store AGU
8	Load data
9	FP/SIMD (additional)

AMD Zen 4 execution units: - 4× integer ALU (2 of which can do MUL) - 4× AGU (address generation, 2 load + 2 store) - 4× FP/SIMD (256-bit vectors, 2 FMA + 1 ADD + 1 misc) - 2× branch units

Execution latencies (Zen 4 / Golden Cove): | Operation | Latency (cycles) | Throughput (per cycle) | |-----------|-----------------|----------------------| | INT ADD | 1 | 4 | | INT MUL | 3 | 1 | | FP ADD (F64) | 3 | 2 | | FP MUL (F64) | 4 | 2 | | FMA | 4 | 2 | | DIV (64-bit) | 21–74 | 1/21 | | SQRTSD | 15–16 | 1/15 | | L1 load | 4–5 | 2 | | L2 load | 12–14 | 1 | | L3 load | 40–50 | — |

Load/Store Subsystem

The load/store subsystem is perhaps the most complex part of the modern OoO CPU. It must appear to execute memory operations in program order while actually executing them out of order.

Load queue (LQ): Tracks all in-flight loads. When a load issues, it checks the store buffer for recent stores to the same address (store-to-load forwarding). LQ size: Golden Cove 192 entries, Zen 4 72 entries.

Store buffer (SB): Stores are buffered here until retirement (when they become architected). A younger load can read from the SB before the store retires (forwarding). This means stores become "visible" to later loads before they commit to the L1D cache. SB size: Golden Cove 114 entries, Zen 4 64 entries.

Memory disambiguation: At issue time, the CPU may not know the address of an earlier store (the address calculation may not have completed). The CPU must either stall the load (conservative) or speculatively execute the load assuming no conflict (aggressive). On a conflict (later detected by memory ordering violation checks), the load and all subsequent instructions must be replayed.

Store-to-load forwarding pipeline:

Store buffer:  [addr: 0x1000, data: 42, age: old]
Load  µop:     [addr: 0x1000, requesting data]
                     │
         LSU detects address match in SB
                     │
         Forward data=42 to load consumer
         (avoid load hitting L1D cache at all)
         Latency: ~3-4 cycles (vs 4-5 for L1 hit)

Retirement

The ROB head instruction retires when: 1. It is at the head of the ROB (in-order retirement) 2. Its execution has completed 3. No exception/fault occurred

Retirement bandwidth: Golden Cove: 6 µops/cycle retirement. Retirement updates the architectural register file, frees physical registers, and commits stores to the cache.

Exceptions at retirement: Even if an instruction executed out-of-order, exceptions are reported precisely (in order). The ROB ensures that no younger instruction is retired before the faulting instruction, and the architectural state is consistent at the point of the fault. This is the key contract of OoO execution: same as in-order from a software perspective.

Speculative instructions: Instructions beyond a branch are speculatively executed. If the branch is mispredicted, all speculative ROB entries are flushed. Physical registers are freed. Execution restarts from the correct branch target. Misprediction penalty on Golden Cove: ~18–20 cycles.

Golden Cove Pipeline Block Diagram

          ┌─────────────────────────────────────────────────────┐
          │                   FRONTEND                          │
          │  ┌──────┐  ┌──────────┐  ┌──────────────────────┐  │
          │  │L1-I  │→ │ Branch   │→ │  Fetch (16B/cycle)   │  │
          │  │32KB  │  │Predictor │  │  TAGE+SC predictor   │  │
          │  └──────┘  └──────────┘  └──────────────────────┘  │
          │                                   ↓                 │
          │  ┌────────────────────────────────────────────────┐ │
          │  │  6-wide x86 Decoder (+ ucode for complex)      │ │
          │  └────────────────────────────────────────────────┘ │
          │                                   ↓                 │
          │  ┌─────────────────────────────────────────────┐    │
          │  │  Decoded Stream Buffer (DSB, 1536 µops)     │    │
          │  │  + Loop Stream Detector                      │    │
          │  └─────────────────────────────────────────────┘    │
          │                                   ↓                 │
          │  ┌─────────────────────────────────────────────┐    │
          │  │  IDQ / Micro-op Queue (98 entries)          │    │
          └──┴─────────────────────────────────────────────┴────┘
                                               ↓ (6 µops/cycle)
          ┌─────────────────────────────────────────────────────┐
          │                   BACKEND                           │
          │  ┌─────────────────────────────────────────────┐    │
          │  │  Register Rename / RAT                       │    │
          │  │  280 int physical regs, 332 vector regs      │    │
          │  └─────────────────────────────────────────────┘    │
          │                      ↓              ↓               │
          │  ┌──────────────────────────────────────────────┐   │
          │  │  Reorder Buffer (ROB) — 512 entries          │   │
          │  └──────────────────────────────────────────────┘   │
          │                      ↓                              │
          │  ┌──────────────────────────────────────────────┐   │
          │  │  Scheduler / RS (97 entries)                 │   │
          │  │  Tomasulo-style wakeup + select              │   │
          │  └──────────────────────────────────────────────┘   │
          │                      ↓                              │
          │  ┌──────┬──────┬──────┬──────┬──────┬──────┐       │
          │  │ P0   │ P1   │ P2   │ P3   │ P4   │ P5-9 │       │
          │  │ALU+  │ALU+  │AGU   │AGU   │Store │ALU/  │       │
          │  │FMA   │FMA   │Load  │Load  │Data  │SIMD  │       │
          │  └──────┴──────┴──────┴──────┴──────┴──────┘       │
          │                      ↓                              │
          │  ┌──────────────────────────────────────────────┐   │
          │  │  L1D (48KB 12-way) → L2 (2MB) → L3 (slice)  │   │
          │  │  Store Buffer (114 entries)                   │   │
          │  │  Load Queue (192 entries)                     │   │
          │  └──────────────────────────────────────────────┘   │
          │                      ↓                              │
          │  ┌──────────────────────────────────────────────┐   │
          │  │  Retirement (6 µops/cycle, in-order)         │   │
          │  └──────────────────────────────────────────────┘   │
          └─────────────────────────────────────────────────────┘

Historical Context

In-order pipelines (IBM 360/85, 1968) were the norm until Tomasulo's algorithm (IBM 360/91, 1967) introduced register renaming and dynamic instruction scheduling. The first commercial OoO CPUs were the MIPS R10000 and Intel Pentium Pro (P6 microarchitecture, 1995)—the P6 ROB had 40 entries. AMD's K6 (1997) and K7 (Athlon, 1999) competed with P6. Intel's Netburst (Pentium 4, 2000) pushed clock speed over IPC, reaching 3.8+ GHz but with poor performance-per-clock. The Core architecture (2006) returned to P6 principles with wider decode and better OoO. Sandy Bridge (2011) introduced the DSB. Haswell (2013) added AVX2 and fused multiply-add. Skylake (2015) remained dominant for 5 years. Golden Cove (2021, in Alder Lake) expanded ROB to 512 and added AVX-512 (P-cores). AMD Zen (2017) broke Intel's performance crown's decade-long assumption; Zen 4 (2022) with 5nm TSMC achieves near parity.

Production Examples

Intel Alder Lake (12th gen Core, 2021): Introduced hybrid architecture—Performance cores (Golden Cove, 6+2 at launch) and Efficiency cores (Gracemont). P-cores have the 512-ROB, 6-wide decode pipeline described above. E-cores are simpler 4-wide in-order-ish (Gracemont is also OoO but smaller). Thread Director (hardware + OS scheduler) routes threads to appropriate core type.

AMD Zen 4 (EPYC Genoa, 2022): Used in cloud servers (AWS c7a, Azure D-series v6). 96-core EPYC Genoa (12 CCDs × 8 cores per CCD). 5nm TSMC. Each core: 320-ROB, 4-wide decode, 320 MHz base/5.7 GHz boost on Ryzen 9. DDR5 support.

Server MFU measurement: A 96-core Genoa server running a memory-bandwidth-bound workload achieves ~60% IPC efficiency (many loads waiting for DRAM). A compute-bound HPC workload (dense matmul with AVX-512) achieves >90% IPC efficiency, demonstrating the pipeline runs near its theoretical maximum.

Debugging Notes

Identifying pipeline bottlenecks with perf: Use Intel Top-Down Microarchitecture Analysis (TMA):

perf stat -M tma_l1 -a -- ./benchmark
# Look for: Frontend_Bound, Backend_Bound, Bad_Speculation, Retiring

High Backend_Bound → memory or execution port contention. High Bad_Speculation → branch mispredictions.

Stall profiling with PEBS (Processor Event-Based Sampling):

perf record -e cycles:ppp,mem-loads:ppp -- ./workload
# -e mem-loads samples load instructions; identifies hot loads with high latency

ROB full stall: If RESOURCE_STALLS.ROB counter is high, the ROB is full (too many in-flight instructions). Consider reducing loop unrolling or improving instruction throughput to drain the ROB faster.

Store forwarding stall: LD_BLOCKS.STORE_FORWARD indicates loads that tried to forward from a store but couldn't due to size/alignment mismatch. Common in code that writes 4 bytes then reads 8 bytes from the same address.

Security Implications

Spectre (2018): Exploits speculative execution. A branch is speculatively executed to read a secret value into a cache line. Even after the speculation is rolled back (correct branch taken), the cache side-effect remains. The attacker measures cache access time to infer the secret. Mitigations: LFENCE after sensitive branches, indirect branch restricted speculation (IBRS), Retpoline (ret-based trampoline for indirect branches).

Meltdown (2018): Exploits OoO execution past a privilege check. A user-space instruction reads kernel memory (which would normally fault), but the OoO engine executes several instructions past the fault speculatively, leaving kernel data in cache. Mitigation: KPTI (Kernel Page Table Isolation) — remove kernel mappings from user-space page tables, preventing the speculative access from touching kernel pages.

MDS attacks (Microarchitectural Data Sampling, 2019): RIDL, Fallout, ZombieLoad — exploit data visible in store buffers, fill buffers, and line-fill buffers during transient execution windows. Intel mitigation: VERW instruction flushes buffers, MDSClear microcode.

The fundamental problem: OoO CPUs speculatively execute instructions that should not have been executed. The architectural state is always correct (speculation is rolled back), but microarchitectural side effects (cache state, buffer contents) are not. This is a design flaw that decades of speculative execution optimization has made deeply fundamental.

Performance Implications

IPC (Instructions Per Clock): Modern OoO CPUs achieve 4–6 IPC on well-optimized code. Limiting factors: branch mispredictions (18–20 cycles penalty), cache misses (L3: 40–50 cycles, DRAM: 300+ cycles), execution port contention (code hitting same port repeatedly).

Instruction-level parallelism limits: Amdahl's law for ILP: if a program has 50% of instructions with true RAW dependencies, maximum ILP = 2. In practice, real-world code has ILP of 2–4. OoO execution with 512-ROB can exploit ~128 instructions of latency for 4-wide ILP.

Frequency vs IPC: Modern CPUs run at 3.5–5.5 GHz. Frequency scaling above ~5 GHz hits exponential power walls. Future performance improvements focus on wider decode (7–8 wide?), larger ROB (1024 entries?), and better branch prediction rather than frequency.

Failure Modes and Real Incidents

Incident: Spectre+Meltdown discovery (2018): Disclosed by Google Project Zero, Cyberus Technology, and Graz University of Technology simultaneously. Required emergency OS patches (KPTI) that reduced kernel call performance by 5–30% on I/O-heavy workloads. Cloud providers saw 10–15% degradation on database workloads. New CPU generations (Ice Lake, Zen 2+) added hardware mitigations.

Incident: Intel Skylake LSD (Loop Stream Detector) bug (2017, CVE-2017-5715 variant): Intel's LSD had a bug where it could corrupt certain code patterns. Mitigation required disabling the LSD via microcode update, causing 0–15% performance regression on loop-heavy workloads (e.g., Java JIT, JavaScript V8).

Incident: AMD Zen 1 TLB bug (2018, Erratum 1019): A specific sequence of instructions in certain configurations could cause the TLB to return stale mappings. Required microcode update. Affected primarily hypervisor-intensive workloads. 2–5% performance regression from mitigation.

Modern Usage

Golden Cove vs Gracemont (Alder Lake / Raptor Lake): Intel's Thread Director (running on CPU + OS) uses IPC hints embedded in EIST packets to identify thread types. Compute-bound threads are routed to P-cores; efficiency-class threads to E-cores. This heterogeneous architecture enables both peak performance and power efficiency.

AMD 3D V-Cache (2022): 64 MB of L3 SRAM stacked via TSV (Through-Silicon Via) on Zen 3 CCDs. Dramatically increases L3 hit rate for gaming and simulation workloads. Zen 4 3D V-Cache (X3D) in Ryzen 9 7950X3D follows the same approach on Zen 4.

Simultaneous Multi-Threading (SMT / HyperThreading): Each physical core runs 2 hardware threads sharing the execution resources (2× frontend, shared RS/ROB). SMT efficiency: 25–35% throughput gain for latency-bound workloads; 5–10% for compute-bound. Security note: SMT threads share execution resources, enabling side-channel attacks (Port Contention attacks, SMoTherSpectre).

Future Directions

8-wide decode / 1024-entry ROB: Next generation superscalar designs are pushing both frontend and backend wider; IC design tools and silicon area make this increasingly tractable
Data-speculation with hardware load value prediction: Speculate on load values (not just addresses) to begin compute before memory returns data; demonstrated in research (Intel's Loadedist, CMU LAMP)
Hybrid OoO + vector processing: ARM SVE2, RISC-V V extension, and Intel AMX/AVX-512 blurring the line between superscalar OoO and SIMD
Near-threshold voltage computing: Operate CPUs at 200–300 mV (near-threshold) for 10× lower power at 30–50% of peak frequency; relevant for edge/mobile

Exercises

Pipeline simulation: Implement a 4-wide OoO CPU simulator in Python (toy-scale). Support: in-order fetch (4 instructions/cycle), 2-bit branch predictor, 64-entry ROB, 32-entry RS, 2 ALU ports (1-cycle latency), 1 MUL port (3-cycle), 1 load port (4-cycle). Simulate a 100-instruction benchmark. Compute IPC and visualize the pipeline utilization.
Tomasulo walkthrough: Given 10 instructions with specified data dependencies, manually trace register renaming through a 6-register physical file, show RAT state at each cycle, and determine execution order (identify which instructions execute in parallel vs serialize).
perf TMA analysis: On a Linux machine, run three workloads: (a) matrix multiply (compute-bound), (b) random memory access (memory-bound), (c) function pointer dispatch loop (branch-bound). Use perf stat -M tma_l1 to categorize each. Explain the dominant bottleneck for each.
Store-to-load forwarding: Write two C functions that access the same memory with different size patterns: (a) store 4 bytes, immediately load 4 bytes (forwarding works), (b) store 4 bytes, immediately load 8 bytes (forwarding fails, stall). Benchmark with perf stat -e LD_BLOCKS.STORE_FORWARD. Explain the latency difference.
Meltdown exploit (educational): Study the Meltdown proof-of-concept code. Identify the three key steps: (1) speculative access of kernel address, (2) cache-based side channel encoding, (3) Flush+Reload timing measurement. Explain why KPTI prevents this but does not affect legitimate system call performance for CPU-intensive workloads.

References

Intel 64 and IA-32 Architectures Optimization Reference Manual, Intel 2023
AMD Software Optimization Guide for AMD Family 19h Processors (Zen 4), AMD 2022
Hennessy, Patterson, "Computer Architecture: A Quantitative Approach," 6th ed., 2019
Intel Alder Lake (Golden Cove) Microarchitecture: https://en.wikichip.org/wiki/intel/microarchitectures/golden_cove
AMD Zen 4 Microarchitecture: https://en.wikichip.org/wiki/amd/microarchitectures/zen_4
Kocher et al., "Spectre Attacks: Exploiting Speculative Execution," IEEE S&P 2019
Lipp et al., "Meltdown: Reading Kernel Memory from User Space," USENIX Security 2018
Travis Downs, "Intel Microarchitecture Performance Events (Sandy Bridge through Skylake)" (TMAM)