CPU Pipeline: Classic 5-Stage to Modern Superscalar
Prerequisites
- Basic digital logic: flip-flops, registers, combinational circuits
- Assembly language fundamentals: instructions, operands, memory addressing
- Familiarity with instruction set architectures (x86, RISC-V, ARM)
- Number representation: two's complement, IEEE 754 floating point basics
- Clock cycles and frequency: understanding that work happens at clock edges
Technical Overview
A CPU pipeline is a hardware technique that overlaps execution of multiple instructions simultaneously, analogous to an assembly line in manufacturing. Rather than completing one instruction fully before beginning the next, a pipelined processor divides instruction execution into discrete stages, each handled by dedicated hardware. At any clock cycle, each stage is working on a different instruction, increasing instruction throughput without necessarily increasing clock frequency.
The central insight is that sequential instructions are largely independent: while instruction N is executing in the ALU (Arithmetic Logic Unit), instruction N+1 can be fetched from memory simultaneously. This spatial parallelism within a single core is the foundation of all modern CPU design.
Modern pipelines far exceed the classic 5-stage textbook model. Intel's Core architecture runs approximately 14-19 stages depending on the microarchitecture generation. AMD Zen 4 is approximately 15 stages from fetch to writeback. These deeper pipelines enable higher clock frequencies by reducing the amount of logic per pipeline stage (shorter critical path per stage = lower propagation delay = faster clock). The tradeoff: every branch misprediction wastes cycles proportional to pipeline depth.
Historical Context
1985 — MIPS R2000: David Patterson and John Hennessy at Stanford and UC Berkeley co-developed the RISC philosophy and the canonical 5-stage pipeline. The MIPS R2000 was the first commercial embodiment of a clean 5-stage design. The simplicity of the RISC ISA was deliberately chosen to make pipelining tractable.
1989 — Intel i486: Intel's first pipelined x86 processor. A 5-stage pipeline on a CISC architecture required substantial decode complexity because x86 instructions are variable length with complex addressing modes. The i486 achieved 1 CPI (Cycles Per Instruction) on simple instructions.
1993 — Intel Pentium (P5): Two parallel integer pipelines (U and V pipes), marking Intel's entry into superscalar execution. The U pipe could handle any instruction; the V pipe could only handle simple instructions in parallel with U. Effective throughput: up to 2 integer instructions per cycle, but only under restrictive pairing rules.
1995 — Intel Pentium Pro (P6): The architectural watershed. First x86 processor with out-of-order (OoO) execution. The P6 microarchitecture introduced the decode→uop translation that all modern Intel CPUs still use: complex x86 instructions decoded into simpler internal micro-operations (uops). Introduced the ROB (ReOrder Buffer) and reservation stations. Pipeline depth: 14 stages. This architecture's descendants — Pentium II, III, Pentium M, Core 2, Sandy Bridge, Skylake — all trace lineage to P6.
2000 — Intel Pentium 4 (NetBurst): Aggressive pipeline deepening to 20 stages (Willamette) and eventually 31 stages (Prescott). The theory was that deeper pipelines enable higher frequencies. In practice, branch misprediction penalty grew to 20-30 cycles, frequency scaling stalled around 3.8 GHz due to power/heat, and the architecture was abandoned.
2006 — Intel Core 2 (Conroe): Return to P6-derived architecture with ~14 stage pipeline. Power-efficient, superscalar, wide decode.
2020 — Apple M1: 8-wide decode — the widest front-end decode of any shipping CPU at launch. Up to 8 instructions can be fetched and decoded per cycle, feeding a massive out-of-order backend with 630-entry ROB.
Core Content: The Classic 5-Stage Pipeline
Stage Definitions
Stage 1: IF — Instruction Fetch
Stage 2: ID — Instruction Decode / Register Read
Stage 3: EX — Execute (ALU, address calculation)
Stage 4: MEM — Memory Access (load/store)
Stage 5: WB — Write Back (update register file)
Pipeline Diagram: 5 Instructions in Steady State
Clock: 1 2 3 4 5 6 7 8 9
─────────────────────────────────────────
I1: IF ID EX MEM WB
I2: IF ID EX MEM WB
I3: IF ID EX MEM WB
I4: IF ID EX MEM WB
I5: IF ID EX MEM WB
In steady state, one instruction completes per clock cycle (CPI=1), even though each instruction takes 5 cycles end-to-end (latency=5). Throughput ≠ latency.
Pipeline Registers
Between each stage sit pipeline registers (flip-flops clocked at the CPU's master clock). They hold the instruction and its associated data as it moves through:
┌────┐ ┌──────────┐ ┌────┐ ┌──────────┐ ┌────┐
│ IF │──│IF/ID reg │──│ ID │──│ID/EX reg │──│ EX │
└────┘ └──────────┘ └────┘ └──────────┘ └────┘
│
┌────┐ ┌──────────┐ ┌─────┐ ┌──────────┐ │
│ WB │──│MEM/WB reg│──│ MEM │──│EX/MEM reg│─────┘
└────┘ └──────────┘ └─────┘ └──────────┘
Pipeline Hazards
Hazards are situations that prevent the next instruction from executing in its designated clock cycle. Three categories:
1. Structural Hazards
A required hardware resource is occupied by another instruction. Example: a single shared memory port serving both instruction fetch (IF stage) and data access (MEM stage) simultaneously.
Modern solution: separate instruction cache (I-cache) and data cache (D-cache) eliminate the most common structural hazard. Modern CPUs also have multiple execution ports (see superscalar section).
2. Data Hazards
An instruction depends on the result of a prior instruction that has not yet written back its result. Three sub-types:
RAW Read After Write — true dependency (I2 reads what I1 writes)
WAW Write After Write — output dependency (both write same register)
WAR Write After Read — anti-dependency (I2 writes before I1 reads)
RAW is the most critical: it is a true data dependency and cannot be eliminated by renaming alone.
Example:
ADD R1, R2, R3 ; I1: R1 = R2 + R3 (writes R1 in WB, cycle 5)
SUB R4, R1, R5 ; I2: R4 = R1 - R5 (reads R1 in ID, cycle 3 — STALE!)
Without mitigation, I2 reads R1 before I1 has written it. The result is incorrect.
Solution 1: Pipeline Stall / Bubble
The hazard detection unit stalls I2 (and all subsequent instructions) for 2 cycles by inserting NOPs (bubbles) into the pipeline:
Clock: 1 2 3 4 5 6 7 8 9
I1: IF ID EX MEM WB
I2: IF ID [NOP][NOP] EX MEM WB
I3: IF [---] [---] ID EX ...
stall stall
Cost: 2 wasted cycles per RAW hazard with 1-cycle stall (in a 5-stage pipeline with EX→WB forwarding not possible).
Solution 2: Operand Forwarding (Bypassing)
Hardware paths that short-circuit the normal register file read, delivering results directly from where they were computed to where they are needed — without waiting for WB.
┌──────────────────────────┐
│ FORWARDING PATHS │
│ │
IF ID EX MEM WB
─────────────────────────────────────────────
I1: IF ID EX ──┐MEM──WB
│ └────────────────────► ID (EX→ID forward, 2-cycle)
└──────────────────────► EX (MEM→EX forward, 1-cycle)
I2: IF ID EX MEM WB
I3: IF ID EX MEM WB
Detailed forwarding diagram with data paths:
┌──────────────────────────────────────────────────────┐
│ 5-STAGE PIPELINE │
│ │
│ PC ──► IF ──► [IF/ID] ──► ID ──► [ID/EX] ──► EX │
│ ▲ │ │
│ │ EX→ID forward │ │
│ └──────────────────┘ │
│ │ │
│ ──► [EX/MEM] ──► MEM ──► [MEM/WB] ─►WB │
│ │ │ │
│ │ MEM→EX forward │ │
│ └────────────────────┘ │
│ (to EX input MUX) │
└──────────────────────────────────────────────────────┘
Forward MUX at EX stage selects:
[0] Register file output (normal)
[1] EX/MEM pipeline register (forward from prior EX)
[2] MEM/WB pipeline register (forward from two cycles ago)
With full forwarding, most RAW hazards require 0 stall cycles. The only unavoidable stall: load-use hazard — a load followed immediately by an instruction using the loaded value requires 1 stall cycle because the data is not available until end of MEM stage.
LDR R1, [R2] ; load from memory — result available after MEM
ADD R3, R1, R4 ; needs R1 — 1 cycle stall unavoidable
3. Control Hazards
Caused by branch instructions. The CPU doesn't know the next instruction's address until the branch is resolved in EX (or later). Meanwhile, instructions after the branch have already entered IF and ID stages — if the branch is taken, those instructions must be flushed (turned into bubbles).
Clock: 1 2 3 4 5
BEQ: IF ID EX ← branch resolved here
I_fall: IF ID ← must flush if branch taken
I_fall+1: IF ← must flush if branch taken
With branch resolved in EX stage: 2-cycle branch penalty if taken.
Branch Prediction: Modern CPUs predict the branch outcome and direction speculatively continue fetching. A correct prediction costs 0 cycles. A misprediction requires flushing the pipeline back to the branch, costing pipeline-depth - 1 cycles (approximately 15-20 cycles on modern deep pipelines).
Superscalar Execution
A superscalar CPU issues more than one instruction per clock cycle. This requires:
- Wide fetch: Fetch multiple instructions per cycle (fetch width)
- Wide decode: Decode multiple instructions per cycle (decode width)
- Multiple execution units: Parallel ALUs, load/store units, FPUs
- Dependency checking: Hardware that identifies which instructions can issue in parallel
Fetch Width vs Decode Width
These are often different. Fetch retrieves a block of bytes from the I-cache. Decode identifies instruction boundaries (critical for x86 with variable-length encoding) and translates to uops.
┌──────────────────────┐
I-Cache ─────────►│ Fetch Buffer │ 16-32 bytes / cycle
└──────────┬───────────┘
│
┌──────────▼───────────┐
│ Pre-Decode │ find instruction boundaries
└──────────┬───────────┘
│
┌──────────▼───────────┐
│ Instruction Queue │ buffer decoded instrs
└──────────┬───────────┘
│ (decode width)
┌──────────▼───────────┐
│ Decode (4-8 wide) │ → uops
└──────────┬───────────┘
│
┌──────────▼───────────┐
│ Rename / Dispatch │ → OoO engine
└──────────────────────┘
Decode Width Comparison (2024)
| CPU | Decode Width | Pipeline Depth (approx) |
|---|---|---|
| Intel Pentium (P5) | 2 (U+V) | 5 |
| Intel Pentium Pro | 3 uops/cycle | 14 |
| Intel Sandy Bridge | 4 uops/cycle | 14 |
| Intel Skylake | 4 uops/cycle | 14 (19 with MEM ops) |
| Intel Sunny Cove | 5 uops/cycle | ~14 |
| AMD Zen 2 | 4 uops/cycle | ~19 |
| AMD Zen 4 | 4 uops/cycle | ~15 |
| Apple M1 | 8 uops/cycle | ~15 |
| Apple M4 | 8 uops/cycle | ~15 |
| ARM Cortex-X4 | 5 uops/cycle | ~13 |
Apple M1's 8-wide decode is remarkable. Most of the benefit comes when code has enough independent instructions to fill all decode slots — common in high-performance compute kernels but not in pointer-chasing workloads.
Why Deep Pipelines Are Problematic
Pipeline depth directly multiplies branch misprediction penalty:
Misprediction penalty ≈ pipeline_depth - 1 cycles
Pentium (P5), depth 5: ~4 cycle penalty
Pentium 4 (Prescott), depth 31: ~30 cycle penalty
Intel Core (modern), depth ~15: ~15 cycle penalty
If a branch predictor has 95% accuracy (very good) and branches occur every 5 instructions:
IPC with perfect prediction: 4.0 (4-wide decode)
Branch every 5 instrs → 1 branch per 5 cycles
Miss rate: 5% → misprediction every 100 cycles
Penalty: 15 cycles per misprediction
Throughput loss: 15/100 = 15%
Effective IPC: ~3.4
With Pentium 4's 30-cycle penalty under same conditions: 30% throughput loss.
This is why Intel abandoned NetBurst: the frequency gains from a 31-stage pipeline were nullified by the catastrophic branch misprediction penalty.
Production Examples and Debugging
Measuring Pipeline Effects with perf
# Count branch mispredictions
perf stat -e branches,branch-misses,cycles,instructions ./program
# Output example:
# 1,234,567,890 branches
# 12,345,678 branch-misses # 1.00% of all branches
# 4,000,000,000 cycles
# 3,600,000,000 instructions # 0.90 insn per cycle
# IPC of 0.90 on a 4-wide machine → significant pipeline waste
Identifying Load-Use Hazards
A common source of pipeline stalls in tight loops is load-use dependency chains:
// Anti-pattern: pointer chasing (unavoidable load-use stalls)
while (node) {
sum += node->value;
node = node->next; // loads next pointer, then immediately dereferences
}
The hardware cannot hide this because each node->next load must complete before the next loop iteration's load address is known. Memory latency (~4 cycles L1, ~12 cycles L2, ~40 cycles L3) dominates.
Loop Unrolling to Hide Latency
// Compiler unrolls to allow overlapping of independent loads
sum1 += a[0]; sum2 += a[4];
sum1 += a[1]; sum2 += a[5];
sum1 += a[2]; sum2 += a[6];
sum1 += a[3]; sum2 += a[7];
Multiple independent accumulators allow the pipeline to issue loads while prior loads are still in-flight.
Security Implications
Pipeline design has direct security consequences:
-
Branch predictor state is shared between security contexts (processes, VMs). Spectre v2 exploits indirect branch prediction by poisoning the Branch Target Buffer from one process to misdirect speculation in another. (See
03-speculative-execution.md.) -
Forwarding paths and timing: The presence of forwarding can make certain instruction sequences execute in fewer cycles, potentially creating timing side-channels. This is generally not directly exploitable but contributes to the microarchitectural attack surface.
-
Pipeline flushing on privilege changes: On syscall entry, the CPU must carefully manage pipeline state. Instructions that were speculatively fetched from user space must be flushed before kernel execution begins. Failures here (as in early Meltdown) led to catastrophic privilege escalation vectors.
-
Transient execution: Instructions that are in the pipeline but will be squashed (due to branch misprediction, exception, etc.) can still leave microarchitectural side effects in caches and other buffers. This is the root class of Spectre/Meltdown attacks.
Performance Implications
Key performance metrics and their pipeline interpretations:
| Metric | Formula | Typical Range |
|---|---|---|
| IPC | instructions / cycles | 0.5–5.0 |
| CPI | cycles / instructions | 0.2–2.0 |
| Branch miss rate | misses / branches | 0.5%–10% |
| Misprediction penalty | cycles wasted / miss | 10–30 cycles |
| Load-use stalls | stall_cycles / cycles | 0%–40% |
A healthy server workload typically achieves IPC 2.0–3.5 on a modern 4-wide machine. IPC below 1.5 usually indicates memory latency, branch mispredictions, or both.
Failure Modes
-
Branch predictor thrashing: Pathological access patterns (e.g., alternating taken/not-taken at high frequency) can defeat prediction heuristics, causing sustained 15-20% throughput loss.
-
Pipeline replay: Some CPUs speculatively execute operations before memory disambiguation is confirmed. If a dependency is discovered late, the pipeline "replays" the dependent instructions. Intel's pre-Nehalem architectures had memory ordering replay storms on certain store-then-load patterns.
-
Long-latency instruction in decode: x86 complex instructions (like certain
REP MOVSvariants or legacy x87 FP) can decode into many uops and stall the decode stage for multiple cycles. -
Front-end bottleneck: If the instruction stream is so dense with branches that the branch predictor falls behind, fetch bandwidth drops and the out-of-order backend starves despite its capacity.
Modern Usage and Current State
Modern CPUs from all major vendors (Intel, AMD, Apple, ARM) implement:
- Multi-level branch prediction (see
04-branch-prediction.md) - Out-of-order execution (see
02-out-of-order-execution.md) - Speculative execution (see
03-speculative-execution.md) - Loop stream detectors: Intel CPUs detect small loops (≤64 uops in the loop buffer / DSB) and bypass the decode stage entirely for subsequent iterations, saving power and front-end bandwidth
- Micro-fusion: Some adjacent uop pairs are fused into a single dispatch/execution micro-operation to effectively widen the pipeline
Apple M1/M2/M3/M4's 8-wide decode with 630-entry ROB represents a bet that wide OoO windows are more valuable than deep pipelines. This is well-suited for macOS workloads where single-threaded latency dominates and memory access patterns are cache-friendly.
Future Directions
-
Wider decode: 8-wide (Apple M-series) may expand. 10-12 wide is theoretically possible but faces diminishing returns from instruction-level parallelism limits (Amdahl-style serial bottlenecks in real code).
-
Dataflow execution: Academic designs (MIT Raw, Intel's research architectures) propose abandoning the von Neumann sequential fetch model in favor of pure dataflow graphs. Not shipping commercially.
-
Disaggregated pipelines: RISC-V's open ISA enables custom pipeline stages for specific workloads (cryptographic accelerators inserted as pipeline stages).
-
Thread-level speculation: Hardware support for speculatively parallelizing serial loop iterations across cores, with rollback on conflict. IBM and Intel have explored but not commercialized this.
-
AI-assisted branch prediction: Apple and others have reported using ML models offline to generate better branch predictor initialization data, and proposals exist for small neural networks in hardware for branch prediction.
Exercises
-
Hazard analysis: Given the following instruction sequence, identify all RAW hazards and determine how many stall cycles are needed with and without forwarding (assume a classic 5-stage pipeline with no forwarding, then with full EX→ID and MEM→ID forwarding):
asm ADD R1, R2, R3 MUL R4, R1, R5 LDR R6, [R1] ADD R7, R6, R4 -
Pipeline depth tradeoff: A CPU designer can choose between a 10-stage pipeline at 3.0 GHz or a 20-stage pipeline at 4.5 GHz. Branch predictor accuracy is 97%, and branches occur every 8 instructions. Which design delivers higher sustained IPC? Show your work.
-
Superscalar dependency analysis: Write a short assembly function that computes
sum = a[0]+a[1]+...+a[7]. Identify which version — using a single accumulator register versus 4 separate partial-sum registers — would better exploit a 4-wide superscalar pipeline and explain why. -
Forwarding path design: Draw the forwarding multiplexer logic needed at the input to the EX stage in a 5-stage pipeline. Include control signals from the hazard detection unit that select between register file output and the two forwarding paths.
-
Profiling exercise: Run
perf stat -e cycles,instructions,branches,branch-misses,stalled-cycles-frontend,stalled-cycles-backendon a known compute-heavy binary (e.g.,gzip,openssl speed). Interpret the results: what is the IPC, branch miss rate, and where is the dominant pipeline bottleneck?
References
- Hennessy, J. L., & Patterson, D. A. (2017). Computer Architecture: A Quantitative Approach (6th ed.). Morgan Kaufmann. Chapters 3–4.
- Intel 64 and IA-32 Architectures Optimization Reference Manual. Order Number: 248966-045+. Section 2 (Pipeline Overview).
- Fog, A. (2023). Microarchitecture of Intel, AMD and VIA CPUs. https://www.agner.org/optimize/microarchitecture.pdf
- Smith, J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE, 83(12), 1609–1624.
- Bhandarkar, D. P. (1995). Alpha Architecture and Implementation. Digital Press.
- Yeager, K. (1996). The MIPS R10000 Superscalar Microprocessor. IEEE Micro, 16(2), 28–41.
- Intel P6 Microarchitecture White Paper (1995). Intel Corporation.