CPU Pipeline: Classic 5-Stage to Modern Superscalar

Prerequisites

Basic digital logic: flip-flops, registers, combinational circuits
Assembly language fundamentals: instructions, operands, memory addressing
Familiarity with instruction set architectures (x86, RISC-V, ARM)
Number representation: two's complement, IEEE 754 floating point basics
Clock cycles and frequency: understanding that work happens at clock edges

Technical Overview

A CPU pipeline is a hardware technique that overlaps execution of multiple instructions simultaneously, analogous to an assembly line in manufacturing. Rather than completing one instruction fully before beginning the next, a pipelined processor divides instruction execution into discrete stages, each handled by dedicated hardware. At any clock cycle, each stage is working on a different instruction, increasing instruction throughput without necessarily increasing clock frequency.

The central insight is that sequential instructions are largely independent: while instruction N is executing in the ALU (Arithmetic Logic Unit), instruction N+1 can be fetched from memory simultaneously. This spatial parallelism within a single core is the foundation of all modern CPU design.

Modern pipelines far exceed the classic 5-stage textbook model. Intel's Core architecture runs approximately 14-19 stages depending on the microarchitecture generation. AMD Zen 4 is approximately 15 stages from fetch to writeback. These deeper pipelines enable higher clock frequencies by reducing the amount of logic per pipeline stage (shorter critical path per stage = lower propagation delay = faster clock). The tradeoff: every branch misprediction wastes cycles proportional to pipeline depth.

Historical Context

1985 — MIPS R2000: David Patterson and John Hennessy at Stanford and UC Berkeley co-developed the RISC philosophy and the canonical 5-stage pipeline. The MIPS R2000 was the first commercial embodiment of a clean 5-stage design. The simplicity of the RISC ISA was deliberately chosen to make pipelining tractable.

1989 — Intel i486: Intel's first pipelined x86 processor. A 5-stage pipeline on a CISC architecture required substantial decode complexity because x86 instructions are variable length with complex addressing modes. The i486 achieved 1 CPI (Cycles Per Instruction) on simple instructions.

1993 — Intel Pentium (P5): Two parallel integer pipelines (U and V pipes), marking Intel's entry into superscalar execution. The U pipe could handle any instruction; the V pipe could only handle simple instructions in parallel with U. Effective throughput: up to 2 integer instructions per cycle, but only under restrictive pairing rules.

1995 — Intel Pentium Pro (P6): The architectural watershed. First x86 processor with out-of-order (OoO) execution. The P6 microarchitecture introduced the decode→uop translation that all modern Intel CPUs still use: complex x86 instructions decoded into simpler internal micro-operations (uops). Introduced the ROB (ReOrder Buffer) and reservation stations. Pipeline depth: 14 stages. This architecture's descendants — Pentium II, III, Pentium M, Core 2, Sandy Bridge, Skylake — all trace lineage to P6.

2000 — Intel Pentium 4 (NetBurst): Aggressive pipeline deepening to 20 stages (Willamette) and eventually 31 stages (Prescott). The theory was that deeper pipelines enable higher frequencies. In practice, branch misprediction penalty grew to 20-30 cycles, frequency scaling stalled around 3.8 GHz due to power/heat, and the architecture was abandoned.

2006 — Intel Core 2 (Conroe): Return to P6-derived architecture with ~14 stage pipeline. Power-efficient, superscalar, wide decode.

2020 — Apple M1: 8-wide decode — the widest front-end decode of any shipping CPU at launch. Up to 8 instructions can be fetched and decoded per cycle, feeding a massive out-of-order backend with 630-entry ROB.

Core Content: The Classic 5-Stage Pipeline

Stage Definitions

Stage 1: IF  — Instruction Fetch
Stage 2: ID  — Instruction Decode / Register Read
Stage 3: EX  — Execute (ALU, address calculation)
Stage 4: MEM — Memory Access (load/store)
Stage 5: WB  — Write Back (update register file)

Pipeline Diagram: 5 Instructions in Steady State

Clock:     1    2    3    4    5    6    7    8    9
           ─────────────────────────────────────────
I1:        IF   ID   EX   MEM  WB
I2:             IF   ID   EX   MEM  WB
I3:                  IF   ID   EX   MEM  WB
I4:                       IF   ID   EX   MEM  WB
I5:                            IF   ID   EX   MEM  WB

In steady state, one instruction completes per clock cycle (CPI=1), even though each instruction takes 5 cycles end-to-end (latency=5). Throughput ≠ latency.

Pipeline Registers

Between each stage sit pipeline registers (flip-flops clocked at the CPU's master clock). They hold the instruction and its associated data as it moves through:

  ┌────┐  ┌──────────┐  ┌────┐  ┌──────────┐  ┌────┐
  │ IF │──│IF/ID reg │──│ ID │──│ID/EX reg │──│ EX │
  └────┘  └──────────┘  └────┘  └──────────┘  └────┘
                                                   │
  ┌────┐  ┌──────────┐  ┌─────┐  ┌──────────┐     │
  │ WB │──│MEM/WB reg│──│ MEM │──│EX/MEM reg│─────┘
  └────┘  └──────────┘  └─────┘  └──────────┘

Pipeline Hazards

Hazards are situations that prevent the next instruction from executing in its designated clock cycle. Three categories:

1. Structural Hazards

A required hardware resource is occupied by another instruction. Example: a single shared memory port serving both instruction fetch (IF stage) and data access (MEM stage) simultaneously.

Modern solution: separate instruction cache (I-cache) and data cache (D-cache) eliminate the most common structural hazard. Modern CPUs also have multiple execution ports (see superscalar section).

2. Data Hazards

An instruction depends on the result of a prior instruction that has not yet written back its result. Three sub-types:

RAW  Read After Write   — true dependency (I2 reads what I1 writes)
WAW  Write After Write  — output dependency (both write same register)
WAR  Write After Read   — anti-dependency (I2 writes before I1 reads)

RAW is the most critical: it is a true data dependency and cannot be eliminated by renaming alone.

Example:

ADD  R1, R2, R3    ; I1: R1 = R2 + R3  (writes R1 in WB, cycle 5)
SUB  R4, R1, R5    ; I2: R4 = R1 - R5  (reads R1 in ID, cycle 3 — STALE!)

Without mitigation, I2 reads R1 before I1 has written it. The result is incorrect.

Solution 1: Pipeline Stall / Bubble

The hazard detection unit stalls I2 (and all subsequent instructions) for 2 cycles by inserting NOPs (bubbles) into the pipeline:

Clock:     1    2    3    4    5    6    7    8    9
I1:        IF   ID   EX   MEM  WB
I2:             IF   ID  [NOP][NOP]  EX   MEM  WB
I3:                       IF  [---] [---]  ID   EX  ...
                               stall stall

Cost: 2 wasted cycles per RAW hazard with 1-cycle stall (in a 5-stage pipeline with EX→WB forwarding not possible).

Solution 2: Operand Forwarding (Bypassing)

Hardware paths that short-circuit the normal register file read, delivering results directly from where they were computed to where they are needed — without waiting for WB.

                     ┌──────────────────────────┐
                     │   FORWARDING PATHS        │
                     │                           │
  IF   ID   EX   MEM   WB
  ─────────────────────────────────────────────
  I1:  IF   ID   EX ──┐MEM──WB
                       │  └────────────────────► ID (EX→ID forward, 2-cycle)
                       └──────────────────────► EX (MEM→EX forward, 1-cycle)
  I2:       IF   ID   EX   MEM  WB
  I3:            IF   ID   EX   MEM  WB

Detailed forwarding diagram with data paths:

  ┌──────────────────────────────────────────────────────┐
  │                   5-STAGE PIPELINE                    │
  │                                                        │
  │  PC ──► IF ──► [IF/ID] ──► ID ──► [ID/EX] ──► EX    │
  │                              ▲                  │      │
  │                              │   EX→ID forward  │      │
  │                              └──────────────────┘      │
  │                                                  │      │
  │              ──► [EX/MEM] ──► MEM ──► [MEM/WB] ─►WB   │
  │                       │                    │            │
  │                       │   MEM→EX forward   │            │
  │                       └────────────────────┘            │
  │                              (to EX input MUX)          │
  └──────────────────────────────────────────────────────┘

  Forward MUX at EX stage selects:
    [0] Register file output (normal)
    [1] EX/MEM pipeline register (forward from prior EX)
    [2] MEM/WB pipeline register (forward from two cycles ago)

With full forwarding, most RAW hazards require 0 stall cycles. The only unavoidable stall: load-use hazard — a load followed immediately by an instruction using the loaded value requires 1 stall cycle because the data is not available until end of MEM stage.

LDR  R1, [R2]    ; load from memory — result available after MEM
ADD  R3, R1, R4  ; needs R1 — 1 cycle stall unavoidable

3. Control Hazards

Caused by branch instructions. The CPU doesn't know the next instruction's address until the branch is resolved in EX (or later). Meanwhile, instructions after the branch have already entered IF and ID stages — if the branch is taken, those instructions must be flushed (turned into bubbles).

Clock:     1    2    3    4    5
BEQ:       IF   ID   EX ← branch resolved here
I_fall:         IF   ID ← must flush if branch taken
I_fall+1:            IF ← must flush if branch taken

With branch resolved in EX stage: 2-cycle branch penalty if taken.

Branch Prediction: Modern CPUs predict the branch outcome and direction speculatively continue fetching. A correct prediction costs 0 cycles. A misprediction requires flushing the pipeline back to the branch, costing pipeline-depth - 1 cycles (approximately 15-20 cycles on modern deep pipelines).

Superscalar Execution

A superscalar CPU issues more than one instruction per clock cycle. This requires:

Wide fetch: Fetch multiple instructions per cycle (fetch width)
Wide decode: Decode multiple instructions per cycle (decode width)
Multiple execution units: Parallel ALUs, load/store units, FPUs
Dependency checking: Hardware that identifies which instructions can issue in parallel

Fetch Width vs Decode Width

These are often different. Fetch retrieves a block of bytes from the I-cache. Decode identifies instruction boundaries (critical for x86 with variable-length encoding) and translates to uops.

                    ┌──────────────────────┐
  I-Cache ─────────►│  Fetch Buffer        │ 16-32 bytes / cycle
                    └──────────┬───────────┘
                               │
                    ┌──────────▼───────────┐
                    │  Pre-Decode          │ find instruction boundaries
                    └──────────┬───────────┘
                               │
                    ┌──────────▼───────────┐
                    │  Instruction Queue   │ buffer decoded instrs
                    └──────────┬───────────┘
                               │ (decode width)
                    ┌──────────▼───────────┐
                    │  Decode (4-8 wide)   │ → uops
                    └──────────┬───────────┘
                               │
                    ┌──────────▼───────────┐
                    │  Rename / Dispatch   │ → OoO engine
                    └──────────────────────┘

Decode Width Comparison (2024)

CPU	Decode Width	Pipeline Depth (approx)
Intel Pentium (P5)	2 (U+V)	5
Intel Pentium Pro	3 uops/cycle	14
Intel Sandy Bridge	4 uops/cycle	14
Intel Skylake	4 uops/cycle	14 (19 with MEM ops)
Intel Sunny Cove	5 uops/cycle	~14
AMD Zen 2	4 uops/cycle	~19
AMD Zen 4	4 uops/cycle	~15
Apple M1	8 uops/cycle	~15
Apple M4	8 uops/cycle	~15
ARM Cortex-X4	5 uops/cycle	~13

Apple M1's 8-wide decode is remarkable. Most of the benefit comes when code has enough independent instructions to fill all decode slots — common in high-performance compute kernels but not in pointer-chasing workloads.

Why Deep Pipelines Are Problematic

Pipeline depth directly multiplies branch misprediction penalty:

Misprediction penalty ≈ pipeline_depth - 1 cycles

Pentium (P5), depth 5:      ~4 cycle penalty
Pentium 4 (Prescott), depth 31: ~30 cycle penalty
Intel Core (modern), depth ~15: ~15 cycle penalty

If a branch predictor has 95% accuracy (very good) and branches occur every 5 instructions:

IPC with perfect prediction: 4.0 (4-wide decode)
Branch every 5 instrs → 1 branch per 5 cycles
Miss rate: 5% → misprediction every 100 cycles
Penalty: 15 cycles per misprediction
Throughput loss: 15/100 = 15%
Effective IPC: ~3.4

With Pentium 4's 30-cycle penalty under same conditions: 30% throughput loss.

This is why Intel abandoned NetBurst: the frequency gains from a 31-stage pipeline were nullified by the catastrophic branch misprediction penalty.

Production Examples and Debugging

Measuring Pipeline Effects with `perf`

# Count branch mispredictions
perf stat -e branches,branch-misses,cycles,instructions ./program

# Output example:
#    1,234,567,890      branches
#       12,345,678      branch-misses   #    1.00% of all branches
#    4,000,000,000      cycles
#    3,600,000,000      instructions    #    0.90  insn per cycle

# IPC of 0.90 on a 4-wide machine → significant pipeline waste

Identifying Load-Use Hazards

A common source of pipeline stalls in tight loops is load-use dependency chains:

// Anti-pattern: pointer chasing (unavoidable load-use stalls)
while (node) {
    sum += node->value;
    node = node->next;  // loads next pointer, then immediately dereferences
}

The hardware cannot hide this because each node->next load must complete before the next loop iteration's load address is known. Memory latency (~4 cycles L1, ~12 cycles L2, ~40 cycles L3) dominates.

Loop Unrolling to Hide Latency

// Compiler unrolls to allow overlapping of independent loads
sum1 += a[0]; sum2 += a[4];
sum1 += a[1]; sum2 += a[5];
sum1 += a[2]; sum2 += a[6];
sum1 += a[3]; sum2 += a[7];

Multiple independent accumulators allow the pipeline to issue loads while prior loads are still in-flight.

Security Implications

Pipeline design has direct security consequences:

Branch predictor state is shared between security contexts (processes, VMs). Spectre v2 exploits indirect branch prediction by poisoning the Branch Target Buffer from one process to misdirect speculation in another. (See 03-speculative-execution.md.)
Forwarding paths and timing: The presence of forwarding can make certain instruction sequences execute in fewer cycles, potentially creating timing side-channels. This is generally not directly exploitable but contributes to the microarchitectural attack surface.
Pipeline flushing on privilege changes: On syscall entry, the CPU must carefully manage pipeline state. Instructions that were speculatively fetched from user space must be flushed before kernel execution begins. Failures here (as in early Meltdown) led to catastrophic privilege escalation vectors.
Transient execution: Instructions that are in the pipeline but will be squashed (due to branch misprediction, exception, etc.) can still leave microarchitectural side effects in caches and other buffers. This is the root class of Spectre/Meltdown attacks.

Performance Implications

Key performance metrics and their pipeline interpretations:

Metric	Formula	Typical Range
IPC	instructions / cycles	0.5–5.0
CPI	cycles / instructions	0.2–2.0
Branch miss rate	misses / branches	0.5%–10%
Misprediction penalty	cycles wasted / miss	10–30 cycles
Load-use stalls	stall_cycles / cycles	0%–40%

A healthy server workload typically achieves IPC 2.0–3.5 on a modern 4-wide machine. IPC below 1.5 usually indicates memory latency, branch mispredictions, or both.

Failure Modes

Branch predictor thrashing: Pathological access patterns (e.g., alternating taken/not-taken at high frequency) can defeat prediction heuristics, causing sustained 15-20% throughput loss.
Pipeline replay: Some CPUs speculatively execute operations before memory disambiguation is confirmed. If a dependency is discovered late, the pipeline "replays" the dependent instructions. Intel's pre-Nehalem architectures had memory ordering replay storms on certain store-then-load patterns.
Long-latency instruction in decode: x86 complex instructions (like certain REP MOVS variants or legacy x87 FP) can decode into many uops and stall the decode stage for multiple cycles.
Front-end bottleneck: If the instruction stream is so dense with branches that the branch predictor falls behind, fetch bandwidth drops and the out-of-order backend starves despite its capacity.

Modern Usage and Current State

Modern CPUs from all major vendors (Intel, AMD, Apple, ARM) implement:

Multi-level branch prediction (see 04-branch-prediction.md)
Out-of-order execution (see 02-out-of-order-execution.md)
Speculative execution (see 03-speculative-execution.md)
Loop stream detectors: Intel CPUs detect small loops (≤64 uops in the loop buffer / DSB) and bypass the decode stage entirely for subsequent iterations, saving power and front-end bandwidth
Micro-fusion: Some adjacent uop pairs are fused into a single dispatch/execution micro-operation to effectively widen the pipeline

Apple M1/M2/M3/M4's 8-wide decode with 630-entry ROB represents a bet that wide OoO windows are more valuable than deep pipelines. This is well-suited for macOS workloads where single-threaded latency dominates and memory access patterns are cache-friendly.

Future Directions

Wider decode: 8-wide (Apple M-series) may expand. 10-12 wide is theoretically possible but faces diminishing returns from instruction-level parallelism limits (Amdahl-style serial bottlenecks in real code).
Dataflow execution: Academic designs (MIT Raw, Intel's research architectures) propose abandoning the von Neumann sequential fetch model in favor of pure dataflow graphs. Not shipping commercially.
Disaggregated pipelines: RISC-V's open ISA enables custom pipeline stages for specific workloads (cryptographic accelerators inserted as pipeline stages).
Thread-level speculation: Hardware support for speculatively parallelizing serial loop iterations across cores, with rollback on conflict. IBM and Intel have explored but not commercialized this.
AI-assisted branch prediction: Apple and others have reported using ML models offline to generate better branch predictor initialization data, and proposals exist for small neural networks in hardware for branch prediction.

Exercises

Hazard analysis: Given the following instruction sequence, identify all RAW hazards and determine how many stall cycles are needed with and without forwarding (assume a classic 5-stage pipeline with no forwarding, then with full EX→ID and MEM→ID forwarding): asm ADD R1, R2, R3 MUL R4, R1, R5 LDR R6, [R1] ADD R7, R6, R4
Pipeline depth tradeoff: A CPU designer can choose between a 10-stage pipeline at 3.0 GHz or a 20-stage pipeline at 4.5 GHz. Branch predictor accuracy is 97%, and branches occur every 8 instructions. Which design delivers higher sustained IPC? Show your work.
Superscalar dependency analysis: Write a short assembly function that computes sum = a[0]+a[1]+...+a[7]. Identify which version — using a single accumulator register versus 4 separate partial-sum registers — would better exploit a 4-wide superscalar pipeline and explain why.
Forwarding path design: Draw the forwarding multiplexer logic needed at the input to the EX stage in a 5-stage pipeline. Include control signals from the hazard detection unit that select between register file output and the two forwarding paths.
Profiling exercise: Run perf stat -e cycles,instructions,branches,branch-misses,stalled-cycles-frontend,stalled-cycles-backend on a known compute-heavy binary (e.g., gzip, openssl speed). Interpret the results: what is the IPC, branch miss rate, and where is the dominant pipeline bottleneck?

References

Hennessy, J. L., & Patterson, D. A. (2017). Computer Architecture: A Quantitative Approach (6th ed.). Morgan Kaufmann. Chapters 3–4.
Intel 64 and IA-32 Architectures Optimization Reference Manual. Order Number: 248966-045+. Section 2 (Pipeline Overview).
Fog, A. (2023). Microarchitecture of Intel, AMD and VIA CPUs. https://www.agner.org/optimize/microarchitecture.pdf
Smith, J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE, 83(12), 1609–1624.
Bhandarkar, D. P. (1995). Alpha Architecture and Implementation. Digital Press.
Yeager, K. (1996). The MIPS R10000 Superscalar Microprocessor. IEEE Micro, 16(2), 28–41.
Intel P6 Microarchitecture White Paper (1995). Intel Corporation.