Out-of-Order Execution: Tomasulo, ROB, and the OoO Engine

Prerequisites

CPU pipeline fundamentals (01-cpu-pipeline.md): stages, hazards, forwarding
Data hazard types: RAW, WAR, WAW dependencies
Register file architecture: architectural registers vs physical registers
Basic memory hierarchy: cache hit/miss concept
Assembly language: instruction encoding, operands, flags

Technical Overview

Out-of-order (OoO) execution is the hardware technique by which a CPU executes instructions in a different order than the programmer wrote them, as long as the results are identical to in-order execution. The motivation is straightforward: a long-latency operation (cache miss, division, multi-cycle multiply) causes all subsequent in-order instructions to stall even if those instructions have no dependency on the slow operation.

An OoO processor dynamically schedules instructions at runtime, finding independent work to do while waiting for slow operations to complete. The key insight: preserve the architectural (programmer-visible) state as if execution were in-order, while the underlying microarchitectural execution is free to reorder.

The practical impact is enormous. A modern CPU with a 200-entry ROB (ReOrder Buffer) can have 200 instructions "in flight" simultaneously — fetched, partially executed, awaiting results — while the programmer's sequential model remains preserved. This window of 200 instructions is searched for independent work at every cycle.

Historical Context

1967 — Tomasulo Algorithm (IBM 360/91): Robert Tomasulo at IBM designed the first dynamic instruction scheduling algorithm for the IBM System/360 Model 91 floating-point unit. The 360/91 had long-latency FP operations (addition: 2 cycles, multiplication: 3 cycles, division: 12 cycles) and Tomasulo's algorithm allowed the FPU to overlap these, issuing FP operations out-of-order while maintaining correct data flow via Common Data Bus (CDB) broadcast.

This algorithm introduced the two foundational concepts that all modern OoO processors use: reservation stations and register renaming via tagging.

1990 — IBM POWER1: First general-purpose processor with OoO execution across integer and FP operations.

1995 — Intel Pentium Pro (P6): First OoO x86 processor. The P6 microarchitecture introduced the decode→uop pipeline with a 40-entry ROB. This was architecturally audacious: the x86 ISA's complex instructions are decoded into simple, RISC-like micro-operations (uops) that then flow through a Tomasulo-style OoO engine. Every Intel Core processor since traces to this design.

1996 — MIPS R10000: Clean RISC OoO design, 32-entry ROB, 4-wide issue, influential in academic study due to its clean architecture.

2013 — Intel Haswell: 192-entry ROB, 4-wide decode, multiple load/store units.

2019 — Intel Sunny Cove (Ice Lake): 352-entry ROB (temporarily, later revised), 5-wide decode.

2021 — Apple M1: 630-entry ROB — the largest in any shipping CPU as of 2024. This massive window allows hiding very large memory latencies and finding ILP across hundreds of instructions.

2023 — AMD Zen 4: 320-entry ROB, 4-wide decode, 6 execution ports.

Core Content: The Tomasulo Algorithm

Original IBM 360/91 Structure

                     ┌─────────────────────┐
  Instruction Issue ─►│  Reservation        │
                     │  Stations           │
                     │  ┌───┬───┬───┬───┐  │
                     │  │RS1│RS2│RS3│RS4│  │
                     │  └─┬─┴─┬─┴─┬─┴─┬┘  │
                     └────┼───┼───┼───┼───┘
                          │   │   │   │
              ┌───────────▼───▼───▼───▼──────────┐
              │          Execution Units          │
              │  [ADD unit] [MUL unit] [DIV unit] │
              └─────────────────┬────────────────┘
                                │ result + tag
                     ┌──────────▼──────────┐
                     │   Common Data Bus   │ (broadcast to all RSs + RegFile)
                     └─────────────────────┘

Reservation Stations

A reservation station (RS) is a buffer that holds an instruction waiting to execute. Each RS entry contains:

┌─────────────────────────────────────────────────────────┐
│ Reservation Station Entry                                │
│                                                          │
│  Op    : ADD                                            │
│  Vj    : 42        (value of source operand 1, if ready)│
│  Vk    : <wait>    (value of source operand 2, not ready)│
│  Qj    : --        (tag of RS producing Vj, null=ready) │
│  Qk    : RS3       (waiting for result from RS3)        │
│  Dest  : R5        (destination architectural register) │
│  Busy  : YES                                            │
└─────────────────────────────────────────────────────────┘

An instruction in a RS is eligible to execute ("fire") when both Qj and Qk are null (both operands are ready). When any execution unit completes, it broadcasts its result on the CDB with a tag. All RSs that are waiting on that tag capture the value and clear their Qj/Qk field.

This is the fundamental elegance of Tomasulo: operand values flow directly from producer to consumer via the CDB, bypassing the register file entirely. The register file is only updated as a side effect.

Register Renaming

Register renaming is the mechanism that eliminates false dependencies (WAR and WAW hazards) by mapping architectural register names to a larger pool of physical registers.

The False Dependency Problem

MUL R1, R2, R3   ; I1: R1 = R2 * R3  (takes 4 cycles)
ADD R4, R1, R5   ; I2: R4 = R1 + R5  (TRUE dependency on I1 — must wait)
MOV R1, R6       ; I3: R1 = R6       (WAW with I1 — FALSE dependency)
SUB R7, R1, R8   ; I4: R7 = R1 - R8  (WAR with I1 via I3 — FALSE dependency)

I3 cannot execute before I1 in in-order execution (WAW: both write R1). But logically, I3's R1 is completely independent — it's a new value. Similarly, I4 must read the R1 that I3 writes, not I1's R1. These are naming conflicts, not data dependencies.

Rename Map Table

The register rename unit maintains a map from each architectural register to the most recently assigned physical register:

BEFORE I1:                    AFTER I1 issued:
  Arch → Phys                   Arch → Phys
  R1   → P10  (current value)   R1   → P22  (new allocation for I1's result)
  R2   → P14                    R2   → P14
  R3   → P07                    R3   → P07
  R4   → P18                    R4   → P18

I1 is renamed: MUL P22, P14, P07
              (P22 is fresh physical register for result)

After renaming I2 (ADD R4, R1, R5): - R1 lookup → P22 (the result I1 will produce) - R5 lookup → P19 - R4 allocates → P23 - Renamed: ADD P23, P22, P19 (P22 is a dependency, but it's the only TRUE one)

After renaming I3 (MOV R1, R6): - R1 gets a NEW physical register → P24 - R6 lookup → P11 - Renamed: MOV P24, P11 - I3 has NO dependency on P22 at all — the WAW is eliminated

After renaming I4 (SUB R7, R1, R8): - R1 lookup → P24 (I3's output, not I1's) - Renamed: SUB P25, P24, P09 - I4 correctly depends on I3's P24, not I1's P22 — the WAR is resolved

Physical register file:
┌────────────────────────────────────────────────────────┐
│  Architectural View    Physical View                    │
│                                                         │
│  R1: current=P10      P10: value=42  (prior R1)        │
│      in-flight=P22    P22: pending   (I1 result)       │
│      in-flight=P24    P24: pending   (I3 result)       │
│                                                         │
│  At any time, multiple physical regs map to R1         │
│  Only the most recently committed one is "R1"          │
└────────────────────────────────────────────────────────┘

Physical register file sizes: - Intel Skylake: 180 integer physical registers, 168 FP physical registers - AMD Zen 4: 192 integer physical registers, 192 FP physical registers - Apple M1: ~600+ physical registers (estimated from ROB size and superscalar width)

The ReOrder Buffer (ROB)

The ROB maintains program order for commit while allowing out-of-order execution. It is a circular buffer, FIFO (First In First Out), where instructions enter at the tail in program order and retire from the head in program order.

ROB Entry Structure

┌──────────────────────────────────────────────────┐
│ ROB Entry                                         │
│                                                   │
│  PC         : 0x401234                           │
│  Instruction: ADD P23, P22, P19                  │
│  Arch dest  : R4                                 │
│  Phys dest  : P23                                │
│  Value      : <not ready>                        │
│  Done       : NO                                 │
│  Exception  : NONE                               │
└──────────────────────────────────────────────────┘

ROB Lifecycle

         program order
         ────────────►
  ┌──────────────────────────────────────────────────────┐
  │                    ROB (circular)                     │
  │                                                        │
  │ HEAD                                              TAIL │
  │  │                                                  │  │
  │  ▼                                                  ▼  │
  │ [I1:done][I2:done][I3:wait][I4:done][I5:wait][I6:--]  │
  │                                                        │
  │  Commit: I1, I2 can retire (in order from head)       │
  │  I3 blocks commit (not yet done — stall retirement)   │
  └──────────────────────────────────────────────────────┘

Dispatch: new instructions enter at TAIL
Commit:   completed instructions retire from HEAD (in order)

ROB Size and ILP

The ROB size determines the OoO window: the maximum number of instructions that can be simultaneously in-flight. A larger ROB enables hiding more latency.

Consider a cache miss (L3 hit, ~40 cycle latency) in a 4-wide machine:

With 40-entry ROB:  only 40 instructions can be examined for ILP
                    → small chance of finding 40 independent instructions

With 320-entry ROB: 320 instructions examined
                    → much higher probability of finding independent work
                    → effectively hides the 40-cycle miss behind other work

ROB sizes over time:

Year  CPU                    ROB entries
1995  Intel P6 (Pentium Pro) 40
1999  Intel P6 (Pentium III) 40
2000  Intel P4 (Willamette)  126
2006  Intel Core 2 (Conroe)  96
2012  Intel Ivy Bridge       168
2015  Intel Skylake          224
2019  Intel Sunny Cove       352
2021  Apple M1               630
2021  Intel Alder Lake (P)   512 (efficiency core: 256)
2022  AMD Zen 4              320
2024  Apple M4               ~1000 (estimated)

The Load/Store Queue

Memory operations require special handling: loads and stores cannot simply be reordered arbitrarily because memory has global visibility (unlike registers). The load/store queue (LSQ) handles memory-ordering constraints.

Store-to-Load Forwarding

If a store and a subsequent load address the same memory location, the store's data can be forwarded directly to the load without going through the cache hierarchy:

STORE [R1], R2   → enters Store Queue: addr=0x1000, data=42
LOAD  R3, [R1]   → Load Queue checks Store Queue for addr=0x1000
                   → found! Forward 42 directly to R3
                   → avoids cache access entirely (saves 4+ cycles)

Memory Disambiguation

The processor may not know load/store addresses until late in execution. It can either:

Wait: Stall the load until all prior store addresses are known. Safe but slow.
Speculate: Assume no alias, execute the load early. If a prior store later turns out to alias, trigger a memory ordering replay — squash and re-execute the dependent instructions.

Intel uses speculative memory disambiguation with replay detection. AMD Zen uses a similar approach. The memory ordering machine (MOB) tracks all in-flight memory operations.

Retirement (In-Order Commit)

Despite out-of-order execution, the architectural state (registers, memory, exception state) is updated in program order. This is the commit stage:

An instruction at the ROB head is checked: is it "done" (result available, no exception)?
If yes: write result to architectural register file (the "committed" state), free the old physical register, remove ROB entry.
If no (exception): flush all younger instructions from ROB, set PC to exception handler.

This in-order commit is what makes OoO execution transparent to software. The programmer's sequential model is maintained at the commit boundary.

How OoO Affects Debugging

A key consequence: when a debugger inserts a breakpoint and halts execution, it sees the committed architectural state — the state as if instructions had executed in program order up to the breakpoint. In-flight (speculatively executing, not-yet-committed) instructions are invisible.

This is generally desirable: the programmer's view is clean. But it means:

Timing information is unreliable in a debugger: Single-stepping hides pipeline behavior. A single "step" causes a full pipeline flush and refill.
Performance counters require hardware support: Tools like perf use hardware performance monitoring units (PMU) that directly observe microarchitectural events (ROB occupancy, RS stalls, etc.) — not visible in software.
Speculative side effects are not visible to the debugger: Cache state modified by speculative execution before an exception is not reflected in architectural state. This is precisely the Meltdown exploit vector.

Tomasulo Data Flow (Modern CPU)

                    ┌──────────────────────────────────────────┐
  Fetch → Decode    │              FRONT END                   │
                    └────────────────────┬─────────────────────┘
                                         │ uops, in program order
                    ┌────────────────────▼─────────────────────┐
                    │           REGISTER RENAME                  │
                    │  Rename Map Table: arch reg → phys reg    │
                    │  Free List: available physical regs       │
                    └────────────────────┬─────────────────────┘
                                         │ renamed uops + ROB entry
              ┌──────────────────────────┼──────────────────┐
              │          ┌───────────────▼────────────────┐  │
              │          │      REORDER BUFFER (ROB)       │  │
              │          │  [maintains program order]      │  │
              │          └───────────────┬────────────────┘  │
              │                          │                    │
              │          ┌───────────────▼────────────────┐  │
              │          │   RESERVATION STATIONS / UOP Q  │  │
              │          │   [wait for operands, out-of-   │  │
              │          │    order issue to exec units]    │  │
              │          └───┬───────┬───────┬────────────┘  │
              │              │       │       │                │
              │         ┌────▼──┐ ┌──▼───┐ ┌▼──────┐        │
              │         │ INT   │ │ FP   │ │ Load/ │        │
              │         │ ALU   │ │ Unit │ │ Store │        │
              │         └────┬──┘ └──┬───┘ └┬──────┘        │
              │              │       │       │  results      │
              │              └───────┴───────┘               │
              │                      │ (Common Data Bus)     │
              │  ┌───────────────────▼────────────────────┐  │
              │  │        PHYSICAL REGISTER FILE          │  │
              │  └───────────────────┬────────────────────┘  │
              │                      │ commit (in order)      │
              │  ┌───────────────────▼────────────────────┐  │
              └──►   ARCHITECTURAL REGISTER FILE          │  │
                 │  (committed, program-order view)       │  │
                 └────────────────────────────────────────┘

OoO and Security: Spectre/Meltdown

The OoO engine is the enabler of the most severe CPU vulnerabilities of the modern era.

Meltdown (CVE-2017-5754): During speculative execution in the OoO window, a load from a kernel address can be issued before the permission check completes. The load data is briefly visible in the physical register file and can affect cache state (which register/cache line gets loaded). Even after the permission-violation exception causes the ROB to be flushed, the cache state change persists. An attacker can measure cache timing to reconstruct the transiently loaded data.

The fix (KPTI — Kernel Page Table Isolation) removes kernel mappings from user-space page tables, preventing the speculative load from even having a valid address to load from. Cost: ~5-30% overhead on syscall-heavy workloads due to TLB flushes.

Spectre (CVE-2017-5753, CVE-2017-5715): The branch predictor misdirects speculative execution. The victim process speculatively executes code that accesses out-of-bounds data. Even though this path is squashed, cache state is affected. See 03-speculative-execution.md for full analysis.

The root cause in both cases: the OoO commit boundary does not prevent microarchitectural state changes from speculative instructions. The architectural state is protected, but the cache, TLB, branch predictor, and prefetch buffers are not rolled back on squash.

Performance Implications

Instruction Throughput Metrics

# Observe OoO engine utilization
perf stat -e \
  cycles,instructions,\
  uops_issued.any,\
  uops_retired.total,\
  rs_events.empty_cycles,\
  rob_misc_events.lbr_inserts \
  ./program

# Key ratios:
# uops_issued / cycles  → issue rate (ideal: match execution unit count)
# rs_events.empty_cycles / cycles → fraction of time RS is empty (front-end bound)

Identifying OoO Bottlenecks

# Intel Top-Down Methodology (TMA) using perf
perf stat -M TopdownL1 ./program

# Output shows percentage of slots wasted in:
# - Frontend Bound: instruction supply bottleneck
# - Backend Bound: execution unit or memory bottleneck
# - Bad Speculation: branch mispredictions flushing ROB
# - Retiring: actual useful work

A "Backend Bound — Memory Bound" result means the OoO engine is finding work to do but is running out of independent instructions (ROB full of memory-waiting instructions). Larger ROB or software prefetching helps.

Failure Modes

ROB full stall: If all ROB entries are occupied by instructions waiting on a long-latency operation (e.g., a TLB miss taking 100+ cycles), no new instructions can be dispatched. The front end stalls completely. Symptoms: very low IPC despite no apparent computational work.
Store buffer exhaustion: The store buffer (where stores wait until they can retire to cache) can fill up. This causes the entire pipeline to stall until stores can drain. Common in write-heavy memcpy-type workloads with cache pressure.
Physical register file exhaustion: If the rename unit cannot allocate a new physical register (free list empty), dispatch stalls. Indicates excessive in-flight instructions with many live values. Rare in practice with large physical register files.
Memory ordering violation: Speculative load executed too early gets the wrong value when a prior store to the same address commits later. The pipeline must replay the load and all dependent instructions. Intel Sandy Bridge had a known performance issue (the "4K aliasing" problem) where loads and stores 4096 bytes apart were incorrectly assumed to alias, causing spurious replays.
Exception handling latency: An exception (page fault, divide by zero) causes all instructions younger than the faulting instruction in the ROB to be flushed. Deep OoO windows mean many instructions are flushed. This is generally not a performance problem (exceptions are rare) but affects worst-case interrupt latency in real-time systems.

Modern Usage

Intel Skylake OoO Engine Summary

ROB: 224 entries
Reservation Station (Unified Scheduler): 97 entries
Load buffer: 72 entries, Store buffer: 56 entries
Physical register file: 180 integer, 168 FP
Execution ports: 8 (ports 0–7), 4 of which handle ALU ops

AMD Zen 4 OoO Engine Summary

ROB: 320 entries
Op queue (before scheduler): 70 entries
Physical register file: 192 integer, 192 FP
Execution ports: 6 integer, 4 FP/vector

Apple M1 OoO Engine Summary

ROB: 630 entries
Fetch/decode: 8-wide
Load/store: 4 load pipes, 2 store pipes
Physical registers: estimated ~600+ integer, ~400+ FP
SMT: none (no Hyperthreading — all resources dedicated to single thread)

The M1's design philosophy: maximize single-thread performance with massive OoO window rather than SMT. This works extremely well for latency-sensitive interactive workloads.

Future Directions

Larger ROB sizes: Apple appears to be on a trajectory of continually increasing the OoO window. Apple M3 Pro is estimated at ~700+ entries. Fundamental limits come from power consumption and transistor budget.
Decoupled front-end/back-end: More aggressive fetch-ahead with deeper instruction queues between the front end (fetch/decode) and the OoO backend, tolerating longer front-end disruptions.
Runahead execution: When the ROB fills (blocked on a cache miss), some designs propose running a "runahead" thread that continues speculatively to generate prefetch addresses, then discards all results. IBM and academic research have explored this.
Memory-level parallelism (MLP) optimization: Better hardware and compiler techniques to issue independent memory operations simultaneously rather than serially, maximizing the overlap of cache miss latency.

Exercises

ROB sizing exercise: A program has an L3 cache miss every 500 instructions, each miss taking 50 cycles. The CPU has a 4-wide issue machine. How large does the ROB need to be to fully hide the L3 latency (i.e., keep the execution units busy during the miss)? What assumptions does this depend on?
Register renaming trace: Trace the register renaming for the following sequence. Show the rename map table state after each instruction and identify which physical register conflicts are eliminated: asm ADD R1, R2, R3 MUL R1, R1, R4 ; WAW: R1 ADD R5, R1, R6 ; RAW: R1 MOV R2, R7 ; WAR: R2 SUB R8, R2, R9 ; RAW: R2 (should get new R2)
OoO profiling: Use perf record -e cpu/mem-loads,ldlat=30/ -p <pid> to capture load latency data on a running process. What does the distribution of load latencies tell you about the OoO engine's effectiveness at hiding memory latency?
Meltdown mitigation analysis: Explain at the microarchitectural level why KPTI (Kernel Page Table Isolation) prevents Meltdown. What specific point in the OoO pipeline does it intervene at, and why does it impose a performance cost specifically on syscall-heavy workloads?
Store buffer design: Design a simplified store buffer (as a data structure) that supports: (a) storing a pending store entry, (b) checking if a load address matches any pending store (for forwarding), (c) draining stores to cache in order. What is the time complexity of the forwarding check as a function of store buffer size?

References

Tomasulo, R. M. (1967). An Efficient Algorithm for Exploiting Multiple Arithmetic Units. IBM Journal of Research and Development, 11(1), 25–33.
Smith, M. D., Johnson, M., & Horowitz, M. A. (1989). Limits on Multiple Issue. ASPLOS III, 290–302.
Kessler, R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro, 19(2), 24–36.
Intel Corporation. (2021). Intel 64 and IA-32 Architectures Optimization Reference Manual. Section 2.
Fog, A. (2023). Microarchitecture of Intel, AMD, and VIA CPUs. https://www.agner.org/optimize/microarchitecture.pdf
Lempel, O. (2010). 2nd Generation Intel Core Processor Family. Hot Chips 22.
Fisker, J. (2021). Apple M1 Microarchitecture Research. Dougall Johnson (dougallj). https://dougallj.github.io/applecpu/
Horn, J. (2018). Project Zero: Spectre and Meltdown. Google Project Zero Blog.