Branch Prediction Deep Dive

Technical Overview

Branch prediction is among the most impactful microarchitectural features in modern CPUs. Because an OoO pipeline fetches and speculatively executes instructions 20+ cycles ahead of the retirement of any branch, the predictor must determine—at fetch time, before the branch instruction is even decoded—what the next instruction address will be. A 99% accurate predictor still mispredicts 1 in 100 branches, and at 4 branches per 16 instructions with 20-cycle penalty, that reduces IPC from a theoretical 4.0 to ~3.2. Modern CPUs implement hybrid TAGE-SC-L predictors, returning address stacks, indirect branch predictors, and loop detectors—all operating in parallel within a tight power budget.

Prerequisites

Understanding of OoO CPU pipeline (see 01-cpu-pipeline-deep-dive.md)
Familiarity with conditional and indirect branches in x86-64 assembly
Basic understanding of saturating counters and state machines
Familiarity with cache concepts (sets, tags, indexing)
Knowledge of function call conventions (return addresses, call stacks)

Core Content

Branch Types and Prediction Challenges

Conditional branches (Jcc: JE, JNE, JL, JGE, etc.): - Predict: taken vs not-taken (direction) - Predict: target address (known from decode — PC-relative, always same) - Challenge: direction depends on runtime data (flags register)

Unconditional direct branches (JMP rel): - Always taken; target is fixed (compile-time known) - Easy to predict; main challenge is detecting quickly during fetch

Indirect branches (JMP reg, JMP [mem]): - Target address is in a register or memory at runtime - Requires BTB (Branch Target Buffer) to predict target - Extremely hard for function pointer dispatch, vtable calls, switch statements

Call/Ret: - CALL: unconditional, target from decode (direct) or register (indirect) - RET: pops return address from stack — target is the return address - RAS (Return Address Stack) hardware predicts ret accurately

Bimodal Predictor (2-bit Saturating Counter)

The simplest useful predictor. For each branch PC, maintain a 2-bit saturating counter:

States:
  00 Strongly Not Taken
  01 Weakly Not Taken
  10 Weakly Taken
  11 Strongly Taken

Transitions:
  Branch taken:     00→01, 01→10, 10→11, 11→11
  Branch not taken: 11→10, 10→01, 01→00, 00→00

Prediction: if state ≥ 10 → predict Taken, else predict Not Taken

Implementation: An array of 2-bit counters indexed by low-order bits of branch PC. For 2^K entries, uses K bits of PC. Aliasing problem: two branches mapping to the same entry will interfere, corrupting each other's prediction.

Accuracy: ~85–90% on typical workloads. An alternating TNTNTN... branch achieves 50% accuracy (no better than random). Loops with constant trip count achieve 100% accuracy except on the last iteration.

Two-Level Predictors

Two-level prediction (Yeh & Patt 1991) uses the history of past branch outcomes to predict the current branch.

Local Predictor (Per-PC history):

Branch History Table (BHT): indexed by PC, stores last K outcomes (shift register)
  e.g., BHT[PC] = 01101 (last 5 outcomes: 0=NT, 1=T)

Pattern History Table (PHT): indexed by BHT entry
  PHT[01101] = 2-bit counter → prediction

Intuition: the same pattern of prior outcomes tends to predict the same outcome
Example: a loop that runs 8 iterations:
  Pattern 1111111 0 → not taken on 8th iteration
  History 1111110 → state before 8th iteration; PHT[1111110] → strongly not taken → correct!

Global Predictor (Single shared history register):

Global History Register (GHR): K-bit shift register, shared across all branches
  Every branch outcome shifts GHR left and inserts the outcome

PHT[GHR XOR PC_hash] = 2-bit counter → prediction

Intuition: captures inter-branch correlations
Example: if-else chains — prior branch outcomes predict subsequent branch outcomes

Hybrid (Tournament) Predictor: Maintain both local and global predictors plus a meta-predictor that selects which to trust per-branch. DEC Alpha 21264 (1998) famously used this. Intel P6 / Core used a similar tournament scheme.

TAGE Predictor (Tagged Geometric History Length)

TAGE (TAgged GEometric history length) is the state-of-the-art conditional branch predictor, developed by André Seznec (IRISA, 2006). It is used in Intel Skylake through Golden Cove, AMD Zen 3/4, and ARM Cortex-X1 and later.

TAGE structure:

T0: Base predictor (bimodal, no history, indexed by PC)
T1: Tagged component (H1 = 2-bit history, indexed by PC XOR GHR[0:1])
T2: Tagged component (H2 = 4-bit history, indexed by PC XOR GHR[0:3])
T3: Tagged component (H3 = 8-bit history, indexed by PC XOR GHR[0:7])
T4: Tagged component (H4 = 16-bit history)
T5: Tagged component (H5 = 32-bit history)
T6: Tagged component (H6 = 64-bit history)

Each Tᵢ (i≥1) stores a (tag, counter, useful-bit) triple per entry: - Tag: partial hash of PC + GHR history — distinguishes different branches mapping to same entry - Counter: 3-bit saturating counter (prediction + confidence) - Useful bit (u): indicates whether this entry has been recently providing useful predictions (for replacement)

TAGE prediction logic:

For a branch at PC with GHR state:
  1. Query all 7 components simultaneously
  2. Find the LONGEST history component Tᵢ that HITS (tag matches)
  3. Use Tᵢ's counter as the prediction
  4. T0 provides default if no Tᵢ hits

Why longer history wins: a longer history distinguishes more execution contexts
  → a branch conditioned on a value computed 60 branches ago is perfectly predicted
  by T6 if the 64-bit GHR captures that value's influence

TAGE update (training):

On branch resolution:
  1. Update counter of the provider component (the one that provided prediction)
  2. If prediction was wrong:
     - Allocate a new entry in a component with longer history than provider
     - Tag = hash(PC, longer GHR prefix)
     - Counter initialized to weakly taken/not-taken (correct direction)
     - Useful bit = 0 (can be replaced if needed)
     - Periodic u-bit reset: avoids entries from becoming permanently useful

TAGE-SC-L (2011): Extends TAGE with Statistical Corrector (SC) and Loop predictor (L): - SC: Captures biases missed by TAGE (e.g., extremely biased branches where the TAGE counter is often wrong due to aliasing). Uses logistic regression on PC + history features. - L: Detects loops with constant trip count. Stores (tag, trip_count_prediction, iteration_counter). Predicts "taken" for all iterations except the trip_count-th.

TAGE accuracy: 95–99% on most workloads. The primary remaining mispredictions come from truly unpredictable branches (data-dependent, random), indirect branch targets, and aliasing in the PHT.

BTB (Branch Target Buffer)

The BTB predicts the target address of a branch before the branch is decoded (at fetch time, cycles before decode would reveal the target).

BTB structure (simplified):
  Array of (tag, target_address, branch_type) entries
  Indexed by low bits of branch PC
  Tag = upper bits of branch PC (disambiguates collisions)

BTB hit: PC matches a BTB entry tag
  → Speculatively fetch from target_address
  → Avoids 1 cycle of bubble waiting for decode to reveal target

BTB miss: PC not in BTB
  → Fetch stalls until decode determines branch target
  → 2-4 cycle penalty

BTB size (Golden Cove): ~12K entries (multi-level BTB: L1=256, L2=2K, L3=10K+)

Indirect BTB: For indirect branches (JMP reg), the target changes at runtime. The indirect BTB (IBTB) maps PC → last observed target. History-augmented IBTB: stores multiple targets per PC indexed by history, capturing vtable dispatch patterns.

Prefetch BTB: Some CPUs maintain a larger "L3 BTB" (~12K entries, shared) that feeds into a faster "L1 BTB" (~256 entries). The L3 BTB is consulted asynchronously to prefetch target instructions into the instruction cache.

RAS (Return Address Stack)

RET instructions pop a return address from the hardware stack—predicting RET targets without memory access.

RAS operation:
  CALL instruction:
    Push predicted return address (CALL-following-instruction PC) onto RAS
    RAS depth: 16–32 entries (Intel: 24, AMD: 32)

  RET instruction:
    Pop from RAS
    Use popped value as predicted target

Accuracy: ~95%+ for typical call stacks
Failure cases:
  - Stack depth > RAS size (recursive functions)
  - Non-standard ABI patterns (setjmp/longjmp, C++ exceptions)
  - Return address overwrite (stack smashing — also a security issue)

Shadow call stack (security note): Some CPUs maintain a second, hardware-protected shadow stack (Intel CET — Control-flow Enforcement Technology) for RAS-based control-flow integrity. If RET address from memory ≠ shadow stack value → fault.

Intel Loop Stream Detector (LSD)

Small loops (≤25 µops in the loop body) are detected by the LSD and served from a small buffer inside the IDQ without re-fetching from L1-I cache or DSB.

LSD detection heuristics:
  1. A taken back-edge branch seen repeatedly
  2. Loop µop count ≤ 25 (Skylake/Golden Cove threshold)
  3. No calls inside loop

Once locked into LSD mode:
  → Frontend pipeline stages (fetch, pre-decode, decode, DSB lookup) power-gated
  → Reduces dynamic power consumption by 20–30% in tight loops
  → Saves 1–2 cycles of frontend latency per iteration

LSD bug in Skylake (2017, errata SKL150): A specific pattern of AVX256 µops in a loop caused the LSD to produce incorrect µop counts, leading to wrong predictions and pipeline corruption. Intel disabled LSD in Skylake microcode update. Regression: 0–5% depending on workload.

Branch Predictor Area and Power

The branch predictor consumes a surprisingly large fraction of chip area and power: - TAGE tables: ~100–200 KB of SRAM each (multiple tables) - BTB: ~12K × (tag + target + meta) bits ≈ 500 KB SRAM - RAS: 32 × 64 bits = 256 bytes - PHT update logic, GHR register: minimal area

Total predictor SRAM: ~1–2 MB (comparable to a few L2 cache slices).

On a modern chip running at 5 GHz, the predictor must produce a new prediction every cycle (or multiple predictions for superscalar fetch). This requires the SRAM to have sub-200ps read latency, achieved via custom SRAM design and aggressive voltage scaling.

Power budget: Estimated 3–5% of total CPU die power for the entire prediction unit on modern chips.

Misprediction Penalty

Misprediction penalty:
  = stages from fetch to branch resolution
  ≈ 18–20 cycles on Golden Cove / Zen 4

Example: at 4.0 GHz, 99% accuracy, 10 branches per 100 instructions:
  Mispredictions per 100 instructions = 10 × 0.01 = 0.1
  Cycles wasted = 0.1 × 20 = 2 cycles per 100 instructions
  CPI increase: +0.02
  IPC reduction: 4.0 → ~3.8 (5% penalty)

At 95% accuracy (pathological case):
  Mispredictions = 10 × 0.05 = 0.5
  Cycles wasted = 0.5 × 20 = 10 per 100 instructions
  IPC from 4.0 → ~2.0 (50% penalty!)

Historical Context

The first branch predictor in a commercial CPU was the IBM 360/91 (1967), which used the simplest possible strategy: predict always not-taken. Smith's two-bit counter predictor (1981) was published as a conference paper and became standard in RISC processors. Yeh and Patt published two-level adaptive prediction (1991). Pan, So, and Rahmeh published the tournament/hybrid predictor (1992). McFarling published gshare (1993) — XOR of PC and GHR indexes PHT, reducing aliasing. The TAGE predictor was presented by Seznec and Michaud at the Championship Branch Prediction (CBP) competition in 2006. CBP-1 through CBP-7 have driven the state of the art in predictor accuracy over two decades. Modern predictors are proprietary, but academic reconstructions suggest all three major CPU vendors (Intel, AMD, ARM) use TAGE or TAGE-SC-L variants.

Production Examples

Intel Skylake (2015): TAGE-based predictor, 1 cycle per prediction update. LSD present (later patched out). BTB: 256-entry L1, 2K-entry L2. Misprediction penalty: 15–17 cycles.

AMD Zen 4 (2022): TAGE-like predictor, 32-entry RAS, larger BTB than Zen 3. Misprediction penalty: ~20 cycles. Zen 4 added a larger loop predictor for high-trip-count loops.

Apple M2 (2022): Reported 8-cycle misprediction penalty (much lower than x86, due to shorter pipeline), enabling aggressive speculation with lower miss cost. Apple's predictor details are confidential.

High-frequency trading (HFT) implications: HFT code is typically warm in cache and L1 BTB. A market data handler loop processing 10M messages/second with 1 branch per message at 99.9% accuracy: 10K mispredictions/second × 5 ns each = 50 µs/second overhead. Branch prediction is explicitly optimized in HFT codebase design (branch-free coding, sorted data for predictable branches).

Debugging Notes

Measuring branch mispredictions:

perf stat -e branches,branch-misses -- ./program
# branch-miss-rate = branch-misses / branches
# Target: <0.5% for well-optimized code, <2% for typical code

Identifying hot mispredicted branches:

perf record -e branch-misses:ppp -- ./program
perf report
# Look for branches with high miss rate in the hottest functions

Profile-guided optimization (PGO): Compiler collects branch frequency data and reorders branches so the most-taken path is "fall-through" (no branch = no prediction needed). GCC -fprofile-generate/-fprofile-use, Clang -fprofile-instr-generate. Typically 10–20% performance improvement on branch-heavy code.

Conditional move (CMOV) vs branch: Replacing a predictable branch with a data-dependent CMOV (conditional move) can hurt performance (CMOV serializes; branch with good predictor is faster). Replacing an unpredictable branch with CMOV helps. Use __builtin_expect to hint to GCC about branch likelihood.

Security Implications

Spectre v1 (bounds check bypass): Conditional branch predictor speculatively bypasses an array bounds check:

if (x < array_size)     // if(x=malicious) → speculatively executes next line
  value = array1[x];    // speculative read of out-of-bounds memory
  use = array2[value * 512]; // cache side channel: loads array2 at secret-derived offset

Mitigation: LFENCE after the bounds check fence (serializes speculative execution). Compiler -mspeculative-load-hardening (Clang).

Spectre v2 (branch target injection): Attacker trains the BTB (indirect branch predictor) to predict a target of the attacker's choice. When the victim executes an indirect branch (e.g., a kernel syscall handler table entry), the predictor speculatively executes the attacker's chosen gadget in the victim's address space.

Retpoline: Software mitigation for Spectre v2. Replaces jmp [target] with a trampoline:

call load_target       ; push return address (loop_forever) onto call stack
loop_forever:
  pause
  lfence
  jmp loop_forever     ; predictor sees "always loop back" → predicts loop_forever
load_target:
  mov [rsp], target    ; overwrite return address with actual target
  ret                  ; ret uses speculative RAS (predicts loop_forever) but actually jumps to target

The indirect branch is replaced by a ret whose RAS prediction (loop_forever) is never used (the ret actually redirects to the real target). IBRS and eIBRS (Enhanced IBRS, Skylake) are hardware alternatives.

Phantom speculation (BranchScope 2018): Side channel on PHT entries via aliasing — an attacker branch can alias a victim branch's PHT entry, inferring the victim's branch direction. Mitigation: flush branch predictor state on context switch (expensive) or use microarchitectural partitioning.

Performance Implications

Branch-free code transformation: Instead of:

if (a > b) max = a; else max = b;

Use:

max = a > b ? a : b;  // Compiler often generates CMOV: no branch at all

Sorted inputs for binary searches: Binary search over a sorted array causes every branch outcome to be 50% T / 50% NT (random walk down the tree), achieving 50% accuracy. Profiling binary searches almost always shows high miss rate. Consider B-trees (8-way branches, each with 87.5% likely direction) or branch-free eytzinger layout.

Virtual function dispatch: Each virtual call is an indirect branch. At a well-bounded polymorphic call site (only 2 concrete types appear), the IBTB achieves near-100% accuracy. With 10+ types: ~30% accuracy. Devirtualization (LTO/PGO) eliminates the indirect branch for single-type call sites.

Failure Modes and Real Incidents

Incident: JVM JIT branch predictor poisoning (reported by Azul Systems, 2019): A pathological pattern of JIT-compiled code with many rarely-taken guards trained the predictor to "never taken." When a guard finally triggered, the misprediction cascade caused 2ms GC pause spikes instead of the expected <1ms. Resolved by inserting explicit prediction hints (JDK 17 @Stable annotation + branch hint intrinsics).

Incident: Linux kernel performance regression from IBRS (2018): After Spectre v2 disclosure, IBRS (Indirect Branch Restricted Speculation) was enabled in the kernel. IBRS forces the predictor to not use history trained in user-mode for kernel-mode execution. On Intel Skylake, IBRS required MSR writes on every kernel entry/exit, adding ~100ns overhead. Syscall-heavy workloads (nginx, databases) saw 10–30% throughput regression. eIBRS (in Ice Lake+) does this in hardware at ~1% overhead.

Incident: Spectre v2 in Firefox JavaScript engine (2018): JIT-compiled JavaScript could train the BTB to jump to attacker-chosen gadgets in the browser process. Allowed reading arbitrary memory from the browser process (including other tabs' data). Mitigation: Spectre countermeasures in SpiderMonkey JIT (reduced timer precision, site isolation, JIT retpoline).

Modern Usage

LBR (Last Branch Record): Intel hardware feature recording last 32 branch PC+target pairs in MSRs. Essential for performance profiling (perf record -b), JIT compiler optimization, and stack unwinding without frame pointers.

BRBE (Branch Record Buffer Extension, ARMv9.2): ARM's equivalent of LBR, standardized in the ISA for more consistent profiling.

BOLT (Binary Optimization and Layout Tool, Meta 2021): Post-link optimizer that uses LBR profiles to reorder code for better BTB utilization and instruction cache density. Reported 5–15% throughput improvement for large applications (Meta's production services, LLVM itself).

Future Directions

Neural branch predictors: Research showing LSTM-based predictors achieving >99.5% accuracy; too expensive for hardware implementation currently but inspiring hybrid approaches
Deep learning JIT branch prediction hints: Compiler inserts hardware hint instructions (x86 has no official hint; proposals exist for future ISA) based on ML-trained models of branch behavior
Persistent predictor state across context switches: Current CPUs flush predictor on context switch (security). Future trusted-execution environments (TDX, SEV-SNP) could maintain per-VM predictor state for better guest performance
CBP-8 (Championship Branch Prediction 8, 2025): Academic competition driving the next generation of predictor accuracy — current leaders achieve >99.9% on SPEC INT

Exercises

2-bit predictor simulation: Implement a bimodal 2-bit predictor with 2^12 = 4096 entries. Test on: (a) a loop from 0 to 100 (trip count 100), (b) alternating T/NT sequence, (c) a sequence derived from a random source (MT19937). Report accuracy and analyze aliasing effects.
TAGE predictor implementation: Implement a simplified 4-component TAGE predictor (T0=bimodal, T1=H2, T2=H8, T3=H32) with 2^10 entries each. Use a 32-bit global history register. Evaluate on a PIN-instrumented trace of a branch-intensive program (e.g., compiled bzip2). Compare accuracy to bimodal baseline.
BTB miss measurement: Write a C program with a function pointer array of size N (varied from 2 to 1024), where each iteration calls a different function in random order. Measure MPKI (misses per thousand instructions) vs N using perf stat. Plot the BTB saturation point.
RAS depth experiment: Write a recursive function that recurses to depth D (varied from 1 to 64). Measure misprediction rate of the RET instruction vs depth. Identify the RAS depth of your CPU from the data.
Spectre v1 demonstration (sandboxed): In a controlled environment, implement the classic Spectre v1 gadget in C. Use cache flush+reload to measure which array index was accessed. Vary the LFENCE placement and observe how it prevents the side channel. Measure performance overhead of the mitigation.

References

André Seznec, Pierre Michaud, "A Case for (Partially) TAgged GEometric History Length Branch Prediction," JILP 2006
Yeh & Patt, "Two-Level Adaptive Training Branch Prediction," Micro 1991
Smith, "A Study of Branch Prediction Strategies," ISCA 1981
Intel 64 and IA-32 Architectures Optimization Reference Manual, Chapter 22 (Branch Prediction)
Championship Branch Prediction: https://jilp.org/cbp/
Kocher et al., "Spectre Attacks," IEEE S&P 2019
Turner et al., "Retpoline: A Software Construct for Preventing Branch-Target-Injection," Google 2018
AMD Branch Prediction Technical Brief (Zen 3): https://developer.amd.com/resources/zen-architecture/
Evtyushkin et al., "BranchScope: A New Side-Channel Attack on Directional Branch Predictor," ASPLOS 2018