Branch Prediction: Predictors, BTB, and Speculation Control

Prerequisites

CPU pipeline fundamentals (01-cpu-pipeline.md): pipeline stages, control hazards, branch penalty
Speculative execution (03-speculative-execution.md): BTB poisoning, Spectre V2
Basic probability: Markov chains help for predictor modeling
Assembly language: branch instruction encoding, conditional vs indirect branches

Technical Overview

Branch prediction is the mechanism by which a CPU guesses the outcome of conditional branches before their condition is evaluated, allowing speculative execution to continue without stalling the pipeline. On a modern processor with a 15-20 stage pipeline, every branch instruction that completes execution provides a verdict (taken/not-taken, and target address) that occurs 10-15 cycles after the branch was fetched. Without prediction, the CPU would stall for those cycles on every branch. Given that branches occur every 5-7 instructions in typical code, this would reduce IPC from ~4 to ~0.5 on a modern wide machine — an 8x throughput reduction.

Branch prediction is therefore not an optimization: it is a fundamental requirement for superscalar performance. Modern predictors achieve 97-99%+ accuracy on most workloads. The 1-3% misprediction rate, at 15-20 cycles per misprediction, represents a 3-12% throughput penalty — unavoidable physics of branch-heavy code.

Historical Context

1960s: The CDC 6600 (Seymour Cray, 1964) had a simple 1-bit branch history. The scoreboard architecture attempted to hide branch latency via look-ahead.

1985: Yale Patt proposed the two-level adaptive predictor in a series of papers that laid the theoretical foundation for all modern predictors.

1988: James Smith published the seminal paper on bimodal (2-bit saturating counter) prediction, demonstrating that simple hysteresis dramatically improves over 1-bit prediction.

1992: Yeh and Patt publish the Two-Level Adaptive Training paper, showing that per-address history + Pattern History Table (PHT) achieves >95% accuracy on SPEC benchmarks.

1993: Combining predictors: tournaments/hybrid predictors that select between two specialized predictors.

2006: André Seznec proposes the TAGE (TAged GEometric history length) predictor, which remains the state-of-the-art design. All modern high-performance CPUs use TAGE or TAGE-derived predictors.

2016: Neural/Perceptron-based branch predictors demonstrated in hardware (Samsung Exynos, and later in competition-winning HPCA results). Some AMD designs incorporate perceptron-inspired components.

Branch Misprediction Cost

The cost of a misprediction is approximately pipeline_depth - 1 cycles:

Pipeline depth and misprediction penalty:

CPU                          Depth    Penalty (approx)
─────────────────────────────────────────────────────
MIPS R2000 (1985)            5 stages    4 cycles
Intel Pentium Pro (1995)     14 stages  10-14 cycles
Intel Pentium 4 (2000)       20 stages  17-20 cycles
Intel Pentium 4 Prescott     31 stages  25-30 cycles
Intel Core 2 (2006)          14 stages  12-14 cycles
Intel Skylake (2015)         14 stages  15-17 cycles (varies by branch type)
AMD Zen 2 (2019)             19 stages  14-18 cycles
Apple M1 (2020)              ~15 stages 14-16 cycles
ARM Cortex-X4 (2023)        ~13 stages  13-15 cycles

Note: "penalty" = cycles from misprediction detected to correct instruction
entering execute stage. Actual frontend penalty may be slightly higher
due to BTB update and redirect latency.

Throughput impact formula:

Cycles wasted per instruction = miss_rate × misprediction_penalty / branch_frequency

Example: Skylake, 1% miss rate, 15-cycle penalty, branch every 6 instructions:
  Wasted = 0.01 × 15 / 6 = 0.025 cycles/instruction
  IPC reduction from 4.0: 4.0 - 4.0 × (0.025 / 1.0) ≈ 3.9
  (~2.5% throughput loss from prediction alone)

Pathological case: 10% miss rate, 20-cycle penalty, branch every 3 instructions:
  Wasted = 0.10 × 20 / 3 = 0.67 cycles/instruction
  Effective IPC: 4.0 - 2.67 = 1.33 (67% throughput loss!)

Predictor Architectures

1. Bimodal Predictor (Smith, 1988)

The simplest effective predictor. One 2-bit saturating counter per branch PC (indexed by low bits of PC into a table).

   State machine for each 2-bit counter:

   Strongly     Weakly      Weakly      Strongly
   Not Taken    Not Taken    Taken        Taken
     (00) ──T──► (01) ──T──► (10) ──T──► (11)
          ◄──NT─      ◄──NT─      ◄──NT─

   Prediction: counter ≥ 2 → Taken, < 2 → Not Taken
   Two consecutive wrong predictions needed to change direction.
   → Hysteresis: avoids thrashing on near-alternating branches.

Branch PC
─────────
[47:2]   PC bits (low bits used as table index)
         │
         ▼
 ┌──────────────────┐
 │  PHT (Pattern    │   2^k entries of 2-bit counters
 │  History Table)  │   k = 10 bits → 1024 entries → 2KB
 └────────┬─────────┘
          │
          ▼
    2-bit counter
    → Prediction: Taken/Not-Taken

Limitation: Two branches that alias to the same PHT entry interfere with each other ("aliasing"). Large branches per program → significant aliasing in a 1K-entry table.

2. Two-Level Adaptive Predictor (Yeh & Patt, 1992)

Key insight: branches are not independent. A branch's outcome often depends on the outcomes of recent branches. Track history, use it as an index.

Local History (per-branch): Each branch has its own shift register (Branch History Register, BHR) recording its last N outcomes (T/NT). This BHR indexes into a local PHT for this branch.

    Branch PC → BHR[PC] (per-branch history register)
                 e.g., BHR = 010110 (last 6 outcomes)
                 │
                 ▼
           PHT[BHR] → 2-bit counter → prediction

Example: A loop branch: BHR for this branch will typically contain 1111110 (6 taken, then 1 not-taken for loop exit). The PHT entry for pattern 111111 predicts "Taken", pattern 011111 predicts "Not Taken" — perfectly predicting the loop.

Global History (GShare/GSelect): A single Global History Register (GHR) records the outcomes of the last N branches (any branch, regardless of PC). GHR is XORed with the branch PC to index the PHT — this "gsharing" spreads aliasing.

   Branch PC  ──XOR──► PHT index
   GHR ───────►
                        │
                        ▼
                  PHT[PC XOR GHR] → 2-bit counter → prediction

GShare excels at correlated branches (e.g., if (a && b) where first branch correlates with second).

3. Tournament / Hybrid Predictor (McFarling, 1993)

Combine a local history predictor and a global history predictor. A meta-predictor (another 2-bit counter table) selects which component predictor to trust:

   Branch PC
       │
       ├──────────────────► Local Predictor  → prediction_L
       │                                          │
       ├──────────────────► Global Predictor → prediction_G
       │                                          │
       └──────────────────► Meta-predictor   → select(L or G)?
                             (indexed by PC)       │
                                                   ▼
                                              Final Prediction

Alpha 21264 (1998) used this design: local predictor with 1024-entry BHT (10-bit history per entry) + global predictor with 4096-entry GHR + 4096-entry meta-predictor. Achieved ~93-95% accuracy.

Intel Pentium 4 (Willamette) and later Pentium M used variants of this approach.

4. TAGE Predictor (Seznec & Michaud, 2006)

TAged GEometric history length predictor. State of the art as of 2024.

Core idea: use multiple predictor tables with geometrically increasing history lengths. A branch is predicted by the component that has the longest matching history. This captures both short-term and long-term correlations.

                          Geometric history lengths:
   h0=0   h1=2   h2=4   h3=8   h4=16   h5=32   h6=64  (bits)
   ──────────────────────────────────────────────────────
   T0     T1     T2     T3     T4      T5      T6      (tables)

   Each entry in T[i>0]:
     ┌────────┬─────┬───────┐
     │  TAG   │ CTR │ USEFUL│
     │ (tag)  │(2b) │  (2b) │
     └────────┴─────┴───────┘
     TAG: partial PC hash XOR GHR[0..hi]
     CTR: 2-bit prediction counter
     USEFUL: 2-bit "usefulness" (prevent premature eviction)

   T0: base bimodal predictor (no tag, always hits)

   Prediction logic:
   1. Compute indices for all tables using hashed (PC XOR GHR[0..hi])
   2. Check each table for a tag match (hit)
   3. Use prediction from the LONGEST-history hitting table
   4. If only T0 hits: use T0 (default bimodal)

   ┌──────────────────────────────────────────────────────────────┐
   │                TAGE PREDICTOR                                 │
   │                                                               │
   │  GHR: [b63 b62 ... b1 b0]  (global history register)        │
   │                                                               │
   │  Branch PC ──────────────────────────────────────────────┐   │
   │                                                           │   │
   │  hash(PC,GHR[0:1])  → T1[idx] → hit? CTR → pred          │   │
   │  hash(PC,GHR[0:3])  → T2[idx] → hit? CTR → pred          │   │
   │  hash(PC,GHR[0:7])  → T3[idx] → hit? CTR → pred          │   │
   │  hash(PC,GHR[0:15]) → T4[idx] → hit? CTR → pred          │   │
   │  hash(PC,GHR[0:31]) → T5[idx] → hit? CTR → pred          │   │
   │  hash(PC,GHR[0:63]) → T6[idx] → hit? CTR → pred          │   │
   │  PC[k:0]            → T0[idx] → always  CTR → pred        │   │
   │                                                           │   │
   │  Final prediction: longest history with tag match ────────┘   │
   │  (provider component), altpred = second-longest hit           │
   └──────────────────────────────────────────────────────────────┘

TAGE achieves >99% accuracy on SPEC CPU benchmarks. Variations: - ITTAGE: TAGE for indirect branch target prediction - LTAGE: TAGE + Loop predictor (for counted loops with known iteration counts) - MTAGE: Multiple interleaved TAGE tables

AMD Zen and Intel's modern microarchitectures use TAGE-based predictors. Exact implementation details are proprietary.

Branch Target Buffer (BTB)

The branch predictor tells us direction (taken/not-taken). The BTB tells us the target address of taken branches — needed for instruction fetch redirect.

   ┌─────────────────────────────────────────────────────────┐
   │                    BTB Structure                         │
   │                                                          │
   │  Indexed by branch PC (low bits)                        │
   │                                                          │
   │  Entry:                                                  │
   │  ┌─────────────┬──────────────┬──────────┬───────────┐  │
   │  │  Tag (PC)   │ Target Addr  │ Branch   │  Valid    │  │
   │  │ (partial)   │              │   Type   │           │  │
   │  └─────────────┴──────────────┴──────────┴───────────┘  │
   │                                                          │
   │  Branch types tracked:                                   │
   │  - Direct conditional (JE, JNE, JL, etc.)               │
   │  - Direct unconditional (JMP label)                      │
   │  - Indirect (JMP [RAX], CALL [table+RCX*8])             │
   │  - Near call (CALL target)                               │
   │  - Return (RET) — predicted by RSB, not BTB             │
   └─────────────────────────────────────────────────────────┘

BTB capacity: Modern Intel BTB has roughly 4096-8192 entries (microarchitecture-dependent, not publicly disclosed). AMD Zen 4: 8192 BTB entries at L1 BTB, larger at L2 BTB.

BTB miss penalty: If the BTB doesn't have an entry for a branch (first time seen, or evicted), the CPU cannot redirect fetch until the branch is decoded — an additional delay beyond the prediction penalty.

Indirect branch prediction: For JMP [RAX] where RAX changes at runtime (e.g., virtual function dispatch, computed goto), the BTB must record a different target per input. Modern BTBs implement indirect branch predictors (ITTAGE or similar) that use history to predict which target will be taken.

Return Address Stack (RAS)

RET instructions would seem impossible to predict — the return address is on the stack and changes every call site. The solution: a hardware Return Address Stack (RAS), a LIFO stack mirroring the call stack.

   ┌─────────────────────────────────────────────────────────┐
   │              Return Address Stack (RAS)                  │
   │                                                          │
   │  On CALL instruction:                                    │
   │    RAS.push(PC + sizeof(CALL))   ← predicted return addr│
   │                                                          │
   │  On RET instruction:                                     │
   │    predicted_target = RAS.pop()  ← use for speculation  │
   │                                                          │
   │  Hardware:                                               │
   │  ┌────┬────┬────┬────┬────┬────┬────┬────┐             │
   │  │RA7 │RA6 │RA5 │RA4 │RA3 │RA2 │RA1 │RA0 │  ← TOP     │
   │  └────┴────┴────┴────┴────┴────┴────┴────┘             │
   │  Size: Intel = 16 entries, AMD Zen = 32 entries,        │
   │        Apple M1 = 64 entries (estimated)                │
   │                                                          │
   │  CALL/RET pairs that stay within RAS depth: ~100%       │
   │  prediction accuracy for RET                            │
   └─────────────────────────────────────────────────────────┘

RAS underflow: If call depth exceeds RAS size (e.g., deeply recursive functions), the RAS overflows and old entries are lost. When the stack unwinds past depth 16 (for Intel), RETs miss the RAS and fall back to BTB prediction.

Security relevance: The RSB (Intel's name for the RAS, "Return Stack Buffer") is per-thread and cannot be poisoned across SMT siblings. This is why retpoline works: it uses RET (RSB prediction) instead of indirect JMP (BTB prediction).

BTB and RAS Diagram

  ┌─────────────────────────────────────────────────────────────┐
  │              BRANCH PREDICTION UNIT                          │
  │                                                              │
  │   Fetch PC ──────────────────────────────────────────────┐  │
  │                │                                          │  │
  │                ▼                                          │  │
  │         ┌─────────────┐                                  │  │
  │         │     BTB     │ ──► target address (if hit)      │  │
  │         │  (4-8K ent) │                                  │  │
  │         └─────────────┘                                  │  │
  │                │ BTB hit?                                 │  │
  │                │                                          │  │
  │                ▼                                          │  │
  │         ┌─────────────┐                                  │  │
  │         │  TAGE       │ ──► Taken / Not Taken            │  │
  │         │  Direction  │                                   │  │
  │         │  Predictor  │                                   │  │
  │         └─────────────┘                                   │  │
  │                                                            │  │
  │         ┌─────────────┐                                   │  │
  │         │     RSB     │ ──► return address (for RET)     │  │
  │         │  (16-64 ent)│                                   │  │
  │         └─────────────┘                                   │  │
  │                │                                           │  │
  │                ▼                                           │  │
  │    ┌─────────────────────────────────────────────────┐    │  │
  │    │  MUX: select next fetch PC                       │    │  │
  │    │  [0] BTB target (if taken branch predicted)     │    │  │
  │    │  [1] RSB top (if RET predicted)                 │    │  │
  │    │  [2] PC+4/+2 (sequential, not-taken predicted)  │    │  │
  │    └────────────────────┬────────────────────────────┘    │  │
  │                         │                                  │  │
  │                         └──────────────────────────────────┘  │
  │                           ▼                                    │
  │                    Next Fetch Address → I-Cache               │
  └─────────────────────────────────────────────────────────────┘

Intel Loop Stream Detector (LSD)

Intel CPUs (Sandy Bridge through Ice Lake) include a Loop Stream Detector that detects small loops (≤64 uops) and locks them into the Decoded ICache (DSB), serving the decoded uops without re-fetching/decoding from the I-cache on each iteration.

Loop Condition:
  - Loop body ≤ 64 uops (Skylake)
  - Loop has a single back edge
  - No branch misprediction in loop body
  - No page crossing in the loop instructions

Effect:
  - Loop iterations execute with zero front-end overhead
  - Branch predictor still used for loop exit prediction
  - Power savings: I-cache not accessed per iteration

  Normal execution:                 LSD active:
  I-cache → Decode → Uop Queue      Uop Queue ↔ Uop Queue
  (every iteration)                 (loop just recirculates)

Note: Intel disabled the LSD in some generations (Skylake Q3 2017 microcode update) due to a bug with AVX-512 instructions in loops. LSD was re-enabled in subsequent stepping.

Branch Prediction as an Attack Surface

Spectre V2: BTB Poisoning (CVE-2017-5715)

Detailed mechanism: The BTB is indexed by a subset of the branch PC. Two branches at different virtual addresses with the same index alias to the same BTB entry. An attacker in one process trains the BTB to predict a specific target, then triggers the victim's indirect branch (which aliases the same BTB index). The victim's branch speculatively jumps to the attacker's chosen target (a "gadget" in the victim's address space).

  Attacker Address Space:          Victim Address Space:
  ─────────────────────            ─────────────────────
  addr = 0x00401234                addr = 0xFF401234
       └── BTB index: bits[12:0]        └── BTB index: bits[12:0]
            = 0x234 (same!)              = 0x234 (same alias!)

  Attacker trains BTB[0x234] → target: gadget_addr (in victim space)
  Victim's JMP [RAX] at 0xFF401234 → BTB[0x234] predicts gadget_addr
  Victim speculatively executes gadget_addr → data leak via cache

Mitigation effectiveness: - Retpoline: forces RET (RSB) instead of BTB for indirect branches → BTB poisoning has no effect - eIBRS: hardware prevents user-space BTB entries from affecting kernel-mode prediction - IBPB: flushes entire BTB on context switch between security domains

Spectre RSB: Underflow (CVE-2018-15572)

When the RSB is exhausted (deep recursion or a context switch between processes with different RSB states), RETs fall back to BTB prediction. An attacker can pre-fill the BTB with poisoned entries for the expected return addresses. Linux mitigates via RSB filling on context switch: kernel fills the RSB with kernel addresses using a sequence of CALL instructions before returning to user space.

Microbenchmark: Measuring Branch Predictor Performance

#include <stdint.h>
#include <stdio.h>
#include <time.h>

#define N 10000000

// Branch pattern: always taken → predictor should achieve near 100%
static int always_taken(int i) {
    return (i < N - 1) ? 1 : 0;  // taken 9999999/10000000 times
}

// Branch pattern: alternating → 0% accuracy with bimodal, better with 2-level
static int alternating(int i) {
    return i & 1;
}

// Branch pattern: every 32 iterations → tests pattern length
static int every32(int i) {
    return (i % 32 == 0) ? 1 : 0;
}

int benchmark(int (*branch_fn)(int)) {
    int sum = 0;
    struct timespec t0, t1;
    clock_gettime(CLOCK_MONOTONIC, &t0);
    for (int i = 0; i < N; i++) {
        if (branch_fn(i))
            sum++;
    }
    clock_gettime(CLOCK_MONOTONIC, &t1);
    long ns = (t1.tv_sec - t0.tv_sec) * 1000000000L + (t1.tv_nsec - t0.tv_nsec);
    printf("ns/iter: %ld, sum: %d\n", ns / N, sum);
    return sum;
}

Expected results on a modern x86 CPU: - always_taken: ~0.3 ns/iter (predictor nearly perfect, ~4 cycles/iter) - alternating: ~0.5-0.8 ns/iter (predictor can learn this with 2-level) - every32: ~0.3 ns/iter (modern TAGE predictor learns period-32 patterns easily)

Truly unpredictable branches (e.g., random data-dependent outcomes) produce ~1.5-2.5 ns/iter due to misprediction penalties.

# Measure mispredictions directly:
perf stat -e branch-misses,branches ./branch_bench

# For detailed prediction breakdown:
perf stat -e \
  cpu/event=0xc5,umask=0x00,name=BR_MISP_RETIRED.ALL_BRANCHES/ \
  ./branch_bench

Debugging Notes

Identifying Branch Prediction Bottlenecks

# Linux perf — branch mispredictions
perf stat -e cycles,instructions,branch-misses,branches ./program

# TMA (Top-Down Microarchitecture Analysis)
perf stat -M TopdownL1 ./program
# Look for "Bad Speculation" metric > 5% → branch prediction bottleneck

# Per-function branch miss breakdown
perf record -e branch-misses:pp ./program
perf report --sort=sym,overhead

Software-Level Optimization

// Branchless code (avoids branch predictor entirely)
// Instead of:
if (a > b) max = a; else max = b;

// Use:
max = a ^ ((a ^ b) & -(a < b));  // branchless max via bit manipulation
// Or with GCC:
max = (a > b) ? a : b;  // GCC often compiles to CMOV (conditional move)
                         // CMOV: no branch, no misprediction possible

CMOV (Conditional Move): x86 instruction that performs a register move without branching. The CPU executes it unconditionally; it either updates the destination or not based on flags. No branch predictor involvement, no misprediction penalty — but both paths are computed.

Security Implications

BTB is shared across processes: On the same physical core (with or without SMT), BTB entries are shared across context switches. IBPB flushes the BTB on switch between untrusted domains.
Predictor state as a covert channel: Two processes can communicate via branch predictor state — one process trains specific branches, the other observes prediction outcomes. This is a covert channel even without Spectre exploitation.
PHT history as a fingerprint: The pattern of branch outcomes from a process can fingerprint its execution (which code paths were taken). This can leak information about secret-dependent control flow even without cache timing.
RSB manipulation for control flow hijacking: If an attacker can manipulate the actual stack (via stack overflow), they can misalign the RAS relative to the real return address stack, causing speculative misprediction — but this typically requires code execution already.

Performance Implications

Optimization	Technique	Typical Gain
Eliminate branch	Use CMOV, branchless arithmetic	0-30% on branch-heavy loops
Sort before branch	`if (arr[i] > 50)` with sorted arr	2-10x on inner loops
Profile-guided optimization	PGO: compiler uses branch profiles	5-20% overall
__builtin_expect	Hint to compiler for branch layout	1-5%
Reduce branch aliasing	Align hot functions	1-5%
Loop unrolling	Reduce loop back-branch count	1-20%

The most impactful optimization for branch-heavy code is often sorting or partitioning data to make branches more predictable.

Modern Usage

Profile-Guided Optimization (PGO)

Compilers use branch profile data (from instrumentation or sampling) to: 1. Arrange hot code so taken branches are "fall-through" (sequential, no redirect) 2. Clone functions that are always called from contexts with predictable branch outcomes 3. Inline functions at call sites where it eliminates indirect branches 4. Unroll loops with known iteration counts

# GCC PGO workflow:
gcc -fprofile-generate -O2 program.c -o program_instr
./program_instr < training_input
gcc -fprofile-use -O2 program.c -o program_optimized

BOLT (Binary Optimization and Layout Tool)

Meta's BOLT performs function layout optimization at the binary level using sampled perf data, achieving 5-15% improvement on large server binaries (e.g., folly, MySQL, Nginx).

Future Directions

Perceptron branch predictors: Fully neural branch predictors (sum of weighted history bits + bias → threshold). Competitive with TAGE for certain workloads, lower implementation cost. Some production use in Samsung and reportedly in components of AMD Zen 4.
Prefetch-aware prediction: Branch predictors that issue prefetches when they predict a taken branch, reducing BTB miss penalty.
AI-offline training: Using ML to generate better PHT initial states or to identify branches that benefit from custom history lengths — offline, baked into the microcode.
Branch target landing pads (Intel CET IBT): Indirect Branch Tracking requires valid branch targets to be marked with ENDBR64 instructions, limiting the set of valid indirect branch destinations. This reduces the usable gadget set for Spectre V2.
Larger RSB: As code complexity grows, deeper call stacks require larger RSBs. Apple M1's estimated 64-entry RSB vs Intel's 16-entry RSB reflects this trend.

Exercises

Predictor accuracy experiment: Write a C program with an inner loop containing a branch whose outcome depends on arr[i] % 2. Run it on an unsorted array vs a sorted array. Measure branch miss rate with perf stat -e branch-misses. Explain the ~2x difference in ns/iteration.
BTB capacity measurement: Write a program with N indirect branches (function pointer calls) where N varies from 1 to 16384. For each N, measure branch mispredictions per call. Plot the curve. Identify the "knee" — the BTB capacity of your CPU.
TAGE history tracing: Choose a loop branch that executes exactly once every K iterations where K ∈ {3, 5, 7, 11, 13}. Measure prediction accuracy for each K using perf stat. Which K values are predicted accurately? Relate to TAGE history lengths (powers of 2: 2, 4, 8, 16, 32, 64).
RSB underflow experiment: Write a recursive function with depth controlled by a parameter. At each depth, measure perf stat -e branch-misses for the RET instructions as depth increases from 4 to 64. Identify the RSB depth of your CPU from the miss rate inflection point.
Retpoline verification: Compile a program with and without -mindirect-branch=thunk. Compare the branch-misses count for indirect call sites. Use objdump -d to verify retpoline thunks are present. Measure the overhead in cycles per indirect call.

References

Smith, J. E. (1981). A study of branch prediction strategies. ISCA 1981, 135–148.
Yeh, T. Y., & Patt, Y. N. (1992). Alternative Implementations of Two-Level Adaptive Branch Prediction. ISCA 1992, 124–134.
McFarling, S. (1993). Combining Branch Predictors. DEC WRL Technical Note TN-36.
Seznec, A., & Michaud, P. (2006). A Case for (Partially) TAgged GEometric History Length Branch Prediction. JILP, 8, 23.
Kocher, P., et al. (2019). Spectre Attacks: Exploiting Speculative Execution. IEEE S&P 2019.
Fog, A. (2023). The Microarchitecture of Intel, AMD, and VIA CPUs. https://www.agner.org/optimize/microarchitecture.pdf — Chapter on Branch Prediction.
Intel Corporation. (2019). Retpoline: A Branch Target Injection Mitigation. https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/retpoline-a-branch-target-injection-mitigation.html
Daniel, A., & Seznec, A. (2021). ITTAGE: Instruction Tagged Geometric History Length Branch Predictor. Championship Branch Prediction Workshop.