Speculative Execution: Hiding Latency, Exposing Secrets

Prerequisites

Out-of-order execution fundamentals (02-out-of-order-execution.md): ROB, register renaming, retirement
CPU cache hierarchy basics (05-cache-hierarchy.md): cache lines, sets, eviction
Branch prediction concepts (04-branch-prediction.md): BTB, PHT, predictor state
Virtual memory and privilege levels: kernel/user separation, page tables
x86-64 internals (09-x86-64-internals.md): CR3, SMEP, syscall mechanism

Technical Overview

Speculative execution is the mechanism by which a processor executes instructions before it is certain those instructions should be executed, in order to hide latency and keep the execution pipeline full. If the speculation was correct, results are committed. If incorrect, the speculatively executed instructions are squashed — their architectural effects (register writes, memory writes) are undone — and execution resumes from the correct path.

The fundamental property that makes speculation safe (or was believed to): only the architectural state is rolled back on a misprediction or exception. Until January 2018, it was widely assumed that this was sufficient. The Spectre and Meltdown disclosures proved it was not: microarchitectural state (cache occupancy, TLB state, branch predictor tables, port timing) is not rolled back, and this state is measurable by an attacker with cache timing primitives.

This section covers: the types of speculation, the specific mechanisms of the two paradigm-defining vulnerabilities (Spectre variants 1/2 and Meltdown), the hardware and software mitigations deployed, their performance costs, and the ongoing stream of related vulnerabilities.

Historical Context

Pre-2018: Speculative execution has been present in commercial CPUs since the IBM 360/91 (1967) and was widely deployed from the mid-1990s (Intel Pentium Pro, 1995). The security implications were known academically — papers on microarchitectural side channels date to 2005 (Percival's "Cache Missing for Fun and Profit") — but exploiting speculative execution across privilege boundaries was not demonstrated publicly.

June 2017: Google Project Zero (Jann Horn) independently discovers the Spectre and Meltdown vulnerabilities. ARM, AMD, Intel are notified under coordinated disclosure.

January 3, 2018: Public disclosure. CVE-2017-5753 (Spectre Variant 1), CVE-2017-5715 (Spectre Variant 2), CVE-2017-5754 (Meltdown / Spectre Variant 3). The security and operating system communities scramble to deploy mitigations for hundreds of millions of deployed systems.

2018–2024: A continuing stream of related vulnerabilities: Spectre-RSB, Spectre-v3a (Rogue System Register Read), Spectre-v4 (Speculative Store Bypass), L1TF (L1 Terminal Fault / Foreshadow), MDS (Microarchitectural Data Sampling including RIDL, Fallout, ZombieLoad), TAA (TSX Asynchronous Abort), LVI (Load Value Injection), SRBDS (Special Register Buffer Data Sampling). Each exploits a different transient execution path.

Types of Speculative Execution

Branch Speculation (Most Common)

The CPU predicts the direction (taken/not-taken) and target of a branch and begins fetching/executing along the predicted path. Resolves when the branch instruction completes in the EX stage.

  BEQ R1, R2, target          ; branch instruction
  ───────────────
  Predictor: "TAKEN → jump to target"
  CPU fetches instructions at [target] speculatively

  [speculation window: several cycles of speculative execution]

  EX stage: R1 == R2? YES → correct prediction → commit
            R1 == R2? NO  → misprediction → flush pipeline

Load Speculation (Memory Disambiguation)

The CPU issues a load before knowing whether a prior store aliases the same address. If it turns out the store did alias, the load is re-issued with the correct (forwarded) value, and dependent instructions are replayed.

Speculative Store Bypass (Variant 4, CVE-2018-3639)

The CPU speculates that a store and a subsequent load do NOT alias, executes the load with a stale value from cache, then the store's value arrives. The speculatively loaded stale value may have been used to influence cache state.

Return Stack Buffer (RSB) Speculation

The CPU uses a hardware Return Address Stack to predict the target of RET instructions. If the RSB is underflowed or poisoned, RET speculates to an attacker-controlled address.

Spectre: Mechanism and Attack Flow

Spectre Variant 1 (CVE-2017-5753): Bounds Check Bypass

Core mechanism: Mistrain the conditional branch predictor, then speculatively execute past a bounds check to access out-of-bounds memory. Use a cache timing channel to observe what was accessed.

Canonical victim code pattern:

// Victim code in kernel or a victim process
uint8_t array1[8] = {1, 2, 3, 4, 5, 6, 7, 8};
uint64_t array1_size = 8;
uint8_t array2[256 * 512];  // 128 KB — one element per cache line

void victim_function(size_t x) {
    if (x < array1_size) {          // bounds check ← BRANCH
        uint8_t val = array1[x];    // access with attacker-controlled x
        // bring array2[val * 512] into cache — the "gadget"
        uint8_t dummy = array2[val * 512];  // cache state now encodes val
    }
}

Attack sequence:

STEP 1: MISTRAIN the branch predictor
   Call victim_function(0), (1), (2), (3), (4), (5)
   with in-bounds values many times.
   Branch predictor learns: "this branch is usually TAKEN (in-bounds)"

STEP 2: EVICT array1_size from cache
   Use clflush or cache eviction to push array1_size out of L1/L2.
   Now the bounds check load will be slow (cache miss ~200 cycles).

STEP 3: CALL victim_function(malicious_x)
   where malicious_x is out-of-bounds (e.g., points to kernel secret)

   Timeline:
   cycle 1:  branch begins evaluating (x < array1_size)
   cycles 2-40: array1_size is fetched from DRAM (cache miss)
   cycles 2-10: speculative execution begins (predictor said TAKEN)
                → speculatively reads array1[malicious_x]
                   (reads kernel secret byte: val = secret_byte)
                → speculatively reads array2[val * 512]
                   (loads cache line for array2[secret_byte * 512])
   cycle 40: array1_size arrives: x >= array1_size → MISPREDICT
             → pipeline squashed, architectural effects undone

STEP 4: OBSERVE cache state (Flush+Reload)
   For i = 0..255:
       t_start = rdtsc
       access array2[i * 512]
       t_end = rdtsc
       if (t_end - t_start) < cache_hit_threshold:
           secret_byte = i    ← only this one is in cache

Spectre Variant 1 Attack Flow Diagram:

  Attacker Process                    Victim Process / Kernel
  ─────────────────                   ────────────────────────
  [1] Mistrain PHT for
      victim branch address
            │
            │  (kernel syscall boundary)
            │
  [2] Trigger victim function
      with malicious x
            │
            │                         [3] Branch: x < array1_size?
            │                              └─ array1_size: CACHE MISS
            │                              └─ Speculative: assume TAKEN
            │                                   ↓
            │                         [4] Speculative read: array1[x]
            │                              ← x is attacker-controlled OOB
            │                              ← reads secret_byte
            │                                   ↓
            │                         [5] Speculative read: array2[secret*512]
            │                              ← loads cache line into SHARED CACHE
            │                                   ↓
            │                         [6] array1_size arrives: MISPREDICT
            │                              ← squash! architectural state clean
            │                              ← BUT cache line [secret*512] remains
            │
  [7] Flush+Reload array2[]
      Measure timing for each index
      Cache hit at index K → secret_byte = K
            │
            ▼
  secret_byte RECOVERED

Key insight: The architectural state (register values) is rolled back after the misprediction. The microarchitectural state (which cache line is hot) is not. This persistent side effect is the information channel.

Meltdown: Kernel Memory Leakage

CVE-2017-5754: Rogue Data Cache Load

Meltdown exploits the fact that on some CPUs, a speculative load from a kernel address completes (populates cache) before the permission check raises an exception.

Affected hardware: Most Intel CPUs prior to 2019 (pre-Cascade Lake), some ARM Cortex-A variants. AMD CPUs largely unaffected (AMD's design raises the fault earlier in the pipeline, before the load data is forwarded).

Attack pattern:

// User-space attack code
// attempt to read kernel address 0xFFFFFFFF80000000 (Linux kernel text)
uint8_t *kernel_addr = (uint8_t *)0xFFFFFFFF80000000UL;

// The magical sequence:
// 1. Speculatively load the kernel byte (will fault, but AFTER cache is updated)
// 2. Use it as index into flush+reload array

uint8_t probe_array[256 * 512];

// On CPU without Meltdown fix:
asm volatile(
    "xor %%eax, %%eax           \n"
    "movb (%1), %%al            \n"  // speculative load from kernel addr
                                     // exception pending, but not yet raised
    "shl $9, %%rax              \n"  // scale to cache-line granularity
    "movb (%0,%%rax), %%bl      \n"  // access probe_array[val * 512]
                                     // loads CACHE LINE encoding secret byte
    :
    : "r"(probe_array), "r"(kernel_addr)
    : "rax", "rbx"
);
// exception raised here — signal handler catches it
// but probe_array[secret * 512] is now in cache

Meltdown timeline at microarchitectural level:

cycle 1-3:   Load instruction dispatched speculatively
cycle 4:     MMU begins page walk / TLB check for kernel address
cycle 4-8:   Speculative execution continues downstream instructions
             (OoO engine issues the probe_array load with the secret value)
             SECRET BYTE IS IN PHYSICAL REGISTER FILE
             CACHE LINE PROBE_ARRAY[secret*512] IS LOADED
cycle 9:     Permission check completes: user ≠ kernel → #GP fault generated
             ROB squashed: registers restored, but...
             ...CACHE LINE STAYS LOADED
cycle 10+:   Exception handler invoked
             Attacker's signal handler catches SIGSEGV, continues
             Flush+Reload recovers secret_byte from cache timing

KPTI (Kernel Page Table Isolation) Mitigation

KPTI removes all kernel mappings from the user-space page tables:

BEFORE KPTI:                    AFTER KPTI:
User CR3 page table:            User CR3 page table:
  user pages: mapped              user pages: mapped
  kernel pages: mapped            kernel pages: NOT PRESENT
                                     └─ speculative load faults immediately
                                        at address translation stage,
                                        before data reaches physical register

Cost: every syscall/interrupt requires a CR3 switch (TLB flush unless PCID supported). Linux uses PCID (Process Context Identifiers) to tag TLB entries, avoiding full flush, but the CR3 switch itself costs ~10-20 cycles. On syscall-heavy workloads (Redis, PostgreSQL with many small queries): 5-20% throughput reduction.

Spectre Variant 2: Branch Target Injection (CVE-2017-5715)

Variant 2 targets indirect branches — branches where the target address is in a register (e.g., JMP RAX, CALL [table + RCX*8]). The BTB (Branch Target Buffer) maps PC → predicted target.

Attack: Attacker poisons the BTB entry for a victim's indirect branch (by training the BTB from attacker's address space at the same aliased BTB index) to point to an attacker-chosen "gadget" in the victim's code. When the victim executes the indirect branch, it speculatively jumps to the gadget.

ATTACKER training BTB:
  [attacker PC aliasing victim's indirect branch PC]
  → trains BTB to predict target = gadget_addr (in victim code)

VICTIM executes indirect branch:
  BTB lookup: poisoned entry → speculate to gadget_addr
  [speculative execution of gadget reveals secret via cache]
  [actual branch target computed: different address → misprediction + squash]
  [but cache side channel is already set]

Retpoline Mitigation

Retpoline ("return trampoline") replaces indirect branches with a construct that prevents BTB speculation:

; Original indirect call:
;   CALL [rax]   ← BTB-predictable, can be poisoned

; Retpoline replacement:
    call setup_target
capture_spec:
    pause             ; speculative execution enters infinite PAUSE loop
    lfence            ; serialize after pause
    jmp capture_spec  ; keep spinning (but this is never reached by attacker)
setup_target:
    mov [rsp], rax    ; overwrite return address with real target
    ret               ; RET uses RSB (not BTB) to predict → controlled target
                      ; RSB is per-thread → cannot be poisoned cross-process

The retpoline works because: 1. CALL setup_target pushes the return address (capture_spec) onto the RSB 2. The RSB is used to predict RET targets (not the BTB) 3. The RSB predicts return to capture_spec — an infinite PAUSE loop 4. Speculative execution is trapped harmlessly in the PAUSE loop 5. The actual RET executes with the correct address already on the real stack

Cost: retpoline adds ~5-10 cycles per indirect branch in the non-speculative path, plus prevents CPU from speculatively loading the next basic block.

Hardware Mitigations

IBRS (Indirect Branch Restricted Speculation) — Intel

IBRS prevents BTB speculation across privilege levels (user→kernel, guest→host). When IBRS=1, indirect branches in kernel mode cannot be influenced by prior indirect branches in user mode.

Mode: Software sets via MSR IA32_SPEC_CTRL[0]. Original IBRS: set on every kernel entry → very expensive (~3% overall, much worse on specific workloads). Enhanced IBRS (eIBRS, Cascade Lake+): set once at boot, persistent across kernel/user transitions → ~0.5% overhead.

IBPB (Indirect Branch Predictor Barrier)

Flushes the entire BTB and branch predictor state. Called on context switch between non-trusted processes. Cost: ~100-150 cycles per IBPB flush. Linux enables IBPB on every context switch when mitigation is required.

STIBP (Single Thread Indirect Branch Predictors)

Prevents an SMT sibling thread from influencing the BTB of the current thread. Required when two processes sharing a physical core (via Hyperthreading) have different security domains.

SSBD (Speculative Store Bypass Disable) — Variant 4 Fix

Disables speculative store bypass: loads always wait for all prior store addresses to be known before executing. Cost: ~3-8% on memory-bound workloads. Linux enables only for processes that opt-in via prctl(PR_SET_SPECULATION_CTRL, ...).

L1TF (L1 Terminal Fault) — Foreshadow (CVE-2018-3620, CVE-2018-3646)

On affected Intel CPUs: when a page table entry has Present=0 (page not present), the CPU still speculatively loads data from the L1 cache using the physical address encoded in the PTE (bits 51:12 of the non-present PTE). This allows an SMT sibling or a VM guest to read L1 cache data belonging to the host kernel or other VMs.

Mitigation: Core scheduling (prevent untrusted VMs from sharing a physical core with the host), or disable Hyperthreading entirely. Many cloud providers (Amazon EC2, Google Cloud) disabled HT on bare-metal instances after L1TF. L1TF requires flushing the L1D cache on every VM entry/exit → 15-40% hypervisor overhead.

MDS: Microarchitectural Data Sampling

A class of attacks discovered in 2019 that extract data from CPU internal buffers rather than the cache:

Name	CVE	Buffer	Description
RIDL (Rogue In-Flight Data Load)	CVE-2018-12127	Line Fill Buffer	Sample data being loaded into cache from memory
Fallout	CVE-2018-12126	Store Buffer	Sample data from pending store buffer
ZombieLoad	CVE-2018-12130	Fill Buffer	Extract data being loaded from memory/IO
TAA (TSX Async Abort)	CVE-2019-11135	TSX buffers	Abort TSX transaction, read from internal buffers

These are particularly dangerous in multi-tenant environments: an SMT sibling thread (Hyperthreading) sharing a physical core's fill buffers can sample data in-flight from the other thread's memory accesses.

Mitigation: MDS clear (flush fill buffers via VERW instruction) on every context switch between security domains. Linux added mds=full kernel parameter. Cost: ~3% average overhead.

Performance Cost Summary

Mitigation	Trigger	Overhead (typical)	Worst case
KPTI	Syscall/interrupt	3-8%	20-30% (syscall-heavy)
Retpoline	Indirect branch	1-3%	10-15% (JIT-heavy)
IBPB	Context switch	2-5%	10% (high-switch rate)
eIBRS (vs IBRS)	Once at boot	0.5%	1%
SSBD	Per-process opt-in	3-8%	15% (memcpy heavy)
L1TF/HT disable	Always	15-40%	50% (HPC workloads)
MDS clear	Context switch	1-3%	5%

Combined overhead on a typical mixed server workload: 10-20% compared to a fully-unmitigated pre-2018 system. Workloads with high syscall rates (databases, web servers with many short connections) are most affected.

Modern Intel CPUs (Ice Lake+, Alder Lake+) have hardware fixes for Meltdown, Spectre v2, and some MDS variants built into silicon, recovering most of the performance overhead.

Debugging Notes

Checking Active Mitigations

# Linux: read per-mitigation status
cat /sys/devices/system/cpu/vulnerabilities/spectre_v1
cat /sys/devices/system/cpu/vulnerabilities/spectre_v2
cat /sys/devices/system/cpu/vulnerabilities/meltdown
cat /sys/devices/system/cpu/vulnerabilities/mds
cat /sys/devices/system/cpu/vulnerabilities/l1tf

# Example output:
# spectre_v1:  Mitigation: usercopy/swapgs barriers and __user pointer sanitization
# spectre_v2:  Mitigation: Enhanced IBRS, IBPB: conditional, RSB filling
# meltdown:    Not affected
# mds:         Not affected

# Check kernel boot parameters for mitigation control
cat /proc/cmdline | grep -o 'spectre_v2[^ ]*\|nopti\|mitigations=[^ ]*'

# Disable all mitigations (DANGEROUS, testing only):
# mitigations=off  in kernel cmdline

Measuring Mitigation Overhead

# Benchmark syscall rate with/without mitigations
# (requires two otherwise-identical systems or kernel cmdline change)
perf bench sched messaging --group -l 1000
perf bench syscall basic

# Or measure KPTI cost specifically:
strace -c -p <pid>   # count syscalls
perf stat -e syscalls:sys_enter_* ./syscall_heavy_program

Failure Modes and Ongoing Research

New gadget discovery: Spectre V1 gadgets exist in virtually every software codebase. The kernel, JIT compilers (V8, SpiderMonkey), eBPF, and device drivers all contain patterns that can be turned into Spectre V1 gadgets. Tools like spectre-meltdown-checker and Intel's spectre-gadget-scanner identify patterns.
Incomplete compiler mitigations: __builtin_speculation_safe_value() (GCC/LLVM) and lfence insertion are imperfect — compilers can't always identify all gadget sites.
Spectre v1 is fundamentally unmitigated in hardware: No hardware fix exists for V1 (it requires fixing every possible branch in software). The mitigation is: add lfence or speculation barrier after critical bounds checks in kernel code.
New microarchitectural buffers as attack surfaces: Each new CPU generation introduces new internal buffers (e.g., Gather Data Sampling / GDS, CVE-2022-40982, August 2023 — leaks data through AVX gather instructions). The attack class is alive and active.

Modern Usage and Defense-in-Depth

Cloud providers have responded with layered defenses:

Hardware refresh: Deploy Ice Lake / Cascade Lake+ CPUs with built-in hardware mitigations
Core scheduling: Never schedule untrusted workloads on SMT siblings of trusted workloads
Hypervisor hardening: L1D cache flush on VM entry, IBPB on every VM switch
Process isolation: Use separate physical cores (not just VCPUs) for different tenants when security is paramount (AWS Nitro architecture with dedicated hardware)
Kernel hardening: CONFIG_INIT_ON_ALLOC_DEFAULT_ON, KPTI, CONFIG_RANDOMIZE_BASE (KASLR)

Language-level mitigations: Rust's memory safety prevents certain Spectre V1 gadget formation (bounds check elimination is less aggressive). Go runtime inserts speculation barriers. WebAssembly runtimes use lfence after computed branches.

Future Directions

Hardware-enforced speculation limits: Proposals to hardware-track "taint" through speculative execution and prevent tainted values from affecting observable microarchitectural state (cache). IBM has published research on "Speculative Taint Tracking."
Formal verification of microarchitecture: Academic work on verifying that a CPU design satisfies "speculative non-interference" properties — no information leakage via speculation.
Constant-time hardware primitives: New ISA extensions to mark memory operations as "non-speculative" (execute only after all prior branches are resolved). ARM has CSDB (Consumption of Speculative Data Barrier).
CHERI architectural capability security: Cambridge's CHERI architecture attaches capabilities to pointers that prevent speculative out-of-bounds access at the ISA level, eliminating Spectre V1 class entirely.
Software-based isolation (SFI): Compiler-enforced sandboxing (WebAssembly, Google's Software Fault Isolation) that masks memory accesses with AND-masking, preventing speculation from reaching attacker-controlled addresses.

Exercises

Spectre V1 gadget hunting: Examine the Linux kernel source arch/x86/kvm/x86.c. Find the pattern if (index < array_size) { access(array[index]); } and identify whether a Spectre V1 gadget exists. Look for array_index_nospec() usage as the correct mitigation.
Flush+Reload implementation: Implement a Flush+Reload timing channel in C:
Allocate a 256 × 512 byte probe array
Flush all entries with clflush
Access probe_array[secret_byte * 512]
Time each of the 256 possible accesses using rdtscp
Verify you can recover secret_byte with >95% accuracy
Mitigation overhead measurement: On a Linux VM, disable mitigations (mitigations=off in GRUB), benchmark Redis redis-benchmark -n 1000000 -c 50, re-enable mitigations, repeat. Quantify the overhead. Which operation types (GET vs SET vs LPUSH) are most affected?
Retpoline analysis: Compile a program with -mindirect-branch=thunk (GCC retpoline flag). Use objdump -d to find the retpoline thunk. Trace through the assembly and explain why speculative execution is trapped in the PAUSE loop.
Variant 4 (Speculative Store Bypass) analysis: Write a C program with a store followed by a load to the same stack address. Verify with perf stat -e ld_blocks.store_forward whether store-to-load forwarding is occurring. Enable SSBD (prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_STORE_BYPASS, PR_SPEC_DISABLE, 0, 0)) and re-measure to observe the difference.

References

Kocher, P., et al. (2019). Spectre Attacks: Exploiting Speculative Execution. IEEE S&P 2019.
Lipp, M., et al. (2020). Meltdown: Reading Kernel Memory from User Space. Communications of the ACM, 63(6).
Horn, J. (2018). Reading privileged memory with a side-channel. Google Project Zero.
Van Bulck, J., et al. (2018). Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution. USENIX Security 2018.
Van Schaik, S., et al. (2019). RIDL: Rogue In-flight Data Load. IEEE S&P 2019.
Intel Corporation. (2022). Deep Dive: Intel Analysis of Microarchitectural Data Sampling. https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/intel-analysis-microarchitectural-data-sampling.html
ARM Limited. (2018). Cache Speculation Side-channels whitepaper. https://developer.arm.com/support/arm-security-updates/speculative-processor-vulnerability
Canella, C., et al. (2019). A Systematic Evaluation of Transient Execution Attacks and Defenses. USENIX Security 2019.