SMT and Hyperthreading: Shared Execution, Hidden Dangers

Prerequisites

CPU pipeline fundamentals (01-cpu-pipeline.md): pipeline stages, execution units
Out-of-order execution (02-out-of-order-execution.md): ROB, RS, physical register file
Cache hierarchy (05-cache-hierarchy.md): L1/L2 sharing, bandwidth
Cache coherence (06-cache-coherence.md): false sharing
Speculative execution (03-speculative-execution.md): Spectre/Meltdown background

Technical Overview

Simultaneous Multi-Threading (SMT) is a microarchitectural technique that allows multiple hardware thread contexts to share a single physical CPU core's execution resources simultaneously. From the operating system's perspective, each SMT thread appears as a separate logical CPU. From the hardware perspective, two (or more) threads are interleaved in the out-of-order execution pipeline, sharing functional units.

The fundamental motivation: a single thread rarely saturates all of a superscalar CPU's execution units. Branch mispredictions, cache misses, and data dependencies leave execution ports idle. SMT presents a second thread's instructions to fill those otherwise-idle slots, increasing total throughput from the same silicon.

Intel's commercial implementation is called Hyperthreading (HT), first shipping in the Pentium 4 Xeon in 2002. The Intel marketing name is specific to Intel CPUs; the architectural concept — SMT — appears across all major vendors.

The core tradeoff: SMT improves aggregate throughput for mixed or I/O-heavy workloads at the cost of: (a) increased contention for shared resources (cache, branch predictor), (b) potential performance regression for compute-bound workloads, and (c) significant security implications for shared microarchitectural state between security domains.

Historical Context

1958 — CDC 6600 "Scoreboard": Time-sharing across 10 peripheral processors sharing one central processor. Not true SMT but the concept of multiplexing a single execution resource across threads.

1969 — IBM System/370 Model 168: Introduced "Simultaneous co-routines" — hardware support for switching between two task states at high speed. Closest ancestor of SMT.

1970s — IBM Research: Jack Smotherman and others at IBM Research published theoretical work on simultaneous multi-threading as a method to utilize idle pipeline slots.

1996 — Dean Tullsen (University of Washington): Seminal paper "Simultaneous Multithreading: Maximizing On-Chip Parallelism" describing the microarchitectural requirements and performance analysis of SMT. Demonstrated that SMT could increase IPC by 16-28% on a 4-wide superscalar with 2-4 threads.

2000 — IBM POWER4: First commercial processor with SMT (IBM called it "simultaneous multi-threading"). 2 threads per core.

2002 — Intel Pentium 4 Xeon (Northwood): Intel Hyperthreading. First x86 SMT. The Pentium 4's NetBurst architecture (very deep pipeline, often stalled) was particularly well-suited for HT as the stall cycles gave room for the second thread.

2007 — Intel Core 2 / Penryn (mobile): Hyperthreading returns after being absent in Core 2 desktop. All Intel Core i-series (Nehalem, 2008) include HT: 2 threads per core.

2007 — Sun/Oracle UltraSPARC T1 "Niagara": 4 threads per core × 8 cores = 32 hardware threads. Designed for high-throughput server workloads (web serving). Maximally shared pipelines — the FPU was shared across all 4 cores. Not suitable for compute-bound tasks.

2017 — IBM POWER9: 4 threads per core (SMT4) or 8 threads per core (SMT8) modes. POWER9 is unique in offering two different SMT levels as a runtime configuration.

2020 — IBM POWER10: 8 threads per core (SMT8). Each physical core has 8 hardware thread contexts.

What HT Shares vs What is Private

This is the central hardware design question for SMT: which resources are shared between threads (potentially creating contention), and which are private (per-thread, no contention)?

Shared Resources (Intel Hyperthreading, Skylake+)

  ┌───────────────────────────────────────────────────────────────┐
  │                    PHYSICAL CORE                              │
  │                                                               │
  │  SHARED BETWEEN THREAD 0 AND THREAD 1:                       │
  │  ─────────────────────────────────────                        │
  │  • Execution Units (ALU ports 0-7)                           │
  │    - Integer ALUs (2×: ports 0,6; 1×: port 1)               │
  │    - Load units (ports 2,3)  Store unit (port 4)             │
  │    - Branch unit (port 0)    AGU (ports 2,3,7)               │
  │    - Vector/FP (ports 0,1,5)                                  │
  │                                                               │
  │  • Unified Reservation Station (97 entries, Skylake)         │
  │    → both threads' instructions compete for RS entries        │
  │                                                               │
  │  • L1 Instruction Cache (32 KB, 8-way)                       │
  │  • L1 Data Cache (32/48 KB, 8-way)                           │
  │  • L2 Cache (256 KB–1 MB, 8-way)                             │
  │  • L1 DTLB / ITLB / STLB (TLB entries shared)               │
  │                                                               │
  │  • Branch Predictor (PHT, BTB, RSB — PARTIALLY shared)       │
  │    - Skylake: PHT is shared (tagged by thread ID)            │
  │    - RSB: private per-thread (critical for security)         │
  │                                                               │
  │  • L2 TLB (shared)                                           │
  │  • Pre-decode / Instruction queue / Uop cache (shared)       │
  │                                                               │
  └───────────────────────────────────────────────────────────────┘

Private Resources (Per Hardware Thread)

  ┌───────────────────────────────────────────────────────────────┐
  │                PRIVATE PER HARDWARE THREAD                    │
  │                                                               │
  │  • Architectural Register File                                │
  │    16 GPR (RAX-R15) × 64-bit = 128 bytes                    │
  │    + 16 YMM/ZMM vector registers                             │
  │    + RFLAGS, RIP, segment registers                          │
  │                                                               │
  │  • Physical Register File (PRF)                               │
  │    In Skylake: PRF is shared but each thread gets a pool     │
  │    of physical registers; PRF is logically partitioned        │
  │                                                               │
  │  • Reorder Buffer (ROB)                                       │
  │    Skylake: 224-entry ROB SPLIT between two threads          │
  │    → each thread gets ~112 entries                            │
  │    (NOT full 224 per thread — halved!)                        │
  │                                                               │
  │  • Load Buffer and Store Buffer                               │
  │    Split between threads                                      │
  │    Skylake: 72 load / 56 store entries → ~36/28 per thread   │
  │                                                               │
  │  • Return Stack Buffer / RSB                                  │
  │    FULLY private per thread (16 entries each)                │
  │    (Critical: prevents RSB cross-contamination)              │
  │                                                               │
  │  • Program Counter (RIP)                                      │
  │  • Page Table (CR3) — different processes have diff CR3      │
  │  • Machine-Specific Registers (most MSRs are private)        │
  └───────────────────────────────────────────────────────────────┘

HT Shared/Private Resource Summary Table

Resource	Private	Shared	Notes
Architectural registers	YES	-	Full register file per thread
ROB	SPLIT	-	~112 per thread in Skylake
Load/Store buffers	SPLIT	-	~36/28 per thread in Skylake
Physical register file	POOLED	-	Partitioned, not fully split
RSB (Return Stack Buffer)	YES	-	16 entries per thread
Execution units (ALUs)	-	YES	All ports shared
L1 I-Cache	-	YES	Both threads fetch from same L1
L1 D-Cache	-	YES	Both threads' data
L2 Cache	-	YES	Both threads' data
Branch predictor (PHT)	TAGGED	-	Entries tagged by thread ID
BTB (Branch Target Buffer)	-	YES	Shared! Source of Spectre V2
Fill buffers / LFBs	-	YES	Source of MDS attacks
L1TF data buffers	-	YES	Source of L1TF attack

SMT Performance Analysis

When HT Helps

SMT improves throughput when threads are execution-resource limited in different ways — one thread's idle cycles are filled by the other thread's instructions.

Best cases: 1. I/O-bound thread + compute-bound thread: The I/O-bound thread often stalls waiting for cache misses or system calls. The compute thread fills in. 2. Mixed workloads: Web server handling both CPU-intensive crypto + waiting-on-disk requests. 3. Integer + floating-point mix: Two threads where one is integer-heavy and the other FP-heavy can share different execution ports. 4. Memory-latency hiding: Two threads both doing pointer chasing. When one thread's load is in-flight to DRAM, the other thread's instructions execute.

Typical HT throughput improvement: 0-30% total core throughput. Often stated as "30% average" but highly workload-dependent.

When HT Hurts (or Provides No Benefit)

Compute-bound workloads: Two threads both saturating the same ALU ports. Neither thread runs faster; they merely share (and fight over) the execution units. Throughput per thread decreases by 30-50%.
Cache-working-set-sensitive workloads: Thread 0 has a 24 KB working set; Thread 1 has a 24 KB working set. Together: 48 KB > 32 KB L1 D-cache → both threads' L1 hit rates drop significantly. Without HT, Thread 0 had perfect L1 hits; with HT, both miss 30-50% of accesses.
Branch predictor thrashing: Thread 1's branches pollute the PHT entries Thread 0 was using for accurate prediction. Miss rate for both threads increases.
ROB starvation: Each thread's ROB is half-sized. An OoO window of 112 entries (instead of 224) means the CPU cannot hide as many independent instructions during a cache miss.

HT Performance Model

Without HT (1 thread):
  Thread throughput = f(ROB_size=224, RS=97, L1=32KB, predictor=full)

With HT (2 threads, each half):
  Thread throughput = f(ROB_size=112, RS=~48, L1=16KB_effective, predictor=polluted)

HT benefit breaks even when:
  (2 × reduced_throughput) > (1 × full_throughput)
  i.e., each thread's throughput ≥ 50% of what it was alone

In practice:
  Memory-latency-bound: each thread ≈ 70-90% → HT wins
  Compute-bound:        each thread ≈ 40-60% → HT breaks even or loses
  Cache-thrashing:      each thread ≈ 20-40% → HT loses badly

SMT Security Issues

SMT is the source of the most dangerous class of post-Spectre vulnerabilities, because the shared microarchitectural state between SMT threads provides cross-thread information leakage channels that are more efficient and harder to mitigate than cross-process channels.

MDS: Microarchitectural Data Sampling

RIDL (Rogue In-Flight Data Load, CVE-2018-12127): The Line Fill Buffer (LFB) — the buffer that holds cache lines being filled from memory — is shared between SMT threads. A thread can transiently load from a "fictitious" address that maps to an in-flight load from the sibling thread's LFB, extracting the sibling's in-flight data.

Thread 0 (victim):                Thread 1 (attacker):
  stores secret to addr X           issues load from an unmapped address
  → cache miss → LFB fills with       or uses TSX (transactional memory)
     secret data                      to access port 4's store buffer
                                    → speculatively loads secret from LFB
                                    → encodes in cache timing

The LFB is shared because both threads share the load path from L2/L3/DRAM into L1. There is no way to make LFBs private without duplicating the entire memory access pipeline — which would double silicon area.

Fallout (CVE-2018-12126 — MSBDS: Microarchitectural Store Buffer Data Sampling): The store buffer (which holds pending stores awaiting commit to cache) can be sampled from a sibling thread. Under specific conditions involving store buffer stalls, speculative loads from one thread can read data from another thread's store buffer entries.

ZombieLoad (CVE-2018-12130 — MFBDS: Microarchitectural Fill Buffer Data Sampling): A variant of RIDL targeting fill buffers more broadly. Can extract data that has been recently loaded by any thread on the same physical core, including kernel data.

L1TF (L1 Terminal Fault, CVE-2018-3620, CVE-2018-3646 — Foreshadow):

The most dangerous for cloud environments. When the CPU encounters a PTE (Page Table Entry) with Present=0, it still proceeds to use the physical address encoded in bits 51:12 of the (non-present) PTE as a physical address to look up in the L1 D-Cache.

With SMT: if an attacker controls a non-present PTE with a crafted physical address, the speculative L1 lookup from the attacker's thread can extract data placed in L1 by the victim sibling thread — including data from the kernel or another VM's address space.

Why L1TF requires HT disable (not just KPTI):

KPTI defense against Meltdown:
  Removes kernel pages from user-mode page tables
  → Speculative load from kernel addr: page fault at address translation
  → L1 is never populated with kernel data
  → L1TF cannot extract what isn't in L1

But L1TF with HT:
  Victim thread (e.g., host kernel) places data in L1
  Attacker thread (e.g., VM guest) uses crafted non-present PTE
  → L1TF lookup hits victim's data in the SHARED L1
  → L1 is shared between SMT siblings — KPTI cannot prevent this

Fix: Flush L1 on every VM entry/exit (expensive: ~15-40% overhead)
     OR disable HT entirely (removes the sibling threat)

Cross-HT Covert Channels

Even without speculative execution, SMT threads share the cache. Prime+Probe and Flush+Reload attacks work across SMT siblings with higher bandwidth and lower noise than across separate cores, because the shared L1/L2 provides more efficient eviction and observation.

Branch Predictor Pollution Across HT

The shared BTB (between HT threads in many implementations) allows Branch-Type Poisoning attacks analogous to Spectre V2 but at lower latency and higher bandwidth than cross-process attacks.

Disabling HT in Security-Sensitive Environments

Linux

# Check HT status
cat /sys/devices/system/cpu/smt/active  # 1 = HT active, 0 = disabled

# Disable HT at boot (kernel parameter)
# In /etc/default/grub: GRUB_CMDLINE_LINUX="nosmt"
# Or: GRUB_CMDLINE_LINUX="mitigations=auto,nosmt"  # also enables all mitigations

# Disable HT at runtime (without reboot)
echo off > /sys/devices/system/cpu/smt/control

# Verify: logical CPUs disappear
nproc  # drops from 2N to N
cat /sys/devices/system/cpu/smt/active  # now 0

Who disables HT: - Google: disabled HT on production servers for L1TF mitigation (disclosed 2018) - Amazon AWS: bare-metal EC2 instances allow HT disable option; c5.metal launched with HT disabled by default for security-focused customers - Many financial services firms: disabled HT for regulatory/compliance environments handling cardholder data or trading algorithms - Security-focused Linux distributions: some ship with nosmt default

macOS

macOS Intel systems: HT is enabled by default. No runtime disable; requires firmware/BIOS change (UEFI-level).

Apple Silicon (M-series): No SMT at all. Each P-core and E-core is a single-thread hardware context. No HT sharing, no MDS attack surface via SMT. This is a security and performance simplification — Apple chose wider OoO windows over SMT throughput.

Windows

# Windows 10 1803+: disable HT in security center
# For Server 2019: Disable via Group Policy or registry:
# HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Virtualization\
#   DisableHyperThreadingEnabled = 1

# Hyper-V L1TF mitigation with HT still enabled:
# Set-VMProcessor -VMName "vm1" -EnableHostResourceProtection $true
# This enables L1D cache flushing on every VM entry/exit

IBM POWER10: 8 Threads Per Core

IBM POWER10 takes SMT to the extreme: 8 hardware thread contexts per physical core.

POWER10 Core (15nm):
  Physical core: 1
  SMT modes: SMT1 (1 thread, max per-thread performance)
             SMT2 (2 threads)
             SMT4 (4 threads)
             SMT8 (8 threads, max throughput)

What is shared in SMT8:
  - 1 GHz-class execution pipelines
  - 512 KB L2 (private to core)
  - Large OoO window with 8-thread round-robin issue

IBM design philosophy: server/mainframe workloads (many concurrent VMs,
database connections, transactions) benefit enormously from 8 threads.
Single-thread compute (rarely the bottleneck in IBM's markets) is not prioritized.

POWER10 cache hierarchy with SMT8:
  L1: 32 KB I + 32 KB D (private, fast, shared only across same 8 HW threads)
  L2: 512 KB per core (private to core — all 8 threads share)
  L3: 8 MB slice per 4 cores (shared between 4 cores × 8 = 32 logical CPUs)

ARM big.LITTLE and Heterogeneous SMT

ARM does not implement SMT in the conventional Intel sense. Instead, ARM uses asymmetric multi-core:

Apple M3 (example):
  P-cores (Performance): 4 cores × 1 thread/core = 4 logical CPUs
  E-cores (Efficiency):  4 cores × 1 thread/core = 4 logical CPUs

  No SMT anywhere: every core is 1 hardware thread only

Benefits of no-SMT:
  - Each core's full ROB (630 entries M1, ~800 M3) available to the single thread
  - No cache sharing between "threads" — full L1/L2 per core
  - No MDS/L1TF attack surface via SMT
  - Simpler scheduling: no thread-to-core affinity complexity

ARM Cortex-X4:
  Big cores (X4): 5-wide decode, no SMT, 256-entry ROB
  Medium (A720): 4-wide, no SMT
  Small (A520): 2-wide, no SMT

Samsung Exynos M1-M6 cores: No SMT. Qualcomm Oryon: No SMT.

In contrast to IBM/Intel's throughput-via-sharing strategy, ARM and Apple use throughput-via-more-cores strategy. With TSMC 3nm process (Apple M3) delivering high core density, the area cost of private resources per core is acceptable.

Debugging and Monitoring SMT

# Detect HT topology
cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
# Output: "0,24" → CPU 0 and CPU 24 are HT siblings on same physical core

# List all core sibling pairs
for cpu in /sys/devices/system/cpu/cpu*/topology/thread_siblings_list; do
    echo "$cpu: $(cat $cpu)"
done | sort -u -t: -k2

# NUMA + HT topology with lstopo
lstopo --no-io
# Shows: packages → cores → PUs (Processing Units = logical CPUs)
# Each core with 2 PUs = HT enabled

# hwloc-bind: bind a process to physical cores (not HT siblings)
hwloc-bind core:0 -- ./compute_bound_program  # only 1 HT sibling
hwloc-bind core:all -- ./program               # all cores, no HT

# perf: compare single-thread vs HT performance
perf stat -C 0 ./single_thread     # CPU 0 alone
perf stat -C 0,24 ./dual_thread    # CPU 0 and its HT sibling 24

# Measure cache interference with taskset
taskset -c 0 ./prog1 &   # Core 0 alone
taskset -c 24 ./prog2 &  # HT sibling → competing for L1/L2

Failure Modes

L1 cache thrashing with HT: Two threads each with ~24-32 KB working sets. Combined 48-64 KB exceeds L1. Hit rate drops from ~99% to ~70%. Both threads run slower WITH HT than the single thread alone.
ROB starvation under HT: Each thread only has ~112 ROB entries. A thread that generates frequent L3 misses (each taking 40 cycles) fills its ROB quickly. With a full ROB, decode stalls. The sibling thread can still run, but the stalled thread's throughput is drastically reduced beyond what it would be without HT (because the ROB is smaller and back-pressure arrives sooner).
SMT-induced branch predictor pollution: A thread running through a large hash table (many unpredictable branches) fills the branch predictor with unpredictable entries, polluting the predictor for its HT sibling's more predictable but conflicting branch addresses.
MDS/L1TF attacks in production: Documented cases of cloud providers being theoretically vulnerable to cross-VM data leakage via MDS until HT disable + L1D flush mitigations were deployed. No confirmed large-scale exfiltration, but proof-of-concept attacks demonstrated in research.

Performance Implications Summary

Workload Type           HT Benefit   Notes
──────────────────────────────────────────────────────────────
Web server (nginx)       +20-30%     Many threads, I/O latency gaps
Database (OLTP)          +10-20%     Mixed read/write, lock contention
In-memory cache (Redis)  +5-15%      Mostly single-threaded, HT limited benefit
Scientific compute       -10-30%     Compute-bound → HT hurts, disable recommended
Video encoding (ffmpeg)  -5-20%      Compute-bound
Genomics/bioinformatics  0-10%       Depends on I/O vs compute ratio
Java application server  +15-25%     Mixed workload, GC + compute + I/O
Kafka (producer/consumer)+10-20%     I/O + serialization mix

The conventional wisdom for HPC and scientific computing: disable HT for compute-bound workloads. Enables each thread to use the full ROB and cache without sharing. Particularly important for benchmarks (HT adds variability and often reduces peak throughput).

Modern Usage

Hyper-V Core Scheduler

Windows Server 2019 / Hyper-V introduced "Core Scheduler" mode: virtual CPUs are only co-scheduled on physical cores that belong to the same VM. This prevents cross-VM SMT-sibling attacks (MDS, L1TF) while retaining HT for intra-VM SMT benefit.

Classic scheduler (default pre-2018):
  pCPU 0  (thread 0): VM A vCPU
  pCPU 24 (thread 1): VM B vCPU   ← different VMs share core! Attack possible

Core scheduler (default post-L1TF):
  pCPU 0  (thread 0): VM A vCPU 0
  pCPU 24 (thread 1): VM A vCPU 1  ← same VM shares core, safe

Linux KVM: kvm_intel.nodirty_log_flush + CONFIG_KVM_SMT_ISOLATION (newer kernels). Use nosmt on host for maximum isolation.

Future Directions

SMT security hardware fixes: Future Intel/AMD designs may include per-SMT-thread fill buffer partitioning or flushing on context switch, eliminating MDS without requiring HT disable. Intel has fixed MDS in hardware starting Ice Lake for most variants.
Asymmetric SMT: Different numbers of hardware threads per core based on core purpose (P-core: 2 HT, E-core: 4 HT). Intel's efficiency cores may benefit more from SMT due to their smaller ROB and cache.
SMT with hardware-enforced isolation: Research on "secure SMT" with hardware tracking of information flow between threads, preventing cross-thread microarchitectural leakage.
RISC-V SMT: RISC-V processor designs for HPC (e.g., SiFive P870) are beginning to include SMT. The open ISA enables experiments with novel SMT isolation mechanisms.
Dynamic SMT enable/disable: Linux kernel already supports echo off > /sys/devices/system/cpu/smt/control at runtime. Future systems may dynamically enable/disable HT based on workload type — compute-bound tasks auto-disable HT, I/O-bound tasks auto-enable.

Exercises

Measure HT cache interference: Run a matrix multiply benchmark (compute-bound) on:
1 thread, physical core 0 only (taskset -c 0)
2 threads, same physical core (taskset -c 0,24)
2 threads, different physical cores (taskset -c 0,1) Measure throughput (GFLOPS) in each case. Explain the difference using the shared L1/L2 cache and ROB analysis.
MDS theoretical analysis: Read Intel's RIDL paper (Van Schaik et al., IEEE S&P 2019). Identify which specific CPU buffers are exploitable by an SMT sibling. For each buffer: (a) what data can be extracted, (b) what rate (bytes/second) does the paper demonstrate, (c) what is the mitigation.
L1TF simulation: Set up a KVM VM on an Intel CPU (pre-Ice-Lake). Use perf kvm stat to measure L1D flush events (if available) with kvm_intel.vmentry_l1d_flush=always. Compare VM entry latency with and without L1D flushing. Calculate the overhead as a percentage of total execution time.
HT topology detection: Write a C program that identifies HT siblings by measuring cache interference. For each pair of logical CPUs (CPU i, CPU j), run competing threads and measure L1 miss rate. CPU pairs on the same physical core will show significantly higher mutual L1 interference than pairs on different cores.
Core Scheduler emulation: Implement a user-space scheduler (using sched_setaffinity) that assigns threads to physical cores such that no two threads from different security domains share an SMT-sibling relationship. Given N physical cores × 2 HT threads = 2N logical CPUs, and M processes of D distinct security domains, write the placement algorithm.

References

Tullsen, D. M., et al. (1995). Simultaneous Multithreading: Maximizing On-Chip Parallelism. ISCA 1995, 392–403.
Tullsen, D. M., & Brown, J. A. (2001). Handling Long-Latency Loads in a Simultaneous Multithreading Processor. MICRO 2001.
Intel Corporation. (2003). Intel Hyper-Threading Technology Technical User's Guide. https://www.intel.com/content/www/us/en/developer/articles/technical/hyper-threading-technology.html
Van Schaik, S., et al. (2019). RIDL: Rogue In-flight Data Load. IEEE S&P 2019.
Schwarz, M., et al. (2019). Fallout: Leaking Data on Meltdown-resistant Current CPUs. CCS 2019.
Bulck, J. V., et al. (2018). Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Execution. USENIX Security 2018.
Fog, A. (2023). Microarchitecture of Intel, AMD, and VIA CPUs. Section on Hyperthreading. https://www.agner.org/optimize/microarchitecture.pdf
IBM Corporation. (2021). IBM POWER10 Processor Technical Overview. IBM Systems White Paper.
Microsoft. (2019). Windows Server guidance to protect against L1TF / Speculative Store Bypass. https://support.microsoft.com/en-us/topic/windows-server-guidance-to-protect-against-l1-terminal-fault