Section 06: CPU Architecture — Overview
Section Purpose and Scope
The CPU is the computational substrate upon which all systems software executes. To reason rigorously about performance, correctness, and security in operating systems, you must understand what the CPU is actually doing — not just the abstract "execute instructions" model, but the real hardware pipeline with its speculative execution, out-of-order completion, branch prediction, cache hierarchies, and NUMA topologies.
This section bridges computer architecture and operating systems. It covers the mechanisms that OS designers must account for: CPU privilege modes (rings), cache coherence (critical for SMP kernels), TLBs (fundamental to virtual memory), NUMA (essential for the scheduler), and SMT/hyperthreading (with its security implications). It covers all three major ISAs relevant to production systems: x86-64, AArch64 (ARM64), and RISC-V.
Prerequisites
- Section 00 (Foundations): CPU privilege rings, interrupts
- Section 03 (Kernel Fundamentals): kernel memory layout, system calls
- Basic familiarity with binary/hex notation and assembly-level thinking (not required to write assembly, but must be able to read it)
Learning Objectives
After completing this section you will be able to:
- Describe the classical 5-stage pipeline and explain how superscalar and OoO execution extend it
- Explain branch prediction, its accuracy metrics, and its performance and security implications
- Describe the cache hierarchy (L1/L2/L3, coherence protocols MESI/MOESI) and reason about cache effects on kernel code
- Explain TLB structure, TLB shootdowns, and their cost in multi-core kernels
- Describe NUMA topology and explain why the scheduler must be NUMA-aware
- Compare x86-64, AArch64, and RISC-V at the ISA design level
- Explain SMT (hyperthreading), its performance benefits, and its security implications (L1TF, MDS, etc.)
- Describe CPU privilege modes for x86-64 (rings + VMX root/non-root) and AArch64 (EL0–EL3)
Architecture Overview
CPU INTERNAL ARCHITECTURE (Modern Superscalar OoO Processor)
┌────────────────────────────────────────────────────────────────┐
│ CPU DIE │
│ │
│ FRONTEND │
│ ┌──────────┐ ┌──────────┐ ┌─────────────┐ ┌───────────┐ │
│ │ Branch │ │ Fetch │ │ Decode │ │ Rename/ │ │
│ │Predictor │→ │ Unit │→ │ (µop gen) │→ │ Alloc │ │
│ │ (BHT,BTB)│ │ (I-cache)│ │ x86→µops │ │ (ROB) │ │
│ └──────────┘ └──────────┘ └─────────────┘ └─────┬─────┘ │
│ │ │
│ BACKEND (Out-of-Order Engine) │ │
│ ┌─────▼─────┐ │
│ │ Dispatch │ │
│ │ (Sched) │ │
│ └─────┬─────┘ │
│ │ │
│ ┌────────────┬────────────┬────────────┬─────────────┘ │
│ │ ALU 0 │ ALU 1 │ FPU/SIMD │ Load/Store Unit │
│ │ (port 0) │ (port 1) │ (port 5) │ (D-cache, TLB) │
│ └────────────┴────────────┴────────────┴──────────────────── │
│ │ │
│ ┌─────▼─────┐ │
│ │ Reorder │ │
│ │ Buffer │ │
│ │ (Commit) │ │
│ └────────────┘ │
│ │
│ MEMORY HIERARCHY │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ L1-I $ │ │ L1-D $ │ │ L2 $ │ │ L3 $ │ │
│ │ 32-64KB │ │ 32-64KB │ │256KB-1MB │ │ 4MB–128MB │ │
│ │ 4-cycle │ │ 4-cycle │ │ 12-cycle │ │ 30-50 cycle │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────┬───────┘ │
└──────────────────────────────────────────────────┬─┘──────────┘
│
┌───────▼──────┐
│ DRAM (DIMM) │
│ ~100ns │
└──────────────┘
NUMA TOPOLOGY (2-socket server):
┌──────────────────────┐ ┌──────────────────────┐
│ Socket 0 │ │ Socket 1 │
│ ┌──┐ ┌──┐ ┌──┐ ┌──┐│ │┌──┐ ┌──┐ ┌──┐ ┌──┐ │
│ │C0│ │C1│ │C2│ │C3││ ││C4│ │C5│ │C6│ │C7│ │
│ └──┘ └──┘ └──┘ └──┘│ │└──┘ └──┘ └──┘ └──┘ │
│ Shared L3 $ │ │ Shared L3 $ │
│ Local DRAM │ │ Local DRAM │
└──────────┬───────────┘ └────────────┬─────────┘
└───── QPI/UPI interconnect ──┘
Local access: ~100ns Remote access: ~300ns
CACHE COHERENCE STATE MACHINE (MESI):
Modified ──── write-back ────► Shared
▲ │
│ local write bus snoop
│ ▼
Exclusive ◄── cache fill ──── Invalid
Key Concepts
- Pipeline: The mechanism by which a CPU overlaps execution of multiple instructions in different stages (Fetch → Decode → Execute → Memory → Write-back). A 5-stage pipeline can have 5 instructions in flight simultaneously.
- Superscalar: A CPU that issues multiple instructions per clock cycle by replicating execution units. Modern CPUs issue 4–6 µops per cycle.
- Out-of-Order Execution (OoO): Instructions execute in data-dependency order rather than program order. A Reorder Buffer (ROB) tracks in-flight instructions and retires them in order to maintain architectural state.
- Speculative Execution: The CPU speculatively executes instructions along a predicted branch path before knowing if the prediction is correct. If wrong, the CPU squashes the speculative results. Meltdown and Spectre showed that speculative execution can leak privileged data.
- Branch Prediction: Logic that predicts the outcome of conditional branches to keep the pipeline fed. Modern predictors achieve >99% accuracy on typical workloads. Mispredictions cost ~15–20 cycles.
- Register File: The CPU's fastest storage. x86-64 has 16 general-purpose registers (rax–r15); AArch64 has 31 (x0–x30). With register renaming, the physical register file is much larger (~180–224 physical registers in Intel CPUs).
- ISA (Instruction Set Architecture): The contract between hardware and software defining the instruction encoding, register names, and semantics. x86-64 (CISC, complex, heavily backward-compatible), AArch64 (RISC, clean, fixed-width 32-bit instructions), RISC-V (open, modular, extensions via ISA letters: I, M, A, F, D, C, V...).
- Cache Hierarchy: L1 (per-core, ~4-cycle latency, 32–64KB), L2 (per-core or shared pair, ~12-cycle, 256KB–1MB), L3 (shared, ~30–50 cycles, 4–128MB), DRAM (~100ns, GBs).
- Cache Coherence: In multi-core systems, each core has its own L1/L2. The coherence protocol (MESI: Modified, Exclusive, Shared, Invalid) ensures all cores see a consistent view of memory. False sharing (two cores modifying different variables on the same cache line) causes significant performance degradation.
- TLB (Translation Lookaside Buffer): A cache for virtual-to-physical address translations. A TLB miss is expensive (page table walk, ~100s of cycles). The OS must flush TLB entries when switching page tables (context switch) or modifying mappings. TLB shootdowns on SMP (sending IPIs to other cores to flush their TLBs) are a significant kernel bottleneck.
- NUMA (Non-Uniform Memory Access): In multi-socket systems, each socket has local DRAM accessible at low latency (~100ns) and remote DRAM accessible via the socket interconnect (QPI/UPI/Infinity Fabric) at higher latency (~300ns). The kernel scheduler, memory allocator, and applications must be NUMA-aware to avoid remote memory accesses.
- SMT / Hyperthreading: A single physical core presents two (or more) logical CPUs to the OS, sharing execution units and caches. Increases throughput on mixed workloads; creates security vulnerabilities (L1TF, MDS) because sibling threads share L1 data cache.
- CPU Privilege Modes: x86-64: Ring 0 (kernel), Ring 3 (user), VMX root/non-root (hypervisor). AArch64: EL0 (user), EL1 (kernel), EL2 (hypervisor), EL3 (secure monitor/TrustZone). RISC-V: U-mode (user), S-mode (supervisor/kernel), M-mode (machine/firmware).
ISA Comparison Matrix
┌───────────────────┬────────────────┬────────────────┬────────────────┐
│ Property │ x86-64 │ AArch64 │ RISC-V │
├───────────────────┼────────────────┼────────────────┼────────────────┤
│ Design │ CISC evolved │ RISC, clean │ RISC, open │
│ Instruction width │ Variable 1–15B │ Fixed 32-bit │ Fixed 32/16B │
│ General regs │ 16 (+ many FP) │ 31 + SP │ 32 │
│ Calling conv. │ System V ABI │ AAPCS64 │ RISC-V psABI │
│ Privilege levels │ 4 rings + VMX │ EL0–EL3 │ U/S/M modes │
│ Virtualization │ VT-x / AMD-V │ ARM VHE │ H-extension │
│ Atomic ops │ LOCK prefix │ LDXR/STXR │ A extension │
│ Memory model │ TSO (strong) │ Weakly ordered │ RVWMO (weak) │
│ Dominant use │ Servers, PCs │ Mobile, Apple │ Embedded, IoT │
│ License │ Proprietary │ Proprietary │ Open (BSD) │
└───────────────────┴────────────────┴────────────────┴────────────────┘
Major Historical Milestones
| Year | Milestone |
|---|---|
| 1978 | Intel 8086: 16-bit, the x86 lineage begins |
| 1985 | Intel 80386: 32-bit protected mode, paging |
| 1989 | Intel 80486: on-chip FPU, 8KB cache, first scalar pipeline |
| 1993 | Intel Pentium: superscalar (2-wide), 64-bit data bus |
| 1995 | Pentium Pro: out-of-order execution, µop translation |
| 1996 | Stanford MIPS R4000: RISC design validated at scale |
| 1997 | Intel Pentium II: SIMD (MMX); AMD K6 |
| 2000 | AMD Athlon: first 1GHz CPU; Intel Pentium 4 (NetBurst, deep pipeline) |
| 2003 | AMD Athlon 64: x86-64 (64-bit extension); Intel forced to follow |
| 2005 | Intel Core Duo: first mainstream dual-core; end of frequency scaling |
| 2006 | Intel Core 2: return to P6 microarchitecture, efficiency over frequency |
| 2007 | Intel Nehalem planning: integrated memory controller, QPI, SMT revival |
| 2011 | Sandy Bridge: ring bus, shared L3, AVX |
| 2013 | Haswell: TSX (transactional memory), AVX2 |
| 2016 | ARM Cortex-A72: AArch64 competitive with x86 for server workloads |
| 2017 | Meltdown and Spectre disclosed: OoO + caches create side channels |
| 2018 | KPTI, Retpoline deployed globally to mitigate Meltdown/Spectre |
| 2019 | MDS vulnerabilities (ZombieLoad, Fallout, RIDL) expose SMT risks |
| 2020 | Apple M1: ARM64 with unified memory, outperforms x86 per-watt |
| 2021 | RISC-V International: commercial RISC-V SoCs at scale |
| 2023 | Intel Meteor Lake: tiled chiplet design; DDR5/LPDDR5 |
| 2024 | ARM v9.2; RISC-V vector extension ratified; Apple M4 |
Modern Relevance and Production Use Cases
Performance engineering: Cache-line alignment, NUMA-aware memory allocation, avoiding false sharing, prefetching, branch hints — all require understanding this section. The difference between a cache-friendly and cache-hostile data structure can be 10x in throughput.
Security: Spectre, Meltdown, L1TF, MDS, Speculative Store Bypass, and related attacks (dozens of CVEs since 2017) all exploit CPU microarchitecture features described here. Kernel mitigations (KPTI, IBPB, STIBP, SRBDS, MDS_CLEAR) all impose real performance costs that production engineers must budget.
Scheduler design: The Linux CFS scheduler, NUMA balancing, and SMT-aware scheduling (core scheduling, introduced in Linux 5.14) all depend on the CPU topology described here.
Compiler and JIT design: CPUs have evolved to work best with specific code patterns. Understanding OoO, branch prediction, and SIMD units is essential for writing compilers, JIT engines, and hand-optimized hot paths.
Cloud instance selection: Choosing between Intel Xeon, AMD EPYC, and AWS Graviton (ARM64) instances requires understanding ISA and microarchitecture tradeoffs for specific workloads.
File Map
06-cpu-architecture/
├── 00-overview.md ← This file
├── 01-pipeline-fundamentals.md ← 5-stage pipeline, hazards, forwarding
├── 02-superscalar-execution.md ← Multiple issue, instruction-level parallelism
├── 03-out-of-order-execution.md ← ROB, Tomasulo algorithm, register renaming
├── 04-speculative-execution.md ← Branch prediction, BHT, BTB, TAGE predictor
├── 05-register-files.md ← Architectural vs physical registers, ABI
├── 06-x86-64-isa.md ← Instruction encoding, addressing modes, CPUID
├── 07-aarch64-isa.md ← ARM64 design philosophy, EL levels, NEON
├── 08-risc-v-isa.md ← Open ISA structure, extensions, privilege spec
├── 09-cache-hierarchy.md ← L1/L2/L3, eviction policies, cache geometry
├── 10-cache-coherence.md ← MESI/MOESI, false sharing, coherence cost
├── 11-tlb-and-address-translation.md ← TLB structure, shootdowns, huge pages
├── 12-numa-architecture.md ← Socket topology, NUMA distances, libnuma
├── 13-smt-hyperthreading.md ← HT mechanics, security implications, tuning
├── 14-cpu-power-management.md ← P-states, C-states, frequency scaling, turbo
├── 15-hardware-security-features.md ← CET, MPX, MTE, memory tagging
Cross-References
- Section 00 (Foundations): CPU rings and privilege modes introduced here
- Section 05 (Boot Process): CPU mode transitions (real → protected → long mode)
- Section 09 (Scheduling): NUMA-aware scheduling and SMT topology use
- Section 11 (Memory Management): TLBs and page tables connect here
- Section 26 (Security): Spectre/Meltdown and hardware mitigations
- Section 33 (Hardware Architecture): Chiplet design, memory subsystems, I/O
Recommended Depth of Study
Essential: Files 01–05, 09–12. Every systems programmer benefits from understanding the pipeline, caches, TLBs, and NUMA.
Deep dive recommended: Files 06–08 (ISAs) if you do cross-architecture work. Files 13–14 for performance tuning on real hardware. File 15 for security engineers.
Hands-on: Use perf stat -e cache-misses,branch-misses,tlb-misses on a workload. Use lstopo to visualize NUMA topology. Read /proc/cpuinfo and understand every relevant field.
Estimated study time: 20–25 hours for full coverage. This is a dense section — ISA references can be referenced rather than memorized.