Section 06: CPU Architecture — Overview

Section Purpose and Scope

The CPU is the computational substrate upon which all systems software executes. To reason rigorously about performance, correctness, and security in operating systems, you must understand what the CPU is actually doing — not just the abstract "execute instructions" model, but the real hardware pipeline with its speculative execution, out-of-order completion, branch prediction, cache hierarchies, and NUMA topologies.

This section bridges computer architecture and operating systems. It covers the mechanisms that OS designers must account for: CPU privilege modes (rings), cache coherence (critical for SMP kernels), TLBs (fundamental to virtual memory), NUMA (essential for the scheduler), and SMT/hyperthreading (with its security implications). It covers all three major ISAs relevant to production systems: x86-64, AArch64 (ARM64), and RISC-V.

Prerequisites

Section 00 (Foundations): CPU privilege rings, interrupts
Section 03 (Kernel Fundamentals): kernel memory layout, system calls
Basic familiarity with binary/hex notation and assembly-level thinking (not required to write assembly, but must be able to read it)

Learning Objectives

After completing this section you will be able to:

Describe the classical 5-stage pipeline and explain how superscalar and OoO execution extend it
Explain branch prediction, its accuracy metrics, and its performance and security implications
Describe the cache hierarchy (L1/L2/L3, coherence protocols MESI/MOESI) and reason about cache effects on kernel code
Explain TLB structure, TLB shootdowns, and their cost in multi-core kernels
Describe NUMA topology and explain why the scheduler must be NUMA-aware
Compare x86-64, AArch64, and RISC-V at the ISA design level
Explain SMT (hyperthreading), its performance benefits, and its security implications (L1TF, MDS, etc.)
Describe CPU privilege modes for x86-64 (rings + VMX root/non-root) and AArch64 (EL0–EL3)

Architecture Overview

CPU INTERNAL ARCHITECTURE (Modern Superscalar OoO Processor)

  ┌────────────────────────────────────────────────────────────────┐
  │                        CPU DIE                                 │
  │                                                                │
  │  FRONTEND                                                      │
  │  ┌──────────┐  ┌──────────┐  ┌─────────────┐  ┌───────────┐  │
  │  │  Branch  │  │  Fetch   │  │   Decode    │  │  Rename/  │  │
  │  │Predictor │→ │  Unit    │→ │  (µop gen)  │→ │  Alloc    │  │
  │  │ (BHT,BTB)│  │ (I-cache)│  │ x86→µops   │  │  (ROB)    │  │
  │  └──────────┘  └──────────┘  └─────────────┘  └─────┬─────┘  │
  │                                                       │        │
  │  BACKEND (Out-of-Order Engine)                        │        │
  │                                                 ┌─────▼─────┐ │
  │                                                 │  Dispatch  │ │
  │                                                 │  (Sched)   │ │
  │                                                 └─────┬─────┘ │
  │                                                       │        │
  │  ┌────────────┬────────────┬────────────┬─────────────┘        │
  │  │  ALU 0     │  ALU 1     │  FPU/SIMD  │  Load/Store Unit    │
  │  │  (port 0)  │  (port 1)  │  (port 5)  │  (D-cache, TLB)    │
  │  └────────────┴────────────┴────────────┴──────────────────── │
  │                                                       │        │
  │                                                 ┌─────▼─────┐ │
  │                                                 │  Reorder   │ │
  │                                                 │  Buffer    │ │
  │                                                 │  (Commit)  │ │
  │                                                 └────────────┘ │
  │                                                                │
  │  MEMORY HIERARCHY                                              │
  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────┐  │
  │  │  L1-I $  │  │  L1-D $  │  │   L2 $   │  │    L3 $      │  │
  │  │ 32-64KB  │  │ 32-64KB  │  │256KB-1MB │  │  4MB–128MB   │  │
  │  │ 4-cycle  │  │ 4-cycle  │  │ 12-cycle │  │  30-50 cycle │  │
  │  └──────────┘  └──────────┘  └──────────┘  └──────┬───────┘  │
  └──────────────────────────────────────────────────┬─┘──────────┘
                                                      │
                                              ┌───────▼──────┐
                                              │  DRAM (DIMM) │
                                              │  ~100ns       │
                                              └──────────────┘

NUMA TOPOLOGY (2-socket server):

  ┌──────────────────────┐    ┌──────────────────────┐
  │   Socket 0           │    │   Socket 1           │
  │  ┌──┐ ┌──┐ ┌──┐ ┌──┐│    │┌──┐ ┌──┐ ┌──┐ ┌──┐ │
  │  │C0│ │C1│ │C2│ │C3││    ││C4│ │C5│ │C6│ │C7│ │
  │  └──┘ └──┘ └──┘ └──┘│    │└──┘ └──┘ └──┘ └──┘ │
  │      Shared L3 $     │    │     Shared L3 $      │
  │      Local DRAM      │    │     Local DRAM        │
  └──────────┬───────────┘    └────────────┬─────────┘
             └───── QPI/UPI interconnect ──┘
             Local access: ~100ns    Remote access: ~300ns

CACHE COHERENCE STATE MACHINE (MESI):

  Modified ──── write-back ────► Shared
     ▲                              │
     │ local write            bus snoop
     │                              ▼
  Exclusive ◄── cache fill ──── Invalid

Key Concepts

Pipeline: The mechanism by which a CPU overlaps execution of multiple instructions in different stages (Fetch → Decode → Execute → Memory → Write-back). A 5-stage pipeline can have 5 instructions in flight simultaneously.
Superscalar: A CPU that issues multiple instructions per clock cycle by replicating execution units. Modern CPUs issue 4–6 µops per cycle.
Out-of-Order Execution (OoO): Instructions execute in data-dependency order rather than program order. A Reorder Buffer (ROB) tracks in-flight instructions and retires them in order to maintain architectural state.
Speculative Execution: The CPU speculatively executes instructions along a predicted branch path before knowing if the prediction is correct. If wrong, the CPU squashes the speculative results. Meltdown and Spectre showed that speculative execution can leak privileged data.
Branch Prediction: Logic that predicts the outcome of conditional branches to keep the pipeline fed. Modern predictors achieve >99% accuracy on typical workloads. Mispredictions cost ~15–20 cycles.
Register File: The CPU's fastest storage. x86-64 has 16 general-purpose registers (rax–r15); AArch64 has 31 (x0–x30). With register renaming, the physical register file is much larger (~180–224 physical registers in Intel CPUs).
ISA (Instruction Set Architecture): The contract between hardware and software defining the instruction encoding, register names, and semantics. x86-64 (CISC, complex, heavily backward-compatible), AArch64 (RISC, clean, fixed-width 32-bit instructions), RISC-V (open, modular, extensions via ISA letters: I, M, A, F, D, C, V...).
Cache Hierarchy: L1 (per-core, ~4-cycle latency, 32–64KB), L2 (per-core or shared pair, ~12-cycle, 256KB–1MB), L3 (shared, ~30–50 cycles, 4–128MB), DRAM (~100ns, GBs).
Cache Coherence: In multi-core systems, each core has its own L1/L2. The coherence protocol (MESI: Modified, Exclusive, Shared, Invalid) ensures all cores see a consistent view of memory. False sharing (two cores modifying different variables on the same cache line) causes significant performance degradation.
TLB (Translation Lookaside Buffer): A cache for virtual-to-physical address translations. A TLB miss is expensive (page table walk, ~100s of cycles). The OS must flush TLB entries when switching page tables (context switch) or modifying mappings. TLB shootdowns on SMP (sending IPIs to other cores to flush their TLBs) are a significant kernel bottleneck.
NUMA (Non-Uniform Memory Access): In multi-socket systems, each socket has local DRAM accessible at low latency (~100ns) and remote DRAM accessible via the socket interconnect (QPI/UPI/Infinity Fabric) at higher latency (~300ns). The kernel scheduler, memory allocator, and applications must be NUMA-aware to avoid remote memory accesses.
SMT / Hyperthreading: A single physical core presents two (or more) logical CPUs to the OS, sharing execution units and caches. Increases throughput on mixed workloads; creates security vulnerabilities (L1TF, MDS) because sibling threads share L1 data cache.
CPU Privilege Modes: x86-64: Ring 0 (kernel), Ring 3 (user), VMX root/non-root (hypervisor). AArch64: EL0 (user), EL1 (kernel), EL2 (hypervisor), EL3 (secure monitor/TrustZone). RISC-V: U-mode (user), S-mode (supervisor/kernel), M-mode (machine/firmware).

ISA Comparison Matrix

┌───────────────────┬────────────────┬────────────────┬────────────────┐
│ Property          │    x86-64      │   AArch64      │    RISC-V      │
├───────────────────┼────────────────┼────────────────┼────────────────┤
│ Design            │ CISC evolved   │ RISC, clean    │ RISC, open     │
│ Instruction width │ Variable 1–15B │ Fixed 32-bit   │ Fixed 32/16B   │
│ General regs      │ 16 (+ many FP) │ 31 + SP        │ 32             │
│ Calling conv.     │ System V ABI   │ AAPCS64        │ RISC-V psABI   │
│ Privilege levels  │ 4 rings + VMX  │ EL0–EL3        │ U/S/M modes    │
│ Virtualization    │ VT-x / AMD-V   │ ARM VHE        │ H-extension    │
│ Atomic ops        │ LOCK prefix    │ LDXR/STXR      │ A extension    │
│ Memory model      │ TSO (strong)   │ Weakly ordered │ RVWMO (weak)   │
│ Dominant use      │ Servers, PCs   │ Mobile, Apple  │ Embedded, IoT  │
│ License           │ Proprietary    │ Proprietary    │ Open (BSD)     │
└───────────────────┴────────────────┴────────────────┴────────────────┘

Major Historical Milestones

Year	Milestone
1978	Intel 8086: 16-bit, the x86 lineage begins
1985	Intel 80386: 32-bit protected mode, paging
1989	Intel 80486: on-chip FPU, 8KB cache, first scalar pipeline
1993	Intel Pentium: superscalar (2-wide), 64-bit data bus
1995	Pentium Pro: out-of-order execution, µop translation
1996	Stanford MIPS R4000: RISC design validated at scale
1997	Intel Pentium II: SIMD (MMX); AMD K6
2000	AMD Athlon: first 1GHz CPU; Intel Pentium 4 (NetBurst, deep pipeline)
2003	AMD Athlon 64: x86-64 (64-bit extension); Intel forced to follow
2005	Intel Core Duo: first mainstream dual-core; end of frequency scaling
2006	Intel Core 2: return to P6 microarchitecture, efficiency over frequency
2007	Intel Nehalem planning: integrated memory controller, QPI, SMT revival
2011	Sandy Bridge: ring bus, shared L3, AVX
2013	Haswell: TSX (transactional memory), AVX2
2016	ARM Cortex-A72: AArch64 competitive with x86 for server workloads
2017	Meltdown and Spectre disclosed: OoO + caches create side channels
2018	KPTI, Retpoline deployed globally to mitigate Meltdown/Spectre
2019	MDS vulnerabilities (ZombieLoad, Fallout, RIDL) expose SMT risks
2020	Apple M1: ARM64 with unified memory, outperforms x86 per-watt
2021	RISC-V International: commercial RISC-V SoCs at scale
2023	Intel Meteor Lake: tiled chiplet design; DDR5/LPDDR5
2024	ARM v9.2; RISC-V vector extension ratified; Apple M4

Modern Relevance and Production Use Cases

Performance engineering: Cache-line alignment, NUMA-aware memory allocation, avoiding false sharing, prefetching, branch hints — all require understanding this section. The difference between a cache-friendly and cache-hostile data structure can be 10x in throughput.

Security: Spectre, Meltdown, L1TF, MDS, Speculative Store Bypass, and related attacks (dozens of CVEs since 2017) all exploit CPU microarchitecture features described here. Kernel mitigations (KPTI, IBPB, STIBP, SRBDS, MDS_CLEAR) all impose real performance costs that production engineers must budget.

Scheduler design: The Linux CFS scheduler, NUMA balancing, and SMT-aware scheduling (core scheduling, introduced in Linux 5.14) all depend on the CPU topology described here.

Compiler and JIT design: CPUs have evolved to work best with specific code patterns. Understanding OoO, branch prediction, and SIMD units is essential for writing compilers, JIT engines, and hand-optimized hot paths.

Cloud instance selection: Choosing between Intel Xeon, AMD EPYC, and AWS Graviton (ARM64) instances requires understanding ISA and microarchitecture tradeoffs for specific workloads.

File Map

06-cpu-architecture/
├── 00-overview.md                      ← This file
├── 01-pipeline-fundamentals.md         ← 5-stage pipeline, hazards, forwarding
├── 02-superscalar-execution.md         ← Multiple issue, instruction-level parallelism
├── 03-out-of-order-execution.md        ← ROB, Tomasulo algorithm, register renaming
├── 04-speculative-execution.md         ← Branch prediction, BHT, BTB, TAGE predictor
├── 05-register-files.md                ← Architectural vs physical registers, ABI
├── 06-x86-64-isa.md                    ← Instruction encoding, addressing modes, CPUID
├── 07-aarch64-isa.md                   ← ARM64 design philosophy, EL levels, NEON
├── 08-risc-v-isa.md                    ← Open ISA structure, extensions, privilege spec
├── 09-cache-hierarchy.md               ← L1/L2/L3, eviction policies, cache geometry
├── 10-cache-coherence.md               ← MESI/MOESI, false sharing, coherence cost
├── 11-tlb-and-address-translation.md   ← TLB structure, shootdowns, huge pages
├── 12-numa-architecture.md             ← Socket topology, NUMA distances, libnuma
├── 13-smt-hyperthreading.md            ← HT mechanics, security implications, tuning
├── 14-cpu-power-management.md          ← P-states, C-states, frequency scaling, turbo
├── 15-hardware-security-features.md    ← CET, MPX, MTE, memory tagging

Cross-References

Section 00 (Foundations): CPU rings and privilege modes introduced here
Section 05 (Boot Process): CPU mode transitions (real → protected → long mode)
Section 09 (Scheduling): NUMA-aware scheduling and SMT topology use
Section 11 (Memory Management): TLBs and page tables connect here
Section 26 (Security): Spectre/Meltdown and hardware mitigations
Section 33 (Hardware Architecture): Chiplet design, memory subsystems, I/O

Recommended Depth of Study

Essential: Files 01–05, 09–12. Every systems programmer benefits from understanding the pipeline, caches, TLBs, and NUMA.

Deep dive recommended: Files 06–08 (ISAs) if you do cross-architecture work. Files 13–14 for performance tuning on real hardware. File 15 for security engineers.

Hands-on: Use perf stat -e cache-misses,branch-misses,tlb-misses on a workload. Use lstopo to visualize NUMA topology. Read /proc/cpuinfo and understand every relevant field.

Estimated study time: 20–25 hours for full coverage. This is a dense section — ISA references can be referenced rather than memorized.