Section 33: Hardware Architecture
Purpose and Scope
Hardware architecture is the foundation on which every systems software decision rests. Understanding why memory latencies, cache hierarchies, branch mispredictions, and NUMA topology exist — and their exact costs — is what separates performance engineering from guesswork. This section covers the CPU pipeline in depth: instruction fetch, decode, execution units, out-of-order scheduling, speculation, and retirement. It covers the memory system: SRAM cache hierarchy, coherency protocols (MESI/MOESI), NUMA, and memory controllers. It includes deep dives on ARM, RISC-V, and x86-64 ISAs, as well as modern packaging trends (chiplets, 3D stacking, HBM). PCIe and NVMe internals connect CPU architecture to the broader platform.
Prerequisites
- Assembly language basics (any ISA)
- Operating system memory management (Section 11)
- Basic digital logic (flip-flops, adders, multiplexers)
- C programming with awareness of memory layout
Learning Objectives
By the end of this section, you will be able to:
- Trace an instruction through each stage of a modern OoO superscalar pipeline
- Explain Tomasulo's algorithm and how the Reorder Buffer enforces in-order retirement
- Describe TAGE branch prediction and quantify the performance cost of mispredictions
- Explain MESI cache coherency and describe the protocol messages for a read-modify-write across NUMA nodes
- Calculate the effective memory access time using the cache hierarchy formula
- Compare ARM (AArch64), RISC-V, and x86-64 ISA design philosophies
- Describe chiplet packaging and explain why it has become the dominant architecture
- Explain how PCIe topology affects DMA latency and bandwidth
Architecture Overview
Modern Superscalar OoO Pipeline
Instruction Cache (L1-I)
|
+----v----+
| Fetch | Branch predictor (TAGE) predicts next PC
| Unit | ~4-6 instructions/cycle wide
+----+----+
|
+----v----+
| Decode | x86-64: variable-length -> fixed-length micro-ops (uops)
| Unit | ARM: fixed 32-bit instructions, simpler decode
+----+----+
|
+----v----+
| Rename | Register renaming via RAT (Register Alias Table)
| / RAT | Physical register file >> architectural registers
+----+----+ Eliminates WAR and WAW hazards (false dependencies)
|
+----v------+
| Dispatch | Uops written into Reorder Buffer (ROB) and
| / ROB | Reservation Stations / Scheduler
+----+------+
|
+----v---------+
| Scheduler | OoO issue: uops execute when operands ready
| (Issue Queue)| Dynamically scheduled, bypassing original order
+----+---------+
|
+----v-------------+
| Execution Units | Multiple ports: ALU, FPU, SIMD, Load, Store, Branch
+----+-------------+
|
+----v----+
| Writeback| Results written to physical register file
| / CDB | Common Data Bus broadcasts results to waiting uops
+----+----+
|
+----v----+
| Retire | In-order retirement from ROB head
| / ROB | Handles exceptions, precise interrupts
+---------+ Commits architectural state
Cache Hierarchy and Coherency
Core 0 Core 1 Core N
+-------+ +-------+ +-------+
|L1-I 32K| |L1-I 32K| |L1-I 32K|
|L1-D 48K| |L1-D 48K| |L1-D 48K|
+---+---+ +---+---+ +---+---+
| | |
+---v---+ +---v---+ +---v---+
| L2 1MB| | L2 1MB| | L2 1MB|
+---+---+ +---+---+ +---+---+
| | |
+---+----------------------+---------------------+---+
| LLC / L3 (shared) |
| 32MB - 192MB (typical) |
+---+----------------------------------------------------+
| |
+---v---+ +---v---+
| Memory| | Memory| NUMA node 0 NUMA node 1
|Channel| |Channel| separate DDR5 controllers
+-------+ +-------+
MESI Protocol states per cache line:
M (Modified) - dirty, sole owner, must writeback on eviction
E (Exclusive) - clean, sole owner, can promote to M silently
S (Shared) - clean, multiple sharers, read-only
I (Invalid) - not present
RFO (Read-For-Ownership): I -> M requires broadcast invalidation
to all sharers, then exclusive grant. ~100-300ns cross-socket.
Branch Prediction: TAGE
TAGE (TAgged GEometric history length predictor):
T0: Bimodal base predictor (indexed by PC)
|
T1: Tagged table, history length h1 (e.g., 5 bits)
|
T2: Tagged table, history length h2 (e.g., 11 bits)
|
T3: Tagged table, history length h4 (e.g., 22 bits)
|
T4: Tagged table, history length h8 (e.g., 44 bits)
Longest matching tagged component wins.
Tags prevent aliasing between unrelated branches.
Misprediction cost:
Modern CPUs: 15-20 pipeline stages flushed on mispredict
Cost: ~15-20 cycles * IPC = significant for tight loops
Branch predictor accuracy: >99% for typical workloads
x86-64 vs ARM vs RISC-V ISA Comparison
x86-64 (CISC) AArch64 (RISC) RISC-V (RISC)
+--------------+ +--------------+ +--------------+
| Variable-len | | Fixed 32-bit | | Fixed 32-bit |
| 1-15 bytes | | instructions | | (+ 16-bit C) |
| Complex decode| | Simple decode| | Simple decode|
| Implicit mem | | Load/store | | Load/store |
| operands | | only | | only |
| 16 GPRs | | 31 GPRs | | 32 GPRs |
| Complex ABI | | Clean ABI | | Clean ABI |
| Huge legacy | | Growing eco | | Open ISA |
| Dominant | | Apple, AWS | | Embedded+ |
| server/desktop| | Graviton3 | | RISC-V chips |
+--------------+ +--------------+ +--------------+
Chiplet Architecture
Traditional monolithic die:
+------------------------------------------+
| CPU Cores | L3 Cache | Memory Controller |
| I/O | PCIe | USB | SATA |
+------------------------------------------+
Problem: yield loss, diverse process nodes, large reticle
Chiplet disaggregation (AMD EPYC / Intel Meteor Lake):
+--------+ +--------+ +--------+ +--------+
| CCD 0 | | CCD 1 | | CCD 2 | | CCD 3 |
| 8 cores| | 8 cores| | 8 cores| | 8 cores|
| 3nm | | 3nm | | 3nm | | 3nm |
+---+----+ +---+----+ +---+----+ +---+----+
| | | |
+---+------------+------------+------------+---+
| cIOD |
| I/O Die: PCIe, DDR, Infinity Fabric |
| (older/cheaper process node, e.g. 6nm) |
+-----------------------------------------------+
Benefits: mix process nodes, better yield, modular design
Key Concepts
- Tomasulo's Algorithm: OoO execution via reservation stations and a common data bus. Instructions issue to RS when operands available; results broadcast to all waiting stations. Eliminates WAR/WAW hazards through register renaming.
- Reorder Buffer (ROB): FIFO structure holding in-flight uops. Ensures in-order retirement and precise exceptions even when uops execute out-of-order.
- Register Renaming: Mapping architectural registers to a large physical register file. Eliminates false (WAR/WAW) dependencies. Intel Golden Cove: 512 physical integer registers for 16 architectural.
- Speculation: Executing instructions past unresolved branches (control speculation) or loads past unresolved stores (memory disambiguation). Misspelled speculations cause pipeline flushes and security concerns (Spectre/Meltdown).
- NUMA (Non-Uniform Memory Access): Multi-socket systems where each socket has local DRAM attached. Remote memory access traverses the inter-socket interconnect (UPI/Infinity Fabric), adding ~70-100ns latency vs ~70ns local.
- PCIe Topology: Hierarchical tree of root complexes, switches, and endpoints. Each PCIe link is a point-to-point serial lane pair. DMA from NVMe or GPU traverses this tree; NUMA-awareness affects which CPU socket handles the interrupt and memory.
- NVMe Internals: Block device protocol over PCIe. Supports 65535 queues (vs 1 for AHCI), with 65534 commands per queue. Leverages PCIe DMA for zero-copy transfers. Latency ~20-100 microseconds (vs ~1ms HDD).
- Chiplets: Disaggregated silicon dies interconnected via high-density die-to-die interconnect (EMIB, UCIe, Infinity Fabric). AMD EPYC, Intel Sapphire Rapids, and Apple M2 Ultra all use chiplet architectures.
- Simultaneous Multithreading (SMT): Sharing execution units between two logical threads (HyperThreading). Improves throughput when a thread stalls on memory, but can increase contention and L1/L2 cache pressure.
Major Historical Milestones
| Year | Milestone |
|---|---|
| 1945 | von Neumann architecture paper — stored-program computer |
| 1964 | IBM System/360 — first ISA designed for multiple implementations |
| 1967 | Tomasulo's algorithm (IBM 360/91) — OoO execution |
| 1971 | Intel 4004 — first microprocessor |
| 1978 | Intel 8086 — x86 ISA origin; 16-bit |
| 1985 | MIPS R2000 — RISC architecture demonstrated in silicon |
| 1985 | Intel 80386 — x86 goes 32-bit (IA-32) |
| 1993 | Intel Pentium — superscalar x86 |
| 1995 | Intel P6 (Pentium Pro) — OoO execution in x86, ROB, uop decode |
| 1999 | AMD Athlon — x86-64 precursor, 128-bit SSE |
| 2003 | AMD Opteron — x86-64 (AMD64), modern 64-bit x86 |
| 2003 | AMD Opteron — NUMA with HyperTransport |
| 2005 | Intel dual-core — multicore era begins |
| 2010 | ARM Cortex-A9 — out-of-order ARM; mobile superscalar |
| 2011 | AMD Bulldozer — clustered integer units, contested "core" definition |
| 2017 | AMD Zen — modern RISC-V-inspired clean-slate x86 design |
| 2017 | RISC-V ratified as open standard ISA |
| 2018 | Spectre/Meltdown — speculation attacks reshape CPU security model |
| 2019 | AMD EPYC Rome — chiplet architecture, 64 cores from 8 CCDs |
| 2020 | Apple M1 — unified memory, out-of-order ARM, industry disruption |
| 2022 | Intel Alder Lake — hybrid big.LITTLE (P-core + E-core) in x86 |
| 2023 | AMD EPYC Genoa — 96 cores, 12 CCDs, DDR5, PCIe 5.0 |
Modern Relevance
CPU architecture directly determines software performance characteristics. Cache-line-aware data structure design (Section 25), NUMA-aware memory allocation, and lock-free algorithm design all require internalized knowledge of the hardware. The Spectre/Meltdown fallout permanently altered OS kernel design (KPTI, retpoline), demonstrating that CPU architecture affects security. The shift to chiplet designs is changing how hardware is specified and procured. ARM's rise in servers (AWS Graviton, Ampere Altra) makes ISA portability non-optional. RISC-V is gaining traction in embedded and specialty accelerators, making ISA literacy across all three dominant families a core competency.
File Map
33-hardware-architecture/
├── 00-overview.md <- This file
├── 01-von-neumann-vs-harvard.md
├── 02-cpu-pipeline-stages.md
├── 03-superscalar-and-ooo.md
├── 04-speculation-and-prediction.md
├── 05-branch-prediction-tage.md
├── 06-register-renaming-tomasulo.md
├── 07-cache-hierarchy-design.md
├── 08-cache-coherency-mesi.md
├── 09-numa-topology.md
├── 10-pcie-topology.md
├── 11-nvme-internals.md
├── 12-arm-architecture.md
├── 13-risc-v-isa.md
├── 14-x86-64-internals.md
├── 15-memory-controllers.md
└── 16-chiplets-and-packaging.md
Cross-References
- Section 06 (CPU Architecture, intro): Foundational register/memory model, calling conventions
- Section 10 (Synchronization): Cache coherency and memory ordering are why atomics have acquire/release semantics
- Section 11 (Memory Management): TLB, page tables, NUMA-aware allocation (mmap, mbind)
- Section 25 (Performance Engineering): CPU performance counters, cache-miss profiling, perf stat
- Section 31 (GPU Systems): PCIe topology for GPUs; memory hierarchy comparison
- Section 34 (Embedded Systems): ARM Cortex-M (in-order, no cache sometimes), MCU memory maps