Section 33: Hardware Architecture

Purpose and Scope

Hardware architecture is the foundation on which every systems software decision rests. Understanding why memory latencies, cache hierarchies, branch mispredictions, and NUMA topology exist — and their exact costs — is what separates performance engineering from guesswork. This section covers the CPU pipeline in depth: instruction fetch, decode, execution units, out-of-order scheduling, speculation, and retirement. It covers the memory system: SRAM cache hierarchy, coherency protocols (MESI/MOESI), NUMA, and memory controllers. It includes deep dives on ARM, RISC-V, and x86-64 ISAs, as well as modern packaging trends (chiplets, 3D stacking, HBM). PCIe and NVMe internals connect CPU architecture to the broader platform.

Prerequisites

Assembly language basics (any ISA)
Operating system memory management (Section 11)
Basic digital logic (flip-flops, adders, multiplexers)
C programming with awareness of memory layout

Learning Objectives

By the end of this section, you will be able to:

Trace an instruction through each stage of a modern OoO superscalar pipeline
Explain Tomasulo's algorithm and how the Reorder Buffer enforces in-order retirement
Describe TAGE branch prediction and quantify the performance cost of mispredictions
Explain MESI cache coherency and describe the protocol messages for a read-modify-write across NUMA nodes
Calculate the effective memory access time using the cache hierarchy formula
Compare ARM (AArch64), RISC-V, and x86-64 ISA design philosophies
Describe chiplet packaging and explain why it has become the dominant architecture
Explain how PCIe topology affects DMA latency and bandwidth

Architecture Overview

Modern Superscalar OoO Pipeline

  Instruction Cache (L1-I)
       |
  +----v----+
  |  Fetch  |   Branch predictor (TAGE) predicts next PC
  |  Unit   |   ~4-6 instructions/cycle wide
  +----+----+
       |
  +----v----+
  | Decode  |   x86-64: variable-length -> fixed-length micro-ops (uops)
  |  Unit   |   ARM: fixed 32-bit instructions, simpler decode
  +----+----+
       |
  +----v----+
  | Rename  |   Register renaming via RAT (Register Alias Table)
  |  / RAT  |   Physical register file >> architectural registers
  +----+----+   Eliminates WAR and WAW hazards (false dependencies)
       |
  +----v------+
  | Dispatch  |   Uops written into Reorder Buffer (ROB) and
  |  / ROB    |   Reservation Stations / Scheduler
  +----+------+
       |
  +----v---------+
  |  Scheduler   |   OoO issue: uops execute when operands ready
  | (Issue Queue)|   Dynamically scheduled, bypassing original order
  +----+---------+
       |
  +----v-------------+
  | Execution Units  |   Multiple ports: ALU, FPU, SIMD, Load, Store, Branch
  +----+-------------+
       |
  +----v----+
  | Writeback|   Results written to physical register file
  |  / CDB  |   Common Data Bus broadcasts results to waiting uops
  +----+----+
       |
  +----v----+
  | Retire  |   In-order retirement from ROB head
  |  / ROB  |   Handles exceptions, precise interrupts
  +---------+   Commits architectural state

Cache Hierarchy and Coherency

  Core 0                 Core 1                Core N
  +-------+              +-------+             +-------+
  |L1-I 32K|             |L1-I 32K|            |L1-I 32K|
  |L1-D 48K|             |L1-D 48K|            |L1-D 48K|
  +---+---+              +---+---+             +---+---+
      |                      |                     |
  +---v---+              +---v---+             +---v---+
  | L2 1MB|              | L2 1MB|             | L2 1MB|
  +---+---+              +---+---+             +---+---+
      |                      |                     |
  +---+----------------------+---------------------+---+
  |                     LLC / L3 (shared)               |
  |                  32MB - 192MB (typical)              |
  +---+----------------------------------------------------+
      |                      |
  +---v---+              +---v---+
  | Memory|              | Memory|     NUMA node 0  NUMA node 1
  |Channel|              |Channel|     separate DDR5 controllers
  +-------+              +-------+

  MESI Protocol states per cache line:
  M (Modified) - dirty, sole owner, must writeback on eviction
  E (Exclusive) - clean, sole owner, can promote to M silently
  S (Shared)   - clean, multiple sharers, read-only
  I (Invalid)  - not present

  RFO (Read-For-Ownership): I -> M requires broadcast invalidation
  to all sharers, then exclusive grant. ~100-300ns cross-socket.

Branch Prediction: TAGE

  TAGE (TAgged GEometric history length predictor):

  T0: Bimodal base predictor (indexed by PC)
       |
  T1: Tagged table, history length h1 (e.g., 5 bits)
       |
  T2: Tagged table, history length h2 (e.g., 11 bits)
       |
  T3: Tagged table, history length h4 (e.g., 22 bits)
       |
  T4: Tagged table, history length h8 (e.g., 44 bits)

  Longest matching tagged component wins.
  Tags prevent aliasing between unrelated branches.

  Misprediction cost:
  Modern CPUs: 15-20 pipeline stages flushed on mispredict
  Cost: ~15-20 cycles * IPC = significant for tight loops
  Branch predictor accuracy: >99% for typical workloads

x86-64 vs ARM vs RISC-V ISA Comparison

  x86-64 (CISC)          AArch64 (RISC)         RISC-V (RISC)
  +--------------+        +--------------+        +--------------+
  | Variable-len |        | Fixed 32-bit |        | Fixed 32-bit |
  | 1-15 bytes   |        | instructions |        | (+ 16-bit C) |
  | Complex decode|       | Simple decode|        | Simple decode|
  | Implicit mem |        | Load/store   |        | Load/store   |
  | operands     |        | only         |        | only         |
  | 16 GPRs      |        | 31 GPRs      |        | 32 GPRs      |
  | Complex ABI  |        | Clean ABI    |        | Clean ABI    |
  | Huge legacy  |        | Growing eco  |        | Open ISA     |
  | Dominant     |        | Apple, AWS   |        | Embedded+    |
  | server/desktop|       | Graviton3    |        | RISC-V chips |
  +--------------+        +--------------+        +--------------+

Chiplet Architecture

  Traditional monolithic die:
  +------------------------------------------+
  |  CPU Cores | L3 Cache | Memory Controller |
  |  I/O       | PCIe     | USB  | SATA       |
  +------------------------------------------+
  Problem: yield loss, diverse process nodes, large reticle

  Chiplet disaggregation (AMD EPYC / Intel Meteor Lake):

  +--------+  +--------+  +--------+  +--------+
  | CCD 0  |  | CCD 1  |  | CCD 2  |  | CCD 3  |
  | 8 cores|  | 8 cores|  | 8 cores|  | 8 cores|
  | 3nm    |  | 3nm    |  | 3nm    |  | 3nm    |
  +---+----+  +---+----+  +---+----+  +---+----+
      |            |            |            |
  +---+------------+------------+------------+---+
  |                   cIOD                        |
  |     I/O Die: PCIe, DDR, Infinity Fabric       |
  |     (older/cheaper process node, e.g. 6nm)    |
  +-----------------------------------------------+

  Benefits: mix process nodes, better yield, modular design

Key Concepts

Tomasulo's Algorithm: OoO execution via reservation stations and a common data bus. Instructions issue to RS when operands available; results broadcast to all waiting stations. Eliminates WAR/WAW hazards through register renaming.
Reorder Buffer (ROB): FIFO structure holding in-flight uops. Ensures in-order retirement and precise exceptions even when uops execute out-of-order.
Register Renaming: Mapping architectural registers to a large physical register file. Eliminates false (WAR/WAW) dependencies. Intel Golden Cove: 512 physical integer registers for 16 architectural.
Speculation: Executing instructions past unresolved branches (control speculation) or loads past unresolved stores (memory disambiguation). Misspelled speculations cause pipeline flushes and security concerns (Spectre/Meltdown).
NUMA (Non-Uniform Memory Access): Multi-socket systems where each socket has local DRAM attached. Remote memory access traverses the inter-socket interconnect (UPI/Infinity Fabric), adding ~70-100ns latency vs ~70ns local.
PCIe Topology: Hierarchical tree of root complexes, switches, and endpoints. Each PCIe link is a point-to-point serial lane pair. DMA from NVMe or GPU traverses this tree; NUMA-awareness affects which CPU socket handles the interrupt and memory.
NVMe Internals: Block device protocol over PCIe. Supports 65535 queues (vs 1 for AHCI), with 65534 commands per queue. Leverages PCIe DMA for zero-copy transfers. Latency ~20-100 microseconds (vs ~1ms HDD).
Chiplets: Disaggregated silicon dies interconnected via high-density die-to-die interconnect (EMIB, UCIe, Infinity Fabric). AMD EPYC, Intel Sapphire Rapids, and Apple M2 Ultra all use chiplet architectures.
Simultaneous Multithreading (SMT): Sharing execution units between two logical threads (HyperThreading). Improves throughput when a thread stalls on memory, but can increase contention and L1/L2 cache pressure.

Major Historical Milestones

Year	Milestone
1945	von Neumann architecture paper — stored-program computer
1964	IBM System/360 — first ISA designed for multiple implementations
1967	Tomasulo's algorithm (IBM 360/91) — OoO execution
1971	Intel 4004 — first microprocessor
1978	Intel 8086 — x86 ISA origin; 16-bit
1985	MIPS R2000 — RISC architecture demonstrated in silicon
1985	Intel 80386 — x86 goes 32-bit (IA-32)
1993	Intel Pentium — superscalar x86
1995	Intel P6 (Pentium Pro) — OoO execution in x86, ROB, uop decode
1999	AMD Athlon — x86-64 precursor, 128-bit SSE
2003	AMD Opteron — x86-64 (AMD64), modern 64-bit x86
2003	AMD Opteron — NUMA with HyperTransport
2005	Intel dual-core — multicore era begins
2010	ARM Cortex-A9 — out-of-order ARM; mobile superscalar
2011	AMD Bulldozer — clustered integer units, contested "core" definition
2017	AMD Zen — modern RISC-V-inspired clean-slate x86 design
2017	RISC-V ratified as open standard ISA
2018	Spectre/Meltdown — speculation attacks reshape CPU security model
2019	AMD EPYC Rome — chiplet architecture, 64 cores from 8 CCDs
2020	Apple M1 — unified memory, out-of-order ARM, industry disruption
2022	Intel Alder Lake — hybrid big.LITTLE (P-core + E-core) in x86
2023	AMD EPYC Genoa — 96 cores, 12 CCDs, DDR5, PCIe 5.0

Modern Relevance

CPU architecture directly determines software performance characteristics. Cache-line-aware data structure design (Section 25), NUMA-aware memory allocation, and lock-free algorithm design all require internalized knowledge of the hardware. The Spectre/Meltdown fallout permanently altered OS kernel design (KPTI, retpoline), demonstrating that CPU architecture affects security. The shift to chiplet designs is changing how hardware is specified and procured. ARM's rise in servers (AWS Graviton, Ampere Altra) makes ISA portability non-optional. RISC-V is gaining traction in embedded and specialty accelerators, making ISA literacy across all three dominant families a core competency.

File Map

33-hardware-architecture/
├── 00-overview.md              <- This file
├── 01-von-neumann-vs-harvard.md
├── 02-cpu-pipeline-stages.md
├── 03-superscalar-and-ooo.md
├── 04-speculation-and-prediction.md
├── 05-branch-prediction-tage.md
├── 06-register-renaming-tomasulo.md
├── 07-cache-hierarchy-design.md
├── 08-cache-coherency-mesi.md
├── 09-numa-topology.md
├── 10-pcie-topology.md
├── 11-nvme-internals.md
├── 12-arm-architecture.md
├── 13-risc-v-isa.md
├── 14-x86-64-internals.md
├── 15-memory-controllers.md
└── 16-chiplets-and-packaging.md

Cross-References

Section 06 (CPU Architecture, intro): Foundational register/memory model, calling conventions
Section 10 (Synchronization): Cache coherency and memory ordering are why atomics have acquire/release semantics
Section 11 (Memory Management): TLB, page tables, NUMA-aware allocation (mmap, mbind)
Section 25 (Performance Engineering): CPU performance counters, cache-miss profiling, perf stat
Section 31 (GPU Systems): PCIe topology for GPUs; memory hierarchy comparison
Section 34 (Embedded Systems): ARM Cortex-M (in-order, no cache sometimes), MCU memory maps