Section 31: GPU Systems

Purpose and Scope

This section covers the architecture and systems software of Graphics Processing Units from first principles. GPUs are no longer peripheral accelerators — they are first-class compute infrastructure for AI training, scientific simulation, graphics, and video processing. Understanding GPU systems requires grasping the hardware architecture (SIMT execution, memory hierarchy, interconnects), the driver stack (kernel and userspace), programming models (CUDA, OpenCL, Metal, Vulkan Compute), and operational concerns (multi-GPU, virtualization, scheduling). This section bridges the hardware specification and the software abstractions that sit above it.

Prerequisites

CPU architecture and cache hierarchy (Section 33)
Operating system kernel fundamentals, device drivers (Section 14)
Virtual memory and DMA concepts (Section 11)
PCIe bus basics
C/C++ programming; optional: CUDA or OpenCL exposure

Learning Objectives

By the end of this section, you will be able to:

Explain SIMT execution and why it differs from SIMD on CPUs
Describe the full GPU memory hierarchy from registers to HBM/GDDR and quantify typical bandwidths
Trace a CUDA kernel launch through the driver stack from cudaLaunchKernel() to hardware dispatch
Explain how the GPU scheduler manages warps and hides memory latency through context switching
Describe NVLink topology and compare it to PCIe for multi-GPU workloads
Explain GPU virtualization approaches (time-slicing, MIG, vGPU/SR-IOV) and their tradeoffs
Compare CUDA, ROCm, Metal, and Vulkan Compute programming models
Diagnose GPU performance bottlenecks (compute-bound vs memory-bound vs latency-bound)

Architecture Overview

GPU Compute Hierarchy

  GPU Die
  +---------------------------------------------------------------+
  |                                                               |
  |  GPC (Graphics Processing Cluster)  x N                      |
  |  +----------------------------------------------------------+ |
  |  |                                                          | |
  |  |  TPC (Texture Processing Cluster) x M                   | |
  |  |  +----------------------------------------------------+  | |
  |  |  |                                                    |  | |
  |  |  |  SM (Streaming Multiprocessor)                     |  | |
  |  |  |  +----------------------------------------------+  |  | |
  |  |  |  |  Warp Schedulers (x4)                        |  |  | |
  |  |  |  |  INT32 Units (x32) | FP32 Units (x64)        |  |  | |
  |  |  |  |  Tensor Cores (x4, per SM on Ampere+)         |  |  | |
  |  |  |  |  Register File (256KB per SM)                 |  |  | |
  |  |  |  |  L1 Cache / Shared Memory (128KB configurable)|  |  | |
  |  |  |  +----------------------------------------------+  |  | |
  |  |  +----------------------------------------------------+  | |
  |  +----------------------------------------------------------+ |
  |                                                               |
  |  L2 Cache (6MB - 50MB depending on GPU)                      |
  |  Memory Controllers                                           |
  |  HBM2e / GDDR6X stacks (A100: 80GB HBM2e, 2TB/s BW)        |
  |  NVLink fabric interface (A100: 600 GB/s bidirectional)      |
  |  PCIe interface (Gen4 x16: 32 GB/s bidirectional)            |
  +---------------------------------------------------------------+

SIMT Execution Model

  CPU SIMD                          GPU SIMT
  +-----------+                     +---------------------------+
  | 1 thread  |                     | 1 Warp = 32 threads       |
  | executes  |                     | all threads execute       |
  | vectorized|                     | same instruction          |
  | SIMD instr|                     | different data lanes      |
  +-----------+                     +---------------------------+

  Divergence Handling:

  if (threadIdx.x % 2 == 0) {      Warp execution:
    path_A();                       Cycle 1: mask=0b...01010101 -> run path_A
  } else {                          Cycle 2: mask=0b...10101010 -> run path_B
    path_B();                       (serialized, both paths pay cost)
  }

  Thread Block -> assigned to one SM, shares shared memory
  Warp         -> 32 threads, hardware scheduling unit
  Grid         -> all thread blocks for a kernel launch

GPU Memory Hierarchy

  Registers         ~256KB/SM    ~1 cycle     per-thread, fastest
  Shared Memory     ~48-100KB/SM ~5 cycles    per-block, explicit management
  L1 Cache          ~32-48KB/SM  ~30 cycles   unified with shared memory
  L2 Cache          ~6-50MB      ~200 cycles  on-die, shared across SMs
  Global Memory     ~24-80GB     ~600 cycles  GDDR6X or HBM, off-die
  CPU System Memory ~TBs         ~1000+ cycles via PCIe, Unified Memory

  Bandwidth comparison (A100):
    HBM2e Global:  2,039 GB/s
    NVLink:          600 GB/s bidirectional
    PCIe Gen4 x16:    32 GB/s bidirectional
    CPU DDR4:          50 GB/s

Multi-GPU with NVLink

  +--------+   NVLink   +--------+
  |  GPU 0 |<---------->|  GPU 1 |
  +---+----+            +----+---+
      |  NVLink              | NVLink
  +---v----+            +----v---+
  |  GPU 2 |<---------->|  GPU 3 |
  +--------+   NVLink   +--------+

  NVSwitch-connected DGX (NVLink 3.0):
  All-to-all 600 GB/s between 8 GPUs via NVSwitch fabric
  Appears as NUMA node in software

CUDA Software Stack

  User Application
       |
  CUDA Runtime API (libcudart.so)
       |
  CUDA Driver API (libcuda.so)
       |
  Kernel-Mode GPU Driver (nvidia.ko)
       |  (ioctl interface)
  +----v---------+
  |  GPU Hardware |
  |  Dispatcher   |
  +---------------+

  Compilation path:
  .cu -> nvcc -> PTX (virtual ISA) -> SASS (native ISA, per GPU arch)
  PTX is re-JIT compiled by driver for target SM version

Key Concepts

SIMT (Single Instruction Multiple Threads): GPU execution model where a warp of 32 threads executes the same instruction simultaneously. Differs from CPU SIMD in that threads have independent program counters (enabling divergence, at a cost).
Streaming Multiprocessor (SM): The fundamental GPU compute unit. Contains warp schedulers, execution units, register file, and L1/shared memory. A100 has 108 SMs.
Warp: 32 threads scheduled together as a unit. The hardware switches between warps in zero cycles to hide memory latency (latency hiding via occupancy).
Occupancy: Ratio of active warps to maximum warps on an SM. Higher occupancy helps hide memory latency but requires careful register and shared memory management.
Memory Coalescing: When 32 threads in a warp access contiguous aligned memory, the hardware issues a single wide transaction. Non-coalesced accesses cause multiple transactions and drastically reduce effective bandwidth.
Shared Memory: On-chip SRAM within an SM, explicitly managed by the programmer. Acts as a programmer-managed L1 cache. Banks of 4-byte width; bank conflicts serialize access.
Unified Memory: CUDA virtual memory abstraction where CPU and GPU share an address space; the driver handles page migration via PCIe. Convenient but has performance overhead.
PCIe vs NVLink: PCIe Gen4 x16 delivers ~32 GB/s; NVLink 3.0 delivers 600 GB/s. For multi-GPU communication, NVLink is essential for training large models.
MIG (Multi-Instance GPU): NVIDIA A100/H100 feature allowing a single GPU to be partitioned into up to 7 isolated GPU instances with dedicated memory and compute. Hard isolation suitable for cloud multi-tenancy.
vGPU: Virtualization via SR-IOV or mediated passthrough; time-sliced or spatially partitioned. Used in VMware vSphere + NVIDIA GRID deployments.

Major Historical Milestones

Year	Milestone
1999	NVIDIA GeForce 256 — first GPU with hardware T&L, coined "GPU" term
2001	Pixel and vertex shaders become programmable (DirectX 8)
2006	NVIDIA G80 (GeForce 8800) — unified shader architecture, CUDA-capable
2007	CUDA 1.0 released — general-purpose GPU computing API
2008	OpenCL 1.0 — open standard GPU compute (Khronos Group)
2009	NVIDIA Fermi architecture — first GPU with ECC memory, L1/L2 cache hierarchy
2012	NVIDIA Kepler — hyper-Q, dynamic parallelism
2014	AMD R9 290X — first GDDR5 with high-bandwidth HBM roadmap
2016	NVIDIA Pascal P100 — first HBM2, NVLink 1.0, fp16 for deep learning
2017	NVIDIA Volta V100 — Tensor Cores for matrix multiply, NVLink 2.0
2018	AMD ROCm open-source GPU compute stack matures
2020	NVIDIA Ampere A100 — MIG, NVLink 3.0, sparsity acceleration, 80GB HBM2e
2021	Apple M1 — unified memory architecture eliminating discrete GPU DRAM
2022	NVIDIA Hopper H100 — Transformer Engine, NVLink 4.0, 80GB HBM3, 700W TDP
2023	AMD MI300X — 192GB HBM3 unified memory, challenging H100 for inference
2024	NVIDIA Blackwell B200 — 208B transistors, 20 petaFLOPS FP4, NVLink 5.0

Modern Relevance

GPUs define the compute substrate for the current AI era. Training a GPT-class model requires thousands of H100s running for months, making GPU cluster architecture a core infrastructure discipline. CUDA ecosystem lock-in drives billion-dollar hardware decisions. GPU memory bandwidth — not compute throughput — is typically the binding constraint for large language model inference, making HBM generations critical. The shift toward unified memory (Apple Silicon, AMD MI300X) is reshaping system architecture. GPU kernel optimization (via Triton, CUTLASS, or hand-written CUDA) is a high-value skill as inference costs dominate AI infrastructure budgets.

File Map

31-gpu-systems/
├── 00-overview.md              <- This file
├── 01-gpu-architecture.md
├── 02-simt-execution-model.md
├── 03-gpu-memory-hierarchy.md
├── 04-cuda-programming-model.md
├── 05-opencl-and-rocm.md
├── 06-metal-and-vulkan-compute.md
├── 07-gpu-driver-stack.md
├── 08-gpu-scheduler.md
├── 09-pcie-interconnect.md
├── 10-nvlink-and-nvswitch.md
├── 11-multi-gpu-setups.md
├── 12-gpu-virtualization.md
└── 13-gpu-performance-analysis.md

Cross-References

Section 14 (Device Drivers): GPU kernel driver architecture, DMA, IOMMU, interrupt handling
Section 19 (Virtualization): SR-IOV, mediated passthrough, vGPU in hypervisors
Section 32 (AI Infrastructure): GPU clusters, collective communication, training frameworks
Section 33 (Hardware Architecture): PCIe topology, NUMA, memory controllers, cache coherency
Section 37 (Browser Architecture): WebGPU API, GPU process in Chromium, GPU scheduling