Section 31: GPU Systems
Purpose and Scope
This section covers the architecture and systems software of Graphics Processing Units from first principles. GPUs are no longer peripheral accelerators — they are first-class compute infrastructure for AI training, scientific simulation, graphics, and video processing. Understanding GPU systems requires grasping the hardware architecture (SIMT execution, memory hierarchy, interconnects), the driver stack (kernel and userspace), programming models (CUDA, OpenCL, Metal, Vulkan Compute), and operational concerns (multi-GPU, virtualization, scheduling). This section bridges the hardware specification and the software abstractions that sit above it.
Prerequisites
- CPU architecture and cache hierarchy (Section 33)
- Operating system kernel fundamentals, device drivers (Section 14)
- Virtual memory and DMA concepts (Section 11)
- PCIe bus basics
- C/C++ programming; optional: CUDA or OpenCL exposure
Learning Objectives
By the end of this section, you will be able to:
- Explain SIMT execution and why it differs from SIMD on CPUs
- Describe the full GPU memory hierarchy from registers to HBM/GDDR and quantify typical bandwidths
- Trace a CUDA kernel launch through the driver stack from cudaLaunchKernel() to hardware dispatch
- Explain how the GPU scheduler manages warps and hides memory latency through context switching
- Describe NVLink topology and compare it to PCIe for multi-GPU workloads
- Explain GPU virtualization approaches (time-slicing, MIG, vGPU/SR-IOV) and their tradeoffs
- Compare CUDA, ROCm, Metal, and Vulkan Compute programming models
- Diagnose GPU performance bottlenecks (compute-bound vs memory-bound vs latency-bound)
Architecture Overview
GPU Compute Hierarchy
GPU Die
+---------------------------------------------------------------+
| |
| GPC (Graphics Processing Cluster) x N |
| +----------------------------------------------------------+ |
| | | |
| | TPC (Texture Processing Cluster) x M | |
| | +----------------------------------------------------+ | |
| | | | | |
| | | SM (Streaming Multiprocessor) | | |
| | | +----------------------------------------------+ | | |
| | | | Warp Schedulers (x4) | | | |
| | | | INT32 Units (x32) | FP32 Units (x64) | | | |
| | | | Tensor Cores (x4, per SM on Ampere+) | | | |
| | | | Register File (256KB per SM) | | | |
| | | | L1 Cache / Shared Memory (128KB configurable)| | | |
| | | +----------------------------------------------+ | | |
| | +----------------------------------------------------+ | |
| +----------------------------------------------------------+ |
| |
| L2 Cache (6MB - 50MB depending on GPU) |
| Memory Controllers |
| HBM2e / GDDR6X stacks (A100: 80GB HBM2e, 2TB/s BW) |
| NVLink fabric interface (A100: 600 GB/s bidirectional) |
| PCIe interface (Gen4 x16: 32 GB/s bidirectional) |
+---------------------------------------------------------------+
SIMT Execution Model
CPU SIMD GPU SIMT
+-----------+ +---------------------------+
| 1 thread | | 1 Warp = 32 threads |
| executes | | all threads execute |
| vectorized| | same instruction |
| SIMD instr| | different data lanes |
+-----------+ +---------------------------+
Divergence Handling:
if (threadIdx.x % 2 == 0) { Warp execution:
path_A(); Cycle 1: mask=0b...01010101 -> run path_A
} else { Cycle 2: mask=0b...10101010 -> run path_B
path_B(); (serialized, both paths pay cost)
}
Thread Block -> assigned to one SM, shares shared memory
Warp -> 32 threads, hardware scheduling unit
Grid -> all thread blocks for a kernel launch
GPU Memory Hierarchy
Registers ~256KB/SM ~1 cycle per-thread, fastest
Shared Memory ~48-100KB/SM ~5 cycles per-block, explicit management
L1 Cache ~32-48KB/SM ~30 cycles unified with shared memory
L2 Cache ~6-50MB ~200 cycles on-die, shared across SMs
Global Memory ~24-80GB ~600 cycles GDDR6X or HBM, off-die
CPU System Memory ~TBs ~1000+ cycles via PCIe, Unified Memory
Bandwidth comparison (A100):
HBM2e Global: 2,039 GB/s
NVLink: 600 GB/s bidirectional
PCIe Gen4 x16: 32 GB/s bidirectional
CPU DDR4: 50 GB/s
Multi-GPU with NVLink
+--------+ NVLink +--------+
| GPU 0 |<---------->| GPU 1 |
+---+----+ +----+---+
| NVLink | NVLink
+---v----+ +----v---+
| GPU 2 |<---------->| GPU 3 |
+--------+ NVLink +--------+
NVSwitch-connected DGX (NVLink 3.0):
All-to-all 600 GB/s between 8 GPUs via NVSwitch fabric
Appears as NUMA node in software
CUDA Software Stack
User Application
|
CUDA Runtime API (libcudart.so)
|
CUDA Driver API (libcuda.so)
|
Kernel-Mode GPU Driver (nvidia.ko)
| (ioctl interface)
+----v---------+
| GPU Hardware |
| Dispatcher |
+---------------+
Compilation path:
.cu -> nvcc -> PTX (virtual ISA) -> SASS (native ISA, per GPU arch)
PTX is re-JIT compiled by driver for target SM version
Key Concepts
- SIMT (Single Instruction Multiple Threads): GPU execution model where a warp of 32 threads executes the same instruction simultaneously. Differs from CPU SIMD in that threads have independent program counters (enabling divergence, at a cost).
- Streaming Multiprocessor (SM): The fundamental GPU compute unit. Contains warp schedulers, execution units, register file, and L1/shared memory. A100 has 108 SMs.
- Warp: 32 threads scheduled together as a unit. The hardware switches between warps in zero cycles to hide memory latency (latency hiding via occupancy).
- Occupancy: Ratio of active warps to maximum warps on an SM. Higher occupancy helps hide memory latency but requires careful register and shared memory management.
- Memory Coalescing: When 32 threads in a warp access contiguous aligned memory, the hardware issues a single wide transaction. Non-coalesced accesses cause multiple transactions and drastically reduce effective bandwidth.
- Shared Memory: On-chip SRAM within an SM, explicitly managed by the programmer. Acts as a programmer-managed L1 cache. Banks of 4-byte width; bank conflicts serialize access.
- Unified Memory: CUDA virtual memory abstraction where CPU and GPU share an address space; the driver handles page migration via PCIe. Convenient but has performance overhead.
- PCIe vs NVLink: PCIe Gen4 x16 delivers ~32 GB/s; NVLink 3.0 delivers 600 GB/s. For multi-GPU communication, NVLink is essential for training large models.
- MIG (Multi-Instance GPU): NVIDIA A100/H100 feature allowing a single GPU to be partitioned into up to 7 isolated GPU instances with dedicated memory and compute. Hard isolation suitable for cloud multi-tenancy.
- vGPU: Virtualization via SR-IOV or mediated passthrough; time-sliced or spatially partitioned. Used in VMware vSphere + NVIDIA GRID deployments.
Major Historical Milestones
| Year | Milestone |
|---|---|
| 1999 | NVIDIA GeForce 256 — first GPU with hardware T&L, coined "GPU" term |
| 2001 | Pixel and vertex shaders become programmable (DirectX 8) |
| 2006 | NVIDIA G80 (GeForce 8800) — unified shader architecture, CUDA-capable |
| 2007 | CUDA 1.0 released — general-purpose GPU computing API |
| 2008 | OpenCL 1.0 — open standard GPU compute (Khronos Group) |
| 2009 | NVIDIA Fermi architecture — first GPU with ECC memory, L1/L2 cache hierarchy |
| 2012 | NVIDIA Kepler — hyper-Q, dynamic parallelism |
| 2014 | AMD R9 290X — first GDDR5 with high-bandwidth HBM roadmap |
| 2016 | NVIDIA Pascal P100 — first HBM2, NVLink 1.0, fp16 for deep learning |
| 2017 | NVIDIA Volta V100 — Tensor Cores for matrix multiply, NVLink 2.0 |
| 2018 | AMD ROCm open-source GPU compute stack matures |
| 2020 | NVIDIA Ampere A100 — MIG, NVLink 3.0, sparsity acceleration, 80GB HBM2e |
| 2021 | Apple M1 — unified memory architecture eliminating discrete GPU DRAM |
| 2022 | NVIDIA Hopper H100 — Transformer Engine, NVLink 4.0, 80GB HBM3, 700W TDP |
| 2023 | AMD MI300X — 192GB HBM3 unified memory, challenging H100 for inference |
| 2024 | NVIDIA Blackwell B200 — 208B transistors, 20 petaFLOPS FP4, NVLink 5.0 |
Modern Relevance
GPUs define the compute substrate for the current AI era. Training a GPT-class model requires thousands of H100s running for months, making GPU cluster architecture a core infrastructure discipline. CUDA ecosystem lock-in drives billion-dollar hardware decisions. GPU memory bandwidth — not compute throughput — is typically the binding constraint for large language model inference, making HBM generations critical. The shift toward unified memory (Apple Silicon, AMD MI300X) is reshaping system architecture. GPU kernel optimization (via Triton, CUTLASS, or hand-written CUDA) is a high-value skill as inference costs dominate AI infrastructure budgets.
File Map
31-gpu-systems/
├── 00-overview.md <- This file
├── 01-gpu-architecture.md
├── 02-simt-execution-model.md
├── 03-gpu-memory-hierarchy.md
├── 04-cuda-programming-model.md
├── 05-opencl-and-rocm.md
├── 06-metal-and-vulkan-compute.md
├── 07-gpu-driver-stack.md
├── 08-gpu-scheduler.md
├── 09-pcie-interconnect.md
├── 10-nvlink-and-nvswitch.md
├── 11-multi-gpu-setups.md
├── 12-gpu-virtualization.md
└── 13-gpu-performance-analysis.md
Cross-References
- Section 14 (Device Drivers): GPU kernel driver architecture, DMA, IOMMU, interrupt handling
- Section 19 (Virtualization): SR-IOV, mediated passthrough, vGPU in hypervisors
- Section 32 (AI Infrastructure): GPU clusters, collective communication, training frameworks
- Section 33 (Hardware Architecture): PCIe topology, NUMA, memory controllers, cache coherency
- Section 37 (Browser Architecture): WebGPU API, GPU process in Chromium, GPU scheduling