GPU Architecture
Technical Overview
The GPU and CPU represent two fundamentally different answers to the question of how to maximize computation throughput. The CPU optimizes for latency — it uses a small number of powerful cores (8–128), deep branch predictors, large out-of-order execution windows, and large multi-level caches to minimize the time to complete any single task. The GPU optimizes for throughput — it uses thousands of simpler cores with minimal per-thread resources and hides memory latency by rapidly switching between thousands of concurrent threads, keeping the execution units busy at all times.
NVIDIA's Ampere A100 GPU (2020) contains 6,912 CUDA cores across 108 Streaming Multiprocessors, 432 Tensor cores, 80 GB of HBM2e memory with 2 TB/s bandwidth, and delivers 312 TFLOPS of FP16 throughput. A high-end CPU (AMD EPYC 9654, 96 cores) delivers ~10 TFLOPS FP64 theoretical peak. For the right workload (dense matrix multiplication), the GPU is 30–100x faster.
Prerequisites
- Computer architecture fundamentals: pipelines, caches, DRAM, instruction-level parallelism
- Basic parallel programming concepts: threads, data parallelism
- Understanding of memory hierarchies and bandwidth vs latency tradeoffs
- Floating-point arithmetic (FP32, FP16, BF16 formats)
GPU SM Architecture Diagram (NVIDIA Ampere)
+------------------------------------------------------------------+
| Streaming Multiprocessor (SM) — Ampere Architecture |
| |
| +------------------------------------------------------------+ |
| | Register File (256KB — 65536 × 32-bit registers) | |
| +------------------------------------------------------------+ |
| |
| +----------------+ +----------------+ +--------------------+ |
| | CUDA Cores (x4)| | Tensor Cores | | Special Function | |
| | partition: | | (FP16/BF16/ | | Units (SFUs) | |
| | 32 FP32 + | | INT8/TF32) | | sin, cos, sqrt, | |
| | 32 INT32 | | 4 × 4 × 4 MMA | | reciprocal | |
| +----------------+ +----------------+ +--------------------+ |
| |
| +------------------------------------------------------------+ |
| | LD/ST Units (16 per SM) — Load/Store to global/shared mem | |
| +------------------------------------------------------------+ |
| |
| +--------------------+ +----------------------------------+ |
| | L1 Cache / | | Warp Schedulers (4 per SM) | |
| | Shared Memory | | Each selects 1 warp/cycle from | |
| | (192KB, split 128/ | | resident warps to issue | |
| | 64 configurable) | +----------------------------------+ |
| +--------------------+ |
| |
| Resident warps: up to 64 per SM (2048 threads per SM) |
| Max blocks per SM: 32 |
+------------------------------------------------------------------+
108 SMs total in A100 = 108 × 64 warps = 6,912 warps = 221,184 threads
Core Content
GPU vs CPU Design Philosophy
CPU Design: - 8–128 cores (AMD EPYC 9754: 128 cores) - Large branch predictors (thousands of entries), deep speculative execution (hundreds of in-flight instructions) - Large L3 cache (64–512 MB) to hide DRAM latency - Out-of-order execution window: 500–1000 in-flight instructions - Optimized for latency: minimize time-to-completion for a single thread - Hyperthreading/SMT: 2 hardware threads per core
GPU Design: - Thousands of small cores grouped into SMs - Minimal per-thread state: small register file, no branch predictor - Latency hiding via massive multithreading: when one warp stalls on memory access, the warp scheduler instantly switches to another ready warp - Optimized for throughput: maximize total work done per second - Each SM is a small SIMD processor plus a hardware scheduler that context-switches between 64 warps at zero cost (register state is held in the register file, not swapped out)
When to use a GPU: Workloads with massive data parallelism, minimal branching, arithmetic intensity (many operations per byte loaded). Neural network training/inference, scientific simulation (FDTD, molecular dynamics), image/video processing, cryptographic mining, graphics rendering.
When the CPU wins: Workloads with complex branching, irregular memory access patterns, low data parallelism, or where the overhead of GPU data transfer exceeds the computation gain.
NVIDIA GPU Architecture: Streaming Multiprocessor
The Streaming Multiprocessor (SM) is the fundamental building block. All of CUDA's abstractions ultimately run on SMs.
Ampere SM composition (as in A100): - CUDA cores: 128 per SM. Each CUDA core executes one FP32 or INT32 operation per clock. At 1.41 GHz, one SM does 128 × 1.41 × 10^9 = 180 GFLOPS FP32. - Tensor cores: 4 per SM partition (16 total per SM in Ampere for 4 partitions). Each Tensor core executes a 4×4×4 matrix multiply-accumulate (MMA) operation per clock, processing 256 FP16 multiplications and 256 FP32 additions per clock. This is NVIDIA's workhorse for deep learning. A100 delivers 312 TFLOPS with sparsity exploitation. - Special Function Units (SFUs): 4 per SM partition. Execute transcendental operations (sin, cos, exp, log, reciprocal, square root) in 1/4 the throughput of CUDA cores (one per 4 cycles instead of one per cycle). - Load/Store Units: 16 per SM. Transfer data between registers and all memory spaces (global, shared, local, constant). - Register file: 256KB per SM (65,536 registers × 32-bit). Largest on-SM resource. Registers hold the state of all resident warps simultaneously — this is why GPU context switching is "free": no register save/restore. - L1 cache / shared memory: 192KB per SM in Ampere (configurable: 128KB shared + 64KB L1, or other splits). Shared memory is explicitly managed by the programmer; L1 is hardware-managed.
Warp Execution: SIMT Model
NVIDIA's execution model is SIMT (Single Instruction, Multiple Threads), a GPU-specific variant of SIMD.
A warp is a group of 32 threads that execute in lock-step — the same instruction is issued to all 32 threads simultaneously, operating on different data (different thread indices). This is analogous to a 32-wide SIMD instruction on a CPU, but with hardware management of the thread identities.
Each SM has 4 warp schedulers. On each clock cycle, each warp scheduler selects one eligible warp from its pool and issues the next instruction for that warp to the execution units. A warp is eligible if: - It has no outstanding data dependency (prior instruction's result is available) - It has no pending memory access
When a warp stalls (waiting for global memory — typically 200–800 cycles), it is set to "not eligible" and another warp is immediately issued. This is latency hiding through multithreading: the SM never sits idle waiting for memory if there are sufficient resident warps.
Warp occupancy: The fraction of maximum resident warps actually resident on an SM. Limited by: - Register usage: if a kernel uses 64 registers/thread, 32 threads/warp × 64 regs = 2048 registers/warp. 65536 regs / 2048 = 32 resident warps max (out of 64 max). Occupancy = 50%. - Shared memory: if a kernel uses 48KB shared memory/block, and each SM has 96KB, only 2 blocks fit → limited occupancy. - Block count limit: Max 32 blocks per SM (Ampere).
High occupancy is generally desirable but not always — a kernel with low occupancy but high ILP (instruction-level parallelism) within each warp can still fully utilize the SM.
SIMT Divergence
When threads in a warp execute a conditional branch (if (condition)), two cases:
1. All 32 threads take the same path (uniform warp): No divergence. Full throughput.
2. Some threads take one path, others take another (divergent warp): Both paths must be executed. The GPU serializes the two paths:
- Execute the if branch for threads where condition == true (other threads are masked, their writes are suppressed)
- Execute the else branch for threads where condition == false
- Both paths take sequential clock cycles even though only half the threads are active each time
Divergence cost: In the worst case (alternate threads take different paths), performance is halved (50% SIMD utilization). Divergent code patterns to avoid in inner loops:
- if (threadIdx.x % 2 == 0) — alternating threads diverge worst case
- if (data[tid] > threshold) — data-dependent branch with non-uniform data
Reconvergence: After a divergent branch, threads reconverge at the nearest common post-dominator. Modern NVIDIA architectures (Volta+) implement Independent Thread Scheduling — each thread maintains its own program counter, allowing full divergence with correct lock-step reconvergence.
GPU Memory Hierarchy
+------------------------------------------+
| Latency / Size / Bandwidth Trade-off |
| |
| Registers 1 cycle 256 KB/SM | Fastest, private per thread
| | |
| Shared Memory ~32 cycles 192 KB/SM | Explicitly managed, per-block
| | |
| L1 Cache ~32 cycles 192 KB/SM | Hardware managed (same SRAM)
| | |
| L2 Cache ~200 cycles 40 MB (A100)| Shared across all SMs
| | |
| HBM2e (Global) ~600 cycles 80 GB (A100)| 2 TB/s bandwidth
| |
+------------------------------------------+
(Latencies approximate for A100 at 1.41 GHz)
Registers: The fastest memory. Private to each thread. No latency if result is available (bypass path). Limited: 255 registers/thread max. Using too many registers reduces occupancy.
Shared memory: On-chip SRAM, same physical SRAM as L1 cache (configurable split). ~100x faster than global memory for reads, ~50x faster for bank-conflict-free access. Must be explicitly allocated (__shared__) and loaded from global memory. Critical for tiling algorithms (GEMM, convolution) where data blocks are reused across threads.
L1 / L2 cache: Hardware-managed. L1 is per-SM. L2 is global (across all SMs). On A100, the L2 is 40MB and bisectional bandwidth is 3.2 TB/s (L2↔SMs). The L2 acts as a bandwidth multiplier for workloads with reusable data.
HBM2e (Global memory): 80GB at 2 TB/s on A100. GDDR6 (consumer GPUs) provides less: RTX 4090 has 24GB GDDR6X at 1 TB/s. For comparison, CPU DDR5 bandwidth: ~100–200 GB/s total. HBM achieves 10–20x CPU DRAM bandwidth by stacking DRAM dies vertically with thousands of wires connecting them to the interposer.
NVIDIA Ampere A100 Specifications
| Specification | A100 (SXM) | H100 (SXM) |
|---|---|---|
| Architecture | Ampere | Hopper |
| SMs | 108 | 132 |
| CUDA Cores | 6,912 | 16,896 |
| Tensor Cores | 432 (3rd gen) | 528 (4th gen) |
| FP32 TFLOPS | 19.5 | 66.9 |
| FP16 TFLOPS (no sparsity) | 77.97 | 267.6 |
| FP16 TFLOPS (with sparsity) | 312 | 835 |
| INT8 TOPS (with sparsity) | 624 | 3,958 |
| Memory | 80 GB HBM2e | 80 GB HBM3 |
| Memory Bandwidth | 2,000 GB/s | 3,350 GB/s |
| NVLink Bandwidth | 600 GB/s | 900 GB/s |
| TDP | 400W | 700W |
| Process node | TSMC 7nm | TSMC 4nm |
H100 Hopper additions over A100: - Transformer Engine: FP8 training with automatic scaling, doubling throughput for LLM training - NVLink 4.0: 900 GB/s vs 600 GB/s, supporting 256 GPUs in an NVSwitch fabric - Thread Block Clusters: Groups of cooperative thread blocks spanning multiple SMs, sharing data via distributed shared memory - DPX instructions: Accelerated dynamic programming (Smith-Waterman, Needleman-Wunsch for genomics)
AMD RDNA/CDNA Architecture
AMD's GPU architecture divides into: - RDNA (Radeon DNA): Gaming GPUs. RDNA 3 (RX 7900 XTX): 24GB GDDR6 at 960 GB/s, ~123 TFLOPS FP16. - CDNA (Compute DNA): Data center GPUs, competition to NVIDIA A100/H100. AMD Instinct MI300X (2023): 192GB HBM3, 5.3 TB/s bandwidth, 2,614 TFLOPS FP8. The memory subsystem surpasses H100 in total capacity.
AMD's compute unit (CU, equivalent to NVIDIA SM) contains 64 shader processors (stream processors), organized as 4 × 16-wide SIMD units. AMD's warp equivalent is called a wavefront (64 threads vs NVIDIA's 32). Larger wavefront = larger SIMD width = better utilization for well-structured code, but worse for irregular branches.
Historical Context
NVIDIA's first programmable GPU was the GeForce 3 (2001), which allowed custom vertex shaders. The launch of CUDA (Compute Unified Device Architecture) in 2006 with the G80 architecture was transformative — it exposed the GPU as a general-purpose parallel processor with a C-like programming model. The breakthrough paper "Scalable Parallel Programming with CUDA" (Lindholm et al., IEEE Micro 2008) established the SIMT model. The 2012 AlexNet breakthrough (Krizhevsky, Sutskever, Hinton) proved that GPUs were essential for deep learning, triggering the explosive growth of GPU computing.
Production Examples
# Query GPU properties with nvidia-smi
nvidia-smi -q | grep -A 20 "GPU 00000"
nvidia-smi --query-gpu=name,memory.total,memory.used,utilization.gpu \
--format=csv,noheader
# Detailed SM utilization
dcgmi dmon -s u # DCGM (Data Center GPU Manager)
# Profile GPU utilization per SM
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed \
./my_kernel
# Check NVLink topology
nvidia-smi topo -m
Debugging Notes
nvidia-smi -q -d CLOCKshows current clock frequencies. Thermal throttling reduces clocks — watch forCLOCK_THROTTLE_REASON_GPU_IDLEorHW_THERMALflags.- Kernel launch failures (invalid launch parameters) often show only in
cudaGetLastError()calls — always check error codes. cuda-memcheck(deprecated, replaced bycompute-sanitizer) detects out-of-bounds global memory accesses, illegal memory accesses, and race conditions in shared memory.Nsight Systemsfor timeline profiling (kernel launches, PCIe transfers, CPU-GPU synchronization).Nsight Computefor kernel-level metrics (occupancy, memory throughput, warp efficiency).
Security Implications
- GPU memory isolation: GPUs do not have IOMMU-level isolation by default — on some configurations, one process can DMA-read another process's GPU memory. Linux kernel 5.6+ and NVIDIA drivers with MIG (Multi-Instance GPU) enforce memory isolation.
- GPU-based malware: Malware running on the GPU is invisible to CPU-based antivirus. The GPU has persistent DMA access to CPU memory. Research has shown GPU-resident rootkits (JellyFish, 2015).
- Speculative execution on GPUs: NVIDIA GPUs do not implement speculative execution, so Spectre/Meltdown class attacks don't apply directly. However, timing side-channels exist (cache timing, warp timing).
- GPU memory scrubbing: When a process releases GPU memory (cudaFree), it may not be zeroed before being reused. A malicious process can allocate and read previously used memory. NVIDIA drivers can be configured to zero-scrub GPU memory between process assignments.
Performance Implications
- Arithmetic Intensity: The GPU's memory bandwidth limits throughput when there are fewer than ~N FLOPs per byte loaded (N = peak_FLOPS / peak_bandwidth = 312e12 / 2e12 ≈ 156 FLOP/byte for A100 FP16). Kernels below this threshold are memory-bandwidth bound; above are compute-bound.
- Occupancy vs ILP: Low-occupancy kernels can still fully utilize the SM if each warp has high instruction-level parallelism (many independent instructions queued). Tuning occupancy without profiling is premature optimization.
- PCIe bottleneck: Data transfer between CPU and GPU via PCIe Gen4 x16 (64 GB/s peak bidirectional). Keeping data on the GPU across multiple kernels is essential for performance.
Failure Modes
- CUDA OOM (Out of Memory):
cudaErrorMemoryAllocation— GPU has insufficient memory. Reduce batch size, use gradient checkpointing, or use a larger GPU. - Kernel launch timeout: On display GPUs (consumer cards), WDDM (Windows Display Driver Model) or X server watchdog kills kernels running >2 seconds. Use Tesla/datacenter GPUs or disable watchdog (Linux:
nvidia-modprobe -uvm; unsafe for display GPUs). - SM divergence stall: Heavily divergent code can reduce effective utilization to 3–6%, making the GPU slower than a CPU. Profile with
warp_execution_efficiencymetric in Nsight Compute. - ECC errors: HBM DRAM uses ECC (Error Correcting Code). Single-bit errors are corrected; double-bit errors cause kernel crashes. Monitor with
nvidia-smi -q -d ECC.
Modern Usage
NVIDIA's Multi-Instance GPU (MIG) (A100, H100) partitions a single GPU into up to 7 independent GPU instances, each with its own SMs, L2 cache, and HBM memory fraction. Used in cloud platforms (AWS p4dn, Azure ND A100) to serve multiple tenants on one GPU with hardware-enforced isolation.
NVSwitch fabric: 8 A100s connected via NVLink 3.0 (A100 NVSwitch system) provide 600 GB/s GPU-GPU bandwidth. 256 H100s in an NVSwitch 3 fabric achieve 900 GB/s all-pairs bandwidth — enabling the training of trillion-parameter LLMs without PCIe bottlenecks.
Future Directions
- NVIDIA GB200 (Blackwell): 2× B200 GPUs + Grace CPU in one package (NVLink-C2C at 900 GB/s). 1,000 TFLOPS FP4, 192GB HBM3e per GPU. Available 2025.
- GPU-native ML frameworks: NVIDIA's Warp (Python-first GPU programming), PyTorch 2.x torch.compile (JIT kernel fusion), and Triton (ML-specific GPU kernel language) are moving kernel programming higher up the abstraction stack.
- Chiplet-based GPUs: AMD MI300X is the first production chiplet GPU (13 chiplets: 3 CPU dies + 10 GPU dies). NVIDIA Rubin (expected 2026) will also use chiplets, enabling larger die area than monolithic limits allow.
- In-memory compute (PIM): Samsung HBM-PIM embeds AI processing elements inside HBM DRAM stacks, computing directly where data lives and eliminating the bandwidth wall for bandwidth-bound kernels.
Exercises
- Use
nvidia-smito collect real-time SM utilization, memory usage, and memory bandwidth utilization while running a matrix multiplication benchmark (e.g.,cublas_sgemm). Identify whether the kernel is compute-bound or memory-bandwidth bound using the roofline model. - Write a CUDA program that measures shared memory latency vs global memory latency by accessing array elements in a controlled pattern. Compare the measured latency to the theoretical values (32 cycles shared, 600+ cycles global).
- Implement a warp divergence experiment: write two CUDA kernels that perform identical arithmetic work. In Kernel A, all threads in a warp take the same branch. In Kernel B, alternate threads take different branches. Measure the throughput difference. Profile with
warp_execution_efficiency. - Use Nsight Compute to profile a naively written matrix transpose kernel. Identify the bottleneck (likely uncoalesced global memory access). Implement the shared-memory tiled version and compare the
l2_global_load_transactionsmetric before and after. - Profile the occupancy of a kernel using the CUDA Occupancy Calculator (CUDA SDK tool). Experiment with changing
--maxrregcount(maximum register count per thread). Observe the occupancy-vs-register tradeoff and measure actual throughput at different register limits.
References
- Lindholm, E., Nickolls, J., Oberman, S., Montrym, J. "NVIDIA Tesla: A Unified Graphics and Computing Architecture." IEEE Micro, 2008.
- NVIDIA A100 GPU Architecture Whitepaper. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
- NVIDIA H100 Tensor Core GPU Architecture. https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper
- CUDA Programming Guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/
- NVIDIA Nsight Compute documentation: https://docs.nvidia.com/nsight-compute/
- Patterson, D., Hennessy, J. Computer Organization and Design: ARM Edition. Morgan Kaufmann, 2016. Chapter 6 (GPU parallelism).