Section 32: AI Infrastructure
Purpose and Scope
AI infrastructure is the systems engineering discipline underlying large-scale machine learning. Training frontier models like GPT-4 or Gemini Ultra requires clusters of thousands of accelerators, custom high-speed networks, distributed training frameworks, and specialized storage systems — all operating reliably for weeks or months. Inference serving introduces a separate set of constraints: latency, throughput, KV cache memory management, and continuous batching. This section covers the full stack from GPU cluster networking and collective communication primitives through distributed training frameworks, inference serving systems, SmartNICs/DPUs, and the MLOps infrastructure that ties it all together.
Prerequisites
- GPU architecture and CUDA programming model (Section 31)
- Distributed systems fundamentals (Section 17)
- Network internals, RDMA concepts (Section 15, 16)
- Operating system and Linux kernel basics (Section 03, 04)
- Basic machine learning concepts (gradient descent, model parameters)
Learning Objectives
By the end of this section, you will be able to:
- Design a GPU cluster network for a training workload and justify topology choices
- Explain Ring-AllReduce and its alternatives, and describe when each collective is optimal
- Distinguish tensor parallelism, pipeline parallelism, and data parallelism, and describe how they are combined
- Explain how FSDP and DeepSpeed ZeRO reduce per-GPU memory requirements
- Describe continuous batching and paged attention and explain why they improve GPU utilization in inference
- Explain how KV cache memory is managed in vLLM and why it matters for throughput
- Describe how SmartNICs and DPUs offload work from host CPUs in AI clusters
- Design an MLOps pipeline for training, versioning, and serving a large model
Architecture Overview
AI Training Cluster Topology
Compute Nodes (x1000s)
+------------------+
| 8x H100 GPUs | <-- NVLink 4.0 fabric within node (900 GB/s)
| 2x AMD EPYC CPUs |
| 1TB DDR5 RAM |
| 8x 400GbE / HDR | <-- one NIC per GPU for network offload
| InfiniBand ports |
+------------------+
|
| InfiniBand HDR200 / NDR400
|
+-------v----------+
| Spine Switches | Rail-optimized topology:
| (per GPU rail) | GPU 0 of every node -> same spine switch
+-------------------+ (reduces hops for collective operations)
|
+-------v----------+
| Core Switches | Fat-tree or 3-stage Clos
+-------------------+
Storage:
+------------------+
| Parallel FS | Lustre / GPFS / WekaIO
| (NFS over RDMA) | 1TB/s aggregate throughput for checkpoint I/O
+------------------+
Collective Communication in Ring-AllReduce
Data Parallel Training: each GPU holds full model, different data shards
N GPUs forming a logical ring for gradient aggregation:
GPU0 -> GPU1 -> GPU2 -> GPU3 -> GPU0 (ring)
Phase 1: Scatter-Reduce (N-1 steps)
Each GPU sends 1/N of its gradient buffer to the next GPU,
receiving and accumulating from the previous.
Phase 2: AllGather (N-1 steps)
Each GPU now has the final sum for 1/N of the gradient.
Broadcast the fully reduced chunks around the ring.
Result: All GPUs have the fully reduced (summed) gradient.
Communication volume: 2*(N-1)/N * gradient_size per GPU
(approaches 2x gradient_size, independent of N)
NCCL handles this using RDMA (RoCE v2 or InfiniBand verbs)
to bypass the CPU kernel entirely.
Model Parallelism Strategies
Data Parallelism (DP):
+----+ +----+ +----+ +----+
| M | | M | | M | | M | M = full model copy
| B1 | | B2 | | B3 | | B4 | B = different data batch
+----+ +----+ +----+ +----+
AllReduce gradients after each backward pass
Tensor Parallelism (TP) - Megatron-style:
Single large matrix multiplication split across GPUs:
+--GPU0--+ +--GPU1--+
| A[:,:N/2]| |A[:,N/2:]| A = weight matrix
| col shard| | col shard|
+----------+ +----------+
Each GPU computes partial result; AllReduce to combine
Pipeline Parallelism (PP):
Layers distributed across GPUs, micro-batches flow through:
GPU0: layers 0-7
GPU1: layers 8-15
GPU2: layers 16-23
GPU3: layers 24-31
Micro-batch 1: GPU0->GPU1->GPU2->GPU3
Micro-batch 2: fills pipeline while micro-batch 1 is in GPU3
3D Parallelism (Megatron + DeepSpeed):
DP x TP x PP = total GPUs
e.g., 8 DP * 8 TP * 16 PP = 1024 GPUs
Inference Serving: vLLM / Continuous Batching
Naive static batching:
Request arrives -> wait for batch to fill -> process -> respond
Problem: short requests wait for long ones (head-of-line blocking)
Continuous Batching (iteration-level scheduling):
Time ->
Batch: [req1] [req1] [req1+req2] [req2+req3] [req3] ...
^ req2 added ^ req1 done, req3 added
Each forward pass ("iteration") has a dynamic batch size.
New requests inserted as soon as a slot frees.
Paged Attention (vLLM):
KV cache split into fixed-size "pages" (blocks) of token slots.
Physical pages allocated on demand; virtual page table maps
logical token positions to physical memory.
Logical KV: [tok0..tok15][tok16..tok31][tok32..tok47]
Physical mem: page7 page2 page19
Benefit: no fragmentation, KV cache shared for common prefixes
(prefix caching), enables >90% GPU memory utilization for KV.
Key Concepts
- NCCL (NVIDIA Collective Communications Library): Optimized collective ops (AllReduce, AllGather, ReduceScatter, Broadcast) that use RDMA and NVLink to bypass the CPU. The communication backbone of all major training frameworks.
- InfiniBand vs RoCE: InfiniBand (IB) is a dedicated lossless fabric with hardware-based congestion control. RoCE v2 (RDMA over Converged Ethernet) runs on standard Ethernet but requires PFC/ECN for losslessness. IB dominates HPC/AI clusters; RoCE is common in hyperscale (Meta, Microsoft).
- RDMA (Remote Direct Memory Access): Network operations that read/write remote host memory without involving the remote CPU. Eliminates kernel crossing on both send and receive paths. Zero-copy, low-latency.
- FSDP (Fully Sharded Data Parallel): PyTorch's ZeRO-3 equivalent. Model parameters, gradients, and optimizer states are sharded across data-parallel ranks. Each GPU holds only 1/N of the model. AllGather parameters before each layer's forward pass; ReduceScatter gradients after backward.
- DeepSpeed ZeRO: Zero Redundancy Optimizer. Three stages: ZeRO-1 shards optimizer states, ZeRO-2 adds gradient sharding, ZeRO-3 adds parameter sharding. Enables training models larger than a single GPU's VRAM.
- Tensor Parallelism: Intra-layer parallelism splitting weight matrices across GPUs. Requires AllReduce or AllGather at each layer boundary. High NVLink bandwidth essential (intra-node only in practice).
- Pipeline Parallelism: Inter-layer parallelism assigning consecutive model layers to different GPU groups. Introduces pipeline bubbles; mitigated by micro-batching (Gpipe, PipeDream, Interleaved 1F1B).
- KV Cache: During autoregressive LLM inference, attention key/value projections from all previous tokens are cached to avoid recomputation. Scales as O(batch * seq_len * layers * d_model) in GPU memory.
- Continuous Batching: Inference scheduling where the batch composition changes at each forward pass step, allowing new requests to join and completed requests to exit without waiting for the full batch.
- SmartNIC / DPU: Network cards with embedded ARM cores and programmable data-plane engines (NVIDIA BlueField, Marvell OCTEON). Offload storage I/O, encryption, telemetry aggregation, and RDMA operations from host CPUs.
Major Historical Milestones
| Year | Milestone |
|---|---|
| 2012 | AlexNet wins ImageNet — GPU training demonstrated at scale |
| 2014 | Google DistBelief — first large-scale distributed training system described |
| 2016 | NCCL 1.0 released; Ring-AllReduce popularized (Baidu research) |
| 2016 | Uber Horovod — Ring-AllReduce for distributed TensorFlow |
| 2017 | Megatron-LM — tensor parallelism for GPT-scale transformer training |
| 2018 | PyTorch DDP (DistributedDataParallel) — gradient bucketing, overlap with backward pass |
| 2019 | DeepSpeed ZeRO — enables training models 10x larger than GPU VRAM |
| 2019 | NVIDIA DGX A100 system — 8 A100s with NVSwitch full mesh |
| 2020 | GPT-3 trained — 175B params, hundreds of A100s for months |
| 2020 | Microsoft/NVIDIA 530B parameter model trained with 3D parallelism |
| 2021 | PyTorch FSDP merged; vLLM paged attention paper |
| 2022 | NVIDIA H100 with Transformer Engine (FP8 training) |
| 2022 | Triton language (OpenAI) for GPU kernel programming without CUDA C |
| 2023 | vLLM 0.1 released — paged attention + continuous batching |
| 2023 | AMD MI300X — 192GB HBM3, competitive inference performance |
| 2024 | NVIDIA Blackwell NVLink 5.0, 1.8 TB/s per GPU; GB200 NVL72 rack systems |
Modern Relevance
AI infrastructure has become the fastest-growing segment of systems engineering. Frontier model training consumes tens of thousands of GPU-months and requires cluster reliability that approaches carrier-grade (a single node failure in a 10,000-GPU training run can waste hours of compute). Inference efficiency directly determines unit economics for AI products — a 2x improvement in tokens/sec/GPU halves serving costs. The discipline draws on networking (RDMA, congestion control), distributed systems (fault tolerance, checkpointing), storage (checkpoint I/O at PB scale), and hardware architecture simultaneously. Engineers who understand this stack end-to-end are among the most sought-after in the industry.
File Map
32-ai-infrastructure/
├── 00-overview.md <- This file
├── 01-ai-training-clusters.md
├── 02-infiniband-and-roce.md
├── 03-rdma-internals.md
├── 04-nccl-and-collectives.md
├── 05-ring-allreduce.md
├── 06-data-parallelism.md
├── 07-tensor-parallelism.md
├── 08-pipeline-parallelism.md
├── 09-pytorch-ddp-and-fsdp.md
├── 10-deepspeed-zero.md
├── 11-megatron-lm.md
├── 12-inference-systems.md
├── 13-kv-cache-management.md
├── 14-continuous-batching.md
├── 15-vllm-internals.md
├── 16-tensorrt-and-triton.md
├── 17-smartnics-and-dpus.md
├── 18-ai-storage-systems.md
└── 19-mlops-infrastructure.md
Cross-References
- Section 15 (Networking): RDMA, InfiniBand, RoCE, congestion control fundamentals
- Section 17 (Distributed Systems): Fault tolerance, consistency, coordination in training runs
- Section 31 (GPU Systems): GPU architecture, NVLink, CUDA programming model
- Section 33 (Hardware Architecture): PCIe topology, NUMA, memory bandwidth constraints
- Section 39 (Large-Scale Case Studies): Meta, Google, and Microsoft training cluster architectures