Section 32: AI Infrastructure

Purpose and Scope

AI infrastructure is the systems engineering discipline underlying large-scale machine learning. Training frontier models like GPT-4 or Gemini Ultra requires clusters of thousands of accelerators, custom high-speed networks, distributed training frameworks, and specialized storage systems — all operating reliably for weeks or months. Inference serving introduces a separate set of constraints: latency, throughput, KV cache memory management, and continuous batching. This section covers the full stack from GPU cluster networking and collective communication primitives through distributed training frameworks, inference serving systems, SmartNICs/DPUs, and the MLOps infrastructure that ties it all together.

Prerequisites

GPU architecture and CUDA programming model (Section 31)
Distributed systems fundamentals (Section 17)
Network internals, RDMA concepts (Section 15, 16)
Operating system and Linux kernel basics (Section 03, 04)
Basic machine learning concepts (gradient descent, model parameters)

Learning Objectives

By the end of this section, you will be able to:

Design a GPU cluster network for a training workload and justify topology choices
Explain Ring-AllReduce and its alternatives, and describe when each collective is optimal
Distinguish tensor parallelism, pipeline parallelism, and data parallelism, and describe how they are combined
Explain how FSDP and DeepSpeed ZeRO reduce per-GPU memory requirements
Describe continuous batching and paged attention and explain why they improve GPU utilization in inference
Explain how KV cache memory is managed in vLLM and why it matters for throughput
Describe how SmartNICs and DPUs offload work from host CPUs in AI clusters
Design an MLOps pipeline for training, versioning, and serving a large model

Architecture Overview

AI Training Cluster Topology

  Compute Nodes (x1000s)
  +------------------+
  | 8x H100 GPUs     |  <-- NVLink 4.0 fabric within node (900 GB/s)
  | 2x AMD EPYC CPUs |
  | 1TB DDR5 RAM     |
  | 8x 400GbE / HDR  |  <-- one NIC per GPU for network offload
  | InfiniBand ports |
  +------------------+
          |
          | InfiniBand HDR200 / NDR400
          |
  +-------v----------+
  | Spine Switches    |  Rail-optimized topology:
  | (per GPU rail)    |  GPU 0 of every node -> same spine switch
  +-------------------+  (reduces hops for collective operations)
          |
  +-------v----------+
  | Core Switches     |  Fat-tree or 3-stage Clos
  +-------------------+

  Storage:
  +------------------+
  | Parallel FS      |  Lustre / GPFS / WekaIO
  | (NFS over RDMA)  |  1TB/s aggregate throughput for checkpoint I/O
  +------------------+

Collective Communication in Ring-AllReduce

  Data Parallel Training: each GPU holds full model, different data shards

  N GPUs forming a logical ring for gradient aggregation:

  GPU0 -> GPU1 -> GPU2 -> GPU3 -> GPU0 (ring)

  Phase 1: Scatter-Reduce (N-1 steps)
  Each GPU sends 1/N of its gradient buffer to the next GPU,
  receiving and accumulating from the previous.

  Phase 2: AllGather (N-1 steps)
  Each GPU now has the final sum for 1/N of the gradient.
  Broadcast the fully reduced chunks around the ring.

  Result: All GPUs have the fully reduced (summed) gradient.

  Communication volume: 2*(N-1)/N * gradient_size per GPU
  (approaches 2x gradient_size, independent of N)

  NCCL handles this using RDMA (RoCE v2 or InfiniBand verbs)
  to bypass the CPU kernel entirely.

Model Parallelism Strategies

  Data Parallelism (DP):
  +----+  +----+  +----+  +----+
  | M  |  | M  |  | M  |  | M  |   M = full model copy
  | B1 |  | B2 |  | B3 |  | B4 |   B = different data batch
  +----+  +----+  +----+  +----+
  AllReduce gradients after each backward pass

  Tensor Parallelism (TP) - Megatron-style:
  Single large matrix multiplication split across GPUs:
  +--GPU0--+  +--GPU1--+
  |  A[:,:N/2]| |A[:,N/2:]|   A = weight matrix
  |  col shard| | col shard|
  +----------+  +----------+
  Each GPU computes partial result; AllReduce to combine

  Pipeline Parallelism (PP):
  Layers distributed across GPUs, micro-batches flow through:
  GPU0: layers 0-7
  GPU1: layers 8-15
  GPU2: layers 16-23
  GPU3: layers 24-31
  Micro-batch 1: GPU0->GPU1->GPU2->GPU3
  Micro-batch 2: fills pipeline while micro-batch 1 is in GPU3

  3D Parallelism (Megatron + DeepSpeed):
  DP x TP x PP = total GPUs
  e.g., 8 DP * 8 TP * 16 PP = 1024 GPUs

Inference Serving: vLLM / Continuous Batching

  Naive static batching:
  Request arrives -> wait for batch to fill -> process -> respond
  Problem: short requests wait for long ones (head-of-line blocking)

  Continuous Batching (iteration-level scheduling):

  Time ->
  Batch:  [req1] [req1] [req1+req2] [req2+req3] [req3] ...
                         ^ req2 added        ^ req1 done, req3 added

  Each forward pass ("iteration") has a dynamic batch size.
  New requests inserted as soon as a slot frees.

  Paged Attention (vLLM):
  KV cache split into fixed-size "pages" (blocks) of token slots.
  Physical pages allocated on demand; virtual page table maps
  logical token positions to physical memory.

  Logical KV:   [tok0..tok15][tok16..tok31][tok32..tok47]
  Physical mem: page7        page2         page19

  Benefit: no fragmentation, KV cache shared for common prefixes
  (prefix caching), enables >90% GPU memory utilization for KV.

Key Concepts

NCCL (NVIDIA Collective Communications Library): Optimized collective ops (AllReduce, AllGather, ReduceScatter, Broadcast) that use RDMA and NVLink to bypass the CPU. The communication backbone of all major training frameworks.
InfiniBand vs RoCE: InfiniBand (IB) is a dedicated lossless fabric with hardware-based congestion control. RoCE v2 (RDMA over Converged Ethernet) runs on standard Ethernet but requires PFC/ECN for losslessness. IB dominates HPC/AI clusters; RoCE is common in hyperscale (Meta, Microsoft).
RDMA (Remote Direct Memory Access): Network operations that read/write remote host memory without involving the remote CPU. Eliminates kernel crossing on both send and receive paths. Zero-copy, low-latency.
FSDP (Fully Sharded Data Parallel): PyTorch's ZeRO-3 equivalent. Model parameters, gradients, and optimizer states are sharded across data-parallel ranks. Each GPU holds only 1/N of the model. AllGather parameters before each layer's forward pass; ReduceScatter gradients after backward.
DeepSpeed ZeRO: Zero Redundancy Optimizer. Three stages: ZeRO-1 shards optimizer states, ZeRO-2 adds gradient sharding, ZeRO-3 adds parameter sharding. Enables training models larger than a single GPU's VRAM.
Tensor Parallelism: Intra-layer parallelism splitting weight matrices across GPUs. Requires AllReduce or AllGather at each layer boundary. High NVLink bandwidth essential (intra-node only in practice).
Pipeline Parallelism: Inter-layer parallelism assigning consecutive model layers to different GPU groups. Introduces pipeline bubbles; mitigated by micro-batching (Gpipe, PipeDream, Interleaved 1F1B).
KV Cache: During autoregressive LLM inference, attention key/value projections from all previous tokens are cached to avoid recomputation. Scales as O(batch * seq_len * layers * d_model) in GPU memory.
Continuous Batching: Inference scheduling where the batch composition changes at each forward pass step, allowing new requests to join and completed requests to exit without waiting for the full batch.
SmartNIC / DPU: Network cards with embedded ARM cores and programmable data-plane engines (NVIDIA BlueField, Marvell OCTEON). Offload storage I/O, encryption, telemetry aggregation, and RDMA operations from host CPUs.

Major Historical Milestones

Year	Milestone
2012	AlexNet wins ImageNet — GPU training demonstrated at scale
2014	Google DistBelief — first large-scale distributed training system described
2016	NCCL 1.0 released; Ring-AllReduce popularized (Baidu research)
2016	Uber Horovod — Ring-AllReduce for distributed TensorFlow
2017	Megatron-LM — tensor parallelism for GPT-scale transformer training
2018	PyTorch DDP (DistributedDataParallel) — gradient bucketing, overlap with backward pass
2019	DeepSpeed ZeRO — enables training models 10x larger than GPU VRAM
2019	NVIDIA DGX A100 system — 8 A100s with NVSwitch full mesh
2020	GPT-3 trained — 175B params, hundreds of A100s for months
2020	Microsoft/NVIDIA 530B parameter model trained with 3D parallelism
2021	PyTorch FSDP merged; vLLM paged attention paper
2022	NVIDIA H100 with Transformer Engine (FP8 training)
2022	Triton language (OpenAI) for GPU kernel programming without CUDA C
2023	vLLM 0.1 released — paged attention + continuous batching
2023	AMD MI300X — 192GB HBM3, competitive inference performance
2024	NVIDIA Blackwell NVLink 5.0, 1.8 TB/s per GPU; GB200 NVL72 rack systems

Modern Relevance

AI infrastructure has become the fastest-growing segment of systems engineering. Frontier model training consumes tens of thousands of GPU-months and requires cluster reliability that approaches carrier-grade (a single node failure in a 10,000-GPU training run can waste hours of compute). Inference efficiency directly determines unit economics for AI products — a 2x improvement in tokens/sec/GPU halves serving costs. The discipline draws on networking (RDMA, congestion control), distributed systems (fault tolerance, checkpointing), storage (checkpoint I/O at PB scale), and hardware architecture simultaneously. Engineers who understand this stack end-to-end are among the most sought-after in the industry.

File Map

32-ai-infrastructure/
├── 00-overview.md              <- This file
├── 01-ai-training-clusters.md
├── 02-infiniband-and-roce.md
├── 03-rdma-internals.md
├── 04-nccl-and-collectives.md
├── 05-ring-allreduce.md
├── 06-data-parallelism.md
├── 07-tensor-parallelism.md
├── 08-pipeline-parallelism.md
├── 09-pytorch-ddp-and-fsdp.md
├── 10-deepspeed-zero.md
├── 11-megatron-lm.md
├── 12-inference-systems.md
├── 13-kv-cache-management.md
├── 14-continuous-batching.md
├── 15-vllm-internals.md
├── 16-tensorrt-and-triton.md
├── 17-smartnics-and-dpus.md
├── 18-ai-storage-systems.md
└── 19-mlops-infrastructure.md

Cross-References

Section 15 (Networking): RDMA, InfiniBand, RoCE, congestion control fundamentals
Section 17 (Distributed Systems): Fault tolerance, consistency, coordination in training runs
Section 31 (GPU Systems): GPU architecture, NVLink, CUDA programming model
Section 33 (Hardware Architecture): PCIe topology, NUMA, memory bandwidth constraints
Section 39 (Large-Scale Case Studies): Meta, Google, and Microsoft training cluster architectures