AI Storage Systems
Technical Overview
Storage is frequently the silent bottleneck in AI workloads. A cluster of 1,000 H100 GPUs can process data faster than almost any storage system can supply it. Training pipelines must read hundreds of terabytes of training data per day while simultaneously writing multi-hundred-gigabyte checkpoints every few minutes. AI storage requirements diverge sharply from traditional enterprise storage: workloads are predominantly sequential reads at very high concurrency, file sizes range from small images (JPEG, few KB) to massive sharded datasets (Parquet, 10s of GB), and checkpoint write patterns require high sustained write throughput with immediate consistency. This document covers the storage stack for AI training from dataset format design through to NVMe-direct GPU transfer.
Prerequisites
- Familiarity with Linux I/O stack (VFS, page cache, block layer, scheduler)
- Basic understanding of distributed filesystems (NFS, HDFS concepts)
- Knowledge of storage hardware (NVMe SSD, HDD, HBM)
- Understanding of DMA, PCIe, and CUDA memory model
- Familiarity with Python data loading libraries (PyTorch DataLoader, tf.data)
Core Content
The Data Loading Bottleneck
GPU compute capacity vs storage throughput:
H100 GPU: ~2,000 TFLOP/s (BF16)
Training ResNet-50 on ImageNet (100GB):
- GPU processes 1,000 images/second
- Each image ≈ 100KB (JPEG)
- Required I/O bandwidth: 100 MB/s
Training LLaMA 3 70B on 1T tokens:
- Tokenized data: ~2TB (2 bytes/token × 1T)
- Training time: ~30 days on 2000 H100s
- Required I/O: 2TB / (30d × 86400s) = 0.77 MB/s per GPU
- With 2000 GPUs: 1.5 TB/s aggregate I/O — exceeds most storage systems
→ Preprocessed sharded data with local SSD caching is essential
Why data loading stalls matter: PyTorch DataLoader prefetches batches in background worker processes. If the prefetch queue runs empty, the GPU waits. GPU utilization drops to 0 until the next batch arrives. With 1,000-GPU jobs at $30/GPU-hour, 10% data loading idle time costs $3,000/hour.
DataLoader worker model:
Main Process (GPU training):
batch = next(dataloader) # blocks if prefetch queue empty
loss = model(batch)
loss.backward()
Worker Process 0: reader → transform → prefetch_queue
Worker Process 1: reader → transform → prefetch_queue
Worker Process N: reader → transform → prefetch_queue
↑
num_workers (typically 4–16 per GPU)
Setting num_workers too low: I/O doesn't keep up with GPU. Too high: context switching overhead and memory pressure from prefetched batches.
Dataset Formats for AI
Raw files (JPEG/PNG directories):
- Simple, human-readable
- Problem: millions of small files → metadata bottleneck on any filesystem
- ls on 1M-file directory takes minutes on POSIX, seconds on Lustre with dir striping
- Inode contention: EXT4/XFS serializes on directory inode for concurrent reads
TFRecord (Google TensorFlow format): - Sequential binary format: length-prefixed protobuf records - One file per shard (10,000 records = 1 shard typical) - Enables tf.data sequential reads at full NVMe throughput - Not efficient for random access; good for streaming
WebDataset (open format, webdataset library):
- TAR archives containing matched image + label files
- image_000001.jpg + image_000001.json in same TAR
- Sequential reads at full bandwidth; compatible with cloud storage streaming
- Used by LAION-5B (5B image-text pairs) training
Parquet (Apache, columnar): - Columnar binary format with schema - Excellent for NLP datasets (text + metadata) - Row groups (typically 100–500MB each) enable parallel reads - Predicate pushdown: read only relevant rows/columns - Used by The Pile, C4, ROOTS, and most modern text datasets
MosaicML Streaming (2022): - Sharded binary format optimized for ML training - Random access within shards via index - Built-in download-as-you-train from S3 (avoids full dataset pre-download) - Used by MPT-7B, MPT-30B training
Petastorm (Uber, 2018): - Parquet + Unischema for ML-specific metadata - Read into PyTorch/TensorFlow native batches directly
POSIX vs Object Storage for ML
S3 (and compatible object stores: GCS, Azure Blob): - Cheap ($0.023/GB/month vs $0.10 for EBS NVMe) - High latency per object: 5–50ms first byte vs <1ms for NVMe - High per-request overhead: HTTPS + TLS + TCP - Limited IOPS: 3,500 PUT/s, 5,500 GET/s per prefix (with prefix partitioning: unlimited) - No POSIX semantics: no rename atomicity, no append
For ML training from S3: - Never read individual files; always large sequential objects (>64MB) - Parallel prefetch: 32+ concurrent S3 GET threads per worker - Use byte-range requests to read specific shards without downloading full files - S3 Express One Zone (2023): 10× lower latency, S3-compatible, same region only
POSIX parallel filesystems:
| Filesystem | Type | Throughput | Latency | Notes |
|---|---|---|---|---|
| Lustre | Parallel POSIX | 1–10 TB/s | 100–500 µs | HPC standard, used by Top500 |
| GPFS (IBM Spectrum Scale) | Parallel POSIX | 1–30 TB/s | 200 µs | Enterprise, more features |
| WekaFS | Scale-out NVMe | 10+ TB/s | 50 µs | Cloud-native, GPU-aware |
| BeeGFS | Parallel POSIX | 100 GB/s | 200 µs | Open-source, simpler |
| EFS (AWS) | Managed NFS | 10+ GB/s | 1–10 ms | Low ops overhead |
Lustre architecture for AI:
Lustre Cluster for 500-GPU Training Job:
┌─────────────────────┐
│ MGS / MDS │ (Metadata Server)
│ Dir entries, inodes │
│ Lustre metadata │
└─────────┬───────────┘
│ Lustre LNET (IB/Ethernet)
┌───────────────────┬─────────┘─────────────────────────┐
│ │ │
┌──────┴──────┐ ┌────────┴──────┐ ┌───────┴──────┐
│ OSS/OST 0 │ │ OSS/OST 1 │ ... │ OSS/OST N │
│ (NVMe RAID) │ │ (NVMe RAID) │ │ (NVMe RAID) │
│ 100 TB │ │ 100 TB │ │ 100 TB │
└──────────────┘ └───────────────┘ └───────────────┘
│ │ │
└────────────────────┴────────────────────────────────────┘
│ Lustre client (each GPU node)
/mnt/training-data/ (striped across all OSTs)
Lustre striping for ML: Each file is striped across multiple OSTs (Object Storage Targets). For large dataset shards, set stripe count = all OSTs to maximize read bandwidth:
lfs setstripe -c -1 /mnt/training-data/dataset/ # -1 means all OSTs
Per-file bandwidth: aggregate bandwidth / (N × clients). With 40 OSTs at 4 GB/s each and 100 clients, each client gets ~1.6 GB/s.
NVIDIA GPUDirect Storage (GDS)
Traditional GPU data loading: NVMe → kernel buffer → CPU DRAM (page cache) → GPU HBM. Each hop costs 10–20 µs and requires CPU involvement.
GPUDirect Storage: NVMe (or Lustre) DMA directly to GPU HBM, bypassing CPU DRAM entirely.
Traditional path:
NVMe SSD ──DMA──▶ CPU DRAM (page cache)
│ cuMemcpyH2D
▼
GPU HBM (~4 GB/s per PCIe lane)
GDS path:
NVMe SSD ──RDMA──▶ GPU HBM directly via DMA
(PCIe peer-to-peer)
Throughput: up to 25 GB/s (H100 PCIe Gen5)
CPU involvement: zero (async DMA completion)
GDS software stack:
- libcufile: user-space library that submits GDS I/O operations
- nvidia-fs: kernel driver that coordinates PCIe peer-to-peer DMA
- Storage backend: NVMe (local) or NVMe-oF (remote), Lustre with lnet GDS support
When GDS helps: Large sequential reads of training data where data preprocessing is minimal (tokenized text, pre-extracted features). For image augmentation workloads (JPEG decode + resize), CPU is needed anyway—GDS provides less benefit.
GDS benchmark (NVIDIA internal): Reading a 10 GB file to GPU: traditional 4.2 GB/s vs GDS 11.8 GB/s on H100. For pre-tokenized text datasets, 2–3× improvement in DataLoader throughput.
Checkpoint Storage
Model checkpoints are the most demanding write workload in AI training. GPT-3-scale checkpoints are 700 GB (FP16 weights) to 2.8 TB (FP32 + optimizer states). Checkpoints must be written quickly (to minimize training interruption during failure recovery) and reliably (corruption = restart from previous good checkpoint).
Checkpoint frequency vs storage cost:
GPT-3 175B training:
Checkpoint size: ~1.4 TB (BF16 weights + FP32 optimizer states, ZeRO-1)
Target checkpoint frequency: every 500 iterations
Iteration time: ~200ms → checkpoint every 100 seconds
Required write throughput: 1.4 TB / 100s = 14 GB/s
→ Requires parallel filesystem or distributed checkpoint write
PyTorch Distributed Checkpoint (DCP, 2023): Each DP/TP/PP rank writes its own shard in parallel to a directory. Reconstruction at load time uses the same rank topology. 16× faster than serializing full model on rank 0.
Asynchronous checkpointing: Write checkpoint to CPU DRAM asynchronously while GPU continues training. Checkpoint is then flushed to storage in the background. Libraries: tensorstore (Google), orbax (JAX). PyTorch 2.1 added torch.distributed.checkpoint.async_save.
DAOS (Distributed Asynchronous Object Storage, Intel): Key-value store optimized for HPC checkpoint workloads. Uses Intel Optane (3D XPoint) or NVMe, RDMA-native. Sustains 1.4 TB/s for checkpoint writes on a large cluster. Used at Argonne National Laboratory's Aurora supercomputer.
Checkpoint storage hierarchy:
GPU HBM ──async copy──▶ CPU DRAM (in-memory)
│ background flush
▼
Local NVMe SSD (fast, temporary)
│ async replication
▼
Lustre / S3 / DAOS (permanent)
MLflow and Weights & Biases for Experiment Tracking
The experiment tracking problem: Training LLMs involves thousands of hyperparameter configurations, ablation studies, and architecture variants. Without systematic tracking, researchers lose results, duplicate experiments, and cannot reproduce findings.
MLflow:
- Open-source, self-hosted or Databricks-managed
- mlflow.log_metric(), mlflow.log_artifact(), mlflow.log_param()
- Model registry: versioned model artifacts with stage transitions (Staging → Production)
- Backed by any S3-compatible store for artifacts; SQLite/MySQL/PostgreSQL for metrics
Weights & Biases (wandb): - Cloud-hosted or on-premises - System metrics: GPU utilization, memory, network via DCGM integration - Gradient/weight histograms: detect vanishing/exploding gradients automatically - Artifacts: dataset and model versioning with lineage tracking - wandb Sweeps: distributed hyperparameter search (Bayesian optimization or random)
Storage requirements for experiment tracking (large-scale): - Training curve metrics at 1-second resolution for 1000 runs: ~10 GB/day - System metrics (GPU stats): ~50 GB/day - Gradient histograms (every 100 steps): ~1 TB per training run - Checkpoint artifacts: ~1 TB per model per checkpoint - Total for a 100-run hyperparameter sweep: 10–50 TB over weeks
Historical Context
Early deep learning data loading used HDD-backed NFS, which caused severe I/O bottlenecks on large ImageNet training runs (2012–2015). The introduction of SSDs and TFRecord format (2016) largely solved the ResNet/ImageNet case. The transformer era (2017+) brought text datasets where sequential streaming was natural and small-file problems less severe. The LLM era (2020+) moved the bottleneck to checkpoint storage (GB→TB scale) and created demand for disaggregated storage at 10+ TB/s. NVIDIA GDS was introduced in 2020 but adoption was slow until 2023 when H100 + NVMe-oF deployment patterns became common.
Production Examples
LAION-5B preprocessing pipeline (2022): 5 billion image-text pairs, ~240 TB. Stored as WebDataset TAR archives in S3. Training used img2dataset + WebDataset streaming from S3 at 250K images/second peak throughput.
Google's C4 dataset and MosaicML StreamingDataset: C4 is 806 GB of Common Crawl text. MosaicML trained MPT-7B using their StreamingDataset on AWS S3 without pre-downloading the full dataset, using concurrent S3 GET streaming.
Meta LLaMA training storage: Reported use of Ceph object storage (self-hosted) and Lustre for checkpoint storage. Checkpoint frequency: every 1,000 iterations; asynchronous copy to Ceph.
Argonne Aurora (2023, 166 Petaflop exascale system): Uses DAOS (Intel) for checkpoint storage. 230 PB total DAOS capacity, 31 TB/s peak I/O bandwidth. Used for AI + scientific computing workloads.
Debugging Notes
DataLoader worker crash: RuntimeError: DataLoader worker is killed usually means worker ran out of shared memory (/dev/shm exhausted). Increase system kernel.shmmax or reduce num_workers. Set prefetch_factor=2 to limit pre-fetched batches.
Checkpoint write failure mid-write: Partial checkpoints are worse than no checkpoint (corrupted weights). Always write to a temp path and atomic rename to final path:
torch.save(state, '/mnt/checkpoints/ckpt_1000.tmp')
os.rename('/mnt/checkpoints/ckpt_1000.tmp', '/mnt/checkpoints/ckpt_1000.pt')
On distributed filesystems, rename() is not always atomic (NFS is not). Use Lustre or GPFS which provide atomic rename.
Lustre MDS overload: Training jobs with millions of small files (e.g., ImageNet as raw JPEGs) saturate the MDS (Metadata Server). Symptoms: ls hangs, stat() takes seconds. Fix: pack small files into large archives (WebDataset TAR) to reduce MDS IOPS.
GDS compatibility issues: GDS requires NVMe drives connected via PCIe (not SATA). Verify with nvidia-fs-test -d /dev/nvme0n1. If GDS silently falls back to CPU path (check cufile.log), you lose the speedup but no errors are reported.
Security Implications
Training data poisoning: Storage access controls must prevent unauthorized modification of training datasets. A compromised storage system can inject adversarial examples into training data, causing model backdoors. Use checksums on training shards (SHA-256 manifest) and verify before training.
Checkpoint exfiltration: Model checkpoint files contain the full model weights—the most valuable IP in an AI company. Storage encryption (AES-256 at rest), access logging, and network egress monitoring are essential. S3 bucket policies must restrict to known training instances only.
Side-channel timing on shared storage: In multi-tenant shared Lustre deployments, file access patterns are visible to the Lustre MDS. A competing tenant can infer what datasets another tenant is reading by monitoring MDS operation rates. Private dedicated Lustre for sensitive AI workloads is recommended.
DAOS security model: DAOS uses CART (Collective and Asynchronous Runtime) with Mercury RPC. Authentication is via certificate-based (x.509). An attacker on the DAOS network can intercept checkpoint data in transit unless TLS is configured for DAOS pool replication.
Performance Implications
Storage bandwidth vs GPU throughput ratio: Rule of thumb: provision at least 1 GB/s of storage read bandwidth per 100 TFLOP/s of GPU compute for typical transformer training. An 8×H100 node (16,000 TFLOP/s) needs 160 GB/s from storage—this exceeds any single NVMe drive (7 GB/s) and requires 8× local NVMe stripes or a parallel filesystem.
Checkpoint write batching: Writing 1 TB checkpoint synchronously stalls training for 1.4 TB / 14 GB/s = 100 seconds. Async checkpoint + CPU buffer reduces this to zero overhead.
Prefetch aggressiveness: PyTorch DataLoader with prefetch_factor=4 and num_workers=16 keeps 64 batches in-flight. At 100 MB/batch this is 6.4 GB of RAM. Tune based on available CPU DRAM.
Failure Modes and Real Incidents
Incident: Corrupted checkpoint causing 3-day training restart: A Lustre OST failure mid-write produced a partial checkpoint that passed size validation (same file size) but had zero bytes in the middle (sparse file). Training resumed from the corrupted checkpoint and produced NaN loss immediately. Mitigation: validate checkpoint integrity using gradient norm check on first iteration after resuming, or store a reference loss value in the checkpoint metadata.
Incident: S3 metadata throttling at training scale: A team training on 50M WebDataset files stored as individual S3 objects (not TAR archives) encountered S3 throttling at 5,000 GET/s per prefix. Training throughput dropped 90%. Fix: repack into 10GB TAR archives (50K files per archive → 1,000 total objects).
Incident: /dev/shm exhaustion crashing DataLoader workers: With num_workers=32 and large batch tensors, shared memory exhausted within hours. Docker containers default to 64 MB /dev/shm. Fix: mount a larger tmpfs (--shm-size=32g in Docker run) or switch to file system-backed shared memory.
Modern Usage
MosaicML Streaming (Databricks, 2024): Dominant format for LLM pretraining data. Supports arbitrary shard sizes, random access, and download-during-training from cloud storage.
fs-spec (fsspec) abstraction: Python library providing a unified filesystem interface over local, S3, GCS, Azure, HDFS, FTP. All major ML frameworks use fsspec, enabling the same DataLoader code to read from local NVMe or S3 transparently.
NVMe-oF (NVMe over Fabrics) for checkpoint storage: NVMe-oF/RDMA (RoCEv2) enables shared NVMe arrays to appear as local NVMe devices across the network, with <100 µs latency. Used for fast checkpoint storage without local SSD.
Future Directions
- Active storage / storage-side preprocessing: Push JPEG decoding and tokenization to storage nodes (Lustre HSM hooks, WekaFS weka.io functions), reducing CPU preprocessing overhead on compute nodes
- CXL memory pooling: CXL (Compute Express Link) enables memory-semantic access to remote DRAM/NVMe over PCIe 5/6 with <1 µs latency; could replace Lustre for random-access dataset serving
- Smart storage for training: Purpose-built storage controllers (Seagate Kinetic, Samsung SmartSSD) with on-drive FPGA for data preprocessing
- Differential checkpointing: Persist only changed model parameters per checkpoint (sparse diff), reducing checkpoint size by 10–100× for incremental saves
Exercises
-
Throughput calculation: Design the storage architecture for a training cluster with 512 H100 GPUs training on a 20TB text dataset. The model processes 100K tokens/second per GPU (BF16). Each token is 2 bytes. Calculate: (a) required aggregate storage read bandwidth, (b) required checkpoint write bandwidth (checkpoint every 200 iterations at 50ms/iteration, checkpoint size 2TB), (c) minimum number of Lustre OSTs at 4 GB/s each.
-
DataLoader optimization: Profile a PyTorch DataLoader reading from local NVMe with
num_workers=1. Identify where time is spent (I/O vs CPU augmentation vs GPU transfer). Progressively increasenum_workers,prefetch_factor, and addpin_memory=True. Plot GPU utilization vs DataLoader configuration. -
WebDataset vs raw files: Create two versions of a 1M-image dataset: (a) raw JPEG files in directories, (b) packed as 1,000 TAR archives (1,000 images each). Benchmark sequential read throughput for each. Measure MDS IOPS if using Lustre (or inode lookup rate if POSIX).
-
GDS micro-benchmark: Using
cufilelibrary, benchmark file-to-GPU read throughput for: (a) standardread()+cudaMemcpy(), (b)cuFileRead()with GDS enabled. Test with 1 GB, 10 GB, and 100 GB files. Compare to theoretical PCIe Gen4 bandwidth. -
Checkpoint integrity checker: Write a Python script that loads a PyTorch checkpoint file, computes a SHA-256 hash of all parameter tensors, and stores it as metadata. Extend to verify integrity on load and detect partial writes or corruption.
References
- NVIDIA GPUDirect Storage Documentation: https://docs.nvidia.com/gpudirect-storage/
- Intel DAOS Documentation: https://docs.daos.io/
- Lustre Architecture Guide: https://wiki.lustre.org/Architecture
- WebDataset documentation: https://github.com/webdataset/webdataset
- MosaicML Streaming documentation: https://docs.mosaicml.com/projects/streaming/
- Zhao et al., "PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel," VLDB 2023
- Paszke et al., "PyTorch: An Imperative Style, High-Performance Deep Learning Library," NeurIPS 2019
- Schuhmann et al., "LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models," NeurIPS 2022