Model Parallelism Deep Dive

Technical Overview

Model parallelism encompasses a family of techniques for distributing a single neural network across multiple accelerators. While data parallelism replicates the entire model and splits the data, model parallelism is necessary when a single model does not fit on a single device, or when fitting it produces suboptimal compute efficiency. The three main dimensions are: Tensor Parallelism (splitting individual weight matrices), Pipeline Parallelism (splitting layers across devices), and Expert Parallelism (routing tokens to different expert sub-networks). For frontier models at 100B+ parameters, all three are combined with data parallelism into "4D parallelism."

Prerequisites

Understanding of transformer architecture (self-attention, MLP, layer norm)
Linear algebra foundations (matrix partitioning, block matrix multiplication)
Familiarity with PyTorch nn.Module and CUDA streams
Knowledge of collective communication primitives (AllReduce, AllGather, ReduceScatter, All-to-All)
Understanding of distributed training from 02-distributed-training.md

Core Content

Tensor Parallelism: Megatron-LM Approach

Tensor Parallelism (TP) distributes weight matrices across devices so that each device computes a fraction of each layer's output.

Mathematical foundation: For a linear layer Y = XA + b: - X: input [B×S, H_in] - A: weight [H_in, H_out] - Y: output [B×S, H_out]

Column Parallel: Partition A along columns: A = [A₁|A₂|...|Aₚ] where each Aᵢ is [H_in, H_out/p]. Each device i computes Yᵢ = X × Aᵢ locally, giving partial output [B×S, H_out/p]. No communication needed here (all devices have full X).

Row Parallel: Partition A along rows: A = [A₁ᵀ; A₂ᵀ; ...; Aₚᵀ]. Also partition input X = [X₁|X₂|...|Xₚ] where each device i gets Xᵢ = [B×S, H_in/p]. Each device i computes Yᵢ = Xᵢ × Aᵢ (partial sum). Final Y = ΣYᵢ requires AllReduce.

Full transformer block with TP=4:

Input X [B×S, H]  ──────── broadcast to all 4 GPUs ────────▶

                    GPU0     GPU1     GPU2     GPU3
Q proj (col-par): [H,H/4] [H,H/4] [H,H/4] [H,H/4]
K proj (col-par): [H,H/4] [H,H/4] [H,H/4] [H,H/4]
V proj (col-par): [H,H/4] [H,H/4] [H,H/4] [H,H/4]
                    │        │        │        │
                Heads 0-7 Heads 8-15 Heads 16-23 Heads 24-31
                    (each GPU owns 1/4 of attention heads)
                    │        │        │        │
Output proj (row):  Partial sums ──▶ AllReduce ──▶ Full [B×S, H]

MLP FC1 (col-par): [H, 4H/4] on each GPU
MLP Act (local):   GELU applied locally
MLP FC2 (row-par): AllReduce → Full [B×S, H]

Communication count: 2 AllReduce per transformer layer (one for attention output, one for MLP output). For backward pass: 2 more AllReduce (for gradients). Total: 4 AllReduce per layer.

Sequence Parallelism (Megatron-LM v3, 2023): The activations in the non-TP regions (LayerNorm, Dropout) are replicated across TP workers. For long sequences, this wastes memory. Sequence Parallelism partitions the sequence dimension across TP workers for these operations, further reducing activation memory.

Non-TP region (e.g., LayerNorm):
  With SP: each GPU holds [B, S/p, H] (sequence partitioned)
  Before TP region: AllGather to reconstruct [B, S, H]
  After TP region: ReduceScatter instead of AllReduce
  → Same communication volume, but sequence is never fully assembled

SP reduces activation memory by an additional factor of TP degree (up to 8×).

Pipeline Parallelism: Schedules and Bubble Analysis

Pipeline Parallelism assigns non-overlapping sets of model layers to different GPUs. A "pipeline stage" processes the activations flowing from the previous stage.

Bubble overhead formula:

Bubble fraction = (p - 1) / (p - 1 + M)

where:
  p = number of pipeline stages
  M = number of micro-batches per global batch

For p=8, M=8:  7/15 = 46.7% wasted
For p=8, M=32: 7/39 = 17.9% wasted
For p=8, M=64: 7/71 = 9.9%  wasted

GPipe (Google, NeurIPS 2019): F-then-B schedule. All micro-batches run forward, then all run backward. Memory: must store activations for all M micro-batches simultaneously = p×M memory in worst case.

GPipe (p=4, M=4):
S0: F0 F1 F2 F3 B3 B2 B1 B0
S1:    F0 F1 F2 F3 B3 B2 B1 B0
S2:       F0 F1 F2 F3 B3 B2 B1 B0
S3:          F0 F1 F2 F3 B3 B2 B1 B0
             ▲ pipeline fill bubble ▲      ▲ drain bubble ▲

PipeDream 1F1B (CMU, SOSP 2019): After pipeline fill, each stage alternates one forward and one backward micro-batch. Memory: only p copies of activations (one per in-flight micro-batch).

PipeDream 1F1B (p=4, M=8):
S0: F0 F1 F2 F3 B0 F4 B1 F5 B2 F6 B3 F7 B4 B5 B6 B7
S1:    F0 F1 F2 F3 B0 F4 B1 F5 B2 F6 B3 F7 B4 B5 B6 B7
S2:       F0 F1 F2 F3 B0 F4 B1 F5 B2 F6 B3 B4 B5 B6 B7
S3:          F0 F1 F2 F3 B0 F1 B1 F2 B2 F3 B3 B4 B5 B6 B7

Virtual Pipeline Stages (Megatron-LM v2): Instead of assigning consecutive layers to each device, assign non-contiguous "chunks" of layers. If TP=4 and PP=4 without virtual stages, GPU 0 gets layers 0–15, GPU 1 gets layers 16–31, etc. With 2 virtual stages, GPU 0 gets layers 0–7 and layers 32–39. This doubles the number of "pipeline stages" from 4 to 8, reducing the bubble by roughly 2×, at the cost of doubling inter-stage communication.

Interleaved 1F1B schedule: Combines virtual stages with 1F1B, achieving near-zero bubble for large M.

Expert Parallelism (EP) for Mixture of Experts

Mixture of Experts (MoE) replaces some or all dense FFN layers with a set of expert FFN sub-networks, where each input token is routed to K (typically 2) of E total experts.

MoE layer:

Input: x [B×S, H]
Router: G(x) = softmax(x × W_g) → [B×S, E]   (expert probabilities)
Top-K selection: select top-K experts per token
Expert FFN: each expert Eᵢ is a standard FFN [H → 4H → H]

Output: y = Σᵢ G(x)ᵢ × Eᵢ(x)  (weighted sum of selected experts)

Expert Parallelism: With E=64 experts and 64 GPUs in the EP group, place one expert per GPU. Token routing requires All-to-All communication: each GPU sends tokens to the GPU holding the expert they need, and receives tokens from GPUs whose tokens need its expert.

Expert Dispatch (All-to-All):
GPU0 (Expert 0): receives tokens from all GPUs destined for Expert 0
GPU1 (Expert 1): receives tokens from all GPUs destined for Expert 1
...
GPU63 (Expert 63): receives tokens from all GPUs destined for Expert 63

Each GPU:
1. Route local tokens → compute destination GPUs
2. All-to-All send: dispatch tokens to correct expert GPUs
3. Expert computation: run FFN on received tokens
4. All-to-All recv: return outputs to originating GPUs
5. Combine: weighted sum of expert outputs

Expert load imbalance: Routing networks tend to concentrate token flow on a few popular experts ("expert collapse"). Auxiliary load-balancing loss penalizes uneven expert utilization during training. At inference, "token dropping" limits each expert to a maximum capacity (1.0–1.5× average tokens per expert); excess tokens are dropped or passed through residual.

Mixtral 8×7B (Mistral AI, 2023): 8 experts, 2 activated per token, 45B total parameters but 12B active per token. Expert Parallelism=8 for distributed inference.

Switch Transformer (Google, 2022): k=1 (single expert selection), simplifying routing. E=128 or 2048 experts. Training to 1.6T parameters was demonstrated.

Tensor Parallel Communication Volume Analysis

For a transformer with hidden_dim=H, sequence_len=S, batch_size=B, TP_degree=p:

Per-layer communication: - Forward pass AllReduce: 2 × [B×S×H] bytes (attention + MLP) - Backward pass AllReduce: 2 × [B×S×H] bytes (same, for gradient reduction) - Total per layer: 4 × B×S×H bytes

For GPT-3 (H=12288, S=2048, B=1, TP=8): - Per AllReduce: 2 × 1 × 2048 × 12288 × 2 bytes (BF16) = 100 MB - Per transformer block (4 AllReduces): 400 MB - 96 transformer blocks: 38.4 GB total communication per iteration

H100 NVLink bandwidth per GPU: 900 GB/s (NVSwitch 3.0) - Time for one AllReduce (100 MB at 900 GB/s): 0.11 ms - Total AllReduce overhead per iteration (384 ops): 42 ms - Typical forward+backward time for one layer: ~5 ms per layer × 96 = 480 ms - Communication fraction: 42/480 ≈ 8.8% (acceptable)

If TP were inter-node (200 Gbps IB): - Time for one AllReduce (100 MB at 200 Gbps effective): 4 ms - Total AllReduce overhead: 384 × 4 ms = 1,536 ms >> compute - TP must be confined to NVLink-connected GPUs

Optimal Parallelism Strategy Selection

Decision tree:

Does model fit on one GPU?
  YES → Data Parallel only
  NO ↓

TP degree selection:
  Set TP = min(GPUs_per_node, model_size / single_gpu_memory)
  Typically TP=8 for node with 8 NVLink GPUs

PP degree selection:
  PP = total_GPUs / (TP × DP)
  Large PP (>8) requires many micro-batches (M > 4×p) for acceptable bubble

DP degree selection:
  DP = total_GPUs / (TP × PP)
  DP throughput scales linearly; larger DP = larger global batch

ZeRO stage selection:
  If model_per_DP_rank > GPU_memory: enable ZeRO-2 or ZeRO-3
  ZeRO-3 with TP conflicts: use FSDP within DP group only

Rule of thumb for transformer models: | Model Size | TP | PP | DP | GPUs | |------------|----|----|----|----| | 7B | 1–2 | 1 | 8 | 8 | | 13B | 2 | 1 | 8 | 16 | | 70B | 4–8 | 4 | 8 | 128 | | 175B | 8 | 16 | 16 | 2,048 | | 530B | 8 | 35 | 35 | 9,800 |

Megatron-LM Code Structure

# megatron/core/tensor_parallel/layers.py (simplified)

class ColumnParallelLinear(torch.nn.Module):
    def __init__(self, input_size, output_size, ...):
        super().__init__()
        world_size = get_tensor_model_parallel_world_size()
        self.output_size_per_partition = divide(output_size, world_size)
        # Each rank only holds its partition of the weight
        self.weight = Parameter(torch.empty(
            self.output_size_per_partition, input_size, ...))

    def forward(self, input_):
        # input_ is full tensor [B, S, H] (no need to split input for column-parallel)
        output_parallel = F.linear(input_, self.weight, self.bias)
        # output_parallel is [B, S, H/p] — partial on each rank
        if self.gather_output:
            output = gather_from_tensor_model_parallel_region(output_parallel)
        else:
            output = output_parallel  # pass partial to next row-parallel layer
        return output

class RowParallelLinear(torch.nn.Module):
    def forward(self, input_):
        # input_ is [B, S, H/p] — already partitioned from column-parallel layer
        output_parallel = F.linear(input_, self.weight)
        # Reduce sum across tensor parallel ranks
        output = reduce_from_tensor_model_parallel_region(output_parallel)
        return output + self.bias  # bias added after reduction

The key insight: reduce_from_tensor_model_parallel_region wraps torch.distributed.all_reduce in both forward (sum) and backward (identity, since backward through AllReduce is just pass-through).

Historical Context

The first model-parallel neural network training at scale was Google's 8.5B-parameter LaMDA predecessor using a combination of model and data parallelism (2019). Megatron-LM (NVIDIA 2019) formalized tensor parallelism and achieved record language model training at the time (8.3B GPT-2 variant). The combination of TP + PP was published in the Megatron-LM v2 paper (SC 2021). Mixture of Experts had been proposed much earlier (Jacobs et al. 1991, Jordan & Jacobs 1994) but was adapted for transformers in GShard (Google 2021) and Switch Transformer (Google 2022). The 4D parallelism framework (DP+TP+PP+EP) was first implemented in Megatron-DeepSpeed for the Megatron-Turing NLG 530B project (2021).

Production Examples

Megatron-Turing NLG 530B (Microsoft + NVIDIA, 2021): 530B parameters, TP=8, PP=35, DP=35, trained on 4,480 A100 80GB GPUs. Required custom pipeline schedule with careful micro-batch count tuning.

GPT-NeMo (NVIDIA, 2023): Production training framework based on Megatron-Core + NeMo. Used by Cohere, AdeptAI, and others for custom LLM training.

Mixtral 8×7B (Mistral AI, 2023): Sparse MoE deployed with EP=8 for inference, TP=1. Open weights model with industry-leading quality-per-FLOP ratio.

DeepSeek-V2 (2024): 236B total parameters, 21B active, MoE with 160 experts. Uses Multi-head Latent Attention (MLA) for KV cache reduction + Expert Parallelism for distributed inference.

Debugging Notes

TP communication deadlock: TP AllReduce must be synchronous and called by all ranks in the TP group simultaneously. If one rank exits the model forward pass early (e.g., due to an exception), the remaining ranks hang waiting for AllReduce. Always use try/finally to ensure collective participation or implement timeout-based abort.

PP activation storage bug: With 1F1B, the pipeline stage must keep activations from micro-batches 0..p-1 alive. A common bug is releasing activations too early, causing incorrect backward pass. Verify with gradient checksum comparison against single-GPU reference.

Expert routing instability: If auxiliary loss weight is too high, the model collapses to uniform routing (all experts equally loaded but all experts learning the same function). Typical range: 0.01–0.1. Monitor expert utilization entropy during training.

False All-to-All deadlock (MoE): All-to-All requires all ranks to call simultaneously. If batch sizes are inconsistent across DP ranks (e.g., due to data loading imbalance), All-to-All hangs. Pad all batches to the same maximum size.

Security Implications

Model parallelism distributes model weights across multiple nodes/GPUs. Each node now holds only a fraction of the model, which may seem like security-by-distribution, but in practice, an adversary with access to any single pipeline stage node can reconstruct the other stages via repeated inference (model inversion). For proprietary models, all pipeline stage machines must be equally trusted and secured. Expert weights on different GPUs each represent partial model capability; collectively they must be protected as one model asset.

Performance Implications

TP vs PP communication latency: - TP AllReduce (NVLink): ~0.1ms for 100MB - PP activation send (IB 200 Gbps): 100MB × 8 bits / 200 Gbps = 4ms - TP is 40× faster for the same data volume; prefer TP over PP when both could solve the memory problem

Optimal micro-batch count for PP: Setting M = 4×p gives ~20% bubble. Beyond M = 8×p, bubble reduction is marginal but memory pressure from in-flight micro-batches increases. Profile pipeline efficiency with nvtx_profile marks in Megatron's timers.py.

Expert parallel overhead: All-to-All for MoE routing adds ~2ms per MoE layer at E=64. For a model with 40 MoE layers, this is 80ms per iteration. With 200ms compute time per iteration, MoE overhead is ~40%. Reduce by using expert capacity factor < 1.5 and implementing async All-to-All.

Failure Modes and Real Incidents

Incident: PP weight corruption from activation overflow (OPT training, 2022): The OPT-175B training log (publicly released by Meta) shows multiple loss spikes attributed to activation overflow in early layers causing numerical instability that propagated through the pipeline. PP made debugging harder because the spike appeared in stage 3 but originated in stage 0 activations. Resolution: gradient clipping + lower learning rate.

Incident: TP rank desynchronization: A misconfigured environment variable caused 2 of 8 TP ranks to use a different model parallelism degree. AllReduce with mismatched participants caused NCCL to hang indefinitely. Detection: validate TP group configuration at startup with a test AllReduce before loading model weights.

Incident: MoE expert load collapse during fine-tuning: A pre-trained MoE model fine-tuned with high learning rate caused expert utilization to collapse to 3 of 64 experts handling 99% of tokens. The load-balancing auxiliary loss was zero during fine-tuning (mistakenly disabled). Fix: always enable auxiliary loss during fine-tuning; add explicit expert utilization monitoring.

Modern Usage

Megatron-Core (2023): Modular library extraction from Megatron-LM. Provides TransformerLayer blocks composable with arbitrary parallelism strategies. Used by NeMo, MegaBlocks (efficient MoE), and custom training stacks.

FairScale/PyTorch FSDP + Megatron TP: Meta's LLaMA training combines FSDP (ZeRO-3 equivalent) for DP and Megatron-style TP within nodes. Avoids the complexity of PP for models up to ~70B.

MegaBlocks (Stanford, 2022): Efficient MoE training using block-sparse operations (dMoE) that express expert dispatch as a single triton kernel, eliminating All-to-All latency for training on single-node setups.

Future Directions

Automated parallelism search: Alpa (ICML 2022), UNITY (MLSys 2023) — treat parallelism as a compiler optimization problem; ILP solves for optimal TP/PP/DP degrees
Inter-layer gradient compression: Reduce PP activation transmission size via low-rank approximation of activation tensors
Expert specialization analysis: Understanding what linguistic phenomena different experts specialize in (early evidence shows syntactic vs semantic specialization)
Heterogeneous parallelism: Different pipeline stages on different GPU types (memory-bandwidth-optimized for FFN stages, compute-optimized for attention stages)

Exercises

Column-parallel correctness proof: Prove mathematically that column-parallel + row-parallel with AllReduce produces identical results to the full [H × H] weight matrix multiplication. Extend the proof to the case where input to column-parallel is itself partitioned (sequence parallelism).
PP bubble simulator: Implement the GPipe, 1F1B, and interleaved-1F1B schedules in Python. For p=8 stages and varying M (1 to 64 micro-batches), plot: (a) bubble fraction, (b) peak memory in terms of stored activations, (c) total communication volume.
MoE routing implementation: Implement top-2 expert routing with an auxiliary load-balancing loss in PyTorch. Train a small MoE MLP on CIFAR-10. Vary the auxiliary loss coefficient (0, 0.001, 0.01, 0.1) and plot test accuracy vs expert utilization entropy.
Megatron-LM profiling: Set up a 2-GPU Megatron-LM training run for a small GPT model (125M parameters). Use nsys profile to capture a timeline. Identify: (a) AllReduce CUDA kernels, (b) forward vs backward time ratio, (c) pipeline bubble fraction.
Expert parallel All-to-All analysis: For an MoE model with E=64 experts, TP=1, PP=1, DP=64 (64 GPUs total), given an average of T=32 tokens per expert per forward pass, 400 Gbps inter-node bandwidth, and hidden_dim=H=4096, calculate: (a) bytes transferred per All-to-All call, (b) time per All-to-All at 50% link utilization, (c) fraction of iteration time spent in All-to-All if compute per MoE layer takes 10ms.

References

Shoeybi et al., "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism," arXiv 2019
Narayanan et al., "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM," SC 2021
Korthikanti et al., "Reducing Activation Recomputation in Large Transformer Models" (Sequence Parallelism), MLSys 2023
Fedus et al., "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity," JMLR 2022
Lepikhin et al., "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding," ICLR 2021
Zheng et al., "Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning," OSDI 2022
Gale et al., "MegaBlocks: Efficient Sparse Training with Mixture-of-Experts," MLSys 2023
Huang et al., "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism," NeurIPS 2019