LLM Inference Systems

Technical Overview

LLM inference is fundamentally different from training: there is no backward pass, no gradient accumulation, and no optimizer state. Instead, the bottleneck shifts to memory bandwidth (loading multi-billion-parameter weight matrices from HBM to compute units) and KV cache management (storing growing key-value tensors for every generated token). The goal is to serve as many concurrent users as possible while meeting latency SLOs (e.g., first token in <500ms, subsequent tokens at >20 tokens/second). Modern inference systems like vLLM, TensorRT-LLM, and SGLang have made batching, memory management, and quantization the central design axes.

Prerequisites

Understanding of the transformer attention mechanism (Q, K, V matrices, causal masking)
Familiarity with CUDA memory model (HBM, L2 cache, registers, shared memory)
Knowledge of quantization concepts (INT8, INT4, weight-only quantization)
Understanding of operating system paging and virtual memory concepts
Basic proficiency with GPU profiling tools (nsys, nvtop)

Core Content

Training vs Inference: Fundamental Differences

Dimension	Training	Inference
Pass type	Forward + Backward	Forward only
Bottleneck	Compute (FLOP-bound)	Memory bandwidth (weight loading)
Batch size	Large (256–8192)	Variable (1–1000s)
Sequence len	Fixed (context length)	Variable (grows per token)
Precision	BF16/FP16	INT8/INT4/FP16
Hardware util	50–55% MFU	30–50% MBU (memory bandwidth utilization)
Duration	Days–months	Milliseconds–seconds

The memory bandwidth bottleneck: A 70B-parameter model in FP16 stores 140 GB of weights. At each decoding step, every weight must be loaded from HBM to compute the single output token. An H100 has 3.35 TB/s HBM3 bandwidth. Loading 140 GB takes 42ms per token—the theoretical minimum time per token regardless of batch size when batch=1. Increasing batch size amortizes weight loading across multiple requests: at batch=32, per-token weight load time stays 42ms but 32 tokens are generated simultaneously, yielding 32/42ms ≈ 762 tokens/second.

Batching Strategies

Static Batching

The server waits until a fixed batch size is reached (or a timeout expires), then processes all requests together.

Time ────────────────────────────────────────────────────────▶
Request 1 (5 tokens): [Prompt] [T1][T2][T3][T4][T5]
Request 2 (3 tokens): [Prompt] [T1][T2][T3] DONE ── WAIT ──
Request 3 (8 tokens): [Prompt] [T1][T2][T3][T4][T5][T6][T7][T8]
                       ◀─── Batch must complete together ──────▶

Problem: Short requests must wait for the longest request in the batch. GPU is idle after short request completes. Low utilization, high latency variance.

Continuous Batching (Iteration-Level Scheduling)

Pioneered by Orca (OSDI 2022) and popularized by vLLM. New requests are inserted into the batch at every decode iteration, not at batch boundaries.

Time ──────────────────────────────────────────────────────▶
Iteration: 1  2  3  4  5  6  7  8  9  10  11  12

Req A:     [P][T1][T2][T3][T4][T5] DONE
Req B:     [P][T1][T2] DONE
Req C:     -- [P][T1][T2][T3][T4][T5][T6][T7] DONE
Req D:     -- --  [P][T1][T2][T3] DONE
Req E:     -- --  -- --  [P][T1][T2] ....

Batch at each iteration: {active requests only}
When a request finishes, new request immediately fills its slot

Continuous batching vs static batching (empirical results from vLLM paper): - 23× higher throughput on Llama-13B at similar latency SLOs - Enables serving 100s of concurrent users on a single A100 node

Prefill vs Decode Phase

Every LLM request has two phases:

Prefill: Process the entire input prompt in one forward pass. All tokens computed in parallel (full attention). FLOP-intensive, not memory-bandwidth limited. Time proportional to prompt length.

Decode: Generate tokens one at a time. Each decode step appends one token to the KV cache and reads the entire cache. Memory-bandwidth intensive. Time per token ≈ constant (dominated by weight loading + KV cache loading).

Prefill and decode have very different resource profiles. Some systems (DistServe, DejaVu) disaggregate prefill and decode onto different GPU pools for efficiency.

KV Cache Architecture

The attention mechanism computes:

Attention(Q, K, V) = softmax(Q × Kᵀ / √d_k) × V

For autoregressive generation, Keys and Values for all past tokens are reused at each new step. Storing them avoids recomputing them.

KV cache memory formula:

KV_cache_bytes = 2 × num_layers × num_kv_heads × head_dim × max_seq_len × batch_size × bytes_per_element

Example: Llama 2 70B, FP16, batch=1, seq=4096:
= 2 × 80 layers × 8 GQA heads × 128 head_dim × 4096 tokens × 1 × 2 bytes
= 2 × 80 × 8 × 128 × 4096 × 2
= 671 MB per request

At batch=100: 67 GB — most of A100's 80 GB

Grouped Query Attention (GQA): Llama 2/3, Mistral, and others use GQA where multiple query heads share key/value heads (e.g., 32 Q heads, 8 KV heads). Reduces KV cache by 4× with minimal accuracy loss.

Multi-Query Attention (MQA): Extreme case: single K and V shared by all Q heads. Used in PaLM, Falcon, Falcon-40B.

PagedAttention (vLLM)

Traditional KV caches pre-allocate max_seq_len × batch_size × ... memory upfront. This causes fragmentation: requests that complete early leave gaps; maximum length reservation wastes memory for short requests.

PagedAttention (Kwon et al., SOSP 2023): Inspired by OS virtual memory paging. KV cache is divided into fixed-size "blocks" (analogous to pages, typically 16 tokens per block). Each block is allocated dynamically as tokens arrive.

Physical KV Cache Memory (H100 HBM):
┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Block 0 │ Block 1 │ Block 2 │ Block 3 │ Block 4 │ Block 5 │
│ (Req A  │ (Req B  │ (Req A  │  FREE   │ (Req C  │ (Req B  │
│  tok 0-15│ tok 0-15│ tok 16-31│        │ tok 0-15│ tok 16-31│
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

Block Table (virtual → physical mapping):
Req A: [Block 0, Block 2, Block 7, ...]  (scattered physical blocks)
Req B: [Block 1, Block 5, ...]
Req C: [Block 4, ...]

The attention kernel uses block tables (like page tables) to translate virtual token positions to physical memory addresses. Attention is computed as scatter-gather across non-contiguous physical blocks.

Benefits: - Near-zero internal fragmentation (last block of each sequence wastes at most block_size-1 tokens) - External fragmentation eliminated (all free blocks are usable) - Copy-on-write sharing for beam search: Multiple beam hypotheses share prefix KV blocks until they diverge. When writing, copy the block first (like OS fork CoW). - Prefix caching: If multiple requests share the same system prompt, their KV cache blocks can be shared. vLLM's "prefix cache" automatically deduplicates shared prefixes.

Memory utilization improvement: vLLM reports 80–95% KV cache memory utilization vs 20–40% for pre-allocation systems.

Speculative Decoding

The decode bottleneck is that each step generates exactly one token. Speculative decoding generates multiple tokens per step using a small "draft" model, then verifies with the large "target" model.

Step 1: Draft model generates γ=4 candidate tokens fast
   Draft: [... input] → [tok1, tok2, tok3, tok4]

Step 2: Target model verifies all 4 tokens in parallel (one forward pass)
   Target evaluates: P(tok1|ctx), P(tok2|ctx,tok1), P(tok3|ctx,tok1,tok2), P(tok4|ctx,tok1,tok2,tok3)

Step 3: Accept tokens using rejection sampling:
   Accept tok_i if rand() < P_target(tok_i) / P_draft(tok_i)
   On first rejection, replace with target model sample and discard subsequent tokens

Example: γ=4, acceptance rate α=0.8
  Expected tokens per step = γ × α + 1 = 4 × 0.8 + 1 = 4.2  (vs 1.0 without speculative)
  Speedup ≈ 4.2× (wall clock, assuming draft model is fast)

Practical speedup: 2–4× on benchmarks (Llama 70B + Llama 7B draft). Works best when the draft and target models have similar vocabulary distributions (fine-tuned variants of same base model).

Medusa (CMU 2023): Trains multiple "heads" on the same model to predict tokens at positions +1, +2, +3, +4 simultaneously. No separate draft model needed. 2–3× speedup.

EAGLE (ICLR 2024): More sophisticated draft model architecture using the target model's hidden states, achieving higher acceptance rates.

Quantization for Inference

INT8 (W8A8): Both weights and activations quantized to 8-bit. LLM.int8() (Dettmers 2022) uses mixed-precision: identifies "outlier" channels in activations (which resist quantization) and keeps them in FP16, quantizes remaining 99.9% to INT8.

Weight-only INT4 (W4A16): Weights quantized to 4-bit, activations remain FP16. GPTQ (2022): minimize layer-by-layer quantization error using second-order gradient information (Optimal Brain Quantization framework). AWQ (MIT 2023): observation that <1% of weights are "salient"—protect those in higher precision. AWQ achieves better accuracy than GPTQ at INT4.

GGUF/GGML (llama.cpp): k-bit quantization (2-8 bit) for CPU inference. Q4_K_M is the popular choice: 4-bit with mixed 6-bit for sensitive layers, ~4.5 bits average. Enables 70B inference on a MacBook Pro.

FP8 (H100+ native): H100 hardware supports FP8 (E4M3 and E5M2 formats). TensorRT-LLM FP8 achieves near-FP16 accuracy with ~2× throughput. FP8 eliminates the quantization error of INT8 by using floating-point representation.

Memory footprint comparison (Llama 2 70B):

Precision	Memory	H100 throughput
FP32	280 GB	Baseline
BF16	140 GB	1.8×
INT8	70 GB	2.5×
INT4	35 GB	3.5×
GGUF Q4_K_M	~40 GB	(CPU)

Continuous Batching Architecture

Client Requests
      │
      ▼
┌─────────────────────────────────┐
│         API Server              │
│  (FastAPI / uvicorn / asyncio)  │
└──────────────┬──────────────────┘
               │ HTTP/gRPC
               ▼
┌─────────────────────────────────┐
│        Scheduler                │
│  - Request queue (priority)     │  ◀── Admission control
│  - KV cache block allocator     │      (reject if OOM)
│  - Iteration-level batching     │
└──────────────┬──────────────────┘
               │ Batch of tokens
               ▼
┌─────────────────────────────────┐
│      LLM Engine (GPU Worker)    │
│  ┌────────────────────────────┐ │
│  │  CUDA Kernels              │ │
│  │  - Prefill attention       │ │
│  │  - PagedAttention decode   │ │
│  │  - GEMM (weight matmul)    │ │
│  │  - Sampler (top-p/top-k)   │ │
│  └────────────────────────────┘ │
└──────────────┬──────────────────┘
               │ Generated tokens
               ▼
        Streaming Response

vLLM implementation notes: - Python asyncio event loop for HTTP server - Forked GPU worker processes (one per GPU for tensor parallelism) - Shared memory ring buffer for CPU-GPU token transfer - Custom CUDA kernels for PagedAttention (faster than FlashAttention for variable-length decode)

Historical Context

Pre-2022, most LLM serving used naive batching (Hugging Face pipeline). The seminal work was Orca (OSDI 2022, University of Washington) which introduced iteration-level scheduling. vLLM (UC Berkeley, 2023, SOSP) introduced PagedAttention, achieving 24× higher throughput than Hugging Face. TensorRT-LLM (NVIDIA, late 2023) provided a production-optimized alternative with FP8 support. Text Generation Inference (HuggingFace, TGI) became the open-source deployment standard. The field moved extremely fast: speculative decoding (Chen et al., 2023), Medusa, EAGLE, continuous batching improvements, and disaggregated prefill/decode all appearing within 18 months.

Production Examples

OpenAI GPT-4 serving: Reportedly uses a custom inference stack with speculative decoding (GPT-3.5 as draft). Token throughput targets: ~60 tokens/second per user, 500ms first token latency.

Anthropic Claude serving: Uses a custom serving system. Multi-head latent attention (MLA) in Claude 3 models reduces KV cache memory. Deployed on AWS with custom EC2 instances.

Anyscale (vLLM production): Serves Llama 2 70B at 4× H100, achieving 2,000 tokens/second total throughput per node via continuous batching + PagedAttention.

Mistral AI deployment: Uses TGI with FP8 quantization on H100. Reports 1,000+ concurrent users per 8×H100 node.

Debugging Notes

CUDA OOM during serving: PagedAttention OOM usually indicates KV cache exhausted. Increase gpu_memory_utilization in vLLM (default 0.9). If still OOM, reduce max_num_seqs (maximum concurrent requests) or use INT8/INT4 quantization.

High first-token latency (TTFT): Long prompts cause long prefill time. Profile with --enable-prefix-caching to see cache hit rate. Consider prefill/decode disaggregation for latency-sensitive workloads.

Token throughput below expectation: Check GPU SM utilization (nvtop, nvidia-smi -q -d UTILIZATION). If SM utilization is low (<70%), the GEMM kernels are not batched effectively—increase max_batch_tokens in TGI.

Incorrect sampling (top-p/top-k): Batched sampling with different parameters per request requires careful per-sequence sampling. Verify that temperature/top-p settings are correctly isolated per request in vLLM's SamplingParams.

Security Implications

Prompt injection: LLM inference systems that accept arbitrary user prompts are vulnerable to prompt injection attacks. The inference server itself cannot prevent this—application-layer filtering (content moderation classifiers) is required.

Model extraction via inference: Repeated targeted queries can reconstruct approximate model weights. Mitigations: rate limiting, output diversity requirements, differential privacy noise on logits.

Side-channel via timing: Request batching means a long request from one user can increase latency for all co-batched users. An adversary can infer approximate prompt lengths from TTFT timing. Not typically exploitable but relevant for sensitive deployments.

GPU memory isolation: On shared GPU inference servers (multi-tenant), CUDA contexts are isolated by the GPU driver. However, hardware side-channels (cache timing, memory bus contention) have been demonstrated to leak information between co-located processes (RowHammer-style attacks on shared GPU memory are a research concern).

Performance Implications

Throughput vs latency Pareto: Higher batch sizes improve throughput but increase latency. Production systems set latency SLOs (P99 TTFT < 1s, P99 TPOT < 50ms) and maximize throughput subject to SLOs.

FlashAttention for prefill: FlashAttention 2 (Dao 2023) achieves near-optimal memory bandwidth utilization for the prefill attention kernel. ~3× faster than naive attention for long sequences. Required for any production system.

Tensor parallelism for inference: TP reduces per-GPU model memory but adds AllReduce latency per layer. For real-time serving, TP>4 is rarely beneficial—AllReduce latency (~20µs per layer × 80 layers = 1.6ms) dominates at small batch sizes.

FlashDecoding: For long contexts (>16K tokens), standard PagedAttention decode attention is memory-bandwidth limited on the KV cache. FlashDecoding (Flashattend-decode) parallelizes across the KV sequence dimension, achieving 8× speedup for 64K+ context.

Failure Modes and Real Incidents

Incident: KV cache thrashing under load burst: A serving system pre-allocating KV cache for max_seq_len=8192 had low memory pressure at normal load. During a traffic burst (2× normal), the KV cache allocator ran out of blocks mid-sequence. The system started aborting requests and returning HTTP 503. PagedAttention's block allocator should preempt (swap to CPU DRAM) rather than abort, but this was not implemented. Resolution: implement preemption + CPU offload.

Incident: Speculative decoding accuracy regression: A deployment replaced Llama-70B-Chat as the target with a fine-tuned variant but kept Llama-7B-base as the draft model. The acceptance rate dropped from 0.8 to 0.3 because the fine-tuned model's output distribution differed significantly. Throughput degraded 40%. Fix: fine-tune the draft model to match the target model's distribution.

Incident: Stale prefix cache returning wrong context (vLLM 0.2.x): A bug in vLLM's prefix cache caused blocks from a previous user's request to be returned for a new request with the same hash prefix (hash collision). The new user saw context from a prior conversation. Severity: high privacy incident. Root cause: SHA-256 was truncated to 64-bit prefix hash; fixed by using full 256-bit hash.

Modern Usage

vLLM (v0.4+, 2024): Added speculative decoding, chunked prefill (split long prefills into chunks to reduce TTFT spikes), and disaggregated prefill/decode (separate GPU pools). Default serving stack for open-source LLMs.

TensorRT-LLM (NVIDIA, 2024): Best absolute throughput for NVIDIA GPUs. FP8 on H100 achieves 2× vs FP16. Integration with Triton Inference Server for production deployment.

SGLang (2024): Structured generation language for complex multi-turn LLM programs. Adds RadixAttention (prefix sharing for tree-structured prompts), achieving 5× better prefix cache utilization.

LMDeploy (Shanghai AI Lab, 2024): Specializes in 4-bit inference with TurboMind engine. Strong performance for smaller-scale deployments.

Future Directions

Disaggregated prefill/decode (DistServe, OSDI 2024): Route prefill to FLOP-optimized nodes (A100) and decode to bandwidth-optimized nodes (cheaper GPUs with more HBM), reducing cost by 2×
MLA (Multi-head Latent Attention): DeepSeek-V2 architecture reduces KV cache by 93% by projecting KV into low-rank latent vectors; inference servers must implement specialized MLA kernels
Continuous speculative decoding: Run draft model in parallel with target model on separate streams, accepting tokens greedily as they arrive
Mixture-of-Experts inference: Sparse expert routing means only 2 of 8 FFN experts activate per token—dramatic FLOP reduction but complex memory management for expert weights
KV cache compression: Learned compression of KV caches using distillation (reduces memory without quantization)

Exercises

Throughput modeling: Given a Llama 2 70B model in FP16 (140 GB weights), an H100 with 3.35 TB/s HBM bandwidth, and a KV cache size of 40 GB (40% of 80 GB reserved), calculate: (a) the maximum decode throughput in tokens/second at batch=1, (b) the minimum batch size to achieve 1,000 tokens/second, (c) the maximum batch size given 40 GB KV cache for 4096-token sequences.
PagedAttention implementation: Implement a simplified PagedAttention block allocator in Python. Support allocate(request_id, num_tokens), free(request_id), and get_block_table(request_id). Use a free list for blocks. Measure fragmentation under a simulated workload of 100 requests with lengths drawn from Poisson(500).
Speculative decoding simulation: Simulate speculative decoding with a draft acceptance rate of α=0.75 and a draft model latency of 10ms and target model latency of 40ms. For γ=1,2,3,4,5,6, calculate the expected tokens per second. Find the optimal γ.
Quantization accuracy benchmarking: Using a 7B model (Llama-3-8B or similar), compare perplexity on WikiText-103 for FP16, INT8 (LLM.int8()), GPTQ INT4, and AWQ INT4. Report memory footprint and inference latency on a single GPU.
Continuous batching simulator: Implement a continuous batching scheduler in Python. Requests arrive according to a Poisson process (λ=10 req/s). Each request has a random prompt length (50–500 tokens) and generation length (50–2000 tokens). Simulate with max_batch_tokens=32768. Plot request queue depth, GPU utilization, and P95 latency over 60 seconds.

References

Yu et al., "Orca: A Distributed Serving System for Transformer-Based Generative Models," OSDI 2022
Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention," SOSP 2023
Chen et al., "Accelerating Large Language Model Decoding with Speculative Sampling," 2023
Dao et al., "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning," ICLR 2024
Dettmers et al., "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale," NeurIPS 2022
Lin et al., "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration," 2023
Zheng et al., "SGLang: Efficient Execution of Structured Language Model Programs," 2024
Zhong et al., "DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized LLM Serving," OSDI 2024
NVIDIA TensorRT-LLM documentation: https://nvidia.github.io/TensorRT-LLM/