GPU Memory Management

Technical Overview

GPU performance is almost always limited by memory: either the bandwidth to load and store data from global memory (HBM), or the latency of non-coalesced or bank-conflicted accesses. The A100's compute peak (312 TFLOPS FP16) requires feeding data at 2 TB/s; at 4 bytes/FP32 that's 500 billion floats per second. At 128 bytes per cache line, that is 15.6 billion cache lines per second — one every 64 picoseconds. Achieving this requires that nearly every memory access be a perfect, coalesced transaction with zero cache misses and zero bank conflicts.

Understanding GPU memory management at this level — the difference between a coalesced and uncoalesced access, how shared memory bank conflicts serialize operations, how to use PCIe and NVLink to move data efficiently — is what separates a GPU programmer from a GPU systems engineer. This document covers the GPU memory hierarchy in sufficient detail to understand, diagnose, and fix memory throughput bottlenecks.

Prerequisites

CUDA programming model (see 02-cuda-programming-model.md)
GPU architecture fundamentals: SM, warp, L1/L2/HBM hierarchy (see 01-gpu-architecture.md)
Understanding of cache line granularity and memory controller behavior
Basic DRAM technology (DRAM vs SRAM)

GPU Memory Hierarchy (A100)

+--------------------------------------------------------------+
| Memory Type  | Location  | Size      | Bandwidth | Latency  |
+--------------+-----------+-----------+-----------+----------+
| Registers    | On-SM     | 256 KB/SM | ~infinite | 1 cycle  |
| Shared mem   | On-SM SRAM| 192 KB/SM | ~20 TB/s  | 32 cycles|
| L1 cache     | On-SM SRAM| (shared   | ~20 TB/s  | 32 cycles|
|              |           |  with shm)|           |          |
| L2 cache     | On-chip   | 40 MB     | ~3.2 TB/s | ~200 cy  |
| HBM2e VRAM   | Off-chip  | 80 GB     | 2.0 TB/s  | ~600 cy  |
+--------------------------------------------------------------+
| PCIe 4.0 x16 | Host-Dev  | -         | 64 GB/s   | ~10 µs   |
| NVLink 3.0   | GPU-GPU   | -         | 600 GB/s  | ~1 µs    |
+--------------------------------------------------------------+

Roofline Analysis (A100 FP32):
Peak compute: 19.5 TFLOPS
Peak memory bandwidth: 2.0 TB/s
Ridge point: 19.5e12 / 2.0e12 = 9.75 FLOP/byte

Kernels with <9.75 FLOP/byte are memory-bandwidth bound.
Kernels with >9.75 FLOP/byte are compute-bound.
Example: GEMM (1024×1024×1024 FP32) = 2×1024^3 FLOPs / (3×1024^2×4 bytes)
       = 2 GFLOPs / 12 MB = ~170 FLOP/byte → compute-bound.

Core Content

Coalesced Memory Access

Coalesced access is the most critical performance concept in CUDA global memory. When 32 threads in a warp execute a load instruction, the GPU memory controller attempts to serve the 32 addresses in as few transactions as possible.

Fully coalesced (ideal): 32 threads access 32 consecutive 4-byte elements aligned to a 128-byte boundary. The memory controller issues ONE 128-byte transaction. All 32 elements arrive in one request.

Warp threads:  T0  T1  T2  ...  T31
Addresses:     0   4   8   ...  124  (bytes, base-aligned to 128B)
               └─────────────────┘
               ONE 128-byte transaction

Strided access (common anti-pattern): 32 threads access addresses with stride-2 (every other element):

Warp threads:  T0  T1  T2  ...  T31
Addresses:     0   8   16  ...  248  (stride-2, bytes)
               |   |   |        |
               Multiple 128-byte cache lines
               → 2 cache line transactions (factor 2 waste)

Stride-32 (threads access same cache line offset, different rows of matrix):
               → 32 separate transactions (32× bandwidth waste)

Practical coalescing patterns:

// COALESCED: Thread tid reads data[tid] — adjacent threads, adjacent addresses
__global__ void coalesced_read(const float *data, float *result, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < n) result[tid] = data[tid] * 2.0f;
}

// UNCOALESCED: Thread tid reads data[tid * stride] — stride > 1
__global__ void strided_read(const float *data, float *result, int n, int stride) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid * stride < n) result[tid] = data[tid * stride] * 2.0f;
    // With stride=32: each thread accesses a different cache line → 32 transactions
}

// MATRIX COLUMN ACCESS (classic uncoalesced pattern):
// Accessing column j of row-major matrix A[M][N]
__global__ void column_access_bad(const float *A, float *col, int M, int N, int j) {
    int row = blockIdx.x * blockDim.x + threadIdx.x;
    if (row < M) col[row] = A[row * N + j];
    // Thread 0 accesses A[0*N+j], Thread 1 accesses A[1*N+j]
    // These addresses are N*4 bytes apart → completely uncoalesced if N > 32
}

// MATRIX TRANSPOSE (solution: use shared memory for coalescing):
#define TILE 32
__global__ void transpose(const float *A, float *AT, int M, int N) {
    __shared__ float tile[TILE][TILE + 1];  // +1 to avoid bank conflicts

    int x = blockIdx.x * TILE + threadIdx.x;  // col index
    int y = blockIdx.y * TILE + threadIdx.y;  // row index

    if (x < N && y < M)
        tile[threadIdx.y][threadIdx.x] = A[y * N + x];  // coalesced read
    __syncthreads();

    x = blockIdx.y * TILE + threadIdx.x;  // transposed col
    y = blockIdx.x * TILE + threadIdx.y;  // transposed row

    if (x < M && y < N)
        AT[y * M + x] = tile[threadIdx.x][threadIdx.y];  // coalesced write
}

Shared Memory Bank Conflicts

Shared memory is divided into 32 banks, each 4 bytes wide. Banks cycle across addresses: address 0 is bank 0, address 4 is bank 1, ..., address 124 is bank 31, address 128 is bank 0 again.

Conflict-free access: If 32 threads in a warp access 32 different banks simultaneously, all 32 accesses are served in parallel — one shared memory clock cycle.

Bank conflict: If 2 or more threads in a warp access different addresses within the same bank, the accesses are serialized. A k-way conflict takes k cycles.

Conflict-free (each thread accesses a unique bank):
Thread 0: address 0  → bank 0   ✓ parallel
Thread 1: address 4  → bank 1   ✓ parallel
Thread 2: address 8  → bank 2   ✓ parallel
...
Thread 31: address 124 → bank 31  ✓ parallel

2-way conflict (stride-2 access):
Thread 0: address 0  → bank 0   ×  2-way conflict
Thread 1: address 8  → bank 2   ×
Thread 2: address 16 → bank 4   ×
...
(Threads 0 and 16 both access bank 0 — conflict)

Worst case: stride-32 (all threads hit bank 0):
Thread 0: address 0   → bank 0   ×  32-way conflict → 32 cycles
Thread 1: address 128 → bank 0   ×
...
Thread 31: address 128*31 → bank 0  ×

Padding to avoid bank conflicts (the canonical fix): When a 2D shared memory tile has dimensions that create stride patterns hitting the same bank, add 1 element of padding:

// Without padding: column access has 32-way bank conflict
__shared__ float tile[32][32];  // stride-32 column access → bank conflict

// With padding: column access is conflict-free
__shared__ float tile[32][33];  // +1 padding shifts column accesses to unique banks
// tile[row][col]: col access now has stride 33, no power-of-2 alignment → no conflicts

Broadcast exception: If all threads in a warp access the same address in shared memory, it is served as a single read broadcast to all threads — no conflict. Useful for scalar coefficients: __shared__ float coeff; if (tid == 0) coeff = ...; (with __syncthreads()).

L1/L2 Cache Behavior

L1 cache (shared with shared memory on SRAM): Hardware-managed. By default on Ampere, the configurable partition is 128 KB shared memory + 64 KB L1. The programmer can change this with:

// Request more shared memory (less L1)
cudaFuncSetAttribute(myKernel,
    cudaFuncAttributePreferredSharedMemoryCarveout, 100); // 100% shared

// Or request more L1 (less shared memory)
cudaFuncSetAttribute(myKernel,
    cudaFuncAttributePreferredSharedMemoryCarveout, 0);   // 0% preferred for shared

L1 hit latency: ~30–40 cycles. L1 miss → L2: ~200 cycles. L2 miss → HBM: ~600 cycles.

L2 cache (A100: 40 MB): Shared across all 108 SMs. All global memory accesses that miss L1 go to L2. A100's L2 is large enough to hold a significant working set — for problems with 10–30 MB active data, L2 hit rates can be high, giving effective bandwidth close to L2's 3.2 TB/s.

Cache control: CUDA provides load modifiers to control caching behavior:

// Load into L1 and L2 (default)
float val = data[tid];

// Load bypassing L1 (goes directly to L2 — useful when data won't be reused)
float val = __ldg(data + tid);  // __ldg = load through texture/read-only cache

// Non-temporal store (bypass L1 and L2 — useful for write-only outputs)
// Via PTX inline assembly:
asm("st.global.cs.f32 [%0], %1;" : : "l"(ptr), "f"(val));
// .cs = cache streaming (evict on next access)

L2 residency control (Ampere+):

// Advise that a memory range should be kept in L2
cudaStreamAttrValue attr = {};
attr.accessPolicyWindow.base_ptr  = (void*)data;
attr.accessPolicyWindow.num_bytes = n * sizeof(float);
attr.accessPolicyWindow.hitRatio  = 0.6f;  // try to keep 60% in L2
attr.accessPolicyWindow.hitProp   = cudaAccessPropertyPersisting;
attr.accessPolicyWindow.missProp  = cudaAccessPropertyStreaming;
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow, &attr);

This is valuable for workloads with a hot dataset (e.g., model weights in inference) that fits in L2 — eliminates redundant global memory reads.

HBM: High Bandwidth Memory

HBM (High Bandwidth Memory) is a stacked DRAM technology where multiple DRAM dies are stacked vertically and connected via a silicon interposer with thousands of parallel wires (much wider than the narrow PCB traces of GDDR6).

HBM2e on A100: - 5 HBM2e stacks (each with 4 DRAM dies) - 80 GB total capacity - 5 × 1024-bit bus = 5120-bit total width (vs GDDR6's 384-bit for RTX 3090) - 2,000 GB/s peak bandwidth - ~1.8 TB/s achievable bandwidth on bandwidth-bound kernels

Bandwidth measurement (achievable vs theoretical):

// Measure achieved memory bandwidth
// Copy N floats from d_in to d_out, measure GB/s

cudaEvent_t start, stop;
cudaEventCreate(&start); cudaEventCreate(&stop);
cudaEventRecord(start);
cudaMemcpy(d_out, d_in, N * sizeof(float), cudaMemcpyDeviceToDevice);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float ms; cudaEventElapsedTime(&ms, start, stop);
float gb_s = (2.0f * N * sizeof(float)) / (ms * 1e6);
printf("Bandwidth: %.1f GB/s (of 2000 GB/s peak)\n", gb_s);
// Expect: ~1800-1900 GB/s for large N (copy kernel is bandwidth-bound)

PCIe Bottleneck for Host-Device Transfers

PCIe Gen4 x16 provides ~64 GB/s bidirectional (32 GB/s each direction). This is 30× lower than HBM bandwidth. Minimizing host-device transfers is paramount.

Pinned (page-locked) host memory:

// Pageable (regular) malloc: cudaMemcpy must stage through a pinned bounce buffer
// Effective bandwidth: ~15-20 GB/s

float *h_pageable = (float*)malloc(bytes);  // pageable
cudaMemcpy(d_data, h_pageable, bytes, cudaMemcpyHostToDevice);

// Pinned memory: DMA transfers directly without bounce buffer
// Effective bandwidth: ~25-32 GB/s (PCIe gen4 x16 limit)
float *h_pinned;
cudaMallocHost(&h_pinned, bytes);  // pinned
cudaMemcpyAsync(d_data, h_pinned, bytes, cudaMemcpyHostToDevice, stream);

Use cudaHostAlloc with cudaHostAllocWriteCombined for write-only host buffers (DMA write-combining eliminates cache pollution on the CPU side):

cudaHostAlloc(&h_input, bytes, cudaHostAllocWriteCombined);
// CPU writes into this, GPU DMAs it out
// CPU reads are slow (no cache), so don't read from WC memory

Overlap transfers with computation (pipeline pattern):

// Divide N elements into CHUNKS chunks
// Use 3 buffers, 2 streams for overlap
cudaStream_t stream_compute, stream_transfer;
cudaStreamCreate(&stream_compute);
cudaStreamCreate(&stream_transfer);

float *d_buf[2];  // double-buffered device memory
cudaMalloc(&d_buf[0], chunk_bytes);
cudaMalloc(&d_buf[1], chunk_bytes);

for (int i = 0; i < CHUNKS; i++) {
    int cur = i % 2, prev = (i-1) % 2;
    // Transfer current chunk
    cudaMemcpyAsync(d_buf[cur], h_data + i*chunk_n, chunk_bytes,
                    cudaMemcpyHostToDevice, stream_transfer);
    // Compute previous chunk (while current transfers)
    if (i > 0) {
        myKernel<<<grid, block, 0, stream_compute>>>(d_buf[prev], d_out+prev*chunk_n, chunk_n);
    }
    // Sync before reusing buffer
    cudaStreamSynchronize(stream_transfer);
    cudaStreamSynchronize(stream_compute);
}
// Process last chunk
myKernel<<<grid, block, 0, stream_compute>>>(d_buf[(CHUNKS-1)%2], ...);
cudaStreamSynchronize(stream_compute);

On A100 SXM (where GPU and CPU connect via NVLink/C2C instead of PCIe), CPU-GPU bandwidth is ~600 GB/s, eliminating this bottleneck entirely. DGX A100 uses NVLink rather than PCIe for CPU-GPU communication.

NVLink: GPU-GPU Communication

NVLink is NVIDIA's high-speed interconnect for GPU-to-GPU communication, providing ~20× the bandwidth of PCIe.

NVLink 3.0 (A100): 600 GB/s total GPU-GPU bandwidth (12 NVLink links × 50 GB/s each). Used in DGX A100 and HGX A100 nodes.

NVLink 4.0 (H100): 900 GB/s total GPU-GPU bandwidth. NVSwitch 3 fabric connects up to 256 H100s with full all-pairs bandwidth.

NVSwitch fabric: In a DGX H100 system, 8 H100 GPUs are connected to 4 NVSwitch chips. Every GPU can communicate with every other GPU at 900 GB/s — not a bisectional reduction as in PCIe topologies.

P2P memory access via NVLink:

// Enable peer access (required before P2P transfers)
int peer_access;
cudaDeviceCanAccessPeer(&peer_access, 0, 1);  // GPU 0 to GPU 1
if (peer_access) cudaDeviceEnablePeerAccess(1, 0);

// Now GPU 0 can directly DMA into GPU 1's memory
// Or a kernel on GPU 0 can access GPU 1's d_ptr
cudaMemcpyPeerAsync(d_gpu1, 1, d_gpu0, 0, bytes, stream);

// NCCL (NVIDIA Collective Communications Library) uses NVLink for:
// AllReduce, AllGather, Reduce, Broadcast across multiple GPUs
// (Used in PyTorch DDP, Megatron-LM, etc.)

NCCL performance on DGX A100 (8× A100 with NVLink 3.0): - AllReduce on 1 GB: ~1 ms (vs ~10 ms over PCIe) - AllGather on 1 GB: ~0.5 ms - These numbers determine gradient synchronization overhead in multi-GPU training

Unified Memory Deep Dive

CUDA Unified Memory (cudaMallocManaged) allows a single pointer to be dereferenced on both CPU and GPU. The runtime handles page migration automatically via CPU/GPU page fault mechanisms.

Page fault mechanics: 1. GPU kernel accesses a managed memory address not resident on GPU 2. GPU page fault exception → interrupt the kernel (or stall the warp) 3. CUDA runtime migrates the page from CPU to GPU (PCIe DMA) 4. GPU retries the access — now succeeds

Page fault cost: ~1–10 µs per fault (PCIe transfer + TLB shootdown). For a kernel that faults on every access, this dominates computation time.

Prefetching: Eliminate faults by prefetching pages before the kernel runs:

cudaMemPrefetchAsync(data, bytes, deviceId, NULL);  // move to GPU asynchronously
// ... kernel launch ...
cudaDeviceSynchronize();
cudaMemPrefetchAsync(data, bytes, cudaCpuDeviceId, NULL);  // move back to CPU

Memory advice:

// Hint: data is primarily GPU-resident (read-mostly by GPU)
cudaMemAdvise(data, bytes, cudaMemAdviseSetReadMostly, deviceId);
// Creates a read-only copy on GPU — CPU modifications invalidate it

// Hint: preferred location is GPU
cudaMemAdvise(data, bytes, cudaMemAdviseSetPreferredLocation, deviceId);
// Pages start on GPU; if CPU faults, a copy is made on CPU

// Hint: data will be accessed by GPU (schedule pre-migration)
cudaMemAdvise(data, bytes, cudaMemAdviseSetAccessedBy, deviceId);

When to use Unified Memory: - Datasets larger than GPU VRAM (automated page migration handles spilling) - Rapid prototyping where transfer management is not yet optimized - HPC workloads with irregular access patterns that are hard to batch explicitly

When not to use Unified Memory: - High-throughput inference/training — explicit transfers with pinned memory and streams are 2–5× faster - Latency-sensitive applications — page fault latency is unpredictable

GPU Memory OOM

When cudaMalloc fails with cudaErrorMemoryAllocation:

# Check GPU memory usage
nvidia-smi --query-gpu=memory.total,memory.used,memory.free --format=csv

# In code: check memory before allocating large buffers
size_t free_mem, total_mem;
cudaMemGetInfo(&free_mem, &total_mem);
if (bytes > free_mem * 0.9) {
    fprintf(stderr, "Insufficient GPU memory: need %zu, have %zu\n",
            bytes, free_mem);
    exit(1);
}

GPU memory fragmentation: Unlike CPU allocators, the CUDA allocator manages GPU VRAM. Repeated alloc/free of varied sizes can fragment GPU memory — a large contiguous allocation fails even with sufficient total free memory. Use memory pools (cudaMemPool) to reduce fragmentation:

cudaMemPool_t pool;
cudaDeviceGetDefaultMemPool(&pool, deviceId);
cudaMallocFromPoolAsync(&d_ptr, bytes, pool, stream);
cudaFreeAsync(d_ptr, stream);  // Returns to pool, not to OS

Historical Context

GPU memory architecture has evolved dramatically from the 150 GB/s of GDDR3 on the GeForce 8800 GTX (2006) to 2,000 GB/s on the A100 (2020). The introduction of HBM technology (NVIDIA Pascal P100, 2016) increased bandwidth by ~4× compared to GDDR5 at the same generation, enabling a new class of memory-bandwidth-bound applications. The PCIe bottleneck between CPU and GPU — identified as a primary bottleneck in multi-GPU distributed training — led to NVLink's development (first generation in Pascal, 2016) and eventually to NVSwitch fabrics (Volta, 2017). The H100's NVLink 4.0 (900 GB/s) represents the current state of GPU-GPU interconnect, while AMD's MI300X (2024) with 5.3 TB/s HBM3 bandwidth represents the alternative architectural direction: prioritizing memory bandwidth over GPU-GPU interconnect.

Production Examples

# Profile memory bandwidth of a kernel with Nsight Compute
ncu --metrics l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum,\
l1tex__t_bytes_pipe_lsu_mem_global_op_st.sum,\
sm__cycles_active.sum \
./my_kernel_benchmark

# Compute achieved bandwidth:
# achieved_bw (GB/s) = (ld_bytes + st_bytes) / elapsed_time_ns

# Full memory hierarchy metrics
ncu --metrics \
  l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum,\
  lts__t_requests_srcunit_tex_op_read.sum,\
  dram__bytes_read.sum,\
  dram__bytes_write.sum \
  ./kernel

# Memory access pattern visualization
ncu --section MemoryWorkloadAnalysis ./kernel

Debugging Notes

compute-sanitizer --tool racecheck detects shared memory hazards: read-after-write within a warp without __syncthreads().
High l2_global_load_transactions to warp_l2_load_requests ratio indicates many cache misses — improve locality or reuse via shared memory.
A bandwidth-bound kernel's performance should scale linearly with global memory bandwidth. If it doesn't, the bottleneck is elsewhere (compute, latency, synchronization).
Nsight Compute's "Roofline Analysis" section places the kernel on the roofline chart — immediately shows whether it is memory-bound or compute-bound and how far from peak.

Security Implications

GPU memory remnants: cudaFree returns memory to the CUDA allocator but does not zero it. A subsequent cudaMalloc may receive the same physical pages, allowing one kernel to read another's data. On multi-tenant systems (cloud GPUs without MIG), this is a serious concern. NVIDIA's CUDA_DISABLE_MEMORY_REUSE environment variable forces page return to the OS (which zeroes pages), but at significant performance cost.
PCIe DMA attacks: PCIe enables DMA between CPU and GPU. A malicious PCIe device (Thunderbolt, rogue NIC) with DMA access could read GPU memory buffers mapped at known physical addresses. IOMMU enables protection but must be configured correctly (VFIO, amd_iommu=on).
Side-channel via memory timing: GPU memory timing channels (cache hit/miss timing) can leak information about concurrent kernels' access patterns in shared GPU contexts (time-sharing without MIG).

Performance Implications

Roofline model: Every kernel operates at a specific arithmetic intensity. Kernels below the ridge point (A100 FP32: ~10 FLOP/byte) benefit primarily from memory bandwidth improvements. Kernels above benefit from compute throughput improvements.
Tensor cores require specific formats: Tensor core operations (wmma or cuBLAS Level-3) require data in specific shapes (multiple of 16 for Ampere Tensor Cores). Padding to the nearest multiple is necessary.
Memory allocation cost: cudaMalloc can take 100–500 µs for large allocations. Pre-allocate GPU memory at application startup; use memory pools for frequent alloc/free cycles.
Atomic operations on global memory: atomicAdd on global memory has high latency and serialization. Use warp-level reductions (__reduce_add_sync) or shared memory atomics first, then global atomic once per block.

Failure Modes

Silent data corruption from uncoalesced access: The kernel produces wrong results when accessed stride doesn't match expected. No error — just wrong numbers. Use compute-sanitizer --tool initcheck to detect uninitialized reads.
Bank conflict performance cliff: A tiled kernel is fast on square tile sizes but slow on non-square (because the padding trick doesn't apply the same way). Profile shared_load_replay metric — > 1.0 indicates conflicts.
Unified memory thrash: A kernel that alternates CPU and GPU access without cudaMemPrefetchAsync causes continuous page migration. Monitor with unified_cache_global_hit_rate and pagefaults metrics.
NVLink bandwidth saturation in AllReduce: With 8 GPUs and large gradients, AllReduce can saturate NVLink, causing gradient sync to become the training bottleneck. Use gradient compression (FP16 or FP8 gradients, structured sparsity) to reduce bandwidth demand.

Modern Usage

Flash Attention (Dao et al., 2022) demonstrates the importance of shared memory and memory hierarchy awareness in production DL kernels. Standard attention computes S = QK^T (N×N matrix) and stores it in global memory before computing softmax(S)V. For N=4096 and d=64, this is 4096^2 × 4 bytes = 64 MB. Flash Attention tiles Q, K, V into shared memory, computing and consuming each tile without materializing the full N×N attention matrix in global memory. Result: attention for long sequences fits in VRAM when it previously didn't, and achieves 2–4× speedup due to HBM bandwidth savings.

cuDNN's Workspace Memory: cuDNN (NVIDIA's DNN library) requires a "workspace" — a scratch buffer in GPU global memory used for intermediate computations in convolution algorithms. The workspace size varies by algorithm (FFT-based convolution uses large workspace; implicit GEMM uses small). cudnnGetConvolutionForwardWorkspaceSize queries the required size. Applications must allocate this workspace and pass it to cuDNN.

Future Directions

Chiplet memory (HBM in package): AMD MI300X integrates 192 GB HBM3 with compute dies on a single interposer. NVIDIA GB200 integrates 192 GB HBM3e per GPU. Future GPUs will have 384+ GB per device.
Compute-in-memory: Samsung AXDIMM and HBM-PIM embed processing elements in DRAM modules, computing directly in memory and eliminating the memory bandwidth wall for certain operations.
CXL (Compute Express Link): A new interconnect standard for CPU-GPU and GPU-accelerator memory coherence. CXL 3.0 enables cache-coherent shared memory between CPUs and GPU-like accelerators, potentially replacing the "copy data to GPU, compute, copy back" model with a shared memory model at speeds approaching NVLink.
Memory compression: Hardware-level lossless compression in the GPU memory controller (NVIDIA's delta color compression for framebuffers) extended to compute workloads. Can effectively increase bandwidth by 2× for compressible data patterns.

Exercises

Write two CUDA kernels: one that reads a 2D matrix row-by-row (coalesced) and one that reads it column-by-column (uncoalesced). For both, measure achieved bandwidth via CUDA events and compare to the theoretical peak. Use Nsight Compute's l2_global_load_transactions metric to quantify the transaction count difference.
Implement a shared memory bank conflict experiment: write a kernel where all 32 threads in a warp access different rows of a float tile[32][32] (equivalent to stride-32 column access). Measure the kernel time. Then add the +1 padding (float tile[32][33]). Measure again. The speedup should be ~32× for the conflict-free version.
Implement a ping-pong memory benchmark using Unified Memory: allocate a 1 GB managed array, have the CPU write to it, then launch a GPU kernel to read it, then have the CPU read the results. Measure total time including page migration. Compare to explicit cudaMalloc + cudaMemcpy + kernel + cudaMemcpy back.
Implement the Flash Attention SRAM-tiling concept (simplified): write a kernel that computes a matrix-vector product y = A * x where A is so large it doesn't fit in shared memory. Tile A into shared memory blocks, computing partial results that are accumulated. Compare achieved HBM bandwidth vs a naive kernel that loads A into registers directly.
Profile a custom kernel using all four Nsight Compute sections: Memory Workload Analysis, Compute Workload Analysis, Warp State Statistics, and the Roofline chart. Write a 1-page analysis identifying the primary bottleneck, the secondary bottleneck, and two specific code changes that would improve performance.

References

Dao, T. et al. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. https://arxiv.org/abs/2205.14135
NVIDIA A100 GPU Memory Subsystem White Paper. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
Volkov, V. & Demmel, J. "Benchmarking GPUs to Tune Dense Linear Algebra." SC 2008. (Classic roofline model for GPUs)
Harris, M. "How to Access Global Memory Efficiently in CUDA C/C++ Kernels." NVIDIA Developer Blog. https://developer.nvidia.com/blog/how-access-global-memory-efficiently-cuda-c-kernels/
Luitjens, J. "GPU Pro Tip: CUDA 7 Streams Simplify Concurrency." NVIDIA Developer Blog.
NVIDIA H100 Tensor Core GPU Architecture (HBM3 specs): https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper
Muyan-Ozcelik, P. et al. "High-Performance Shared Memory Programming in CUDA." NVIDIA whitepaper.