Skip to content

CUDA Programming Model

Technical Overview

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel programming platform and execution model. It extends C/C++ with a small set of keywords and intrinsics that allow a CPU program (the host) to launch parallel functions called kernels on the GPU (the device). The programming model maps CUDA's hierarchical thread organization (grid → block → thread) onto the GPU's physical hierarchy (GPU → SM → CUDA cores / warp).

CUDA's execution model achieves performance through a combination of: massive parallelism (tens of thousands of threads active), hardware warp scheduling to hide memory latency, and explicit memory management that allows programmers to control data placement in the memory hierarchy. A well-written CUDA kernel can achieve 80–90% of theoretical peak throughput on the A100.

Prerequisites

  • GPU architecture fundamentals (see 01-gpu-architecture.md)
  • C/C++ programming, including pointers and memory management
  • Understanding of parallel algorithms (reductions, scans, sorts)
  • Basic knowledge of floating-point arithmetic

Grid/Block/Thread Hierarchy Diagram

CUDA Execution Hierarchy:

  Kernel Launch: myKernel<<<gridDim, blockDim>>>()

  +------------------------------------------+
  |              Grid                        |
  |  (gridDim = {4, 2} = 8 blocks total)    |
  |                                          |
  |  +-------+-------+-------+-------+      |
  |  | Block | Block | Block | Block |      |
  |  | (0,0) | (1,0) | (2,0) | (3,0)|      |
  |  +-------+-------+-------+-------+      |
  |  | Block | Block | Block | Block |      |
  |  | (0,1) | (1,1) | (2,1) | (3,1)|      |
  |  +-------+-------+-------+-------+      |
  +------------------------------------------+

  Each Block (blockDim = {32, 4} = 128 threads):
  +-------------------------------------------+
  |  Thread  Thread  Thread  ...  Thread       |
  | (0,0,0) (1,0,0) (2,0,0)     (31,0,0)     |
  | (0,1,0) (1,1,0) (2,1,0)     (31,1,0)     |
  | (0,2,0) (1,2,0) (2,2,0)     (31,2,0)     |
  | (0,3,0) (1,3,0) (2,3,0)     (31,3,0)     |
  +-------------------------------------------+
  Threads in same block share: __shared__ memory,
  can synchronize with __syncthreads()

  Warp = 32 consecutive threads in x-dimension
  (threadIdx.x % 32 == warp lane)

  Global Thread ID (1D example):
  int tid = blockIdx.x * blockDim.x + threadIdx.x;

Core Content

Host/Device Execution Model

CUDA's execution is heterogeneous: the CPU controls the program flow, manages memory, and launches kernels on the GPU. Key vocabulary:

  • Host: CPU and its memory
  • Device: GPU and its memory (VRAM/HBM)
  • Kernel: A CUDA function executed by thousands of GPU threads in parallel
  • Global memory: GPU DRAM, accessible by all threads but slow (~600 cycles)
  • Shared memory: On-SM SRAM, shared within a thread block, fast (~30 cycles)

Basic CUDA program structure:

#include <cuda_runtime.h>
#include <stdio.h>

// Kernel function — executed on GPU
// __global__ keyword makes it a kernel (callable from host, runs on device)
__global__ void vector_add(const float *a, const float *b, float *c, int n) {
    // Each thread computes one output element
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < n) {
        c[tid] = a[tid] + b[tid];
    }
}

int main() {
    int n = 1 << 20;  // 1M elements
    size_t bytes = n * sizeof(float);

    // Allocate host memory
    float *h_a = (float*)malloc(bytes);
    float *h_b = (float*)malloc(bytes);
    float *h_c = (float*)malloc(bytes);

    // Initialize host arrays
    for (int i = 0; i < n; i++) {
        h_a[i] = i * 1.0f;
        h_b[i] = i * 2.0f;
    }

    // Allocate device memory
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);

    // Copy host → device
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);

    // Launch kernel: 256 threads/block
    int threads_per_block = 256;
    int blocks = (n + threads_per_block - 1) / threads_per_block;
    vector_add<<<blocks, threads_per_block>>>(d_a, d_b, d_c, n);

    // Check for launch errors
    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) {
        printf("Kernel launch error: %s\n", cudaGetErrorString(err));
    }

    // Copy device → host (also synchronizes)
    cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);

    // Verify
    for (int i = 0; i < 10; i++) {
        printf("c[%d] = %.1f (expected %.1f)\n", i, h_c[i], h_a[i] + h_b[i]);
    }

    // Cleanup
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    free(h_a); free(h_b); free(h_c);
    return 0;
}

Compile: nvcc -O2 -arch=sm_80 vector_add.cu -o vector_add

Kernel Launch Syntax and Thread Indexing

Kernel launch: kernel<<<gridDim, blockDim, sharedMemBytes, stream>>>(args...)

  • gridDim: dim3 (up to 3D) specifying the number of blocks in each dimension. Max per dimension: 2^31-1 (x), 65535 (y, z).
  • blockDim: dim3 specifying threads per block. Max 1024 threads total per block (in any dimension combination summing to ≤1024). Must be a multiple of 32 for full warp utilization.
  • sharedMemBytes: Dynamic shared memory allocated per block (optional, default 0)
  • stream: CUDA stream for async execution (optional, default stream)

Thread indexing built-ins (read-only within a kernel):

// Thread position within its block (0-based, 3D)
threadIdx.x, threadIdx.y, threadIdx.z

// Block position within the grid (0-based, 3D)
blockIdx.x, blockIdx.y, blockIdx.z

// Block dimensions (threads per block)
blockDim.x, blockDim.y, blockDim.z

// Grid dimensions (blocks per grid)
gridDim.x, gridDim.y, gridDim.z

// 1D global thread ID:
int tid = blockIdx.x * blockDim.x + threadIdx.x;

// 2D global thread ID:
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int gid = row * gridDim.x * blockDim.x + col;

Choosing block size: Use multiples of 32 (warp size). Common choices: 128, 256, 512 threads/block. Profile for occupancy vs register pressure tradeoff. The CUDA Occupancy Calculator API (cudaOccupancyMaxActiveBlocksPerMultiprocessor) computes theoretically optimal block size.

Warp Execution

A warp is 32 consecutive threads (in the x-dimension, for a 1D block: threads 0–31 form warp 0, threads 32–63 form warp 1, etc.). All 32 threads in a warp execute the same instruction simultaneously.

Warp scheduler: Each SM has 4 warp schedulers (Ampere). Each cycle, each scheduler can issue one instruction from one eligible warp to the execution units. With 64 resident warps per SM and 4 schedulers, the SM can issue 4 instructions per cycle (to 4 different warps). Latencies are hidden because while warp A is waiting for its memory access (600 cycles), warps B, C, D, E, ... are executing their instructions.

Dual-issue: Some CUDA architectures can dual-issue instructions from the same warp if they're independent (issue two instructions to the same warp in the same cycle). This provides ILP exploitation at the warp level.

Occupancy

Occupancy = (Active warps per SM) / (Maximum warps per SM)

Maximum warps per SM on A100: 64 (= 2048 threads / 32 threads per warp).

Occupancy is limited by three resources: 1. Registers: Register file = 65,536 × 32-bit per SM. Threads × registers_per_thread = register demand. If a kernel uses 32 regs/thread: 64 warps × 32 threads × 32 regs = 65,536 — exactly fills the register file → 100% theoretical occupancy from register perspective. 2. Shared memory: If the kernel allocates 48KB/block and the SM has 96KB shared memory, only 2 blocks fit → 2 × (blockDim.x / 32) warps. 3. Block count limit: Max 32 blocks per SM (Ampere). Small blocks with few warps may hit this limit before exhausting registers/shared memory.

// Query occupancy at runtime
int minGridSize, blockSize;
cudaOccupancyMaxPotentialBlockSize(
    &minGridSize,
    &blockSize,
    (void*)myKernel,
    0,    // dynamic shared mem per block
    0     // block size limit (0 = no limit)
);
// blockSize now contains the optimal block size for occupancy

CUDA Memory Types

Global memory: Off-chip GDDR6/HBM. 80GB on A100. Accessible by all threads. High latency (~600 cycles), but high bandwidth (2 TB/s). All cudaMalloc'd memory is global. Coalesced access is critical (see 03-gpu-memory-management.md).

Shared memory: On-chip SRAM per SM. Declared with __shared__. Lifetime: one kernel call, one thread block. Threads in the same block can read/write shared memory and synchronize via __syncthreads(). Used for data reuse (tiles in GEMM, stencil computations).

__global__ void tiled_gemm(float *A, float *B, float *C, int N) {
    __shared__ float tileA[TILE_SIZE][TILE_SIZE];
    __shared__ float tileB[TILE_SIZE][TILE_SIZE];

    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    float sum = 0.0f;

    for (int t = 0; t < N / TILE_SIZE; t++) {
        // Cooperatively load tiles into shared memory
        tileA[threadIdx.y][threadIdx.x] = A[row * N + (t * TILE_SIZE + threadIdx.x)];
        tileB[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        __syncthreads();  // Wait for all threads to finish loading

        // Compute partial dot product
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];
        }
        __syncthreads();  // Wait before loading next tile
    }
    if (row < N && col < N) C[row * N + col] = sum;
}

Constant memory: 64KB read-only memory, cached per SM. Broadcast: if all threads in a warp read the same constant memory address, it is served from cache in one transaction — zero additional transactions. If threads read different addresses, it serializes (one transaction per unique address). Best for read-only parameters.

Texture memory: Read-only, cached. 2D spatial locality caching. Useful when access patterns have 2D locality (image processing). Available as __ldg() intrinsic for load via texture cache path.

Local memory: Register spills go here — physically in global memory, per-thread. Avoid at all costs (high latency, high bandwidth pressure). Use --maxrregcount to force register limits (trading off spills vs occupancy).

Register: Per-thread, private, implicit. Fastest. Avoid arrays in registers if they must be indexed dynamically (compiler may spill them to local memory).

CUDA Streams and Concurrency

A CUDA stream is a sequence of GPU operations that execute in order. Operations in different streams may execute concurrently (if hardware resources allow).

Default stream (stream 0): All operations in the default stream are serialized. Equivalent to placing all GPU work on a single queue.

Concurrent execution with multiple streams:

// Create streams
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Overlap computation on stream1 with data transfer on stream2
// This requires PINNED (page-locked) host memory for overlap to work
cudaMallocHost(&h_data, bytes);  // pinned host memory

// Async copy on stream2
cudaMemcpyAsync(d_data2, h_data2, bytes, cudaMemcpyHostToDevice, stream2);

// Kernel on stream1 (can run concurrently with the memcpy)
kernel1<<<blocks, threads, 0, stream1>>>(d_data1, d_result1, n);

// Synchronize specific stream
cudaStreamSynchronize(stream1);
cudaStreamSynchronize(stream2);

cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);

For peak GPU utilization on data-parallel workloads, use the compute-transfer overlap pattern: while the GPU processes batch N, transfer batch N+1 from host to GPU, and transfer batch N-1's results from GPU to host. This requires at least 3 buffers (triple buffering) and 2 non-default streams.

CUDA Events for Timing

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

cudaEventRecord(start, 0);  // Record start event in default stream

myKernel<<<blocks, threads>>>(d_data, n);

cudaEventRecord(stop, 0);   // Record stop event in default stream
cudaEventSynchronize(stop); // Wait for stop event to complete

float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
printf("Kernel time: %.3f ms\n", milliseconds);
printf("Throughput: %.2f GB/s\n",
    (2.0 * n * sizeof(float)) / (milliseconds * 1e6));

cudaEventDestroy(start);
cudaEventDestroy(stop);

Always use CUDA events for GPU timing, not CPU clock() or gettimeofday() — those measure host time and include CUDA API call overhead, not actual kernel execution time.

Unified Memory (cudaMallocManaged)

Unified Memory creates a managed memory space accessible from both CPU and GPU. The CUDA runtime handles page migration:

float *data;
cudaMallocManaged(&data, bytes);  // Allocate in managed memory

// CPU can initialize it directly
for (int i = 0; i < n; i++) data[i] = i;

// GPU can access it directly (runtime migrates pages to GPU)
myKernel<<<blocks, threads>>>(data, n);
cudaDeviceSynchronize();

// CPU can access after sync (pages migrated back if needed)
printf("data[0] = %f\n", data[0]);

cudaFree(data);

Performance note: Page migration has significant latency (~microseconds per page fault). For latency-sensitive workloads, prefer explicit cudaMalloc + cudaMemcpy. Unified Memory is primarily useful for: - Simplifying code that intermixes CPU and GPU access - Managing datasets larger than GPU memory (pages only on GPU when needed) - Prefetching with cudaMemPrefetchAsync to control migration timing

// Prefetch to GPU before kernel launch
cudaMemPrefetchAsync(data, bytes, deviceId, NULL);
myKernel<<<blocks, threads>>>(data, n);
// Prefetch back to CPU after kernel
cudaMemPrefetchAsync(data, bytes, cudaCpuDeviceId, NULL);

CUDA Debugging

compute-sanitizer (replacement for cuda-memcheck):

# Check for memory errors
compute-sanitizer --tool memcheck ./myapp

# Check for race conditions in shared memory
compute-sanitizer --tool racecheck ./myapp

# Check for uninitialized memory usage
compute-sanitizer --tool initcheck ./myapp

# Check for synchronization hazards
compute-sanitizer --tool synccheck ./myapp

cuda-gdb (CUDA-aware GDB):

cuda-gdb ./myapp
(cuda-gdb) break myKernel
(cuda-gdb) run
# At breakpoint:
(cuda-gdb) cuda thread  # shows current CUDA thread position
(cuda-gdb) info cuda threads  # all active threads
(cuda-gdb) print threadIdx.x  # CUDA built-in variables accessible
(cuda-gdb) print d_data[0]    # inspect device memory

printf debugging in kernels:

__global__ void debug_kernel(float *data, int n) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid < n) {
        // Conditional printf to avoid flooding output
        if (tid < 5 || isnan(data[tid])) {
            printf("Thread %d: data[%d] = %f\n", tid, tid, data[tid]);
        }
    }
}
// Note: printf in kernels uses a circular buffer; output appears after
// cudaDeviceSynchronize()

CUDA Error Handling

Every CUDA API call returns cudaError_t. Never ignore errors in production code:

// Error checking macro
#define CUDA_CHECK(call) do { \
    cudaError_t err = (call); \
    if (err != cudaSuccess) { \
        fprintf(stderr, "CUDA error at %s:%d: %s\n", \
            __FILE__, __LINE__, cudaGetErrorString(err)); \
        exit(EXIT_FAILURE); \
    } \
} while(0)

// Kernel launch errors are asynchronous — check separately
myKernel<<<blocks, threads>>>(data, n);
CUDA_CHECK(cudaGetLastError());       // Check for launch config error
CUDA_CHECK(cudaDeviceSynchronize());  // Wait + check for runtime error

Historical Context

CUDA was announced by NVIDIA at SC06 (Supercomputing 2006) alongside the Tesla GPU architecture. The CUDA programming model emerged from NVIDIA's observation that graphics shaders were becoming increasingly general-purpose computation. The key innovation was abstracting the shader programming model into a general parallel computation model with explicit memory hierarchy. Early CUDA required programming in shader-like constructs; CUDA 1.0 introduced the C language extension that made it accessible to non-graphics programmers. CUDA 6.0 (2014) introduced Unified Memory. CUDA 10.0 (2018) introduced CUDA Cooperative Groups for flexible thread synchronization. As of 2025, CUDA 12.x is current.

Production Examples

A high-performance GEMM kernel using cuBLAS (NVIDIA's optimized BLAS library, not hand-written CUDA):

#include <cublas_v2.h>

void gemm_example(cublasHandle_t handle,
                  const float *A, const float *B, float *C,
                  int M, int N, int K) {
    const float alpha = 1.0f, beta = 0.0f;
    // C = alpha * A * B + beta * C
    // Note: cuBLAS uses column-major order
    // For row-major C++ arrays, swap A and B and transpose flags
    cublasSgemm(handle,
        CUBLAS_OP_N, CUBLAS_OP_N,
        N, M, K,    // m, n, k (note column-major swap)
        &alpha,
        B, N,       // B (column-major lda)
        A, K,       // A (column-major lda)
        &beta,
        C, N);      // C (column-major lda)
}

A100 cuBLAS SGEMM throughput: ~19.5 TFLOPS (close to theoretical peak for large matrices).

Debugging Notes

  • cudaDeviceSynchronize() forces the CPU to wait for all GPU work to complete. Always call before reading device results and before timing measurements.
  • Kernel launch returns immediately (async) — errors in kernel execution (e.g., illegal memory access) don't surface until the next synchronizing CUDA API call. This makes debugging async errors confusing. Use CUDA_LAUNCH_BLOCKING=1 to force synchronous launches during debugging.
  • __device__ functions are inlined by default in CUDA (unlike regular C inline). Use __noinline__ to prevent inlining for profiling.
  • assert() in kernels: CUDA supports assert(condition) in device code. Failed asserts cause the kernel to abort and report via cudaGetLastError().

Security Implications

  • Kernel code runs in GPU ring 0 (privileged) context relative to other CUDA kernels but not relative to the host CPU — it cannot directly escape to the host.
  • cudaMallocManaged pages can be accessed from both CPU and GPU; a GPU OOB access can corrupt host-mapped memory pages, potentially escalating to a host crash.
  • CUDA kernels have access to the entire GPU address space of their context. A bug that computes a bad pointer can corrupt any GPU allocation in the same context.
  • Multi-tenant GPU sharing (without MIG): without MIG partitioning, CUDA kernels from different processes can time-share the GPU but have separate address spaces. Address translation isolates them at the hardware level.

Performance Implications

  • Launching many small kernels has overhead (~5–10 µs per launch for PCIe, ~2 µs for NVLink). Fuse small kernels when possible, or use CUDA Graphs (capture and replay a sequence of kernels without re-launching overhead).
  • cudaMemcpy is synchronous (blocks the CPU thread). Use cudaMemcpyAsync with streams to overlap transfers with computation.
  • Shared memory __syncthreads() adds a barrier — all threads must reach it before any proceeds. Deadlocks occur if __syncthreads() is called conditionally (different threads take different code paths).
  • Block size of 32 (one warp) is legal but usually suboptimal — many SMs allow 32 resident blocks, and small blocks fill this count limit before exhausting SM resources.

Failure Modes

  • Illegal memory access (CUDA_ERROR_ILLEGAL_ADDRESS): Thread accesses out-of-bounds global memory. Check index computation — the if (tid < n) guard is critical.
  • Deadlock in __syncthreads(): Threads in a block diverge (some enter an if block, others don't) and __syncthreads() is inside the if. Threads outside the if never reach __syncthreads(). All threads must call __syncthreads() on the same code path.
  • Warp divergence killing performance: A kernel that runs fine on small, uniform test data is slow on real data with varied values because branches diverge. Profile with warp_execution_efficiency < 50%.
  • Register spill performance cliff: Adding one parameter to a kernel causes the compiler to use one more register, tipping the register count past a boundary that halves occupancy. Profile with Nsight Compute's "register_spills" metric.

Modern Usage

CUDA Graphs (CUDA 10+): Capture a sequence of kernel launches, memcpys, and synchronizations into a replayable graph. Replay overhead is ~1–2 µs vs 5–10 µs per individual launch:

cudaGraph_t graph;
cudaGraphExec_t instance;
cudaStream_t captureStream;
cudaStreamCreate(&captureStream);

// Capture a graph
cudaStreamBeginCapture(captureStream, cudaStreamCaptureModeGlobal);
kernel1<<<grid, block, 0, captureStream>>>(d_data, n);
kernel2<<<grid, block, 0, captureStream>>>(d_data, n);
cudaStreamEndCapture(captureStream, &graph);
cudaGraphInstantiate(&instance, graph, NULL, NULL, 0);

// Replay the graph (fast, ~2µs overhead)
for (int iter = 0; iter < 1000; iter++) {
    cudaGraphLaunch(instance, captureStream);
    cudaStreamSynchronize(captureStream);
}

NVIDIA Triton (PyTorch triton): A Python-level GPU kernel language that generates optimized CUDA kernels. Used by PyTorch 2.x's torch.compile to fuse operations and generate efficient custom kernels without writing CUDA directly.

Future Directions

  • CUDA 12 Thread Block Clusters (Hopper only): Groups of blocks (up to 8) that can share data via distributed shared memory across SMs — enabling larger tile sizes for GEMM and attention kernels.
  • PTX ISA evolution: NVIDIA's PTX (Parallel Thread eXecution) virtual ISA allows writing GPU-agnostic assembly that is JIT-compiled by the GPU driver to the actual microcode. Future SM generations add new PTX instructions (e.g., wgmma for Hopper's warpgroup-level MMA).
  • Cooperative Groups evolution: More flexible synchronization primitives beyond block and grid-level barriers.
  • CUDA Python native: Native CUDA kernel writing from Python without C++ compilation, enabling rapid GPU prototyping.

Exercises

  1. Implement a parallel reduction kernel that sums N floats. Start with the naive atomicAdd approach, then implement a warp-shuffle-based reduction (__shfl_down_sync), then a two-pass reduction (block-level + final block). Benchmark all three. Expected speedup of warp shuffle over atomicAdd: 10–20x.
  2. Implement a matrix transpose kernel without shared memory (naive) and with shared memory (tiled, avoiding bank conflicts with TILE_SIZE+1 padding). Profile both with l2_global_load_transactions and shared_load_transactions. Measure achieved memory bandwidth vs theoretical.
  3. Implement a stream-based pipeline that transfers and processes N batches: use 3 streams (transfer N-1, compute N, transfer-back N+1 simultaneously). Compare throughput to a single-stream sequential version. Quantify the overlap achieved with CUDA events.
  4. Write a kernel that intentionally causes warp divergence (e.g., if (threadIdx.x % 2) { /* expensive path */ } else { /* cheap path */ }). Profile with Nsight Compute's warp_execution_efficiency metric. Then restructure to group similar-work threads in the same warp and remeasure.
  5. Use CUDA Unified Memory to implement a histogram on a dataset too large to fit in GPU memory (simulate by allocating 90% of GPU VRAM). Use cudaMemAdvise with cudaMemAdviseSetPreferredLocation(data, deviceId) to hint page placement. Measure page fault rate and throughput vs explicit cudaMalloc + batched processing.

References

  • CUDA C++ Programming Guide: https://docs.nvidia.com/cuda/cuda-c-programming-guide/
  • CUDA Best Practices Guide: https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/
  • NVIDIA Deep Learning Performance Guide: https://docs.nvidia.com/deeplearning/performance/
  • Volkov, V. "Better Performance at Lower Occupancy." GTC 2010. — Classic talk on occupancy myths.
  • Luitjens, J. "CUDA Streams: Best Practices and Common Pitfalls." GTC 2014.
  • Harris, M. "Optimizing Parallel Reduction in CUDA." NVIDIA whitepaper.
  • Nath, R. et al. "An Improved MAGMA GEMM For Fermi Graphics Processing Units." Int. J. High Performance Computing Apps, 2010.