04 — GPU Scheduling and Drivers

Overview

The GPU driver stack is one of the most complex software systems on a modern Linux host. It spans from kernel-space hardware management through firmware communication to user-space API libraries, across a hardware boundary that has no standard ISA and changes with every GPU generation. Unlike the CPU, where the OS scheduler has fine-grained control of process execution, the GPU historically ran submitted work to completion — the OS could not preempt a compute shader mid-execution. This document traces the full path from a CUDA API call to hardware execution, explains how modern GPUs handle multi-process sharing, and covers the mechanisms — preemption, MPS, MIG — that make multi-tenant GPU usage viable in cloud environments.

Prerequisites

Understanding of GPU architecture and SMs (see 31-gpu-systems/01-gpu-architecture.md)
Familiarity with CUDA programming model (see 31-gpu-systems/02-cuda-programming-model.md)
Understanding of Linux kernel module architecture and DMA (see 03-kernel-fundamentals/)
Familiarity with virtual memory and IOMMU concepts

Historical Context

Early GPUs (pre-2007) had no concept of a software-managed command queue. The GPU driver wrote directly to memory-mapped hardware registers to program the fixed-function pipeline. With the introduction of general-purpose shader programming (NVIDIA GeForce 8800 GT, 2006), GPUs needed a more flexible command submission model. NVIDIA introduced the CUDA driver model in 2007, establishing the pattern used today: the CPU builds a command buffer, the GPU's hardware command processor reads it, and the GPU executes work described in those commands.

The problem of scheduling and multi-tenancy has grown more acute as GPUs became primary compute assets. In 2006, one GPU per workstation was the assumption. By 2024, a single H100 SXM5 costs $30,000–$40,000. Efficient multi-process sharing — without losing isolation or performance — is a pressing operational and economic problem.

GPU Work Submission Model

The fundamental model for submitting work to a GPU:

CPU side:                          GPU side:

User application                   GPU Command Processor
    |                                     ^
    | 1. Build command buffer             |
    v                                     |
libcuda.so / libGL.so                     |
    |                                     |
    | 2. Write cmds to ring buffer        |
    v                                     |
Ring buffer (in pinned CPU/GPU memory) ---+
    |                                     |
    | 3. Write to doorbell register       |
    v                                     |
Doorbell MMIO register -----> GPU wakes, reads ring buffer
                                          |
                                          v
                                 Execute work (kernel launches,
                                 memory copies, pipeline draws)
                                          |
                                          v
                                 Completion: write to fence memory
                                 Optionally: MSI-X interrupt to CPU

Ring Buffer / Command Stream

The ring buffer is a circular queue of commands. The CPU maintains a write pointer (head); the GPU maintains a read pointer (tail). Commands are:

Kernel launches (LAUNCH packets in NVIDIA's command stream)
Memory copy operations
Pipeline state changes (shader binds, descriptor set updates in Vulkan)
Fence/semaphore operations (synchronization with other queues or the CPU)
Compute dispatch commands

When the CPU writes new commands, it updates the write pointer and writes to a doorbell register — a memory-mapped I/O register in the GPU's PCI BAR (Base Address Register) space. Writing the doorbell wakes the GPU's command processor from an idle state.

Ring buffer state:

  +-----+-----+-----+-----+-----+-----+-----+-----+
  | CMD | CMD | CMD | CMD |     |     |     | CMD |
  +-----+-----+-----+-----+-----+-----+-----+-----+
                      ^                        ^
                   GPU read ptr             CPU write ptr
                   (GPU consuming)          (CPU producing)

  Free space: from write ptr to read ptr (wrapping)
  In-flight: from read ptr to write ptr

NVIDIA GPU Driver Architecture

Full NVIDIA Driver Stack:

User space:
  +--------------------------------------------------+
  | Application (CUDA program, OpenGL app, PyTorch) |
  +--------------------------------------------------+
       |                           |
  libcuda.so                  libGL.so / libEGL.so
  (CUDA Driver API)          (OpenGL/Vulkan ICD)
       |                           |
  +--------------------------------------------------+
  | CUDA Runtime (libcudart.so)  — higher-level API  |
  +--------------------------------------------------+
       |
  ioctl() system calls
       |
Kernel space:
  +--------------------------------------------------+
  | nvidia.ko  (NVIDIA kernel module)                |
  |   - Device file: /dev/nvidia0, /dev/nvidiactl    |
  |   - RM (Resource Manager): allocates VRAM,       |
  |     creates channels, manages contexts           |
  |   - GSP (GPU System Processor) firmware loader  |
  |   - IOMMU / DMA management                       |
  |   - Interrupt handler (MSI-X)                    |
  +--------------------------------------------------+
       |
  PCI Express bus
       |
  +--------------------------------------------------+
  | GPU Hardware                                     |
  |   - GSP (firmware CPU on GPU)                    |
  |   - CE (Copy Engines for DMA)                    |
  |   - GR (Graphics Engine / SMs)                   |
  |   - MMU (GPU virtual address translation)        |
  +--------------------------------------------------+

nvidia.ko Responsibilities

The NVIDIA kernel module (nvidia.ko) is a closed-source monolithic driver with a large attack surface. Its responsibilities:

Device enumeration: Detects GPU PCI devices, maps BAR registers, reads GPU firmware.
Memory management: Allocates VRAM for GPU buffers. Manages the GPU's page table (separate from CPU page table). Handles GPU memory eviction when VRAM is full (spills to system RAM via PCIe).
Channel management: A channel is the kernel object representing a GPU context's command submission path — essentially one ring buffer + associated GPU context state. Each CUDA context gets at least one channel.
GSP firmware: Since Ampere (A100), NVIDIA offloaded most RM (Resource Manager) functions to a firmware CPU called GSP (GPU System Processor), a RISC-V core on the GPU die. The kernel driver communicates with GSP via a message queue in shared memory.
Interrupt handling: GPU raises MSI-X interrupts for workload completion, page faults (ATS/SVM — Shared Virtual Memory), and error conditions (ECC, NVLink errors).

User-Space: libcuda.so

libcuda.so is the CUDA Driver API — the lowest-level user-space interface. It communicates with nvidia.ko via ioctl() on /dev/nvidia* devices. Key operations:

cuCtxCreate(): allocates a GPU context (virtual address space + channel)
cuMemAlloc(): allocates device memory (calls into driver RM via ioctl)
cuLaunchKernel(): submits a kernel launch command to the ring buffer
cuStreamSynchronize(): waits for all GPU work in a stream to complete (inserts fence, waits on CPU)

The CUDA Runtime (libcudart.so) is built on top of the Driver API, adding lazy initialization, stream management, and the <<<grid, block>>> kernel launch syntax.

AMD ROCm Driver Stack

AMD's open-source ROCm (Radeon Open Compute) stack uses a different architecture:

AMD ROCm Stack:

User space:
  HIP runtime (libamdhip64.so)  — CUDA-compatible API
       |
  ROCr runtime (libhsa-runtime64.so)  — HSA (Heterogeneous System Architecture) API
       |
  ioctl() on /dev/kfd (kernel fusion driver) and /dev/dri/renderD*
       |
Kernel space:
  amdgpu.ko   — DRM kernel driver (open source, upstream Linux)
    - KFD (Kernel Fusion Driver): compute workload path
    - KMS (Kernel Mode Setting): display path
    - TTM (Translation Table Manager): GPU memory management
    - AMDGPU IOMMU/PASID management
       |
  GPU hardware (RDNA / CDNA architecture)

The key difference from NVIDIA: amdgpu.ko is fully open-source and upstreamed into the Linux kernel. This enables system integrators, hypervisors, and security researchers to audit and modify the GPU driver.

GPU Context and Context Switch

A GPU context is analogous to a CPU process context: it contains:

A GPU virtual address space (separate GPU MMU page tables)
A set of channels (ring buffers for command submission)
Context-save area: GPU register state (shader register file state, etc.)

Context switching on the GPU is far more expensive than on the CPU:

CPU context switch:   ~1–10 μs   (save/restore ~1 KB of register state)
GPU context switch:   ~100–500 μs (save/restore up to 256 MB of SM register file state,
                                   potentially GB of scratchpad memory,
                                   plus TLB flush of GPU MMU)

This cost has a fundamental consequence: unlike the CPU, where context switches happen thousands of times per second across processes, the GPU historically did not preempt running work at all. Work ran to completion before the next submission was executed.

GPU Preemption

Historical Baseline: Run to Completion

On NVIDIA GPUs before Maxwell (2014), once a kernel (shader program) started executing on a SM, it ran until completion. There was no mechanism for the GPU hardware scheduler to pause it mid-execution. If a malformed shader looped infinitely, it would hang the GPU until the driver's watchdog timer (typically 2 seconds) reset the device.

Maxwell: Thread Block Preemption

NVIDIA Maxwell (GTX 900 series, 2014) introduced preemption at thread block granularity. The GPU can pause execution at a thread block boundary: complete all currently executing warps in the CTA (Cooperative Thread Array), save the CTA register state, and switch to a higher-priority context.

This is still coarse-grained: a thread block containing a slow shader must finish before preemption occurs. With 1024 threads per block and 1000+ SMs on a data-center GPU, preemption latency can still be tens of milliseconds.

Pascal and Volta: Instruction-Level Preemption

NVIDIA Pascal (P100, 2016) and Volta (V100, 2017) added instruction-level preemption for compute: the GPU can preempt at any instruction boundary within a warp, saving the full register file state of the preempted warp. This reduced preemption latency to ~10 μs.

H100: Green Contexts

NVIDIA Hopper (H100, 2022) introduces Green Contexts — a new abstraction allowing dynamic partitioning of SM resources between simultaneously executing contexts with hardware enforcement. Green Contexts enable true time-sliced multi-tenancy with predictable isolation guarantees, replacing the software-only MPS model for certain use cases.

Baseline: Context Switching

Without special mechanisms, multiple processes sharing a GPU use time-slicing: the driver switches between processes, one at a time, paying context-switch overhead for each transition. On a heavily loaded system, this overhead dominates for latency-sensitive workloads.

MPS: Multi-Process Service

NVIDIA MPS (Multi-Process Service) allows multiple processes to share GPU SM resources spatially within a single GPU context:

Without MPS:
  Process A (context switch) → Process B (context switch) → Process A
  Only one process uses GPU at a time
  Context switch overhead: ~100-500 μs each

With MPS:
  Process A kernel  ]
  Process B kernel  ] → Executing simultaneously on different SMs
  Process C kernel  ]

  Single GPU context serves all three processes
  No context switch overhead between them
  SM occupancy: 3x higher potential utilization

MPS works by routing all process command streams through a single MPS server process, which serializes submission but allows simultaneous SM execution when SM resources are available. Trade-off: no isolation — a bug in Process A that corrupts GPU memory can affect Process B. MPS is appropriate for trusted multi-tenant workloads (multiple models from the same organization), not for security isolation.

MPS architecture:
  Process A → mps-client A → \
  Process B → mps-client B → MPS server → GPU context → GPU HW
  Process C → mps-client C → /

  nvidia-cuda-mps-control daemon: manages the server lifecycle
  Enable: nvidia-smi -i 0 -c EXCLUSIVE_PROCESS && nvidia-cuda-mps-control -d

MIG: Multi-Instance GPU

NVIDIA Ampere (A100) and Hopper (H100) introduced MIG (Multi-Instance GPU), which partitions a GPU into up to 7 hardware-isolated GPU instances, each with:

Dedicated SM slices (compute)
Dedicated L2 cache partition
Dedicated DRAM partition (HBM slice)
Dedicated memory bandwidth
Hardware error isolation between instances

A100 SXM4 MIG partitioning example:

Full A100: 108 SMs, 80 GB HBM2e, 2 TB/s bandwidth
                 |
     +-----------+-----------+
     |           |           |
  MIG 3g.40gb  MIG 2g.20gb  MIG 2g.20gb
  (42 SMs,     (28 SMs,     (28 SMs,
   40 GB HBM)   20 GB HBM)   20 GB HBM)

Each MIG instance is a fully independent GPU from the software perspective:
  - Separate /dev/nvidia-caps/nvidia-cap<N>
  - Separate CUDA_VISIBLE_DEVICES assignment
  - Hardware enforced memory isolation
  - Cannot observe other instances' memory or compute activity

Other valid partitions: 1g.10gb (x7), 2g.20gb (x3 + 1g.10gb), etc.

MIG is the primary cloud isolation mechanism. AWS EC2 p4d instances expose individual MIG instances as separate virtual GPUs. GCP A2 instances use MIG for multi-tenant ML workloads. Unlike MPS, MIG provides hardware-enforced isolation: a bug or security vulnerability in one MIG instance cannot affect another.

GPU Scheduling in Kubernetes

Kubernetes GPU scheduling uses the device plugin framework:

GPU device plugin workflow:

  NVIDIA device plugin daemonset (runs on each GPU node)
       |
       v
  Queries nvidia-smi for GPU inventory
  Registers GPUs as Extended Resources:
    nvidia.com/gpu: 8  (for a DGX A100 node)
       |
       v
  Pod spec requests GPU:
    resources:
      limits:
        nvidia.com/gpu: 1
       |
       v
  Kubernetes scheduler assigns pod to node with available GPU
       |
       v
  Device plugin injects CUDA_VISIBLE_DEVICES env var into pod
  (specifies which GPU index the pod may use)
       |
       v
  Pod's CUDA code only sees the assigned GPU

MIG-aware scheduling requires nvidia.com/mig-1g.10gb, nvidia.com/mig-3g.40gb etc. as distinct resource types. The NVIDIA GPU Operator automates MIG configuration and device plugin deployment across Kubernetes nodes.

Debugging Notes

GPU hang investigation: nvidia-smi showing a process consuming 100% GPU utilization indefinitely indicates a stuck kernel. nvidia-bug-report.sh collects GPU state. The Xid error code in dmesg (Xid 43: GPU-side exception, Xid 45: preemption timeout) identifies the failure class.
Context leak: If CUDA applications exit abnormally, GPU contexts may not be cleaned up. nvidia-smi showing high VRAM used with no active processes indicates leaked contexts. Reloading nvidia.ko (rmmod nvidia_uvm; rmmod nvidia; modprobe nvidia) clears them, requiring a GPU reset.
MPS debugging: nvidia-smi --query-compute-apps=pid,used_memory --format=csv shows per-process GPU memory. MPS-shared processes appear as a single entry. Use CUDA_VISIBLE_DEVICES=0 nvidia-cuda-mps-control -d to start MPS on GPU 0.
MIG configuration: nvidia-smi mig -lgip lists available GPU instance profiles. nvidia-smi mig -cgi 3g.40gb,2g.20gb,2g.20gb -C creates compute instances. Changes survive reboot only if applied at boot via nvidia-smi --persistence-mode=1.
amdgpu debugging: dmesg | grep amdgpu for kernel-level errors. umr -R mmGRBM_STATUS (UMR tool) reads live GPU register state. ROCm provides rocminfo (device info) and rocm-smi (nvidia-smi equivalent).

Security Implications

nvidia.ko attack surface: The NVIDIA kernel module is closed-source and runs with full kernel privileges. A vulnerability in its ioctl handlers (which accept untrusted data from user space) is a kernel privilege escalation. CVE-2021-1076 (NVIDIA driver privilege escalation via race condition) and similar CVEs demonstrate this risk. Keep the NVIDIA driver updated.
MIG isolation boundary: MIG provides hardware isolation but not complete security isolation — a compromised MIG compute instance still shares the same physical GPU with other instances. Side-channel attacks (power measurement via RAPL, electromagnetic emissions) could theoretically leak information across MIG boundaries. This is an active research area.
MPS security: MPS explicitly does NOT provide security isolation. Never run untrusted workloads with MPS enabled.
IOMMU bypass: Without IOMMU (Input-Output Memory Management Unit) enabled for the GPU, a compromised GPU driver can DMA to any host physical address. Enable IOMMU in BIOS and configure intel_iommu=on or amd_iommu=on in kernel parameters. Verify with dmesg | grep -i iommu.
GPU memory scrubbing: When a VM or container releases GPU memory, that memory is not automatically zeroed. A subsequent tenant could read previous tenant's VRAM content. Require GPU memory scrubbing on deallocation (CUDA_DEVICE_RESET at process exit, or MIG which isolates DRAM hardware).

Performance Implications

Command submission latency: The overhead of cuLaunchKernel() calling into the kernel driver via ioctl is ~5–20 μs on a modern system. For short kernels (<10 μs), submission overhead dominates. Use CUDA Graphs to pre-record command sequences and replay them with a single ioctl.
Context switch impact: GPU context switches are expensive. Avoid patterns where multiple heavy compute processes time-share a GPU. Use MPS for co-located inference workloads.
PCIe bandwidth bottleneck: PCIe 4.0 x16 provides ~32 GB/s bidirectional bandwidth between CPU and GPU. NVLink (H100 SXM5: 900 GB/s GPU-to-GPU) vastly exceeds this. Structure computation to minimize PCIe transfers; keep tensors in VRAM across operations.
MIG performance isolation: MIG instances have guaranteed dedicated bandwidth. An overloaded MIG instance cannot steal bandwidth from another — unlike MPS where all processes share L2 and memory bandwidth.

Failure Modes

Xid 79: GPU has fallen off the bus: The GPU stopped responding to PCI transactions. Causes: overheating, power delivery issue, PCIe link training failure, hardware defect. Recovery requires machine reboot or physical GPU power cycle. Check dmesg for temperature warnings (nvidia-smi --query-gpu=temperature.gpu).
Double-bit ECC error (Xid 48): Uncorrectable DRAM error in VRAM. The affected memory page is retired. Monitor nvidia-smi --query-gpu=ecc.errors.uncorrected.aggregate.total. If accumulating, the GPU needs replacement.
MIG reconfiguration failure: Changing MIG partition topology requires all compute instances to be idle. If a process holds a MIG instance open, nvidia-smi mig -dci fails. Use fuser /dev/nvidia-caps/nvidia-cap* to identify the holding process.
CUDA out-of-memory during kernel launch: CUDA kernels require scratch memory (register spill, shared memory). If VRAM is fragmented, a launch may fail even with theoretically sufficient free memory. Use cudaMemGetInfo to monitor fragmentation. Restart the CUDA context to compact.

Modern Usage (2024–2025)

NVSwitch and NVLink fabric: In DGX H100 systems, 8 GPUs are connected via an NVSwitch ASIC providing all-to-all NVLink connectivity. The NVSwitch has its own driver (nvidia-nvswitch.ko) and appears as additional devices in nvidia-smi. All-reduce operations across 8 GPUs complete entirely within NVLink, not PCIe.
Confidential computing on GPU: NVIDIA H100 Confidential Computing mode encrypts GPU memory and provides attestation of the GPU firmware/driver state. TEE (Trusted Execution Environment) for GPU workloads enables processing sensitive data (patient records, financial data) on cloud GPUs without trusting the cloud provider.
GDR (GPUDirect RDMA): Enables remote GPU memory to be accessed directly over InfiniBand/RoCE, bypassing the CPU. Used in distributed training to eliminate PCIe copies in AllReduce. Requires GDR support in the network adapter and nvidia-peermem kernel module.

Future Directions

Hardware GPU preemption at warp level: Future GPU generations will likely support sub-CTA preemption with minimal overhead, enabling true fine-grained time-sharing.
Unified Virtual Memory (UVM) maturity: CUDA UVM (Unified Virtual Memory) allows CPU and GPU to share a single virtual address space with hardware page migration. As PCIe 5.0 and CXL increase interconnect bandwidth, UVM without explicit memory management becomes practical for more workloads.
Open-source NVIDIA drivers: NVIDIA released the GPU kernel module as open source (GPL/MIT dual-license) for Turing+ (RTX 2000+) GPUs in 2022. The open kernel module (nvidia-open-*) is now the recommended path on modern hardware, enabling upstream Linux integration.
GPU OS: Projects like Mosaic OS (MIT) explore preemptive, fine-grained GPU scheduling at the OS level, treating GPU scheduling as an OS resource management problem rather than a driver-level concern.

Exercises

Command queue inspection: Write a CUDA program that launches 100 sequential kernels with cudaEventRecord around each. Profile with Nsight Systems. Identify the time breakdown between: kernel execution, command submission overhead, event synchronization.
MPS performance: Run the same two CUDA inference workloads simultaneously: (a) with default GPU scheduling (time-sliced), (b) with MPS enabled. Compare total throughput and individual latency distributions. When does MPS help and when does it hurt?
MIG setup: If you have access to an A100 or H100, configure two MIG instances (e.g., 2g.20gb + 2g.20gb). Run memory-bandwidth-bound workloads on each simultaneously. Verify that one instance's bandwidth consumption does not impact the other (MIG isolation guarantee).
Driver stack tracing: Use strace -e ioctl nvidia_app 2>&1 | grep -c NR_ioctl to count ioctl calls for a simple matrix multiplication. Identify the dominant ioctl operations. Compare ioctl count for a plain cuLaunchKernel vs. a CUDA Graph replay.
Preemption latency: Write a CUDA kernel with a configurable spin loop. Launch it from process A, then immediately launch a high-priority kernel from process B. Measure the time from process B's kernel launch to its first instruction executing on the GPU (via nvperf counters). Repeat for different spin loop lengths to characterize preemption latency vs. workload granularity.

References

NVIDIA CUDA Driver Documentation. https://docs.nvidia.com/cuda/cuda-driver-api/
NVIDIA MPS Documentation. https://docs.nvidia.com/deploy/mps/index.html
NVIDIA MIG User Guide. https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
Luk, C., et al. (2009). "A Performance Analysis of GPU-Based Direct Rendering." IEEE Micro.
Tanasic, I., et al. (2014). "Enabling Preemptive Multiprogramming on GPUs." ISCA 2014.
Park, J., et al. (2017). "Chimera: Collaborative Preemption for Multitasking on a Shared GPU." ASPLOS 2015.
NVIDIA Open GPU Kernel Modules. https://github.com/NVIDIA/open-gpu-kernel-modules
AMD ROCm Documentation. https://rocm.docs.amd.com/
Otterness, N., Anderson, J. (2020). "AMD GPU Scheduling Characteristics for Real-Time Systems." RTAS 2020.