Heterogeneous Computing

Overview

Heterogeneous computing refers to systems that use more than one kind of processor or core, each optimized for a different type of workload. Where a homogeneous system runs all code on identical general-purpose CPU cores, a heterogeneous system might route a matrix multiplication to a GPU, a network packet processing task to a DPU, an AI inference request to an NPU, and a data compression task to an FPGA accelerator — all within the same application, potentially on the same host. The driving insight is that a general-purpose CPU is a compromise machine: it executes any instruction reasonably well but excels at none. Specialization delivers orders of magnitude better energy efficiency and throughput for specific workloads.

This transition is not gradual. A modern smartphone SoC (Apple A18, Qualcomm Snapdragon 8 Gen 3) already contains big CPU cores, little CPU cores, a GPU, an NPU, an ISP, a DSP, and a security enclave — all on a single die, all managed by the same OS. Cloud infrastructure is following the same path: AWS Graviton instances offload network and storage processing to Nitro Cards (a DPU implementation); AWS Trainium and Inferentia handle ML workloads; AWS F1 instances expose FPGAs directly to users. The operating system, programming model, and system software must adapt to manage this menagerie of compute resources coherently.

Prerequisites

Understanding of CPU memory hierarchy (L1/L2/L3 cache, DRAM)
Familiarity with PCIe architecture (bus enumeration, BAR registers, DMA)
Basic understanding of virtual memory and page tables
Exposure to parallel programming concepts (threads, SIMD, data parallelism)
Understanding of OS scheduler fundamentals (task states, priority, preemption)

Historical Context

The Single-Core Plateau

From 1970 to ~2004, Moore's Law delivered roughly 2x transistors every 18 months, and processor architects converted those transistors into faster single-thread execution: deeper pipelines, wider out-of-order windows, larger caches. A single application ran faster on each new chip generation without modification.

Around 2004, power density (watts per mm²) hit a wall. Running a single core faster required voltages and clock frequencies that generated more heat than package cooling could dissipate. Intel's Tejas and Jayhawk projects were cancelled because they would have required water cooling in consumer chips. The industry pivoted to multi-core: use the transistor budget for more cores at lower clock speeds rather than one very fast core.

Multi-core was still homogeneous — multiple identical general-purpose cores. The heterogeneous shift began when it became apparent that certain workloads could not be efficiently parallelized on general-purpose cores at any count. Image processing, signal processing, neural network inference, and cryptography all have structural properties (massive data parallelism, regular memory access patterns, low branch complexity) that map far better to specialized hardware than to out-of-order superscalar CPUs.

GPU Emergence

GPUs existed from the early 1990s as graphics accelerators. The insight that their massively parallel architecture could accelerate general computation (GPGPU) crystallized around 2006-2007, when NVIDIA released CUDA. A GPU has thousands of simple cores executing in lockstep on wide data vectors — exactly the profile needed for matrix multiplication, convolution, and other linear algebra operations at the heart of machine learning. This was not the plan when GPUs were designed; it was a fortuitous architectural match.

Heterogeneous System Topology

Modern Heterogeneous Server Node (2025)
=========================================

                        CPU Package (e.g., AMD EPYC 9654)
                       ┌─────────────────────────────────┐
                       │  Core  Core  Core  Core ... (96)│
                       │  ████  ████  ████  ████         │
                       │                                 │
                       │  L3 Cache (384MB)               │
                       │  Memory Controllers             │
                       │  PCIe 5.0 Root Complex          │
                       └──────┬───────────────┬──────────┘
                              │               │
              ════════════════╪═══════════════╪═════════════ PCIe Fabric
                              │               │
              ┌───────────────┼───────────────┼──────────────────┐
              │               │               │                  │
    ┌─────────▼───┐  ┌────────▼──┐  ┌─────────▼────┐  ┌─────────▼────┐
    │ GPU Cluster │  │  DPU/IPU  │  │   FPGA Card  │  │   NVMe SSDs  │
    │ (H100 x8)   │  │ (Bluefield│  │ (AWS F1-like)│  │ (P-state opt)│
    │             │  │  DPU-3)   │  │              │  │              │
    │ 80GB HBM3   │  │ ARM cores │  │ Reconfigurable│  │              │
    │ NVLink mesh │  │ RDMA engine│  │ logic fabric │  │              │
    └──────┬──────┘  └─────┬─────┘  └──────────────┘  └──────────────┘
           │               │
    NVLink │         SmartNIC offloads:
    mesh   │         - Packet processing
           │         - Storage encryption
    ┌──────▼──────┐  - VM networking (OVS)
    │ GPU-GPU     │  - Key management
    │ Direct Mem  │
    │ (RDMA-like) │
    └─────────────┘

On-chip heterogeneous (Apple M4 Pro SoC):
┌─────────────────────────────────────────────────────┐
│  4x P-cores  │  6x E-cores  │  GPU (40 cores)       │
│  (high perf) │  (efficiency)│  (unified mem access)  │
│                                                     │
│  Neural Engine (38 TOPS)  │  Media Engine (encode/  │
│                           │  decode H.265/AV1)      │
│                                                     │
│  Unified Memory Architecture: 48GB LPDDR5X          │
│  CPU and GPU share same physical DRAM, coherent     │
└─────────────────────────────────────────────────────┘

CPU-GPU Collaboration Model

The canonical GPU programming model divides work between a CPU "host" and a GPU "device":

CPU allocates memory, prepares data, and manages application logic
CPU copies data from host DRAM to GPU DRAM over PCIe (or uses unified memory)
CPU launches a "kernel" (GPU program) by submitting it to a command queue
GPU executes the kernel across thousands of parallel threads
CPU waits (synchronously or via callback) for GPU completion
CPU copies results back from GPU DRAM (or reads unified memory)

CPU-GPU Data Flow (Discrete GPU, PCIe)
========================================

CPU DRAM                    PCIe Bus              GPU DRAM
[input tensor]──────────────────────────────►[input tensor copy]
                          12-32 GB/s             │
                                                 │ GPU compute
                                                 │ (2000-4000 GB/s
                                                 │  internal bandwidth)
                                                 ▼
[output tensor]◄──────────────────────────────[output tensor]
                          12-32 GB/s

PCIe copy is often the bottleneck for small kernels.
For large models: GPU-resident data, CPU only sends commands.

CUDA execution model:
  Grid ──► Blocks (up to 65535 x 65535 x 65535)
  Block ──► Threads (up to 1024 per block)
  Threads in a warp (32) execute in lockstep (SIMT)
  Warp divergence (if/else) serializes branches

Unified Memory and NVLink

NVIDIA NVLink (and the Apple Unified Memory Architecture) eliminate or reduce the PCIe copy overhead:

NVLink: High-bandwidth GPU-GPU interconnect (900 GB/s bidirectional between H100s in NVLink 4). Also enables GPU-to-CPU coherent access via NVLink-C2C (in Grace-Hopper superchip, 900 GB/s CPU-GPU).
Apple UMA: CPU and GPU share a single LPDDR5X pool. A GPU kernel can read tensor data that the CPU just wrote without a copy. This eliminates the PCIe bottleneck entirely for Apple Silicon workloads.
CUDA Unified Memory (cudaMallocManaged): Presents a single virtual address space; the runtime migrates pages between CPU and GPU DRAM on demand. Useful for development; production code typically uses explicit management for predictability.

DPU/IPU: Data Processing Units

A Data Processing Unit (DPU), also called Infrastructure Processing Unit (IPU), is a specialized processor that offloads infrastructure tasks from the CPU: network processing, storage virtualization, security (encryption, key management), and hypervisor overhead.

AWS Nitro Architecture

AWS's Nitro system is the most widely deployed DPU architecture in the world (hundreds of thousands of instances). The Nitro Card handles: - All VPC networking (security groups, routing, NAT) — implemented in custom ASIC - EBS (Elastic Block Store) I/O — NVMe-over-Fabrics over encrypted connections - Instance monitoring and hypervisor management

The host OS (Linux KVM) sees clean virtio network and block devices; all the cloud infrastructure logic runs on the Nitro Card's ARM cores and custom ASIC, isolated from the guest. This means Nitro instances can dedicate close to 100% of CPU cores to customer workloads — the hypervisor overhead that consumed 10-30% of CPU on earlier EC2 generations is gone.

NVIDIA BlueField DPU

The BlueField-3 DPU combines: - 16 ARM Cortex-A78 cores (runs full Linux independently) - ConnectX-7 400Gbps SmartNIC - NVMe storage controller - Security processors (hardware AES-256, SHA, PKI) - DOCA software framework for programming the DPU

A BlueField-3 in a server means all Open vSwitch (OVS) packet processing, IPsec encryption, and SR-IOV management runs on the DPU's ARM cores — the x86 host CPUs see only the results.

NPU: Neural Processing Units

Neural Processing Units are fixed-function accelerators optimized for the specific operations of neural network inference: matrix-vector multiplication, ReLU activation, softmax, and convolution. They sacrifice general programmability for 10-100x better energy efficiency per TOPS (tera-operations per second) compared to a GPU.

NPU Architecture Pattern
==========================

Systolic Array (core computation engine):
┌──┬──┬──┬──┐
│PE│PE│PE│PE│  Each Processing Element (PE) performs
├──┼──┼──┼──┤  MAC (multiply-accumulate) operations.
│PE│PE│PE│PE│  Weights flow in from one direction,
├──┼──┼──┼──┤  activations flow through,
│PE│PE│PE│PE│  partial sums accumulate along the other.
├──┼──┼──┼──┤  
│PE│PE│PE│PE│  256x256 systolic array = 65536 MACs/cycle
└──┴──┴──┴──┘  Google TPUv1: 256x256 INT8, 92 TOPS

Examples:
- Apple Neural Engine (A18): 38 TOPS INT8, ~15W total SoC
- Google TPUv5: 459 TOPS BF16, datacenter-scale
- Qualcomm Hexagon NPU: used for on-device LLM inference
- Intel Movidius / Myriad X: edge inference NPU

NPUs typically use quantized arithmetic (INT8, INT4, even INT2 for weights) to pack more operations per cycle and reduce memory bandwidth requirements. The tradeoff is reduced precision, which is acceptable for inference (the model is already trained) but not for training (which requires FP32 or BF16 to maintain gradient accuracy).

FPGA in Cloud Computing

FPGAs (Field-Programmable Gate Arrays) are reconfigurable silicon: a sea of logic elements (LUTs, flip-flops, DSPs, BRAMs) that can be wired together at runtime to implement any digital circuit. Unlike CPUs or GPUs, FPGAs can implement custom pipelines with zero instruction dispatch overhead — the computation literally happens in hardware combinational logic.

AWS F1 and Microsoft Catapult

AWS F1 instances: Expose Xilinx UltraScale+ FPGAs via PCIe to EC2 instances. Used for genomics (FPGA-accelerated BWA-MEM), financial trading (custom order matching logic), and video transcoding.
Microsoft Catapult: FPGAs deployed in every Bing and Azure server. Initially used for Bing search ranking (BM25 scoring in hardware); evolved to accelerate Azure networking (SmartNIC function), OpenCL workloads, and now LLM inference.

The key advantage of FPGAs for specific workloads: a custom FPGA pipeline can process data at memory bandwidth limits (hundreds of GB/s) with single-digit nanosecond latency, which is impossible with a CPU or GPU that must dispatch instructions.

OS Challenges for Heterogeneous Systems

Unified Memory Management

The fundamental OS problem is that each compute device has its own memory with its own address space, access permissions, and coherency domain. A pointer valid on the CPU may be invalid on the GPU; memory allocated for the GPU may be inaccessible from the CPU.

Modern approaches: - Heterogeneous Memory Management (HMM): Linux kernel feature (merged ~5.14) that enables GPUs to share the CPU's page table. The GPU can page-fault and have pages migrated on demand. NVIDIA's ATS (Address Translation Services) uses this. - Shared Virtual Memory (SVM): CXL and PCIe ATS extensions allow accelerators to use the same virtual addresses as the CPU, with the CPU's IOMMU translating addresses for DMA. - Device memory in kernel mm: Linux device_memory infrastructure treats GPU-local HBM as a memory tier, allowing the buddy allocator and NUMA balancing to migrate hot pages between CPU DRAM and GPU HBM.

Heterogeneity-Aware Scheduling

Linux's scheduler was designed for homogeneous cores. ARM's big.LITTLE (and Intel's Performance/Efficiency hybrid architecture) introduced heterogeneous cores to the mainstream. The Energy-Aware Scheduling (EAS) framework in Linux handles this:

Linux EAS for big.LITTLE
==========================

System: 4x Cortex-A76 (big) + 4x Cortex-A55 (LITTLE)

EAS decisions:
1. Task wakeup: place task on core with sufficient capacity
   to complete work before next wakeup
2. Load balance: migrate tasks to minimize energy consumption
   (EM energy model predicts cost)
3. Frequency scaling: raise frequency only as needed (schedutil governor)

Energy Model (EM) per OPP (Operating Performance Point):
  big core @ 2.8GHz: 1800mW, capacity 1024
  big core @ 1.8GHz: 900mW, capacity 660
  LITTLE @ 1.8GHz: 200mW, capacity 380
  LITTLE @ 1.2GHz: 100mW, capacity 250

A task with utilization 300 can run on LITTLE @ 1.8GHz (sufficient)
for 200mW rather than big @ 2.8GHz for 1800mW — 9x power reduction.

For GPU and NPU scheduling, the OS typically delegates to userspace runtimes (CUDA, Metal, CoreML) that manage device queues. Kernel involvement is primarily in: - IOMMU mapping and unmapping for DMA isolation - Interrupt handling for completion notifications - Memory reclaim when device memory pressure requires eviction

Heterogeneous Task Graph Scheduling

ML frameworks (PyTorch, TensorFlow) and media pipelines must schedule operations across CPUs, GPUs, and NPUs in a dependency-respecting order. This is the "task graph scheduling" problem:

Task Graph Example (LLM Inference Pipeline)
=============================================

[Tokenize input]  (CPU)
        │
        ▼
[Embedding lookup] (GPU/NPU)
        │
        ▼
[Attention layers] (GPU — tensor parallel across N GPUs)
        ├─ Head 0 ──► GPU0
        ├─ Head 1 ──► GPU1
        └─ Head N ──► GPUN
                │
                ▼ (all-reduce synchronization via NVLink)
[FFN layers] (GPU)
        │
        ▼
[Detokenize output] (CPU)

Frameworks like CUDA Graphs, TensorRT, and XLA compile these graphs into optimized execution plans with pre-allocated memory and pre-recorded command streams to reduce per-inference overhead.

CXL: Compute Express Link

CXL (Compute Express Link) is an open interconnect standard (based on PCIe physical layer) that provides cache-coherent shared memory between CPUs and accelerators. It was developed by Intel and is now governed by the CXL Consortium.

CXL Memory Sharing vs PCIe Copy
=================================

Traditional PCIe (non-coherent):
CPU DRAM ──copy──► PCIe DMA ──copy──► Accelerator DRAM
Latency: 1-10µs for DMA setup + transfer
CPU cannot see accelerator writes until explicit copy back

CXL.mem (Type 3 device):
CPU ◄────────────────────► Accelerator Memory
    Coherent shared access     (Type 3 CXL expander)
    ~100-200ns latency
    CPU load/store instructions work directly
    Cache coherency maintained by CXL.cache protocol

CXL Use Cases:
- CXL memory expanders: cheap DRAM capacity without adding DIMMs
- GPU/NPU with CXL: zero-copy access to CPU memory and vice versa
- Persistent memory (CXL over CXL.mem): PMDK workloads
- Disaggregated memory pools: multiple hosts share a CXL memory pool

CXL 3.0 (2022) adds peer-to-peer CXL between accelerators and CXL switches for fabric-scale disaggregated memory.

Programming Models: The Fragmented Ecosystem

Heterogeneous Programming Model Comparison
===========================================

Model    Vendor    Targets              Portability  Maturity
-------  --------  -------------------  -----------  --------
CUDA     NVIDIA    NVIDIA GPUs only     None         Excellent
HIP      AMD       AMD + NVIDIA GPU     CUDA subset  Good
OpenCL   Khronos   GPU, FPGA, CPU       High         Good, declining
SYCL     Khronos   Intel + AMD + NVIDIA High         Growing (oneAPI)
oneAPI   Intel     Intel GPU/CPU/FPGA   Intel-focus  Growing
Metal    Apple     Apple GPU/ANE        Apple only   Excellent
OpenACC  PGI/NVIDIA GPU (C/Fortran)    Medium       HPC-focused
Vulkan   Khronos   GPU compute + gfx    High         Good for graphics
ROCm     AMD       AMD GPU ecosystem    AMD-focus    Growing

The fragmentation is severe: CUDA code does not run on AMD GPUs.
ML frameworks (PyTorch, TensorFlow) absorb the fragmentation by
providing a unified frontend that targets device-specific backends.

Debugging Notes

# NVIDIA GPU debugging
nvidia-smi                          # GPU utilization, memory, processes
nvidia-smi topo -m                  # NVLink/PCIe topology
nvtop                               # Interactive GPU monitor
cuda-memcheck ./application         # Memory error checker
ncu --set full ./application        # NVIDIA Nsight Compute profiler

# AMD GPU debugging
rocm-smi                            # AMD equivalent of nvidia-smi
rocprof --stats ./application       # ROCm profiler

# PCIe bandwidth test (for accelerator bottleneck diagnosis)
bandwidthTest --device 0            # CUDA bandwidth test (HtoD, DtoH, DtoD)

# CXL debugging (Linux 6.x+)
ls /sys/bus/cxl/devices/
cat /sys/bus/cxl/devices/mem0/ram/size

# IOMMU / DMA debugging
dmesg | grep -i iommu
cat /sys/kernel/iommu_groups/*/devices/*

# Check device topology
lspci -tv                           # PCIe tree
lstopo --of txt                     # hwloc topology (NUMA + PCIe + GPU)

Security Implications

GPU memory isolation: Without proper IOMMU configuration, one VM's GPU allocations may be readable by another VM on the same host after deallocation. NVIDIA's MIG (Multi-Instance GPU) and AMD's SRIOV address this for cloud multi-tenancy.
DPU privilege: A DPU with access to the host's PCIe bus can potentially DMA into arbitrary host memory if the IOMMU is misconfigured. The DPU has more privilege than the host OS in some architectures — its firmware must be trusted and secured.
Accelerator side channels: GPU memory access patterns leak information about computations. Timing attacks against AES or RSA implementations on GPUs are a demonstrated threat. NPUs may similarly leak model weights through power or timing side channels.
Supply chain: Custom FPGA bitstreams and NPU model files must be signed and verified; a malicious bitstream loaded into a cloud FPGA could attack the host's PCIe bus.

Performance Implications

Heterogeneous systems require careful workload placement: sending a small vector addition to a GPU wastes time on PCIe transfer overhead that exceeds the computation time.
The "roofline model" analysis determines whether a kernel is compute-bound or memory-bandwidth-bound, which guides whether CPU or GPU is the right target.
NVLink vs PCIe bandwidth differences are 10-70x; architectures should minimize cross-PCIe data movement.
For ML inference, NPUs offer 5-20x better energy efficiency than GPUs for INT8 workloads, but GPU flexibility is needed for experimental models.

Failure Modes

CUDA out-of-memory: GPU DRAM is smaller than CPU DRAM (80GB HBM vs terabytes of system RAM). Large models or batch sizes trigger OOM; the GPU kernel is killed, not gracefully suspended.
PCIe bandwidth exhaustion: Multiple GPUs sharing one PCIe root complex compete for bandwidth; a training job that exceeds PCIe bandwidth sees GPU utilization drop dramatically.
IOMMU DMA errors: A driver bug causes a DMA to an invalid address; the IOMMU raises a fault, the device stops, and the kernel logs IOMMU fault messages. Recovery requires device reset.
CXL initialization failure: CXL devices require both kernel and firmware support; mismatched CXL protocol versions between host and device cause the link to fall back to PCIe-only mode, losing coherency.
Thermal throttling: GPU and NPU produce extreme heat density; insufficient cooling causes clock throttling that degrades throughput by 30-70% under sustained load.

Modern Usage (2025)

ML training: All large model training (GPT-4, Gemini, LLaMA) uses multi-GPU clusters with NVLink intra-node and InfiniBand inter-node
On-device AI: Every flagship smartphone has a dedicated NPU running real-time AI features (image enhancement, voice processing, autocomplete)
Cloud CPU offload: AWS Nitro, Google Titanium, Azure Maia/Cobalt represent trillion-dollar infrastructure investments in DPU-style offload
Video transcoding: Netflix, YouTube use FPGA and GPU-based hardware transcoding for 100% of their encoding pipelines
Genomics: FPGA-accelerated sequence alignment (Illumina DRAGEN) is the standard in clinical genomics

Future Directions

Disaggregated computing: Composable infrastructure (CXL fabric, optical interconnects) allows CPU, memory, GPU, and storage to be dynamically composed per workload rather than fixed in a single chassis.
Wafer-scale integration: Cerebras WSE-3 integrates 900,000 AI cores and 44GB SRAM on a single 300mm wafer — the extreme end of heterogeneous specialization.
Unified programming models: SYCL 2023 and oneAPI are gaining traction; the industry is slowly converging on portable heterogeneous programming.
Chiplet disaggregation: AMD's 3D V-Cache and Intel's Foveros allow mixing compute and memory chiplets from different process nodes on the same package — extending heterogeneity to the die level.
Neuromorphic computing: Intel Loihi 2 and IBM TrueNorth represent a different axis of heterogeneity: event-driven spiking neural networks that consume microwatts for certain AI workloads.

Exercises

Profile a matrix multiplication workload using CUDA's nvcc and ncu (Nsight Compute). Vary the matrix size from 64x64 to 8192x8192. Plot throughput (GFLOPS) and identify where the transition from latency-bound to throughput-bound occurs.
Using PyTorch, implement a simple inference pipeline that explicitly places specific layers on CPU and other layers on GPU. Measure the data transfer overhead using torch.cuda.Event timing. At what model size does GPU acceleration become net positive?
Research the AWS Nitro architecture documentation. Draw a diagram showing how a network packet from the external internet reaches a guest VM's application, identifying which components are handled on the Nitro Card vs the host CPU.
Read the CXL 3.0 specification overview. Explain the difference between CXL.io, CXL.cache, and CXL.mem sub-protocols. What hardware is needed on both ends of a CXL link to support cache coherency?
On a Linux system with a discrete GPU, use lstopo to visualize the NUMA topology including PCIe. Identify which CPU NUMA node the GPU is closest to. Write a CUDA/HIP program that benchmarks host-to-device bandwidth when pinning CPU allocations to the nearest vs furthest NUMA node.

References

Patterson, D. et al. (2022). A New Golden Age for Computer Architecture. CACM 65(2).
Abts, D. & Kim, J. (2019). A Billion Transistors for the People. IEEE Micro.
Jia, Z. et al. (2018). Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. arXiv.
AWS re:Invent. (2017-2023). The Nitro Project: Next-Generation EC2 Infrastructure. (multiple years)
CXL Consortium. (2022). Compute Express Link Specification Rev 3.0. https://www.computeexpresslink.org/
NVIDIA. (2023). NVIDIA H100 Tensor Core GPU Architecture Whitepaper.
Intel. (2022). oneAPI Programming Guide. https://www.intel.com/content/www/us/en/developer/tools/oneapi/
Linux kernel heterogeneous memory management: Documentation/mm/hmm.rst
Jouppi, N. et al. (2017). In-Datacenter Performance Analysis of a Tensor Processing Unit. ISCA.
Markidis, S. et al. (2018). NVIDIA Tensor Core Programmability, Performance & Precision. IEEE IPDPS.