05 — DMA in Drivers
Technical Overview
Direct Memory Access (DMA) is the mechanism by which hardware devices read from or write to system memory without CPU involvement. Without DMA, every network packet received would require the CPU to copy it byte by byte from the NIC's internal buffer to system RAM — at 10Gbps, this would consume an entire CPU core just for memory copying. DMA transfers this work to a dedicated DMA engine on the device or the memory controller, freeing the CPU for computation.
From a driver writer's perspective, DMA is one of the most complex kernel APIs because it sits at the intersection of virtual memory, physical memory, cache coherence, and device addressing. The kernel's DMA API abstracts the differences between CPU architectures (x86's hardware coherence vs ARM's manual cache management) and system configurations (32-bit devices on 64-bit systems, IOMMU presence vs absence), providing a portable interface that handles the details.
Prerequisites
- Virtual vs physical memory addressing (MMU operation)
- Cache hierarchy and coherence concepts
- PCIe architecture basics (06-pcie-driver-model.md)
- Linux kernel memory allocation (kmalloc, vmalloc, pages)
The DMA Problem
A device needs to read or write data in RAM. The device's DMA engine works with physical addresses (or IOMMU-translated device-virtual addresses). The driver works with kernel virtual addresses (KVAs). The two are not the same:
System RAM (physical)
Physical: 0x0000000000000000 ─── 0xFFFFFFFFFFFFFFFF
│
[MMU translation]
│
Kernel Virtual: 0xFFFF880000000000 ─── 0xFFFFFFFFFFFFFFFF (x86-64)
(kernel direct-map: phys + PAGE_OFFSET)
Device DMA address:
├── Without IOMMU: same as physical address
└── With IOMMU: device-virtual (translated by IOMMU to physical)
The DMA API provides functions that return dma_addr_t — the address to program into the device's DMA registers. The driver must not use dma_addr_t as a CPU pointer; it is opaque and may be translated by an IOMMU.
DMA Buffer Lifetime Diagram
COHERENT DMA (dma_alloc_coherent):
┌─────────────────────────────────────────────────────────┐
│ Driver: dma_alloc_coherent(dev, size, &dma_handle, GFP_KERNEL)
│ Returns: cpu_addr (kernel virtual) + dma_handle (device address)
│ │
│ ┌─────────────┐ ┌──────────────────────────┐ │
│ │ CPU │ │ Device DMA Engine │ │
│ │ writes │◄───────►│ reads/writes │ │
│ │ via cpu_addr│ always │ via dma_handle │ │
│ └─────────────┘ coherent└──────────────────────────┘ │
│ │ Physical RAM: always consistent │
│ Driver: dma_free_coherent(dev, size, cpu_addr, dma_handle)
└─────────────────────────────────────────────────────────┘
STREAMING DMA (dma_map_single):
┌─────────────────────────────────────────────────────────┐
│ Driver has buffer: kbuf (kernel virtual) │
│ │
│ dma_map_single(dev, kbuf, size, DMA_TO_DEVICE) │
│ 1. On ARM: flush CPU caches for this range │
│ 2. Returns dma_handle for device │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ OWNERSHIP TRANSFERRED TO DEVICE │ │
│ │ CPU must NOT touch kbuf until unmapped! │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ Program device: dev->dma_reg = dma_handle │
│ Trigger DMA transfer │
│ Wait for completion (interrupt) │
│ │
│ dma_unmap_single(dev, dma_handle, size, DMA_TO_DEVICE) │
│ 1. On ARM: invalidate CPU caches for this range │
│ 2. Ownership returns to CPU │
│ │
│ CPU reads result from kbuf │
└─────────────────────────────────────────────────────────┘
Coherent vs Streaming DMA
Coherent (Consistent) DMA: dma_alloc_coherent
Coherent DMA allocates a buffer that is always consistent between the CPU and the device. Any write by the CPU is immediately visible to the device, and any DMA write by the device is immediately visible to the CPU. No explicit synchronization is needed.
/* Allocate coherent DMA buffer */
dma_addr_t dma_handle;
void *cpu_addr;
size_t size = PAGE_SIZE;
cpu_addr = dma_alloc_coherent(dev, size, &dma_handle, GFP_KERNEL);
if (!cpu_addr) {
dev_err(dev, "coherent DMA allocation failed\n");
return -ENOMEM;
}
/* Program device with DMA address */
writel(lower_32_bits(dma_handle), dev->regs + DMA_ADDR_LO);
writel(upper_32_bits(dma_handle), dev->regs + DMA_ADDR_HI);
/* Use cpu_addr to access the buffer from CPU side */
struct my_descriptor *desc = cpu_addr;
desc->command = CMD_READ;
desc->length = 512;
/* Device can now DMA-write results here; CPU reads them directly */
/* Cleanup */
dma_free_coherent(dev, size, cpu_addr, dma_handle);
On x86, coherent memory is allocated from regular cacheable memory — x86's hardware coherence protocol ensures MESI consistency automatically. On ARM without hardware coherence (non-CCI interconnects), coherent memory is allocated as uncacheable (write-through with Device-nGnRE attribute), which is slow but always consistent. This is why coherent DMA should be used sparingly and only for control structures (descriptor rings), not bulk data.
Streaming DMA: dma_map_single / dma_map_sg
Streaming DMA maps an existing buffer for a single DMA operation. The buffer remains cacheable, but the driver must transfer "ownership" explicitly:
/* Map existing buffer for DMA-to-device (device reads from here) */
dma_addr_t dma_handle = dma_map_single(dev,
kbuf, /* kernel virtual addr */
size,
DMA_TO_DEVICE);/* direction */
if (dma_mapping_error(dev, dma_handle)) {
dev_err(dev, "DMA mapping failed\n");
return -EIO;
}
/* Program device, trigger transfer */
writel(dma_handle, dev->regs + DMA_SRC_ADDR);
writel(size, dev->regs + DMA_LENGTH);
writel(DMA_START, dev->regs + DMA_CONTROL);
/* Wait for completion (usually via interrupt) */
wait_event_timeout(dev->dma_done, dev->transfer_complete, HZ);
/* MUST unmap before CPU can access the buffer again */
dma_unmap_single(dev, dma_handle, size, DMA_TO_DEVICE);
/* Now safe to read/free kbuf */
The ownership rule is critical: between dma_map_* and dma_unmap_*, the CPU must not read or write the mapped buffer. On architectures with hardware coherence (x86), violating this is benign because the hardware maintains consistency anyway. On ARM without hardware coherence, the CPU would see stale cached data.
DMA Directions
| Direction | Meaning | Cache operation (ARM) |
|---|---|---|
DMA_TO_DEVICE |
CPU writes, device reads | clean (flush) before map |
DMA_FROM_DEVICE |
Device writes, CPU reads | invalidate before unmap |
DMA_BIDIRECTIONAL |
Both read and write | clean+invalidate both ends |
DMA_NONE |
No actual transfer (testing) | none |
Scatter-Gather DMA: dma_map_sg
Real-world I/O buffers are rarely physically contiguous. A 1MB network packet buffer allocated with vmalloc may span dozens of non-contiguous physical pages. Scatter-gather DMA allows mapping multiple non-contiguous pages as a single DMA operation, letting the device's DMA engine traverse the scatter list.
#include <linux/scatterlist.h>
/* Build a scatter-gather list from pages */
struct scatterlist *sgl;
int nents, mapped;
sgl = kmalloc_array(npages, sizeof(struct scatterlist), GFP_KERNEL);
sg_init_table(sgl, npages);
for (i = 0; i < npages; i++) {
sg_set_page(&sgl[i], pages[i], PAGE_SIZE, 0);
}
/* Map all segments for DMA */
mapped = dma_map_sg(dev, sgl, npages, DMA_FROM_DEVICE);
if (!mapped) {
dev_err(dev, "SG DMA mapping failed\n");
kfree(sgl);
return -EIO;
}
/* Iterate mapped segments (may differ from npages due to merging) */
struct scatterlist *sg;
int i;
for_each_sg(sgl, sg, mapped, i) {
dma_addr_t addr = sg_dma_address(sg);
unsigned int len = sg_dma_len(sg);
/* Program device's scatter-gather descriptor with addr, len */
set_sg_descriptor(dev, i, addr, len);
}
/* ... trigger DMA, wait for completion ... */
dma_unmap_sg(dev, sgl, npages, DMA_FROM_DEVICE);
kfree(sgl);
The kernel's DMA mapping layer may merge adjacent physically-contiguous scatter-gather entries, so mapped may be less than npages. Always use mapped for the device descriptor count, not npages.
DMA Pool
For drivers that need many small DMA-safe allocations (e.g., command descriptors that are 64 bytes each, hundreds of them), dma_alloc_coherent per-descriptor is wasteful because each allocation has a minimum granularity of one page. DMA pools provide a slab-like allocator for small coherent DMA allocations:
/* Create pool: 64-byte objects, aligned to 64-byte boundaries */
struct dma_pool *pool;
pool = dma_pool_create("my_cmd_pool", /* name */
dev, /* device */
sizeof(struct my_cmd), /* obj size */
64, /* alignment */
0); /* cross-page boundary limit (0=none) */
/* Allocate from pool */
dma_addr_t cmd_dma;
struct my_cmd *cmd = dma_pool_alloc(pool, GFP_KERNEL, &cmd_dma);
/* Use cmd (CPU virtual), cmd_dma (device address) */
cmd->opcode = OP_READ;
cmd->length = 4096;
writel(cmd_dma, dev->regs + SUBMIT_CMD);
/* Free back to pool */
dma_pool_free(pool, cmd, cmd_dma);
/* Destroy pool (all allocations must be freed first) */
dma_pool_destroy(pool);
NVMe drivers use DMA pools for their submission queue entries (64 bytes each) and completion queue entries (16 bytes each).
Cache Coherence for DMA
The cache coherence problem for DMA is one of the most platform-specific issues in driver writing:
x86 Architecture:
┌────────────────────────────────────────────────────────┐
│ Intel/AMD CPUs: hardware MESI protocol maintains │
│ coherence between all CPU caches and DMA. │
│ A DMA write from a device invalidates the CPU's │
│ cache line automatically via the memory controller. │
│ dma_map_single on x86 = virtually a no-op (just │
│ IOMMU mapping if present). │
└────────────────────────────────────────────────────────┘
ARM Architecture (without CCI/CCN cache interconnect):
┌────────────────────────────────────────────────────────┐
│ No hardware DMA coherence in older/embedded ARM. │
│ │
│ DMA_TO_DEVICE (device reads): │
│ CPU wrote to cached buffer → CPU cache has data │
│ Device's DMA reads physical RAM → sees old data! │
│ FIX: dma_map calls DC CIVAC (Clean and Invalidate │
│ by VA to Point of Coherency) for the range. │
│ │
│ DMA_FROM_DEVICE (device writes): │
│ Device DMA writes to physical RAM │
│ CPU reads from cached buffer → sees old cached data! │
│ FIX: dma_unmap calls DC IVAC (Invalidate by VA to │
│ PoC) to evict stale cache lines. │
└────────────────────────────────────────────────────────┘
ARM Cortex-A CPUs with hardware coherence (connected via CCI-400, CCN-502, or CMN-600) behave like x86 for DMA. AWS Graviton (Cortex-A72/Neoverse) connects PCIe via a coherent interconnect — DMA mapping there doesn't require explicit cache operations.
IOMMU Role in DMA
Without an IOMMU, a device's DMA engine has unrestricted access to all of physical memory. A compromised or malfunctioning device could overwrite arbitrary kernel memory. The IOMMU (Input-Output Memory Management Unit) interposes a translation layer:
Device DMA address (device-virtual)
│
▼
IOMMU (Intel VT-d / AMD-Vi / ARM SMMU)
├── Looks up device-virtual in IOMMU page table
├── IOMMU page table: device-virtual → physical
│ (configured by kernel during dma_map)
├── If translation exists: forward to memory
└── If no translation: fault → DMAR fault interrupt
│
▼
Physical memory (only allowed regions)
Benefits of IOMMU: 1. Security: prevents DMA attacks from malicious hardware (Thunderclap, 2019: PCIe device in Thunderbolt port could DMA arbitrary memory on systems without IOMMU) 2. Address space: 32-bit devices that can only address 4GB can DMA into high memory via IOMMU remapping 3. Isolation: PCIe passthrough to VMs (VFIO, SR-IOV) — device is isolated to the VM's physical memory ranges only
Enable IOMMU on Linux:
# Intel: add to kernel command line
GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt"
# AMD:
GRUB_CMDLINE_LINUX="amd_iommu=on iommu=pt"
# iommu=pt = passthrough mode (IOMMU only for passthrough, not general DMA)
Bounce Buffers
32-bit devices (with a 32-bit DMA address register) cannot directly DMA into physical memory above 4GB. On a 64-bit system with >4GB RAM, a DMA transfer from a 32-bit disk controller cannot directly reach a page at physical address 0x1_0000_0000 (4GB+1).
The kernel handles this with bounce buffers: a temporary buffer allocated in the 32-bit DMA zone (physical address below 4GB). The driver DMAs to/from the bounce buffer, and the kernel copies the data to/from the high-memory buffer:
Without bounce buffer (broken):
High-memory page (phys: 0x1_0000_0000) ← 32-bit DMA can't reach
With bounce buffer (correct):
High-memory page (phys: 0x1_0000_0000)
↑
[kernel memcpy]
↑
Bounce buffer (phys: 0x0_8000_0000) ← 32-bit DMA works here
↑
Device DMA
Bounce buffers are a performance penalty (extra copy). The kernel sets the DMA mask to indicate what address range the device supports:
/* Tell kernel this device only supports 32-bit DMA addresses */
ret = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(32));
if (ret) {
dev_err(dev, "No suitable DMA available\n");
return ret;
}
/* For 64-bit capable devices: */
ret = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(64));
When the mask is 32-bit and the target buffer is above 4GB, dma_map_single allocates a bounce buffer automatically. This is transparent to the driver but adds ~100-200ns latency per transfer.
DMA Debugging
CONFIG_DMA_API_DEBUG enables comprehensive DMA API validation in the kernel:
- Detects unmapped accesses (CPU reading a mapped buffer)
- Detects double-mapping (mapping the same buffer twice)
- Detects mapping leaks (mapped buffers not unmapped on driver unload)
- Detects address range violations (DMA outside mapped range)
# Enable DMA API debugging (kernel config or boot param)
# kernel cmdline: dma_debug=1
# View DMA API debug errors
dmesg | grep "DMA-API"
# Show DMA mappings for a specific device
cat /sys/kernel/debug/dma-api/dump
Additionally, KASAN (Kernel Address Sanitizer) with CONFIG_KASAN=y detects buffer overflows in DMA buffers at the software level. For streaming DMA, the ownership rule violations are harder to detect because KASAN tracks virtual addresses, not physical page ownership.
Production Examples
NVMe driver (drivers/nvme/host/pci.c): Uses coherent DMA for submission and completion queues (command descriptors). Uses per-request DMA mapping (dma_map_sg) for the actual data buffers. Queue depth of 1023 means up to 1023 concurrent streaming DMA mappings per queue. On a 32-core system with 32 queues, this can be 32,000 concurrent DMA mappings.
Intel E1000e NIC: Uses coherent DMA for TX/RX descriptor rings (256 entries × 16 bytes each), allocated with dma_alloc_coherent. Uses streaming DMA (dma_map_single) for each packet buffer. At 1Gbps line rate, hundreds of DMA mappings/unmappings per second per CPU.
USB Host Controller (xHCI): The Transfer Request Block (TRB) rings are coherent DMA. USB bulk transfers use streaming DMA. The entire USB protocol stack operates through DMA — no CPU intervention in data transfer, only in control flow.
Security Implications
DMA attacks (Thunderclap, 2019): A malicious PCIe device (connected via Thunderbolt 3 without IOMMU) can DMA arbitrary physical memory. On macOS (pre-2016) and Windows (pre-Kernel DMA Protection), plugging a malicious Thunderbolt device gave full system compromise in seconds. Linux with intel_iommu=on is protected; Linux without IOMMU is not.
IOMMU bypass via firmware: Some systems have IOMMU enabled but device firmware can bypass it through reserved memory regions. AMD IOMMU had CVE-2023-20569 (Inception) where certain IOMMU configurations could be bypassed.
Speculative DMA (Spectre-related): DMA addresses returned by dma_map_single can leak information about physical memory layout, which combined with Spectre gadgets could defeat KASLR. The IOMMU's address translation makes DMA addresses device-virtual, not physical, partially mitigating this.
Performance Implications
IOMMU overhead: Each DMA mapping requires updating the IOMMU page table, which involves a TLB shootdown for the IOMMU. On high-throughput workloads (NVMe at 1M IOPS), this overhead is measurable. Solutions: iommu=pt (passthrough, skip IOMMU for trusted devices), IOMMU lazy flush, or keeping long-lived mappings.
Coherent DMA on ARM: Uncacheable memory access is 10-50x slower than cached. NVMe drivers on ARM must minimize the number of coherent DMA accesses to hot paths. The completion queue poll path reads coherent memory per-completion — at 1M IOPS, this is a bottleneck on ARM without cache-coherent interconnects.
Failure Modes
dma_alloc_coherentreturns NULL: DMA zone exhausted. Reduce allocation size or usedma_poolfor small allocations.dma_map_singlereturnsdma_mapping_error: IOMMU page table full or bounce buffer zone exhausted. Reduce concurrent DMA operations or increase DMA zone size.- Silent data corruption (ARM without cache flush): DMA_TO_DEVICE transfer where CPU forgot to flush cache before mapping. Device reads stale data. Extremely hard to debug — use
CONFIG_DMA_API_DEBUG. - DMAR fault: device DMAs to unmapped address (IOMMU fault). Visible in
dmesgas "DMAR: DRHD: handling fault status reg..." Caused by driver bug (programming wrong DMA address to device) or hardware bug.
Modern Usage
RDMA (Remote DMA): Network protocols like InfiniBand and RoCE use DMA not just within one machine but to read/write another machine's memory across the network. The driver registers memory regions with the HCA (Host Channel Adapter), which the remote host can then DMA into directly.
DMA-BUF: A kernel framework (merged in 3.3) for sharing DMA buffers between devices without copying. A camera driver can give a frame buffer directly to a GPU driver via DMA-BUF, allowing the GPU to encode the frame without a CPU copy. Used extensively in Android graphics pipeline (gralloc).
Future Directions
P2P DMA: PCIe peer-to-peer DMA (merged in Linux 4.20) allows two PCIe devices to DMA directly between each other, bypassing system RAM entirely. An NVMe SSD and a GPU can transfer data at PCIe line speed without CPU or memory controller involvement. Used in NVIDIA GPUDirect Storage.
CXL (Compute Express Link): CXL 3.0 introduces fabric memory sharing where multiple hosts can share a DMA-accessible memory pool. Driver models for CXL memory are still evolving in Linux 6.x.
Exercises
- Write a character device driver that allocates a coherent DMA buffer and exposes it to user space via
mmap. Verify the physical address with/proc/iomem. - Measure the performance difference between
dma_alloc_coherent(uncached) access andkmalloc(cached) access for a 1MB write loop. Useperf statto count cache misses. - Enable
CONFIG_DMA_API_DEBUGon a test kernel and intentionally violate the ownership rule (read from a mapped streaming buffer). Capture the kernel warning. - Inspect the DMA mapping code path in
lib/swiotlb.cfor bounce buffer implementation. Trace a DMA map call on a system with a 32-bit-DMA device. - Set
intel_iommu=onand observe the boot performance difference. Usedmesg | grep DMARto verify IOMMU activation and group assignments.
References
Documentation/core-api/dma-api.rst— complete DMA API referenceDocumentation/core-api/dma-api-howto.rst— DMA API usage guideinclude/linux/dma-mapping.h— DMA API declarationslib/swiotlb.c— bounce buffer implementation- Thunderclap paper: Markettos et al., "Thunderclap: Exploring Vulnerabilities in Operating System IOMMU Protection via DMA from Untrustworthy Peripherals" — NDSS 2019
- DMA-BUF:
Documentation/driver-api/dma-buf.rst - Linux Device Drivers, 3rd Edition, Chapter 15 — Memory Mapping and DMA