DMA and IOMMU

Technical Overview

DMA (Direct Memory Access) is the mechanism by which I/O devices transfer data directly to and from system RAM without involving the CPU for each byte. Without DMA, every network packet received would require the CPU to copy bytes from device registers to RAM — completely impractical at modern speeds (100 Gbps NIC = 12.5 GB/s transfer rate). DMA allows the NIC, storage controller, GPU, and other devices to DMA engine to write packet data directly to a pre-allocated ring buffer in RAM, then interrupt the CPU only when the transfer is complete.

The IOMMU (I/O Memory Management Unit) is the hardware equivalent of the CPU's MMU, but for DMA devices. It provides: 1. Address translation: Devices use virtual I/O addresses (IOVA); the IOMMU translates to physical addresses. 2. Protection: A device can only access physical pages it has been explicitly authorized to access. A misbehaving or compromised device cannot read/write arbitrary RAM. 3. Virtualization passthrough: A VM guest can be given direct, exclusive access to a PCIe device (VFIO passthrough) without the hypervisor needing to intercept DMA.

Intel's implementation is Intel VT-d (Virtualization for Directed I/O). AMD's equivalent is AMD-Vi (AMD I/O Virtualization Technology). The specifications are collectively referred to as DMAR (DMA Remapping).

Prerequisites

Physical memory and buddy allocator (06-buddy-allocator.md)
PCIe device model and bus addressing
Virtual memory concepts (01-virtual-memory.md)
Linux device driver model (struct device)

Core Content

DMA Concept and Physical Addressing

DMA Without IOMMU
==================

CPU (virtual addresses)          Physical RAM
  ┌─────────────────┐            ┌──────────────────────────────┐
  │ process A        │            │ frame 0: [OS kernel code]    │
  │ kernel driver    │── writes ─►│ frame 1: [NIC DMA ring buf]  │◄─── NIC DMA
  └─────────────────┘  phys addr │ frame 2: [GPU framebuffer]   │◄─── GPU DMA
                                  │ frame 3: [process A heap]    │
                                  │ frame N: [page cache data]   │
                                  └──────────────────────────────┘

Problem: NIC uses 32-bit bus addresses (PCI devices).
  On a system with > 4 GB RAM, the DMA buffer must be in the
  first 4 GB (physical address < 0x100000000).

  → This is why ZONE_DMA32 exists.
  → Buffers above 4 GB require IOMMU address translation or
    bounce buffers.

DMA ring buffer allocation:
  Driver allocates: ptr = dma_alloc_coherent(dev, size, &dma_handle, GFP_KERNEL)
    ptr       = virtual address (kernel can write to it)
    dma_handle = DMA address (device uses this; = physical addr without IOMMU)

  Driver programs device: writel(dma_handle, device_reg_base_addr)
  Device can now DMA to/from that physical address.

DMA API in Linux

The Linux DMA API (include/linux/dma-mapping.h, kernel/dma/) provides device-agnostic DMA buffer management:

/* Coherent DMA: CPU and device see same data immediately (no explicit sync) */
void *dma_alloc_coherent(struct device *dev, size_t size,
                          dma_addr_t *dma_handle, gfp_t flag);
void dma_free_coherent(struct device *dev, size_t size,
                        void *cpu_addr, dma_addr_t dma_handle);

/* Streaming DMA: CPU writes, then hands off to device (explicit sync required) */
dma_addr_t dma_map_single(struct device *dev, void *ptr,
                           size_t size, enum dma_data_direction dir);
void dma_unmap_single(struct device *dev, dma_addr_t addr,
                       size_t size, enum dma_data_direction dir);

/* Scatter-gather: map an array of non-contiguous buffers */
int dma_map_sg(struct device *dev, struct scatterlist *sg,
               int nents, enum dma_data_direction dir);
void dma_unmap_sg(struct device *dev, struct scatterlist *sg,
                   int nents, enum dma_data_direction dir);

/* Sync for CPU access (after device DMA write, before CPU read) */
void dma_sync_single_for_cpu(struct device *dev, dma_addr_t addr,
                               size_t size, enum dma_data_direction dir);
/* Sync for device access (before device DMA read) */
void dma_sync_single_for_device(struct device *dev, dma_addr_t addr,
                                  size_t size, enum dma_data_direction dir);

DMA direction flags: - DMA_TO_DEVICE: CPU writes to buffer; device reads it (e.g., TX packet) - DMA_FROM_DEVICE: Device writes to buffer; CPU reads it (e.g., RX packet) - DMA_BIDIRECTIONAL: Both directions (expensive: flush + invalidate cache)

Coherent vs Streaming DMA

DMA Coherency Models
=====================

1. Coherent (Consistent) DMA:
   dma_alloc_coherent()

   Properties:
   - Uncached from CPU perspective (marked with MT_DEVICE or WC cache attribute)
   - Device writes are immediately visible to CPU (no explicit sync needed)
   - CPU writes are immediately visible to device

   Cost: CPU cache bypassed for these pages → slow for large regions
   Use case: small control rings (RX/TX descriptors), doorbells

   Example:
     NIC RX descriptor ring: 256 entries × 16 bytes = 4 KB
     CPU writes: "put RX buffer at address X" into descriptor
     Device reads descriptor to know where to DMA packet data

2. Streaming (Non-coherent) DMA:
   dma_map_single() / dma_map_sg()

   Properties:
   - Normal cached CPU memory
   - Before handing to device: flush CPU cache (dma_map = implicit sync)
   - After device write: invalidate CPU cache (dma_unmap = implicit sync)
   - Or use dma_sync_*() for partial sync

   Cost: cache flush on every transfer (but data path is cached = fast)
   Use case: large data buffers (packet payloads, disk sectors)

   Example:
     NIC RX buffer pool: 2KB pages for packet data
     malloc (cached) → packet arrives → dma_sync_for_cpu() → app reads

CPU cache and DMA interaction on non-coherent architectures (ARM, older MIPS):
  Without IOMMU: device writes to physical RAM, bypassing L1/L2 cache.
  If CPU has a stale cached copy → reads stale data → data corruption.
  dma_unmap_single() forces cache invalidation on these architectures.
  On x86-64: cache-coherent IOMMU (no explicit cache flush needed for DMA).

IOMMU Architecture

IOMMU Page Table Structure (Intel VT-d)
==========================================

PCIe topology:
  Host Bridge
    └─ Root Complex
          ├─ PCIe device: NIC (BDF 00:1f.0)
          ├─ PCIe device: NVMe (BDF 01:00.0)
          └─ PCIe device: GPU  (BDF 02:00.0)

IOMMU DMA Remapping Tables:
  Root Entry Table (one entry per PCIe bus):
  [Bus 00] → Context Entry Table
              [Device 1f, Fn 0 = NIC] → Domain #1 Page Table (IOMMU PTs)
              [Device 1f, Fn 1 = other] → Domain #2 Page Table
  [Bus 01] → Context Entry Table
              [Device 00, Fn 0 = NVMe] → Domain #3 Page Table
  [Bus 02] → Context Entry Table
              [Device 00, Fn 0 = GPU]  → Domain #4 Page Table

Each domain has its own IOMMU page table (similar to CPU page tables):
  IOVA (I/O Virtual Address) → Physical Address

NIC DMA request:
  NIC issues DMA write to IOVA 0x1000_0000
    │
    ├── IOMMU intercepts PCIe transaction
    ├── Looks up NIC's domain page table
    ├── IOVA 0x1000_0000 → Physical Address 0xAB00_0000
    │     (if mapped)
    ├── Allows DMA write to physical 0xAB00_0000
    │
    └── If IOVA not in page table: IOMMU blocks the access
        Device gets a PCIe completion with error (UR = Unsupported Request)
        CPU receives DMAR fault interrupt → kernel logs error → may crash device

Intel VT-d / AMD-Vi

Both implement the same conceptual model with different register layouts:

Intel VT-d (enabled by INTEL_IOMMU=y, intel_iommu=on): - DMAR (DMA Remapping) units, one per PCIe root port - Two-level or multi-level page tables (configurable depth) - Interrupt remapping (INTR-R) as a bonus feature - Configured via DMAR ACPI table (drivers/iommu/intel/iommu.c)

AMD-Vi (enabled by AMD_IOMMU=y): - Single IOMMU unit covers all devices (in modern AMD APUs, distributed) - IVRS ACPI table defines device-to-IOMMU mapping - 4-level page tables (similar to CPU page tables) - drivers/iommu/amd/iommu.c

IOMMU Page Tables

Intel VT-d IOMMU Page Table Entry
====================================

Bits [63:52]  Software use / reserved
Bits [51:12]  Physical Frame Number (PFN) — address of next-level table or physical page
Bit  [11]     SNOOP (snoop behavior hint)
Bit  [6]      DIRTY (set by IOMMU on write)
Bit  [5]      ACCESSED (set by IOMMU on read or write)
Bit  [1]      W (write permission)
Bit  [0]      R (read permission)

IOMMU fault conditions:
  - R=0 and device attempts read → DMAR fault
  - W=0 and device attempts write → DMAR fault
  - PFN not present → DMAR fault

DMAR fault handling (drivers/iommu/intel/iommu.c: dmar_fault()):
  1. DMAR unit interrupts CPU
  2. Kernel reads fault reason and faulting address from DMAR registers
  3. Logs: "DMAR:[DMA Read] Request device [02:00.0] fault addr ffee0000"
  4. May force-disable the device, deliver -EIO to driver, trigger recovery

IOMMU for Security: PCI Device Isolation

Without IOMMU, any PCIe device can DMA to any physical address — including kernel code, page tables, and other processes' memory. This is catastrophic from a security standpoint:

DMA Attack Without IOMMU:
  1. Attacker has physical access to a Thunderbolt port
  2. Connects a malicious Thunderbolt device (Thunderspy attack)
  3. Device issues DMA write to physical address 0x1000 (kernel .text)
  4. Overwrites kernel code → root-level code execution
  5. Device reads physical address of kernel credential structures
  6. Reads victim process's uid/gid → credential theft

With IOMMU:
  1. Thunderbolt device assigned its own IOMMU domain
  2. Domain page table initially empty (no authorized physical addresses)
  3. Device DMA to any unauthorized address → DMAR fault → blocked
  4. Linux kernel: Thunderbolt driver explicitly maps only the Thunderbolt
     protocol buffers into the domain
  5. Device can only access exactly those protocol buffers

Kernel IOMMU security features: - iommu.passthrough=0 (strict mode): all devices go through IOMMU translation - iommu.strict=1: TLB invalidation happens synchronously (safer, slower) - intel_iommu=on,igfx_off: enable IOMMU except for integrated GPU

IOMMU for Virtualization: VFIO Passthrough

VFIO (Virtual Function I/O) Passthrough
==========================================

Host: KVM hypervisor, Guest: VM running database application

Setup:
  1. Unbind NIC from host driver (e.g., ixgbe):
     echo 0000:01:00.0 > /sys/bus/pci/devices/0000:01:00.0/driver/unbind

  2. Bind to vfio-pci driver:
     echo 8086 10fb > /sys/bus/pci/drivers/vfio-pci/new_id

  3. QEMU creates a VFIO device for the VM:
     qemu-system-x86_64 ... -device vfio-pci,host=01:00.0

Guest experience:
  - Guest VM sees the NIC as if it's directly attached (bare metal performance)
  - Guest driver programs NIC DMA addresses (guest physical = IOVA from guest's view)

IOMMU's role:
  - QEMU programs the IOMMU: guest physical address = IOVA, maps to host physical
  - VFIO domain: {IOVA=guest_phys, phys=host_phys} for guest's RAM pages only
  - NIC DMA to any guest physical → IOMMU translates → correct host physical
  - NIC cannot access host kernel or other VMs (not in VFIO domain page table)
  - Guest OS can program NIC registers directly (memory-mapped via /dev/vfio/N)

Performance:
  DMA passthrough: near line-rate (100 Gbps NIC at full speed in VM)
  vs. virtio-net:  emulation overhead ~5-15% for high packet-rate workloads
  vs. SR-IOV:     NIC provides multiple Virtual Functions (VF), each gets VFIO
                  domain; multiple VMs share one physical NIC

Bounce Buffers: 32-bit Device Limitation

Legacy PCI devices (and some older PCIe devices) use 32-bit DMA addresses — they cannot address physical memory above 4 GB (ZONE_DMA32). On systems with > 4 GB RAM, the OS must maintain a "bounce buffer":

Bounce Buffer Operation
========================

High memory (> 4 GB):
  [kernel buffer at PA=0x200000000 (8GB)]  ← normal allocation

Low memory (< 4 GB):
  [bounce buffer at PA=0x100000   (1MB)]   ← DMA-accessible zone

DMA write (device → memory):
  1. Device DMA writes to bounce buffer at PA=0x100000
  2. ISR/softirq: memcpy(kernel_buffer, bounce_buffer, size)  ← extra copy
  3. Driver processes data from kernel_buffer

DMA read (memory → device):
  1. memcpy(bounce_buffer, kernel_buffer, size)  ← extra copy
  2. Program device DMA from bounce buffer at PA=0x100000

Cost: one extra memcpy per DMA transfer
  At 10 Gbps: 1.25 GB/s × 2 memcpy = 2.5 GB/s of CPU memory bandwidth
  On a modern CPU with 50 GB/s memcpy bandwidth: ~5% CPU overhead

With IOMMU:
  IOMMU maps high physical address to a 32-bit IOVA
  Device sees IOVA < 4GB, IOMMU translates to physical > 4GB
  → NO bounce buffer needed, even for 32-bit devices

Driver code: dma_map_single() calls swiotlb_map() if device mask < 64-bit
  and IOMMU is not available or not covering this device.
  SWIOTLB (Software I/O TLB) implements bounce buffers in Linux.
  drivers/iommu/dma-iommu.c — IOMMU-aware DMA map
  kernel/dma/swiotlb.c — software bounce buffer fallback

DMA Attacks

Thunderbolt DMA Attack (Thunderspy, CVE-2020-13783 through CVE-2020-13788): Thunderbolt 3/4 ports allow external PCIe devices with DMA access. An attacker with physical access for 5 minutes can: 1. Plug in a hardware tool 2. Reprogram the Thunderbolt controller firmware to disable security levels 3. Gain full DMA access to the host machine's RAM

Mitigation: Enable Kernel DMA Protection (iommu=force in BIOS) which requires IOMMU to be active before Thunderbolt devices are enumerated.

PCIe DMA Attack via ExpressCard / M.2: Similar to Thunderbolt, any slot providing PCIe lanes (ExpressCard, M.2, mini-PCIe) provides DMA to an inserted device. IOMMU is the only defense.

Rogue NIC firmware: A compromised NIC firmware can DMA to arbitrary physical addresses. Without IOMMU, NIC compromise = system compromise. With IOMMU, the NIC's IOMMU domain only contains the driver's registered DMA buffers.

Historical Context

DMA was present in early computers — IBM's 7090 (1959) had dedicated channels for I/O. The Intel 8237 DMA controller chip (1975) was standard in IBM PC-compatible machines, limited to 24-bit physical addressing (16 MB). As RAM grew beyond this, the ISA DMA limitation required bounce buffers. PCI (1993) brought 32-bit DMA addressing. PCI-X and PCIe added 64-bit DMA addressing. The IOMMU concept dates to IBM's mainframes (S/360 channel programs, 1964). Intel VT-d appeared in Xeon 5100 series (Woodcrest, 2006). AMD-Vi appeared in Barcelona (2007). Linux IOMMU framework (drivers/iommu/) was introduced in Linux 2.6.24. VFIO (for device passthrough) was added in Linux 3.6 (2012). The Thunderspy disclosure (2020) brought widespread attention to pre-boot DMA attacks.

Production Examples

SR-IOV for cloud networking: Cloud providers (AWS, GCP, Azure) use SR-IOV (Single Root I/O Virtualization) with IOMMU for NIC passthrough. Each VM gets a Virtual Function (VF) directly attached. AWS ENA (Elastic Network Adapter) uses SR-IOV + IOMMU to deliver 100 Gbps networking to VMs with near-native performance.

GPU IOMMU and CUDA: NVIDIA's CUDA runtime uses the IOMMU to map host memory for GPU DMA. cudaHostAlloc() with cudaHostAllocMapped registers the host buffer with the IOMMU, creating an IOVA mapping for the GPU. This enables zero-copy transfers: the GPU can DMA directly into a CPU-accessible buffer without cudaMemcpy.

NVMe driver DMA optimization: NVMe drives use scatter-gather DMA (PRPs — Physical Region Pages). The NVMe driver builds a SGL (Scatter-Gather List) of physical pages and programs the NVMe device to DMA directly to them. With IOMMU, these mappings are validated; without IOMMU, the device can potentially DMA to any address it finds in a corrupted SGL.

Debugging Notes

# Check if IOMMU is enabled
dmesg | grep -E "IOMMU|DMAR|AMD-Vi|iommu"
# Should see: "IOMMU: hardware identity mapping for device"
#             "Intel(R) Virtualization Technology for Directed I/O"

# IOMMU groups (devices in the same IOMMU protection domain)
ls /sys/kernel/iommu_groups/
# Each group = one VFIO domain / isolation unit

# Check if a device is in its own IOMMU group (needed for passthrough)
ls /sys/kernel/iommu_groups/23/devices/

# DMAR fault log
dmesg | grep DMAR | grep fault
# Example: DMAR:[DMA Write] Request device [02:00.0] fault addr ffc00000 [fault reason 05]

# Check DMA addressing capability of a device
cat /sys/class/net/eth0/device/dma_mask_bits  # e.g., 64

# Bounce buffer (SWIOTLB) activity
dmesg | grep swiotlb
# "software IO TLB: mapped [mem 0x...] total 64 MB"
grep swiotlb /proc/vmstat  # if enabled
cat /sys/kernel/iommu_groups/*/type  # DMA, DMA-FQ, or identity

# Linux IOMMU statistics
ls /sys/class/iommu/
cat /sys/class/iommu/*/msi_timing

# VFIO device status
ls /dev/vfio/
cat /proc/sys/kernel/perf_event_paranoid  # may affect VFIO perf events

# Thunderbolt DMA protection status
cat /sys/bus/thunderbolt/devices/domain*/security
# Options: none (vulnerable), user, secure, dponly, usbonly

Security Implications

IOMMU bypass via PCIe P2P (Peer-to-Peer) DMA: Two PCIe devices on the same root complex can DMA to each other's BARs without going through the IOMMU. Linux 5.16+ blocks this unless explicitly allowed (pcie_p2pdma subsystem). A compromised GPU could DMA into a NIC's registers via P2P.

ACS (Access Control Services) and IOMMU grouping: PCIe ACS prevents peer-to-peer DMA between devices in the same switch. Without ACS, devices in the same PCIe switch share an IOMMU group, meaning you can't individually isolate them. Many PCIe switches and CPU root complexes lack ACS — making true per-device isolation impossible on those platforms.

DMA coherency attacks: On x86-64 with cache-coherent IOMMU (no explicit cache flush needed), a device performing a DMA write to a physical page and then the CPU accessing the same page without any synchronization barrier can cause subtle concurrency bugs. The Linux DMA API handles this with dma_sync_*() functions.

IOMMU off = system-level trust in devices: Without IOMMU active, every PCIe device is fully trusted — any driver vulnerability, any firmware compromise, or any physical access equals total system compromise. For security-critical systems (HSMs, classified workstations), IOMMU must be enabled and verified active.

KASLR bypass via DMA: CVE-2019-11683 — a vulnerability in the network stack combined with a PCIe device that could trigger targeted DMA could bypass KASLR by reading kernel addresses from a known location in RAM.

Performance Implications

IOMMU TLB (IOTLB) overhead: The IOMMU has its own TLB (IOTLB). A DMA operation for a new IOVA misses the IOTLB → IOMMU page table walk → latency added to DMA. For high-frequency small DMA operations (10M DMA ops/sec), IOTLB thrashing can add 5–15% overhead.
dma_map_single() on x86-64 without bounce buffers: With IOMMU active and IOVA = PA (identity mapping), dma_map_single() is nearly free (just flush cache on non-coherent arches). On x86-64 with coherent DMA, it's ~20 ns.
SWIOTLB (bounce buffer) overhead: Each DMA transfer requires an extra memcpy. At 100 Gbps, this consumes ~6% of a single CPU core. High-end NVMe drives can be limited by bounce buffer bandwidth.
IOMMU domain switching: Each VFIO guest context switch requires IOTLB invalidation (write to IOMMU IOTLB invalidation register). For high VM density with frequent PCIe interrupts, this can be measurable.
DMA for huge pages: When DMA buffers are 2 MB aligned and use huge pages, the IOTLB needs fewer entries (one 2 MB entry vs 512 × 4 KB entries), reducing IOTLB pressure. dma_alloc_coherent() requests huge page alignment when GFP_DMA32|__GFP_COMP is used with the right device mask.

Failure Modes and Real Incidents

NVMe hang due to IOMMU DMAR fault: A production Linux server (kernel 5.10) experienced NVMe drive hangs after a kernel update that enabled strict IOMMU mode. The NVMe driver's SGL (Scatter-Gather List) construction had a bug that created an unmapped IOVA entry. With IOMMU off, this caused silent data corruption (the device DMA'd to a wrong physical page). With IOMMU on, the DMAR fault interrupted the I/O, causing NVMe command timeout → device offline. The IOMMU turned a silent bug into a loud one — still painful, but better.

VMware VM DMA corruption without IOMMU: In early hypervisor deployments, a guest VM's NIC virtual driver could trick the hypervisor into performing DMA reads from host kernel memory (CVE-2009-1244, VMware VMCI). IOMMU-based passthrough architectures that came later eliminated this class entirely.

Thunderspy (Thunderbolt DMA attack, 2020): Björn Ruytenberg demonstrated 7 vulnerabilities (CVE-2020-13783 through CVE-2020-13789) in Thunderbolt 3 that allowed completely bypassing BIOS/OS security levels. Physical access for 5 minutes allowed reading/writing arbitrary RAM, extracting encryption keys, and modifying kernel code — even on locked, sleeping machines. Only Kernel DMA Protection (IOMMU active before enumeration) fully mitigated this.

AWS Nitro DMA isolation: AWS's Nitro hypervisor uses IOMMU-based DMA isolation to ensure that the host Nitro card (which handles networking, storage, and security) cannot access VM memory, and vice versa. This was explicitly engineered as a security boundary. Each VM's PCIe virtual functions (NIC, NVMe) are assigned dedicated IOMMU domains with access only to that VM's physical memory pages.

Modern Usage

VFIO mediated devices (mdev): GPU SR-IOV and Intel GVT-g allow a single physical GPU to be shared among multiple VMs with IOMMU isolation between them. Each VM's vGPU has its own IOMMU domain with access only to that VM's framebuffer and command ring.
DMA-BUF: The DMA-BUF (DMA Buffer Sharing) framework allows DMA buffers to be shared between drivers (e.g., between a V4L2 camera driver and a DRM display driver) without copying. Each driver imports the DMA-BUF and maps it into its own IOMMU domain for access.
DPDK with VFIO: DPDK (Data Plane Development Kit) uses VFIO for user-space NIC drivers. The NIC is passed directly to a DPDK application, which programs it with its own ring buffers. IOMMU ensures the NIC can only access DPDK's registered buffers.
P2PDMA (Peer-to-Peer DMA): Storage and GPU can exchange data via P2P DMA (NVMe Direct to GPU). struct pci_p2pdma enables this in Linux 5.16+, with the IOMMU enforcing that only explicitly authorized peer addresses are accessible.

Future Directions

IOMMU for ARM (SMMU): ARM System Memory Management Unit (SMMU v2, v3) is the ARM equivalent of Intel VT-d. Used in mobile SoCs, server ARM CPUs (Ampere, AWS Graviton), and embedded systems. Graviton3 uses SMMU for both DMA isolation and GPU/NIC passthrough.
Confidential Computing (TDX, SEV-SNP): Intel TDX (Trust Domain Extensions) and AMD SEV-SNP extend IOMMU to protect confidential VMs from the hypervisor. The IOMMU enforces that the hypervisor cannot DMA into VM memory; the encrypted guest memory is inaccessible even via direct IOMMU programming.
CXL and IOMMU: CXL (Compute Express Link) 2.0/3.0 devices appear as NUMA nodes with memory accessible via PCIe-derived protocols. The IOMMU must handle CXL device DMA to CXL-attached memory, which requires new IOMMU page table types.
RIO (Remote IOMMU Operations): For disaggregated computing, future standards may allow IOMMU domains to span multiple physical machines connected over CXL or network fabrics.

Exercises

Write a kernel module that calls dma_alloc_coherent() for 1 MB and logs the virtual address vs DMA handle. On a system without IOMMU: confirm they may be equal (physical = DMA). With IOMMU: confirm they differ.
Enable IOMMU (intel_iommu=on) in a VM. Observe dmesg | grep -i iommu. Note the IOMMU groups for each PCIe device. Try to understand why certain devices are in the same group (ACS-related).
Implement a streaming DMA cycle in a minimal kernel module: allocate a buffer, call dma_map_single(), simulate device access (just read the DMA address), call dma_sync_single_for_cpu(), read the buffer, call dma_unmap_single().
Set up VFIO passthrough for a PCIe device (e.g., a USB controller) to a QEMU VM. Verify inside the VM that the device works. Monitor IOMMU fault logs on the host during VM operation.
Use perf stat -e dTLB-loads,dTLB-load-misses on a DMA-intensive workload to see if IOTLB misses (visible as software-reported DMA latency increase) correlate with PMU TLB miss events.
Write a user-space program that reads /sys/kernel/iommu_groups/*/devices/ to enumerate all IOMMU groups on the system. Report: total groups, group sizes, devices with their own group (passthrough-capable) vs shared groups.

References

drivers/iommu/intel/iommu.c — Intel VT-d driver
drivers/iommu/amd/iommu.c — AMD-Vi driver
kernel/dma/mapping.c — DMA API implementation
kernel/dma/swiotlb.c — Software IOMMU (bounce buffers)
drivers/iommu/dma-iommu.c — IOMMU-backed DMA mapping
drivers/vfio/vfio.c — VFIO framework
include/linux/dma-mapping.h — DMA API declarations
include/linux/iommu.h — IOMMU framework API
Intel VT-d Specification: "Intel Virtualization Technology for Directed I/O Architecture Specification"
AMD IOMMU Architecture: "AMD I/O Virtualization Technology (IOMMU) Specification"
Linux DMA API documentation: Documentation/core-api/dma-api.rst
Thunderspy (CVE-2020-13783): https://thunderspy.io/
Kernel DMA Protection: Documentation/admin-guide/kernel-parameters.txt (iommu=)
LWN: "An introduction to IOMMU infrastructure in the Linux kernel" — https://lwn.net/Articles/747551/
LWN: "VFIO and userspace drivers" — https://lwn.net/Articles/474088/