06 — PCIe Driver Model

Technical Overview

PCI Express (PCIe) is the dominant high-speed interconnect for system expansion — GPUs, NVMe SSDs, network cards, FPGAs, and custom accelerators all connect via PCIe. Understanding PCIe from a driver writer's perspective means understanding the physical and logical architecture, how the kernel enumerates devices, how drivers claim and configure them, and how to extract maximum performance through MSI-X, DMA, and multi-queue architectures.

PCIe is a point-to-point serial link. Unlike its predecessor PCI (a shared parallel bus), each PCIe device connects to a root complex or switch via a dedicated link. The protocol is packet-based, with Transaction Layer Packets (TLPs) carrying data, configuration requests, completions, and messages.

Prerequisites

Linux driver model (01-driver-model.md)
DMA in drivers (05-dma-in-drivers.md)
Interrupt handling (04-interrupt-handling.md)
Basic understanding of memory-mapped I/O

PCIe Physical Architecture

CPU ← QPI/UPI/IF → Root Complex
                        │
                   PCIe Root Port
                        │ x16 link (Gen4 = 256 GB/s bidirectional)
                   PCIe Switch (optional)
                   ├─── Downstream Port ── GPU (x16)
                   ├─── Downstream Port ── NVMe SSD (x4)
                   └─── Downstream Port ── NIC (x8)

Link width and speed:
  x1  Gen3: 1 lane  × 8 GT/s  = ~1 GB/s
  x4  Gen3: 4 lanes × 8 GT/s  = ~4 GB/s
  x4  Gen4: 4 lanes × 16 GT/s = ~8 GB/s
  x16 Gen4: 16 lanes × 16 GT/s = ~32 GB/s  (high-end GPU)
  x16 Gen5: 16 lanes × 32 GT/s = ~64 GB/s  (PCIe 5.0 GPU/CXL)

Transfer rate note: Raw bit rate includes 128b/130b encoding overhead.
Effective bandwidth = raw GT/s × (128/130) × lanes ÷ 8 bytes/bit

PCIe Configuration Space

Every PCIe function has a configuration space — 256 bytes for legacy PCI, 4096 bytes for PCIe extended configuration. This space is used at boot to identify the device, allocate Base Address Registers (BARs), and configure capabilities.

Offset 0x00:  Vendor ID (16-bit) / Device ID (16-bit)
Offset 0x04:  Command / Status
Offset 0x08:  Revision ID / Class Code (3 bytes)
Offset 0x0C:  Cache Line Size / Latency Timer / Header Type / BIST
Offset 0x10:  BAR0 (Base Address Register 0)
Offset 0x14:  BAR1
Offset 0x18:  BAR2
Offset 0x1C:  BAR3
Offset 0x20:  BAR4
Offset 0x24:  BAR5
Offset 0x28:  Cardbus CIS Pointer
Offset 0x2C:  Subsystem Vendor ID / Subsystem ID
Offset 0x30:  Expansion ROM Base Address
Offset 0x34:  Capabilities Pointer (linked list of capabilities)
Offset 0x3C:  Interrupt Line / Interrupt Pin / Min Grant / Max Latency

Capabilities linked list (starting at Capabilities Pointer):
  [PCIe Capability — type, version, link status, slot]
  → [MSI Capability — enable, vectors, address, data]
  → [MSI-X Capability — table size, table offset, pending bit array]
  → [Power Management — D0/D1/D2/D3 support]
  → [AER (Advanced Error Reporting) — correctable/uncorrectable error status]
  → NULL

BARs define regions of device memory (MMIO) that the BIOS/firmware allocates in the CPU's physical address space. A driver accesses device registers through these MMIO regions.

Linux PCIe Driver Structure

A PCIe driver registers with the PCI bus using struct pci_driver:

#include <linux/pci.h>

/* Device ID table — used for matching and module autoloading */
static const struct pci_device_id my_pci_ids[] = {
    { PCI_DEVICE(0x1234, 0x5678) },          /* vendor 0x1234, device 0x5678 */
    { PCI_DEVICE(0x1234, 0x5679) },          /* same vendor, different model */
    { PCI_VDEVICE(INTEL, 0x1593) },          /* Intel device using PCI_VENDOR_ID_INTEL */
    { 0 }                                     /* terminating entry */
};
MODULE_DEVICE_TABLE(pci, my_pci_ids);

static int my_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
    struct my_device *dev;
    int ret;

    /* Allocate driver-private data */
    dev = devm_kzalloc(&pdev->dev, sizeof(*dev), GFP_KERNEL);
    if (!dev) return -ENOMEM;
    pci_set_drvdata(pdev, dev);
    dev->pdev = pdev;

    /* 1. Enable the device */
    ret = pci_enable_device(pdev);
    if (ret) { dev_err(&pdev->dev, "enable failed\n"); return ret; }

    /* 2. Set bus master (enable DMA) */
    pci_set_master(pdev);

    /* 3. Request exclusive access to BAR regions */
    ret = pci_request_regions(pdev, "my_device");
    if (ret) goto err_disable;

    /* 4. Map BAR0 into kernel virtual address space */
    dev->base = pci_ioremap_bar(pdev, 0);   /* BAR 0 */
    if (!dev->base) { ret = -EIO; goto err_release; }

    /* 5. Set DMA mask */
    ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
    if (ret) { ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32)); }
    if (ret) goto err_unmap;

    /* 6. Set up MSI-X interrupts */
    ret = pci_alloc_irq_vectors(pdev, 1, MY_MAX_VECTORS, PCI_IRQ_MSIX);
    if (ret < 0) goto err_unmap;
    dev->num_vectors = ret;

    for (int i = 0; i < dev->num_vectors; i++) {
        int irq = pci_irq_vector(pdev, i);
        request_irq(irq, my_irq_handler, 0, "my_device", &dev->queues[i]);
    }

    /* 7. Initialize hardware */
    my_hw_init(dev);

    /* 8. Register with appropriate subsystem (block, net, etc.) */
    ret = my_register_device(dev);
    if (ret) goto err_irqs;

    return 0;

err_irqs:
    pci_free_irq_vectors(pdev);
err_unmap:
    iounmap(dev->base);
err_release:
    pci_release_regions(pdev);
err_disable:
    pci_disable_device(pdev);
    return ret;
}

static void my_remove(struct pci_dev *pdev)
{
    struct my_device *dev = pci_get_drvdata(pdev);

    my_unregister_device(dev);

    for (int i = 0; i < dev->num_vectors; i++)
        free_irq(pci_irq_vector(pdev, i), &dev->queues[i]);
    pci_free_irq_vectors(pdev);

    iounmap(dev->base);
    pci_release_regions(pdev);
    pci_disable_device(pdev);
}

static struct pci_driver my_pci_driver = {
    .name      = "my_device",
    .id_table  = my_pci_ids,
    .probe     = my_probe,
    .remove    = my_remove,
    .shutdown  = my_shutdown,     /* called at system shutdown */
};

module_pci_driver(my_pci_driver);   /* replaces module_init/module_exit */

PCIe Device Probing at Boot

The x86 firmware (BIOS/UEFI) performs initial PCIe enumeration during POST: 1. Traverses the PCIe hierarchy depth-first from the root complex 2. Reads Vendor ID / Device ID from each device's configuration space 3. Assigns BARs: reads BAR size (write all-1s, read back masked value), allocates address range 4. Programs interrupt routing (APIC routing or MSI configuration) 5. Stores topology in ACPI tables

The Linux kernel re-enumerates at boot (pci_scan_bus called from pcibios_init): 1. Reads configuration space to create struct pci_dev for each device 2. Sets up PCI domains, buses, and topology 3. Calls pci_match_device() against all registered pci_driver.id_table entries 4. Calls driver->probe() for each match

# View PCIe topology
lspci -tv

# View full device info including BARs, capabilities, and IRQ
lspci -vvv -s 03:00.0

# Read raw configuration space
xxd /sys/bus/pci/devices/0000:03:00.0/config | head -16

MMIO Register Access

Once a BAR is mapped via pci_ioremap_bar, drivers read/write device registers through the returned kernel virtual address:

/* Always use memory barriers around MMIO accesses */

/* Read 32-bit register */
u32 status = readl(dev->base + STATUS_REG);

/* Write 32-bit register */
writel(CMD_START, dev->base + COMMAND_REG);

/* Ensure write is flushed to device (read-back pattern) */
writel(value, dev->base + REG);
(void)readl(dev->base + REG);   /* flush: posted write becomes non-posted */

/* 64-bit register (non-atomic on 32-bit systems!) */
lo32 = readl(dev->base + ADDR_LO);
hi32 = readl(dev->base + ADDR_HI);
u64 addr = ((u64)hi32 << 32) | lo32;

readl/writel compile to memory-mapped load/store instructions with proper compiler barriers. Never cast the BAR address to a pointer and dereference directly — this bypasses the memory barrier semantics.

MSI-X for Multi-Queue Architectures

MSI-X allows up to 2048 interrupt vectors per device, each mapped to a specific CPU. NVMe's architecture pairs one MSI-X vector with each submission/completion queue pair:

NVMe SSD (32 queues for 32-core system):

  CPU 0 ─── Queue 0 ─── MSI-X Vector 0 ─── CPU 0 interrupt
  CPU 1 ─── Queue 1 ─── MSI-X Vector 1 ─── CPU 1 interrupt
  ...
  CPU 31 ── Queue 31 ── MSI-X Vector 31 ── CPU 31 interrupt
  Any CPU ─ Admin Queue ─ MSI-X Vector 32 ─ CPU 0 interrupt

Result: No spinlock contention between CPUs (each has its own queue)
        CPU 0's I/O completions interrupt CPU 0 (data in L1/L2 cache)
        1M IOPS with minimal cross-CPU coordination

The MSI-X table lives in a BAR, not in configuration space. Each entry is 16 bytes: message address (64-bit), message data (32-bit), vector control (32-bit, bit 0 = masked):

/* MSI-X table entry (in device BAR, MMIO-mapped) */
struct msix_entry {
    __le32 msg_addr_lo;   /* APIC message address low 32 bits */
    __le32 msg_addr_hi;   /* APIC message address high 32 bits */
    __le32 msg_data;      /* APIC message data (encodes vector) */
    __le32 vector_ctrl;   /* bit 0: mask this vector */
};

SR-IOV: Single Root I/O Virtualization

SR-IOV allows a single physical PCIe device to appear as multiple separate functions to the OS or to VMs. The physical function (PF) is the full-featured device controlled by the host. Virtual functions (VFs) are lightweight instances that share hardware resources but have isolated data paths.

Physical NIC (Single Physical Function):
  PF: 0000:03:00.0 — host-managed, configures shared resources
  VF0: 0000:03:01.0 — assigned to VM 1 (via VFIO passthrough)
  VF1: 0000:03:01.1 — assigned to VM 2
  ...
  VF255: 0000:03:11.f — maximum 256 VFs for PCIe SR-IOV

Each VF appears as a separate PCIe device with its own:
  - Config space (limited: no BAR resizing)
  - MSI-X vectors
  - TX/RX queues
  - MAC address

/* Enable SR-IOV from PF driver */
int numvfs = 4;
ret = pci_enable_sriov(pdev, numvfs);
if (ret) { dev_err(&pdev->dev, "SR-IOV failed: %d\n", ret); return ret; }

/* Query maximum VFs supported */
int max_vfs = pci_sriov_get_totalvfs(pdev);

/* Disable SR-IOV */
pci_disable_sriov(pdev);

From userspace, VFs are assigned to VMs using VFIO (Virtual Function I/O):

# Bind VF to vfio-pci driver (detach from host driver)
echo "0000:03:01.0" > /sys/bus/pci/devices/0000:03:01.0/driver/unbind
echo "vfio-pci" > /sys/bus/pci/devices/0000:03:01.0/driver_override
echo "0000:03:01.0" > /sys/bus/pci/drivers/vfio-pci/bind

# In QEMU: pass through VF to VM
qemu-system-x86_64 ... -device vfio-pci,host=03:01.0

PCIe Power Management

PCIe defines device power states D0 (fully operational) through D3hot/D3cold (powered down):

State	Power	Resume latency
D0	Full	0
D1	Medium	~10ms (device-specific)
D2	Less	~100ms (device-specific)
D3hot	Minimal	~100ms (must re-initialize)
D3cold	Off	Seconds (cold boot)

ASPM (Active State Power Management) reduces power during idle periods by placing PCIe links in lower power states (L0s, L1, L1 sub-states) when no transactions are pending.

/* Driver signals runtime PM readiness */
pm_runtime_enable(&pdev->dev);
pm_runtime_allow(&pdev->dev);

/* Called when device goes idle */
static int my_runtime_suspend(struct device *dev)
{
    struct pci_dev *pdev = to_pci_dev(dev);
    /* Save device state */
    pci_save_state(pdev);
    pci_set_power_state(pdev, PCI_D3hot);
    return 0;
}

/* Called when device is needed again */
static int my_runtime_resume(struct device *dev)
{
    struct pci_dev *pdev = to_pci_dev(dev);
    pci_set_power_state(pdev, PCI_D0);
    pci_restore_state(pdev);
    /* Reinitialize hardware */
    my_hw_init(dev_get_drvdata(dev));
    return 0;
}

Historical Context

PCI (Peripheral Component Interconnect) was introduced by Intel in 1992 as a replacement for ISA. It used a 32-bit parallel bus at 33MHz (133 MB/s), later 64-bit at 66MHz (533 MB/s). PCI was a significant improvement over ISA's configuration nightmare but shared bandwidth across all devices on the bus.

PCI-X (2000) increased the bus speed to 133MHz/266MHz/533MHz for server workloads, but remained a shared parallel bus.

PCIe 1.0 was ratified in 2003, replacing the parallel bus with serial point-to-point links. The key insight: serial at higher frequency beats parallel at lower frequency once signal integrity challenges are managed. PCIe 1.0 x1 offered 250 MB/s — half of 32-bit PCI — but x16 offered 4 GB/s with dedicated bandwidth per slot.

PCIe generation history: - Gen1 (2003): 2.5 GT/s, x16 = 4 GB/s - Gen2 (2007): 5.0 GT/s, x16 = 8 GB/s - Gen3 (2010): 8.0 GT/s, x16 = 16 GB/s - Gen4 (2017): 16.0 GT/s, x16 = 32 GB/s (AMD EPYC first; NVMe Gen4 SSDs: 7 GB/s) - Gen5 (2019/deployed 2022): 32.0 GT/s, x16 = 64 GB/s (Intel Sapphire Rapids, AMD Genoa) - Gen6 (2022, deployed 2025): 64 GT/s, x16 = 128 GB/s (uses PAM4 signaling)

Production Examples

NVMe SSD driver (drivers/nvme/host/pci.c): The archetypal PCIe driver. Uses MSI-X for per-CPU queues, 64-bit DMA, blk-mq for queue management. At Gen4 x4: sequential read 7 GB/s (near the PCIe 4.0 x4 theoretical limit of 8 GB/s). Samsung 980 Pro, WD Black SN850 — all NVMe Gen4 drives approach PCIe bandwidth saturation.

NVIDIA A100 GPU (driver: nvidia.ko, open-source: nova): Uses PCIe Gen4 x16 (32 GB/s) for CPU↔GPU data transfer. Internally, HBM2e memory bandwidth is 2 TB/s — 60x faster than PCIe. This bandwidth asymmetry makes PCIe the bottleneck for small models; large models benefit from NVLink (600 GB/s between GPUs, bypassing PCIe).

Intel E810 100GbE NIC: Uses PCIe Gen4 x16. At 100Gbps, line rate is 12.5 GB/s — well within PCIe Gen4 x16's 32 GB/s. Uses 64 MSI-X vectors for 64 queue pairs, supporting RDMA (iWARP/RoCEv2). SR-IOV supports up to 64 VFs.

Debugging Notes

# Check PCIe link training status
lspci -vvv -s 03:00.0 | grep -E "(LnkSta|LnkCap)"
# LnkCap: Speed 16GT/s (Gen4), Width x4
# LnkSta: Speed 16GT/s (ok), Width x4 (ok)

# Check for PCIe errors (AER — Advanced Error Reporting)
dmesg | grep -i "aer\|pcie\|pci error"

# Access PCIe configuration space registers directly
setpci -s 03:00.0 0x04.w         # read Command register
setpci -s 03:00.0 0x04.w=0x0406  # set Memory Space Enable + Bus Master

# Check PCIe bandwidth via performance counters
perf stat -e uncore_imc_0/cas_count_read/,uncore_imc_0/cas_count_write/ \
    -a sleep 1

# Monitor PCIe correctable/uncorrectable errors
cat /sys/bus/pci/devices/0000:03:00.0/aer_dev_correctable
cat /sys/bus/pci/devices/0000:03:00.0/aer_dev_fatal

Link speed negotiation failures: If LnkSta shows lower speed than LnkCap, check BIOS settings (ASPM, PCIe link speed forcing), signal integrity (try shorter cable/riser card), or driver issues (pcie_aspm=off kernel param for debugging).

Security Implications

DMA attacks without IOMMU: Any PCIe device (including Thunderbolt-connected peripherals) can DMA arbitrary memory without IOMMU. Thunderclap (2019) demonstrated complete system compromise by exploiting this. Mitigation: intel_iommu=on in kernel command line.

PCIe fuzzing: Devices on PCIe can send malformed TLPs (Transaction Layer Packets) that may cause bugs in the host's PCIe controller or kernel PCIe code. Enterprise servers have AER (Advanced Error Reporting) that logs and optionally corrects these. CVE-2022-0500 was a kernel PCIe hotplug race condition.

Malicious PCIe devices in colocation: In shared data center environments, a tenant's PCIe device in a multi-tenant PCIe switch environment could potentially observe other tenants' DMA traffic in pathological switch configurations. SR-IOV's isolation guarantees require proper IOMMU configuration.

Performance Implications

NUMA topology: On multi-socket servers, PCIe Root Complexes are attached to specific CPU sockets. An NVMe drive connected to the PCIe root complex of socket 1 is accessed 2x faster by CPUs on socket 1 than socket 0 (cross-socket PCIe access goes through the inter-socket link). lspci -vvv | grep "NUMA node" shows the NUMA node for each device. Pin I/O threads to the matching NUMA node.

PCIe TLP overhead: Each TLP has a 16-24 byte header. For small transfers, header overhead is significant. NVMe command completions are only 16 bytes — at 1M IOPS, PCIe bandwidth is dominated by headers, not data.

Failure Modes

Probe failing with ENODEV: Device not found by PCI enumeration. Check slot, check lspci. Check dmesg for PCIe link training failures.
BAR not mapped correctly: pci_ioremap_bar returns NULL. BAR not allocated (BIOS didn't enumerate device). Try pci_assign_resource manually.
MSI-X allocation fails: Kernel falls back to fewer vectors or legacy IRQ. Driver must handle pci_alloc_irq_vectors returning fewer than requested.
Device disappears from lspci after error: PCIe link went down (cable fault, signal integrity, hardware failure). Visible in dmesg as "PCI Express Bus Error". AER may attempt hot-reset.

Modern Usage

CXL (Compute Express Link): PCIe Gen5-based protocol for memory expansion and cache coherence. CXL 1.0/1.1 (on PCIe 5.0) allows CPU-coherent access to device memory (GPU memory visible in CPU address space). CXL 2.0 adds memory pooling. CXL 3.0 enables peer-to-peer memory sharing between accelerators.

PCIe P2P (Peer-to-Peer): Two PCIe devices can DMA directly without routing through system RAM, if the PCIe switch or root complex supports it. Used in NVMe-oF target implementations (NIC reads directly from NVMe, bypassing CPU).

Future Directions

PCIe Gen6: Uses PAM4 (4-level Pulse Amplitude Modulation) instead of NRZ, doubling data rate to 64 GT/s. x16 Gen6 = 128 GB/s, matching the bandwidth of 12-lane NVLink 4.0. Key enabler for coherent memory pooling (CXL 3.0).

Open PCIe Ecosystem: The CXL Consortium and OCP are standardizing PCIe-based fabric topologies for disaggregated computing, where CPUs, memory, and accelerators in different physical nodes are connected via PCIe/CXL switches with sub-microsecond latency.

Exercises

Write a PCIe driver for a virtual device using QEMU's edu device (-device edu). Implement register read/write, DMA transfer, and interrupt handling.
Enable SR-IOV on a supported NIC (mlx5, i40e). Create 4 VFs, bind them to vfio-pci, and pass one through to a KVM VM. Verify the VM sees it as a PCI device.
Use setpci to read and parse the capability chain of a PCIe NVMe drive. Identify the MSI-X table BAR and offset.
Benchmark the PCIe bandwidth between CPU and GPU using nvidia-smi dmon or CUDA memcpy benchmarks across different transfer sizes. Plot bandwidth vs transfer size to see the TLP overhead effect at small sizes.
Enable AER on a PCIe device and inject a correctable error using pcieport-driver's error injection interface. Observe the kernel log entry.

References

PCI Express Base Specification 5.0 — PCI-SIG
Documentation/PCI/ — Linux PCIe documentation
drivers/pci/ — PCIe core implementation
include/linux/pci.h — PCIe driver API
Jonathan Corbet, "PCI in the Linux Kernel" — LWN series
NVMe Specification 2.0 — NVM Express, Inc.
SR-IOV: PCI-SIG SR-IOV Specification
Thunderclap paper: Markettos et al., NDSS 2019
CXL Specification 3.0 — Compute Express Link Consortium