Operating Systems for New Hardware: CXL, Persistent Memory, and Beyond

Overview

The hardware landscape underlying data center computing is undergoing its most significant transformation since the shift from single-core to multi-core processors. Three concurrent revolutions—CXL (Compute Express Link) enabling disaggregated memory, persistent memory (NVDIMM/Optane) blurring the storage/memory boundary, and the rise of accelerator-centric computing with GPUs, DPUs, and IPUs—each require fundamental changes to how operating systems manage resources.

Linux has been incrementally extended to support each of these technologies, but in each case the extensions reveal the limitations of abstractions designed for a world where every socket had uniform-latency DRAM. The deeper question this section addresses is not just "how does Linux support CXL today" but "what does an OS designed for this hardware look like, and does Linux's architecture allow it to become that?"

Prerequisites

Understanding of NUMA (Non-Uniform Memory Access) and Linux's NUMA support
Familiarity with PCIe architecture and device memory-mapped I/O
Basic knowledge of CPU cache coherence (MESI protocol)
Understanding of persistent storage and crash consistency
Awareness of Linux's page allocator and memory tiering concepts

Part 1: CXL — Compute Express Link

What CXL Is

CXL (Compute Express Link) is a PCIe 5.0-based interconnect standard (first published 2019, CXL 3.0 in 2022) designed to solve a specific bottleneck in data center architecture: memory capacity and bandwidth are constrained by the number of DIMM slots on a CPU. A dual-socket server has at most 24–32 DIMM slots. CXL allows memory to be attached via PCIe, expanding capacity and enabling memory sharing between servers.

CXL defines three types of devices:

Type 1 (CXL.cache + CXL.io): CPU ↔ accelerator coherency. The accelerator (e.g., a smart NIC or AI accelerator) can coherently read and write CPU DRAM. The CPU can coherently access accelerator memory. Used for: GPU cache coherence, coherent DMA.

Type 2 (CXL.cache + CXL.mem + CXL.io): CPU ↔ accelerator with device-managed memory. The accelerator has its own DRAM that participates in the coherence domain. Both CPU and accelerator can access both memories coherently. Used for: GPU-CPU shared memory without explicit copy, FPGA with local memory.

Type 3 (CXL.mem only): Memory expansion. A PCIe card carries DRAM (or NAND or other storage class memory) that the CPU can access as a NUMA memory node. Used for: cheap DRAM expansion beyond DIMM slot limits.

CXL Memory Hierarchy

CXL MEMORY HIERARCHY DIAGRAM

  CPU Socket 0                   CPU Socket 1
  +------------------+           +------------------+
  |  DRAM DIMMs      |           |  DRAM DIMMs      |
  |  (DDR5, ~100ns)  |           |  (DDR5, ~100ns)  |
  |  NUMA node 0     |           |  NUMA node 1     |
  +--------+---------+           +--------+---------+
           |                              |
           +----------+  QPI/UPI  +-------+
                      |           |
           +----------+-----------+----------+
           |      PCIe 5.0 Root Complex      |
           +--+-------+-------+-------+------+
              |       |       |       |
         [CXL]   [CXL]   [CXL]   [CXL]
         Type3   Type3   Type2   Type1
         DRAM    DRAM    GPU     Smart
         exp.    exp.    mem     NIC
         ~200ns  ~200ns  ~150ns  (coherent)
         NUMA    NUMA
         node 2  node 3

Memory latency hierarchy:
  Local DRAM (DDR5):            ~80ns
  Remote DRAM (cross-socket):   ~150ns
  CXL Type 3 DRAM:              ~180-250ns
  CXL Type 3 CZ120 NAND:        ~1µs (Samsung CXL flash)

CXL Memory Tiering in Linux

Linux's approach to CXL memory (merged incrementally from 5.18 onward) is to expose CXL-attached memory as NUMA nodes with higher latency. The memory tiering infrastructure manages pages across tiers:

Tier 0 (fast): CPU-local DRAM Tier 1 (slow): CXL-attached DRAM or other capacity memory

Linux's demotion path (5.18+): Under memory pressure on Tier 0, the kernel identifies cold pages (via access tracking with page age bits) and demotes them to CXL Tier 1. When a demoted page is accessed, it is promoted back to Tier 0.

CXL PAGE DEMOTION/PROMOTION

Page aging (every scan period):
  For each page in Tier 0:
    if page accessed recently: mark accessed, keep in Tier 0
    if page not accessed:      demote to CXL Tier 1 node

On access to Tier 1 page:
    hardware page fault -> minor fault handler
    if Tier 0 has free pages: promote page to Tier 0
    else: run page replacement on Tier 0 first

Linux config:
  /sys/kernel/mm/numa/demotion_enabled = 1
  /sys/devices/system/node/nodeX/access0/targets  (demotion targets)

DAXM (Device Extend Memory) driver: The kernel's CXL subsystem (drivers/cxl/) implements the PCI/CXL enumeration and exposes CXL memory regions as DAX devices or NUMA nodes. The daxctl userspace tool manages CXL region configuration.

CXL for Memory Pooling

CXL 2.0 and 3.0 enable multi-host memory sharing: a CXL memory expander can be attached to a fabric switch and made accessible to multiple server nodes. This enables:

Memory disaggregation: Separate memory capacity from compute, provision memory to whichever server needs it
Memory sharing: Multiple servers read from the same CXL memory (with coherence managed by the switch)
Fine-grained memory allocation: Cloud providers can allocate memory in increments not tied to DIMM slots

OS implications: Linux's memory allocator must understand which pages are shareable (and potentially coherent between hosts) vs. local. The kernel's NUMA abstractions need extension for multi-host CXL topologies—work that is in early stages as of 2025-2026.

Part 2: Persistent Memory

What Persistent Memory Is

Intel Optane DCPMM (Data Center Persistent Memory Module) was Intel's implementation of 3D XPoint storage-class memory. Optane DCPMMs physically fit in DRAM slots (DIMM form factor) and are accessible via CPU memory bus at latencies of 300–400ns (vs ~80ns for DDR4) with capacities of 128–512GB per module.

Intel discontinued Optane DCPMM in 2022, but the OS infrastructure built to support it remains valuable and will apply to future persistent memory technologies (Samsung Z-NAND, CXL-attached SCM).

Persistent memory has two key properties that distinguish it from both DRAM and NVMe: 1. Byte-addressable: The CPU can load and store individual bytes (no block I/O required) 2. Persistent: Data survives power loss

DAX: Direct Access Without Page Cache

Traditional Linux I/O goes through the page cache: data is copied from storage into DRAM pages, applications read from those pages. For persistent memory, copying data into DRAM is wasteful—the data is already in a byte-addressable, CPU-accessible medium.

DAX (Direct Access) mode, supported by ext4 and XFS with the dax mount option, enables direct access to persistent memory without the page cache:

TRADITIONAL I/O vs. DAX

Traditional (ext4 without DAX):
  Application read() -> VFS -> page cache -> block layer -> NVMe -> copy to page
  Application mmap() -> page cache backed mapping

DAX (ext4 -o dax or XFS with dax=always):
  Application read() -> VFS -> DAX -> pmem hardware -> DMA directly to user buffer
  Application mmap() -> direct mapping to pmem DIMM
                        (no page cache, no copy, no TLB shootdown on write)

  write() + fsync() without DAX: write to page cache, fsync flushes to disk
  write() + CLWB + SFENCE with DAX: write directly to pmem, flush cache line,
                                     memory fence = durable immediately

For a mmap()'d DAX file, a store instruction from application code writes directly to persistent memory. There is no intermediate buffering.

Crash Consistency Without WAL

Traditional databases achieve durability via Write-Ahead Logging: write the WAL record to durable storage before applying changes. WAL requires at least one sequential write per transaction, typically a 4KB or 8KB block write.

With persistent memory and DAX, the architecture changes:

WAL-BASED DURABILITY (without pmem):
  1. Write log record to page cache
  2. fsync() -> disk write (~50-200µs NVMe)
  3. Apply change to table in memory
  4. (checkpoint: flush dirty pages to disk periodically)

PMEM DURABILITY (with DAX):
  1. Write change directly to pmem table
  2. CLWB (Cache Line Write Back) to flush CPU cache to pmem
  3. SFENCE (Store Fence) to ensure ordering
  4. Done — durable in ~300ns (CPU cache to pmem)
  No WAL, no fsync, no separate log structure required

CLWB + SFENCE cost: ~100-200ns (vs 50,000-200,000ns for NVMe fsync)

However, achieving crash consistency without WAL on pmem requires careful attention to ordering. The CPU may reorder stores to different addresses; SFENCE ensures earlier stores complete before later ones. Without proper ordering, a crash during a multi-word update can leave data in an inconsistent state.

Libraries like libpmemobj (Intel PMDK) provide transaction primitives for pmem: - TX_BEGIN / TX_END - Snapshot-before-modify with undo logging in pmem itself - pmem_persist() equivalent to CLWB + SFENCE

OS Support for Persistent Memory

Linux's persistent memory support (libnvdimm, nd_pmem, dax drivers):

# Check for pmem devices
ls /dev/pmem*                    # raw pmem device
ls /dev/dax*                     # DAX device

# Create a DAX filesystem (ext4)
mkfs.ext4 -b 4096 /dev/pmem0
mount -o dax /dev/pmem0 /mnt/pmem

# FSDAX mode: files in this filesystem can be mmap'd directly to pmem
mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0)
# Returns a pointer directly into pmem — no copy

# DevDAX mode: direct access without filesystem
ndctl create-namespace --mode=devdax --size=100g
# Results in /dev/dax0.0 — mmap this for byte-level pmem access

Part 3: OS for Accelerator-Centric Computing

The Shift in Computing Paradigm

For the past 40 years, the CPU has been the primary compute unit, with accelerators (GPUs, FPGAs) as attached co-processors. This model is inverting: in modern data centers, GPUs perform the bulk of compute work (ML training/inference), and CPUs serve as orchestrators.

This inversion has OS implications:

GPU as compute, CPU as coordinator:

TRADITIONAL MODEL:
  CPU (compute) -> DMA -> GPU (graphics/co-processor)
  OS schedules CPU threads; GPU scheduling is secondary

MODERN AI TRAINING MODEL:
  GPU cluster (primary compute: matrix multiply, attention, FFT)
  CPU (coordinator: data loading, checkpoint, gradient aggregation)
  OS must be GPU-scheduler aware: CPU should not preempt GPU work

Linux's GPU scheduling support (drm/scheduler) has evolved but remains second-class relative to CPU scheduling. High-performance ML training requires:

GPU work preemption: Ability to preempt long-running GPU kernels for latency-sensitive workloads (still limited in most GPU drivers)
CPU-GPU memory coherence: NVLink (NVIDIA) and CXL Type 2 (in theory) allow CPU and GPU to share a coherent memory space, but the OS virtual memory system must be aware of this
GPU-native memory allocation: cudaMallocManaged() and hmm_range_fault() in Linux allow GPUs to trigger page faults that bring CPU-managed pages into GPU memory

DPUs and IPUs

DPU (Data Processing Unit): NVIDIA BlueField, AMD Pensando — a programmable ARM-based CPU embedded in a SmartNIC, capable of running a full OS alongside its network processing.

IPU (Infrastructure Processing Unit): Intel's Mount Evans / AWS Nitro — offloads hypervisor networking and storage from the host CPU.

OS implications:

DPU ARCHITECTURE (NVIDIA BlueField)

  Host Server:
  +-------------------------------------------+
  | Application VMs / Containers              |
  +-------------------------------------------+
  | Host OS (Linux / KVM)                      |
  +---+---------------------------------------+
      | PCIe
  +---+-------------------------------------------+
  | DPU (BlueField-3)                              |
  | +---ARM cores (8x Cortex-A78)                 |
  | +---NVMe emulation (presents vNVMe to host)   |
  | +---vSwitch (Open vSwitch in DPU memory)      |
  | +---Crypto acceleration                        |
  | +---100GbE NIC                                 |
  +-----------------------------------------------+

The DPU runs its own OS (typically a hardened Ubuntu/DOCA)
Host OS sees the DPU as a standard NVMe + NIC device
Security policy enforcement, encryption, vSwitch live in DPU
   — isolated from potentially compromised host OS

The OS challenge: how should the host OS and DPU OS coordinate? For AWS Nitro, the answer is strict separation: the host OS has no visibility into the Nitro card's software. For NVIDIA BlueField, the scalablefunction and Separated Memory Space models define the interface.

Chiplet OS Support

Modern CPUs are built from chiplets: separate dies bonded together. AMD's Genoa (EPYC 9004) uses up to 12 compute chiplets (CCDs) plus a central I/O die (IOD). The CCDs have different latencies to memory channels depending on which IOD port they route through.

This creates a sub-NUMA topology: within a NUMA node, some CPUs are closer to memory than others.

AMD EPYC CHIPLET TOPOLOGY (Genoa, 4 NUMA nodes)

NUMA node 0 (CPUs 0-23):
  CCD 0 (CPUs 0-7):   direct path to DRAM channels 0-1  (~80ns)
  CCD 1 (CPUs 8-15):  one hop to DRAM channels 0-1      (~100ns)
  CCD 2 (CPUs 16-23): two hops to DRAM channels 0-1     (~115ns)

"sNUMA" (sub-NUMA): some workloads benefit from binding to the closest CCDs

Linux does not model sub-NUMA topology natively (as of 6.x).
Workaround: use cpuset to bind to specific CPU ranges and
            use numactl to bind memory to specific channels.

Future OS kernels will need a "NUMA++" model that represents chiplet-level latency variation within a NUMA node.

Near-Memory Computing

Near-memory or in-memory processing (PIM — Processing In Memory) places compute logic inside DRAM modules, executing simple operations (search, sort, reduction) without moving data to the CPU.

Samsung's HBM-PIM and SK Hynix's AiM place processing elements alongside DRAM banks. The potential benefit: DRAM internal bandwidth is ~10x higher than the external DRAM bus bandwidth. For bandwidth-bound operations (matrix-vector multiply, graph traversal), this could provide 10x speedup.

OS implications (early research stage): - Scheduler must understand that some memory operations can be offloaded to PIM - Virtual memory manager must know which pages are on PIM-capable banks - The OS must expose an API for applications to trigger PIM operations - Coherence between PIM results and CPU caches must be managed

The Linux kernel has no PIM support as of 2025. IBM, Samsung, and UIUC researchers have prototype driver implementations.

Open Problems

Unified memory management across CPU/GPU/CXL/PIM: Today these are managed by separate drivers with separate allocators. No single OS abstraction spans all of them.
NUMA++: Sub-NUMA chiplet topology requires finer-grained placement decisions than current NUMA-aware allocators support.
Crash consistency for shared CXL memory: If two servers share CXL memory, what happens when one server crashes? The memory may be in an inconsistent state relative to the other server's view.
DPU/IPU OS coordination: No standard interface exists for host OS and DPU OS to share scheduling information, memory mappings, or security policy.
Accelerator-aware scheduler: The Linux scheduler has no visibility into GPU or DPU queue depths. It cannot make CPU placement decisions that minimize GPU idle time.

Failure Modes

CXL link failure: A CXL link loss causes the associated NUMA node to become inaccessible. Linux's memory_failure() path handles DRAM ECC errors but CXL-specific link-down events require new driver handling.
Persistent memory partial write during crash: If a power loss occurs during a non-atomic write to pmem, the write may be partially committed. DAX filesystems use journaling or atomic 8-byte writes to handle this.
False NUMA node: CXL Type 3 memory presented as a NUMA node has no CPU. Linux's memory allocator must handle "memoryless nodes" (NUMA nodes with memory but no CPUs) correctly.
PIM coherence bugs: If PIM modifications to DRAM are not properly reflected in CPU caches, reads after PIM writes may return stale data.

Security Implications

CXL memory attack surface: If CXL memory is shared across VMs or tenants, cross-tenant memory access becomes a real threat. The CXL standard includes access control mechanisms (CXL.IDE for encryption/authentication on the link), but OS/VMM enforcement of these is required.
DPU privilege: A DPU running its own OS with access to host memory (via PCIe DMA) is a powerful target. A compromised DPU firmware can read or modify host memory and intercept network traffic.
Persistent memory forensics: pmem retains data across power cycles. After decommissioning a server, pmem modules may contain sensitive data unless explicitly sanitized. Linux adds ND_CMD_CLEAR_ERROR but full sanitization requires vendor-specific commands.

Debugging Notes

# CXL device enumeration
ls /sys/bus/cxl/devices/
cxl list --memdevs            # list CXL memory devices
cxl list --regions            # list configured regions
daxctl list                   # list DAX namespaces

# CXL memory NUMA node inspection
numactl --hardware            # shows CXL nodes with latency

# PMEM inspection
ndctl list --namespaces
ndctl list --dimms
ndctl read-labels /dev/nmem0  # read NVDIMM labels

# DAX filesystem performance
fio --filename=/mnt/pmem/testfile --rw=randwrite --bs=4k --direct=1 \
    --numjobs=1 --iodepth=1 --name=pmem_test

# Check pmem persistence mode
cat /sys/bus/nd/devices/region0/deep_flush  # 1 if ADR-based persistence

# GPU memory visibility
nvidia-smi topo --matrix    # GPU NVLink/PCIe topology
cat /proc/driver/nvidia/params | grep EnableManagedMemory

# PIM (no standard interface yet; vendor-specific)
# Samsung HBM-PIM uses libpim API

Exercises

On a server with NUMA support (numactl --hardware shows multiple nodes), run stream (memory bandwidth benchmark) on each node and compare. If CXL nodes are available, compare CXL vs. DRAM bandwidth.
Study the Linux CXL driver code in drivers/cxl/. Trace how a CXL Type 3 device is enumerated and how it appears as a NUMA node. What is the path from PCI probe to numa_add_memblk()?
If you have access to a pmem-capable system (or QEMU with nvdimm emulation), mount an ext4 filesystem with -o dax and use fio to compare random 4KB reads on DAX vs. non-DAX. Explain the performance difference.
Read the paper "Characterizing the Performance of Intel Optane Persistent Memory" (VLDB 2020). What did the authors find about the asymmetry between read and write performance? How does this affect database WAL design?
Design an OS interface for near-memory PIM. What syscalls or ioctl interface would allow an application to submit a "sort this range of physical memory" operation to PIM hardware? What cache coherence guarantees would the OS need to provide?

References

CXL Consortium specifications: https://www.computeexpresslink.org
Linux CXL driver documentation: Documentation/driver-api/cxl/
Lim, K. et al. "Disaggregated Memory for Expansion and Sharing in Blade Servers" (ISCA 2009) — foundational disaggregated memory paper
Yang, J. et al. "Characterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook" (FAST 2020) — pmem usage at scale
Intel PMDK documentation: https://pmem.io/pmdk/
McSherry, F. et al. "Scalability! But at what COST?" (HotOS 2015) — honest look at when new hardware matters
NVIDIA BlueField DPU documentation and DOCA SDK
Intel Optane Technology Whitepaper (2019)
AMD EPYC 9004 Series Platform/NUMA topology whitepaper (2022)
Gouk, D. et al. "Direct Access, High-Performance Memory Disaggregation with DirectCXL" (USENIX ATC 2022)