NUMA Architecture: Non-Uniform Memory Access in Modern Servers

Prerequisites

Cache hierarchy (05-cache-hierarchy.md): L1/L2/L3 structure, cache miss latency
Cache coherence (06-cache-coherence.md): MESI, snooping vs directory protocols
Virtual memory: page tables, physical addresses, mmap
Linux process/memory management basics: pages, malloc, mmap, NUMA APIs
Multi-socket server hardware concepts: multiple physical CPUs per server

Technical Overview

NUMA (Non-Uniform Memory Access) describes a memory architecture in which different processors in the same system have different access latencies and bandwidths to different regions of physical memory. A processor's "local" memory — physically attached to the same socket or chiplet — is faster to access than "remote" memory attached to a different socket.

The alternative is UMA (Uniform Memory Access): every processor has equal latency to all memory, typically via a centralized memory controller. UMA is simple but does not scale: a single shared memory bus saturates at 4-8 processors. Beyond that, adding processors provides compute but no additional memory bandwidth — each new processor competes for the same bus.

NUMA solves the scaling problem by distributing the memory bus: each processor (or die, or cluster) has its own local memory controller and local DRAM. Processors are connected by a high-bandwidth, low-latency interconnect fabric. Memory is accessible from any processor but with latency that depends on the hop count to the owning memory controller.

Modern server CPUs are almost universally NUMA. A 2-socket Intel Xeon server is a 2-NUMA-node system. A single-socket AMD EPYC Genoa (Zen 4) with 12 CCDs has up to 12 NUMA nodes within one socket. Even mobile processors (Apple M-series) have implicit NUMA effects from chiplet distances and memory channel assignment.

Historical Context

Late 1980s — Need Established: As SMP (Symmetric Multi-Processor) systems scaled beyond 4-8 CPUs, the shared Frontside Bus became a bottleneck. All memory reads/writes competed for the same bus bandwidth.

1990 — BBN TC2000: One of the first commercial NUMA systems. NUMA coined by R. F. Rashid et al. at Carnegie Mellon (cited in early Mach kernel documentation).

1992 — Sequent NUMA-Q: Proprietary NUMA architecture using an "IQ-Link" interconnect.

1996 — SGI Origin 2000: First large-scale commercial ccNUMA (cache-coherent NUMA) system. Used the "CrayLink" interconnect (designed by Jim Laudon, et al.). 2-64 nodes per SGI chassis, scalable to 1024+ nodes via distributed memory. Each node: 2 MIPS R10000 CPUs + local DRAM + directory controller. This was the reference design for all subsequent ccNUMA.

2003 — AMD Opteron (K8): First NUMA in commodity x86. The Opteron integrated the memory controller directly into the CPU die (Intel still used an off-CPU Northbridge chipset). Two Opterons connected via HyperTransport: 2-node NUMA with ~80ns local memory, ~120ns remote (1-hop). HyperTransport peak: 12.8 GB/s per link.

2008 — Intel Nehalem (Intel QuickPath Interconnect): Intel abandoned the shared Frontside Bus, integrated the memory controller (IMC) into the CPU die. QPI (QuickPath Interconnect) replaced the Frontside Bus for CPU-CPU communication. 2-socket Nehalem: 2-NUMA-node system.

2017 — AMD EPYC Naples (Zen 1, 1st gen): Multi-chip module (MCM): 4 CPU dies (each 8 cores) on a single package. Each die has 2 DDR4 memory channels. Result: 4 NUMA nodes per socket for a single-socket server. Total: 8 NUMA nodes for 2-socket Naples system.

2019 — AMD EPYC Rome (Zen 2, 2nd gen): 8 compute chiplets (CCDs, 8 cores each) + 1 central I/O die. Memory controllers on I/O die. In "NPS4" (NUMA Partitions per Socket = 4) mode: 4 NUMA nodes per socket from 4 sets of 2 CCDs. Up to 16 NUMA nodes for a 2-socket system.

2022 — AMD EPYC Genoa (Zen 4, 4th gen): Up to 12 CCDs per socket (96 cores). Configurable NUMA modes. In maximum NUMA mode: 12 NUMA nodes per socket. A 2-socket Genoa server: up to 24 NUMA nodes.

2023 — Intel Xeon Sapphire Rapids: Intel introduces Sub-NUMA Clustering (SNC). A single socket can be divided into 2 or 4 NUMA domains (SNC2, SNC4), each with dedicated LLC slices and memory channels. Better locality for NUMA-aware workloads within a single socket.

NUMA Latency Numbers

These are typical values for Intel Xeon / AMD EPYC dual-socket, measured with mlc or numactl. Exact values vary by CPU generation and memory configuration.

NUMA Latency Matrix (2-socket AMD EPYC 7763, Zen 3, 2021):

         From\To    Node 0     Node 1
         ─────────────────────────────
         Node 0      80 ns     120 ns
         Node 1     120 ns      80 ns

Local access:  ~80 ns (4 hops: CPU → L3 → IMC → DRAM → back)
Remote access: ~120 ns (add 2 HyperTransport/Infinity Fabric hops)
Remote penalty: ~50% overhead

NUMA Latency Matrix (2-socket Intel Xeon Platinum 8380, Ice Lake, 2021):

         From\To    Node 0     Node 1
         ─────────────────────────────
         Node 0      83 ns     139 ns
         Node 1     139 ns      83 ns

Remote penalty: ~67% overhead via UPI (Ultra Path Interconnect)

AMD EPYC Genoa (single socket, 4 CCDs in NPS4 mode):

         From\To    Node 0   Node 1   Node 2   Node 3
         ───────────────────────────────────────────────
         Node 0       70 ns    90 ns    90 ns   110 ns
         Node 1       90 ns    70 ns   110 ns    90 ns
         Node 2       90 ns   110 ns    70 ns    90 ns
         Node 3      110 ns    90 ns    90 ns    70 ns

1-hop (within die-pair): +20 ns
2-hop (cross die-pair):  +40 ns

AMD Infinity Fabric: Interconnect Architecture

AMD's Infinity Fabric is the on-package and cross-package interconnect used in all Zen-based CPUs and APUs.

Topology

AMD EPYC Genoa Single Socket (simplified):

  ┌─────────────────────────────────────────────────────────────────┐
  │                         EPYC Socket                            │
  │                                                                  │
  │  CCD0  CCD1  CCD2  CCD3    CCD4  CCD5  CCD6  CCD7  CCD8  CCD9 │
  │  8C    8C    8C    8C      8C    8C    8C    8C    8C    8C    │
  │  32M   32M   32M   32M    32M   32M   32M   32M   32M   32M    │
  │  L3    L3    L3    L3      L3    L3    L3    L3    L3    L3    │
  │   │     │     │     │       │     │     │     │     │     │    │
  │   └─────┴─────┴─────┘  IF   └─────┴─────┴─────┘          │    │
  │         ▼                              ▼                   │    │
  │    ┌──────────────────────────────────────────────────┐    │    │
  │    │              I/O Die (IOD)                        │    │    │
  │    │   ┌──────┐  ┌──────┐  ┌──────┐  ┌──────┐       │    │    │
  │    │   │ MCx0 │  │ MCx1 │  │ MCx2 │  │ MCx3 │  ...  │    │    │
  │    │   └──┬───┘  └──┬───┘  └──┬───┘  └──┬───┘       │    │    │
  │    └──────┼──────────┼─────────┼──────────┼───────────┘    │    │
  │           │          │         │          │                 │    │
  │        DDR5       DDR5      DDR5       DDR5                │    │
  │        CH0        CH1       CH2        CH3                 │    │
  │                                                             │    │
  │  IF = Infinity Fabric (16 GT/s, ~460 GB/s bisection BW)   │    │
  └─────────────────────────────────────────────────────────────────┘

Infinity Fabric Bandwidth vs Latency

The Infinity Fabric operates at a configurable frequency (up to 2000 MHz in Zen 4, tied to memory clock / 2). The bandwidth is roughly:

Infinity Fabric bandwidth (intra-socket, Genoa):
  Per CCD-to-IOD link: ~256 GB/s bidirectional
  Total fabric bisection: ~460 GB/s

Cross-socket (2P Infinity Fabric, Genoa):
  G-IF bandwidth: up to 512 GB/s total (128 GB/s per direction per socket)
  Latency: +40-60 ns vs intra-socket

Key tradeoff: IF frequency must be set to match DRAM frequency.
IF @ 2000 MHz: best bandwidth, best latency
IF @ 1333 MHz (for DDR5-2666): lower bandwidth, higher latency
→ Mismatched IF/memory frequency is a common misconfiguration source

numactl: Observing and Controlling NUMA

# Show NUMA topology and available memory
numactl --hardware
# Output:
# available: 2 nodes (0-1)
# node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
# node 0 size: 193413 MB
# node 0 free: 181234 MB
# node 1 cpus: 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
# node 1 size: 193528 MB
# node 1 free: 177456 MB
# node distances:
# node   0   1
#    0:  10  21
#    1:  21  10
# (values are relative latency ratios, not nanoseconds: 10=local, 21=21/10 × local)

# Bind a process to NUMA node 0 (both CPU and memory)
numactl --cpunodebind=0 --membind=0 ./program

# Bind memory to node 0 but allow CPUs from any node
numactl --membind=0 ./program

# Interleave memory allocations across all NUMA nodes (maximize bandwidth)
numactl --interleave=all ./program

# Show NUMA stats for a running process
cat /proc/$(pgrep program)/numa_maps | head -20
# Format: virtual_addr policy flags ... pages
# e.g.: 7f2a00000000 interleave:0-1 file=/usr/lib/x86_64-linux-gnu/libc.so.6
#       7ffe00000000 default stack anon=482

Interpreting `numactl --hardware` Node Distances

The distance matrix uses relative units where 10 = local memory access. The actual latency is distance × base_latency:

Common configurations:
  2-socket Intel Xeon:        10, 21  (2.1× remote penalty)
  2-socket AMD EPYC (Zen 3):  10, 12  (1.2× — Zen 3 is notably better)
  Single socket AMD NPS4:     10, 12, 12, 16  (within-socket NUMA)
  4-socket Intel (via UPI):   10, 21, 31, 41  (multi-hop penalties)

NUMA Topology Diagram

2-socket server, AMD EPYC (NPS1 mode — 1 NUMA node per socket):

 ┌───────────────────────────────────┐     ┌───────────────────────────────────┐
 │         Socket 0 (NUMA Node 0)    │     │         Socket 1 (NUMA Node 1)    │
 │                                   │     │                                   │
 │  ┌─────────────────────────────┐  │     │  ┌─────────────────────────────┐  │
 │  │ Core 0-95  (96 cores)       │  │     │  │ Core 96-191 (96 cores)      │  │
 │  │  L1 (32KB) / L2 (1MB each)  │  │     │  │  L1 (32KB) / L2 (1MB each)  │  │
 │  └──────────────┬──────────────┘  │     │  └──────────────┬──────────────┘  │
 │                 │                 │     │                 │                 │
 │  ┌──────────────▼──────────────┐  │     │  ┌──────────────▼──────────────┐  │
 │  │  L3 (384MB total, 12 CCDs) │  │     │  │  L3 (384MB total, 12 CCDs) │  │
 │  └──────────────┬──────────────┘  │     │  └──────────────┬──────────────┘  │
 │                 │                 │     │                 │                 │
 │  ┌──────────────▼──────────────┐  │     │  ┌──────────────▼──────────────┐  │
 │  │    Memory Controller        │  │     │  │    Memory Controller        │  │
 │  │  12ch × DDR5 → 384 GB       │  │     │  │  12ch × DDR5 → 384 GB       │  │
 │  └─────────────────────────────┘  │     │  └─────────────────────────────┘  │
 │                                   │     │                                   │
 └──────────────────┬────────────────┘     └────────────────┬──────────────────┘
                    │                                        │
                    │      Infinity Fabric G-IF              │
                    │    (128 GB/s × 2 directions)           │
                    └────────────────────────────────────────┘

LOCAL access (same socket): ~80 ns
REMOTE access (cross-socket): ~120 ns (+40 ns = 50% penalty)

Single-socket AMD EPYC Genoa in NPS4 mode (4 NUMA nodes per socket):

  NUMA Node 0        NUMA Node 1        NUMA Node 2        NUMA Node 3
  ┌──────────┐       ┌──────────┐       ┌──────────┐       ┌──────────┐
  │ CCD 0,1  │       │ CCD 2,3  │       │ CCD 4,5  │       │ CCD 6,7  │
  │ 16 cores │       │ 16 cores │       │ 16 cores │       │ 16 cores │
  │          │       │          │       │          │       │          │
  │ MC Ch 0,1│       │ MC Ch 2,3│       │ MC Ch 4,5│       │ MC Ch 6,7│
  │  96 GB   │       │  96 GB   │       │  96 GB   │       │  96 GB   │
  └────┬─────┘       └────┬─────┘       └────┬─────┘       └────┬─────┘
       │                  │                  │                  │
       └──────────────────┴──────────────────┴──────────────────┘
                          Infinity Fabric (intra-socket)

Cross-node latency:   0,1 adjacent: ~90 ns    0,3 diagonal: ~110 ns
NPS4 benefit: NUMA-aware workloads get better locality within a socket
NPS4 cost: OS must be NUMA-aware to benefit; naive allocations may cross nodes

Linux NUMA APIs

Kernel Memory Policy (`mbind`, `set_mempolicy`, `move_pages`)

#include <numa.h>
#include <numaif.h>

// Allocate on specific NUMA node (preferred)
nodemask_t nodemask;
nodemask_zero(&nodemask);
nodemask_set_compat(&nodemask, 0);  // node 0
void *buf = mmap(NULL, size, PROT_READ|PROT_WRITE,
                 MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
mbind(buf, size, MPOL_PREFERRED, nodemask.n,
      sizeof(nodemask_t) * 8, MPOL_MF_MOVE);

// Interleave allocation across all NUMA nodes
nodemask_t all_nodes;
copy_nodemask_from_user(&all_nodes, numa_all_nodes_ptr);
mbind(buf, size, MPOL_INTERLEAVE, all_nodes.n, ...);

// Bind current thread to NUMA node for new allocations
set_mempolicy(MPOL_PREFERRED, nodemask.n, sizeof(nodemask_t) * 8);

// Move existing pages to a different NUMA node (EXPENSIVE!)
int nodes[num_pages]; // target NUMA node for each page
int status[num_pages];
void *page_ptrs[num_pages];
move_pages(0, num_pages, page_ptrs, nodes, status, MPOL_MF_MOVE);
// returns: 0 on success per page, -EFAULT if page not mapped, etc.

`numa.h` High-Level Library

#include <numa.h>

// Check if NUMA is available
if (numa_available() == -1) { /* no NUMA */ }

// Get number of nodes
int num_nodes = numa_max_node() + 1;

// Allocate on specific node (wrapper around mmap + mbind)
void *ptr = numa_alloc_onnode(size, 0);  // allocate on node 0
numa_free(ptr, size);

// Set process-wide NUMA memory policy
numa_set_localalloc();       // default: allocate local to current CPU
numa_set_preferred(1);       // prefer node 1
numa_set_interleave_mask(numa_all_nodes_ptr);  // interleave

// Thread binding
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(numa_node_to_cpus_first(0), &cpuset);  // first CPU of node 0
sched_setaffinity(0, sizeof(cpuset), &cpuset);

Linux First-Touch Policy

The most important NUMA behavior in Linux: physical memory is allocated on the NUMA node of the CPU that first writes to it (not the node of the allocating thread).

// Thread 0 (on CPU 0, NUMA node 0) does:
void *buf = mmap(NULL, 4GB, ...);  // No physical pages allocated yet (lazy)

// Thread 1 (on CPU 24, NUMA node 1) does:
parallel_initialize(buf);  // FIRST TOUCH — writes every page
// All physical pages allocated on NUMA node 1 (Thread 1's node)

// Thread 0 (on CPU 0, node 0) now processes buf:
process_data(buf);  // ALL ACCESSES ARE REMOTE (+40 ns per cache miss)

This is the classic NUMA performance pitfall: initialize data from a thread pool (which may be on any NUMA node), then access from a specific node.

Fix strategies:

Initialize from the processing thread: Have Thread 0 initialize its own portion of the buffer.
numactl --membind: Force allocations to a specific node regardless of first touch.
MPOL_INTERLEAVE: Interleave pages across all nodes — reduces worst-case remote access at the cost of some local access.
move_pages(2): Migrate pages to the desired node after initialization. Expensive (stops the world for migrated pages) but useful when page placement is known after the fact.
MADV_HUGEPAGE + transparent huge pages: Huge pages (2 MB) reduce TLB pressure and amortize the NUMA effect by reducing the number of TLB misses, making each miss less costly relative to total access time.

NUMA Performance Pitfalls

1. False Cross-NUMA Memory Allocation (First-Touch)

As described above. Most common in thread-pool-initialized data structures.

2. Inter-NUMA Lock Contention

// Hot spinlock or mutex physically allocated on NUMA node 0.
// Threads on node 1 must cross the fabric to acquire the lock:
//   - Lock is in remote L3 → +40 ns latency per lock acquisition
//   - Lock line ping-pongs across the fabric: coherence traffic
//   - Every acquire/release = Infinity Fabric transaction

// Solution: use per-NUMA-node locks, reduce global lock usage
// MCS locks and CLH locks reduce lock contention via local waiting nodes

3. Memory Bandwidth Asymmetry

On a 2-socket system with one socket lightly loaded:

Socket 0: 96 active threads, 192 GB/s memory bandwidth demand
Socket 1: 4 idle threads, 5 GB/s memory bandwidth demand

If all data is on node 0 (due to first-touch from node 0 threads):
  Node 1 threads must cross fabric to access data
  Node 0 bandwidth: 192 GB/s ← near saturation
  Fabric bandwidth: +15 GB/s (from node 1 threads crossing)
  Performance: node 1 threads heavily bandwidth-limited

Fix: numactl --interleave on the shared data → spreads load across both nodes

4. NUMA in Virtualization (vNUMA)

A virtual machine must see a NUMA topology that matches (or at least doesn't contradict) the physical hardware NUMA topology. If a VM spans multiple physical NUMA nodes but the guest OS thinks it has UMA, the guest will not use NUMA-optimized code paths, and performance will be poor.

VMware vSphere and KVM both expose NUMA topology to guests (vNUMA). QEMU/KVM parameters:

# Expose 2-node NUMA topology to guest
qemu-system-x86_64 \
  -numa node,nodeid=0,cpus=0-31,mem=128G \
  -numa node,nodeid=1,cpus=32-63,mem=128G \
  -numa dist,src=0,dst=1,val=21 \
  ...

Debugging NUMA Issues

# Per-NUMA-node memory usage
numastat -m

# Per-process NUMA memory statistics
numastat -p <pid>
# Output shows:
#                    Node 0    Node 1    Total
#  Huge             0.00      0.00      0.00
#  Heap         10234.23   1234.56  11468.79   ← node 0 heavy = first-touch on node 0
#  Stack            2.34      1.23      3.57
#  Private      50234.00   4234.00  54468.00   ← very lopsided!

# System-wide NUMA hit/miss statistics
cat /proc/vmstat | grep numa
# numa_hit: pages allocated on preferred node
# numa_miss: pages allocated on non-preferred node
# numa_foreign: pages allocated here but preferred elsewhere
# numa_interleave: pages allocated interleaved
# numa_local: pages allocated on the node of the faulting CPU
# numa_other: pages allocated on a non-local node

# Intel MLC latency/bandwidth measurement
mlc --latency_matrix   # full latency between all node pairs
mlc --bandwidth_matrix  # peak bandwidth between all pairs
mlc --idle_latency      # idle latency (no contention)
mlc --loaded_latency    # latency under bandwidth contention

Security Implications

NUMA-aware microarchitectural attacks: Flush+Reload is more powerful when attacker and victim share a physical NUMA node (lower latency = more accurate timing). Cross-NUMA attacks are harder due to higher noise floor.
Memory isolation in multi-tenant systems: Cloud providers must ensure that different tenants' physical pages do not share cache lines (false sharing vulnerability) or NUMA nodes with known timing characteristics that could leak information.
Thunderstrike via NUMA bandwidth flooding: An attacker on NUMA node 1 consuming all cross-fabric bandwidth can degrade a victim on NUMA node 0 whose data happens to be on node 1 (remote DRAM accesses become even slower).
Page migration timing channel: The time to migrate a page from one NUMA node to another (move_pages) leaks information about whether that page is currently cached (in cache: fast migration) or not (miss required first). Minor channel but measurable.

Performance Implications

NUMA-Aware Data Structures

Linux kernel uses per-NUMA-node data structures extensively:

// Per-NUMA-node allocator (slab/slub)
// Each NUMA node has its own free lists — allocation from local node
// avoids cross-fabric access

// Per-NUMA-node runqueues
// Linux CFS scheduler prefers to schedule tasks on the same NUMA node
// as their memory — "NUMA balancing" (CONFIG_NUMA_BALANCING)

Automatic NUMA Balancing (`/proc/sys/kernel/numa_balancing`)

Linux's automatic NUMA balancing periodically scans page table mappings, marks pages as inaccessible, and on the next access determines which NUMA node the faulting CPU is on. If a page is consistently accessed from a remote node, the kernel migrates it to the local node.

# Enable/disable automatic NUMA balancing
echo 1 > /proc/sys/kernel/numa_balancing  # enable
echo 0 > /proc/sys/kernel/numa_balancing  # disable

# View balancing statistics
cat /proc/sys/kernel/numa_balancing_scan_period_min_ms
cat /proc/vmstat | grep numa_pages_migrated

Auto-NUMA balancing is beneficial for steady-state workloads. It can add overhead for workloads with irregular access patterns (migration churn) or those that already pin memory explicitly.

Modern Usage

Kubernetes NUMA-Aware Scheduling

The Kubernetes topology-manager (GA in K8s 1.27+) coordinates CPU and memory affinity:

# Kubelet configuration for NUMA topology alignment
kubeletConfiguration:
  topologyManagerPolicy: "best-effort"  # or "restricted" / "single-numa-node"
  topologyManagerScope: "container"

With single-numa-node policy: containers are only scheduled on nodes where all requested CPUs and memory can be satisfied from a single NUMA node. Prevents cross-NUMA allocations for latency-sensitive workloads.

Database NUMA Configuration

PostgreSQL: huge_pages = on + NUMA_POLICY=interleave for the server process is the common configuration for large shared_buffers. Interleave prevents hot-spot single-node exhaustion.

MySQL/InnoDB: innodb_numa_interleave = ON enables interleaved allocation for the buffer pool.

Redis: numactl --interleave=all redis-server is the common deployment recommendation for large datasets.

Future Directions

CXL Memory Expansion as NUMA Nodes: CXL 2.0 Type 3 devices (memory expansion cards) appear to the OS as additional NUMA nodes. The latency is higher (~300-400 ns) and they can be tiered — "hot" data stays in DRAM NUMA nodes, "warm" data migrates to CXL nodes automatically via tiered memory management (Linux damon + memory_tiering).
Sub-NUMA Clustering Universality: AMD's NPS4 and Intel's SNC4 are becoming default configurations on modern deployments. OS schedulers and NUMA-aware runtimes (Java G1GC, Go runtime) are being updated to handle these within-socket NUMA configurations.
Computational Storage as NUMA: NVMe SSDs with in-drive compute (Samsung SmartSSD, NGD Catalina) can appear as NUMA nodes with extremely high-latency but extremely high-bandwidth local storage. Research into OS abstractions for this is active.
HBM + DDR hybrid NUMA (Intel Sapphire Rapids HBM): Intel Xeon Max (Sapphire Rapids) integrates HBM2e on-package (64 GB, ~900 GB/s bandwidth) + off-package DDR5. Configurable as: (a) flat mode (2 NUMA nodes: HBM node + DDR node, OS manages placement), (b) cache mode (HBM acts as L4 cache for DDR, transparent to OS). Flat mode gives applications explicit control over ultra-fast HBM vs large DDR.
RISC-V multi-socket NUMA: As RISC-V scales to server class (SiFive, Ventana, Sophgo), RISC-V NUMA topologies will emerge. The RISC-V platform specification is developing NUMA topology discovery standards (ACPI SRAT equivalent for RISC-V).

Exercises

First-touch measurement: Write a C program that allocates 1 GB via mmap, then benchmarks sequential reads. Run it three ways: (a) single-threaded (allocate and read on same thread), (b) initialize from a thread bound to a different NUMA node then read from node 0, (c) initialize with interleave policy. Use numastat to verify physical page placement in each case and mlc to confirm the latency difference.
NUMA balancing trace: Enable perf trace with mm:migrate_pages_start events. Run a NUMA-imbalanced workload (initialize on node 0, access from node 1). Observe automatic NUMA balancing migrate pages. Measure the performance before and after balancing converges (allow 30-60 seconds for the balancer to act).
Per-NUMA-node data structure design: Implement a counter that is incremented by all threads concurrently, optimized for NUMA. Use __thread (thread-local storage) for the per-core increment, a per-NUMA-node aggregator (padded to 64 bytes), and a global total. Benchmark vs a naive atomic counter. What is the speedup on a 2-socket system with 48 threads?
NUMA latency matrix measurement: Write a microbenchmark using pointer chasing (linked list traversal, defeats prefetcher) that can be bound to any combination of (compute node, memory node) via mbind + sched_setaffinity. Measure all 4 combinations on a 2-socket system. Build your own latency matrix and compare to numactl --hardware distances.
CXL tiered memory simulation: Using numactl and separate NUMA nodes (or a VM with multiple NUMA nodes configured in QEMU), simulate a 2-tier memory hierarchy (fast tier: node 0, slow tier: node 1). Implement a software tier-management policy that promotes "hot" pages (tracked via soft dirty bit in page tables) to the fast tier and demotes cold pages. Compare throughput to a naive all-node-0 baseline.

References

Lenoski, D., et al. (1992). The Stanford Dash Multiprocessor. IEEE Computer, 25(3), 63–79.
Brecht, T. B., & Karamcheti, V. (1993). Improving the Performance of Shared-Memory Programs. USENIX Summer 1993.
Bolosky, W. J., et al. (1989). NUMA Policies and Their Relation to Memory Architecture. ASPLOS IV, 212–221.
Drepper, U. (2007). What Every Programmer Should Know About Memory, Chapter 5: NUMA Support. https://people.redhat.com/drepper/cpumemory.pdf
Linux Kernel Documentation. (2023). NUMA Memory Policy. https://www.kernel.org/doc/Documentation/vm/numa_memory_policy.rst
AMD Corporation. (2022). EPYC Processor BIOS and Kernel Developer's Guide (BKDG) for AMD Family 19h. Section 2.9: NUMA Topology.
Intel Corporation. (2023). Intel Xeon Scalable Processors — Sub-NUMA Clustering. https://www.intel.com/content/www/us/en/developer/articles/technical/sub-numa-clustering.html
Dashti, M., et al. (2013). Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. ASPLOS 2013.
Lameter, C. (2013). NUMA (Non-Uniform Memory Access): An Overview. ACM Queue, 11(7).