NUMA Memory Architecture

Technical Overview

NUMA (Non-Uniform Memory Access) describes a memory architecture where multiple processors each have local memory attached directly to them, and all processors can access all memory — but remote memory accesses (to another processor's local memory) are significantly slower than local accesses. The "non-uniform" refers to the access time, not the address space — the system presents a single physical address space.

NUMA became important for servers once memory controllers moved from a shared northbridge onto the CPU die (Intel Nehalem, 2008). Today, a typical 2-socket server has two NUMA nodes; a 4-socket server has four. AMD's EPYC line is extreme: a single EPYC CPU with multiple CPU chiplets (CCDs) and a memory controller chiplet (IOD) forms a NUMA-like topology within a single socket — an AMD EPYC 7763 (64-core, 1 socket) may have 4 internal NUMA nodes.

NUMA latency differences: - Local DRAM: ~70–90 ns - Remote DRAM (1 hop): ~140–180 ns - Remote DRAM (2 hops, 4-socket): ~250–350 ns

At 64 bytes per cache line and 100 ns latency difference, the throughput impact of consistently accessing remote memory on a bandwidth-bound workload is 2–4x.

Prerequisites

Physical memory structure and NUMA nodes
Buddy allocator zones (06-buddy-allocator.md)
Virtual memory and mmap (01-virtual-memory.md, 09-mmap.md)
Process affinity and CPU topology (sched_setaffinity)

Core Content

NUMA Topology Detection

The kernel discovers NUMA topology from the ACPI SRAT (System Resource Affinity Table) and SLIT (System Locality Information Table):

NUMA Topology Detection Chain
================================

BIOS/UEFI firmware:
  Provides ACPI tables:
    SRAT: maps physical memory ranges to proximity domains (NUMA nodes)
    SLIT: provides distance matrix between nodes
    MSCT: maximum system characteristics (socket count, cores)

Linux boot:
  acpi_numa_init() [drivers/acpi/numa/srat.c]
    → Parses SRAT: builds node_to_pxm_map[], pxm_to_node_map[]
    → Registers memory ranges: memblock_set_node()

  setup_node_data() → initialize pg_data_t per node
  numa_init() → build zonelist, populate free_area per node

Runtime topology access:
  /sys/devices/system/node/node0/
    cpumap       → CPUs on this node
    distance     → distance to other nodes (SLIT values)
    meminfo      → memory info for this node
    numastat     → allocation statistics

  numactl --hardware
  → Shows nodes, CPUs per node, distances, and available memory

Example (2-socket Intel, 2 NUMA nodes):
  node 0: cpus 0-23, mem 192GB, distance from node0=10, from node1=21
  node 1: cpus 24-47, mem 192GB, distance from node0=21, from node1=10

Distance values: 10 = local (always), higher = less local. Distance 21 means accessing node 1 from node 0 takes 2.1x the local latency (SLIT is a relative scale).

Linux NUMA Nodes in sysfs

# Number of NUMA nodes
ls /sys/devices/system/node/ | grep node | wc -l

# CPUs on each node
cat /sys/devices/system/node/node0/cpulist  # e.g., 0-23
cat /sys/devices/system/node/node1/cpulist  # e.g., 24-47

# Distance matrix
numactl --hardware
# node distances:
# node   0   1
#   0:  10  21
#   1:  21  10

# Memory per node
cat /sys/devices/system/node/node0/meminfo | grep MemTotal

# Per-node allocation stats
cat /sys/devices/system/node/node0/numastat
# numa_hit:      1234567  ← allocations that got local memory
# numa_miss:     89012    ← allocations that got remote memory (bad)
# numa_foreign:  89012    ← allocations intended for this node but went remote
# local_node:    1234567  ← allocs by local CPU
# other_node:    89012    ← allocs by remote CPU

numactl Tool

# Run process with memory bound to node 0
numactl --membind=0 ./database_server

# Run process with CPUs on node 1, memory interleaved across nodes
numactl --cpunodebind=1 --interleave=all ./analytics_job

# Run process preferring node 0 (falls back to other nodes if needed)
numactl --preferred=0 ./app

# Query NUMA topology
numactl --hardware

# Show per-node memory stats
numactl --show

# Interleave memory for bandwidth-bound workloads
numactl --interleave=all ./benchmark
# Note: interleave increases bandwidth but loses locality for latency-sensitive ops

Interleaving (MPOL_INTERLEAVE) allocates pages round-robin across NUMA nodes. Useful for workloads that access all data equally and need maximum aggregate bandwidth (scientific computing, BLAS routines).

NUMA-Aware Allocation in the Kernel

/* Allocate pages on a specific NUMA node */
struct page *alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order);

/* Allocate from the current task's preferred node */
struct page *alloc_pages(gfp_t gfp_mask, unsigned int order);
// Equivalent to: alloc_pages_node(numa_node_id(), gfp_mask, order)

/* Allocate with a specific NUMA policy (from mempolicy) */
struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
                              struct vm_area_struct *vma, unsigned long addr,
                              int node, bool hugepage);

/* kmalloc on a specific node */
void *kmalloc_node(size_t size, gfp_t flags, int node);

The NUMA policy for a process is stored in struct task_struct->mempolicy and struct vm_area_struct->vm_policy. Available policies (include/uapi/linux/mempolicy.h):

MPOL_DEFAULT:       Allocate from the node of the requesting CPU (first-touch)
MPOL_BIND:          Must allocate from specified node(s) (fails if unavailable)
MPOL_PREFERRED:     Prefer specified node; fall back to others if needed
MPOL_INTERLEAVE:    Round-robin across specified nodes
MPOL_LOCAL:         Always allocate from the local node (like DEFAULT but more explicit)

First-Touch Policy

The default NUMA allocation policy is first-touch: the page is allocated on the NUMA node of the CPU that first touches (faults in) the page. This is an implicit locality optimization — if a thread initializes its own data, the data ends up local to that thread's CPU.

First-Touch in Practice:
  Thread on CPU 4 (node 0): mmap 1GB anonymous region
  → VMA created, no physical pages yet

  Thread on CPU 4 faults page at offset 0:
  → alloc_pages(GFP_HIGHUSER, 0) called from do_anonymous_page()
  → numa_node_id() = 0 (CPU 4 is on node 0)
  → Page allocated from node 0

  Thread on CPU 28 (node 1) reads same page:
  → PTE already installed → node 0 page → REMOTE ACCESS (slower)

Implication: always initialize data on the CPU that will use it.
OpenMP parallel loops must be written to initialize and compute on the same core:
  #pragma omp parallel for  // this distributes initialization across threads
  for (int i = 0; i < N; i++) array[i] = 0.0;
  // Now each page is local to the thread that first touched it

Memory Migration: migrate_pages()

NUMA balancing and explicit migration move pages from one NUMA node to another:

/* Migrate pages to a different node */
long migrate_pages(unsigned long start, unsigned long nr_pages,
                   int *status, int *nodes, int *sources);

/* Kernel internal: mm/migrate.c */
int migrate_pages(struct list_head *from, new_page_t get_new_page,
                  free_page_t put_new_page, unsigned long private,
                  enum migrate_mode mode, int reason, unsigned int *ret_succeeded);

Migration steps: 1. Isolate the page from LRU lists 2. Allocate a new page on the target node 3. Copy page content (or unmap and wait for page to be clean) 4. Update all PTEs pointing to old page → new page (requires mm_lock) 5. Free the old page

Cost: one page copy (~4 µs for 4 KB page) + TLB shootdown. For a process with a 10 GB working set on the wrong NUMA node, migration takes ~40 ms but is amortized over the lifetime of the process.

NUMA Balancing (Automatic Migration)

Linux's automatic NUMA balancing (CONFIG_NUMA_BALANCING, enabled by default) migrates pages toward the CPUs that access them most:

NUMA Balancing Mechanism
==========================

1. Periodically, kernel sets "NUMA hinting faults" for anonymous pages:
   PTEs are temporarily cleared (protection = PROT_NONE) for candidate pages
   → Next access generates a page fault

2. On fault: task_numa_fault() records which NUMA node faulted:
   - Which node did the fault occur on? (CPU's node)
   - Which node is the page currently on?

3. If mismatch (task on node X, page on node Y):
   a. Record this as a "NUMA miss"
   b. After threshold misses: schedule_numa_work() → migrate page to node X

4. task_numa_work() also migrates the task toward its data:
   sched_setnuma() moves the task to the node where its data lives
   (if moving task is cheaper than migrating all the data)

Control:
  /proc/sys/kernel/numa_balancing     # 0=off, 1=on
  /proc/sys/kernel/numa_balancing_scan_period_min_ms  # default 1000ms
  /proc/sys/kernel/numa_balancing_scan_period_max_ms  # default 60000ms
  /proc/sys/kernel/numa_balancing_scan_size_mb        # default 256MB

Monitor:
  /proc/PID/sched (numa_faults, numa_pages_migrated)
  perf stat -e numa:* ./myapp

NUMA balancing has overhead: the hinting faults cause extra page faults. The cost is typically 1–3% CPU overhead. Applications that are already NUMA-bound (numactl --membind) should disable NUMA balancing for them.

NUMA Effects on Performance

NUMA Performance Impact Analysis
==================================

Memory access patterns and NUMA:

1. All-local (ideal): task and data on same node
   Latency: ~80 ns
   Bandwidth: ~80 GB/s per node (dual-channel DDR5)

2. All-remote (worst case):
   Latency: ~160 ns (2x slower)
   Bandwidth: ~40 GB/s (limited by inter-socket QPI/UPI)
   CPU time increase: 20-50% for memory-bound workloads

3. Partially remote (typical with NUMA balancing inactive):
   Mixed impact; depends on hot:cold ratio

4. Interleaved (bandwidth-optimized):
   Latency: ~120 ns average (both local and remote)
   Bandwidth: ~160 GB/s aggregate (both nodes)
   Use case: single large array accessed sequentially (BLAS, memcpy)

Measurement tools:
  numastat -p $(pidof app)  # per-node allocations for process
  perf stat -e node-loads,node-load-misses ./app  # NUMA-aware PMU events
  Intel PCM (pcm-memory.x) — per-socket memory bandwidth measurement

NUMA in Databases

PostgreSQL NUMA issues: PostgreSQL's shared buffers are allocated at startup by one process (postmaster). With NUMA, all shared buffer pages end up on the node where postmaster runs, while worker processes on the other node access them remotely. Fix:

# PostgreSQL doesn't natively support NUMA for shared buffers.
# Workaround: interleave the shared buffer allocation:
numactl --interleave=all /usr/bin/postgres -D /var/lib/postgres/data

MySQL/InnoDB: Similar issue. InnoDB buffer pool is allocated at startup. numactl interleave is the standard recommendation.

Oracle RAC: Oracle Real Application Clusters is NUMA-aware. Uses NUMA_ARCHITECTURE=YES and allocates each SGA buffer on the local node for the instance running on that socket.

MongoDB: mongod uses --numa flag (deprecated; now numactl is recommended). WiredTiger (storage engine) is NUMA-aware and allocates caches on local nodes when run with proper NUMA binding.

NUMA in JVM

The JVM has built-in NUMA awareness (-XX:+UseNUMA):

JVM NUMA-aware allocation:
  -XX:+UseNUMA
  -XX:+UseParallelGC or -XX:+UseG1GC  # NUMA-aware GCs

With UseNUMA:
  Young generation is divided into per-node "regions"
  Each thread allocates from its local node's region
  After GC, survivor objects are migrated toward their accessing threads' nodes
  Old generation is interleaved (objects may be accessed from any node)

Result:
  30-40% improvement in young gen GC allocation throughput on NUMA systems
  Reduced false sharing in TLABs (Thread-Local Allocation Buffers)

NUMA Tuning for High-Performance Systems

# Disable NUMA balancing for latency-sensitive, pre-bound workloads
echo 0 > /proc/sys/kernel/numa_balancing

# Bind process to local memory and CPUs
numactl --cpunodebind=0 --membind=0 ./database

# Verify binding is working
numastat -p $(pidof database)  # check numa_hit >> numa_miss

# For HPC: interleave for all-memory-access workloads
numactl --interleave=all ./simulation

# Set default memory policy for all new processes (dangerous)
# via /proc/sys/kernel/numa_balancing_migrate_deferred

# Transparent huge pages with NUMA:
# THP allocation respects NUMA policy (numactl --membind)
# Check: grep AnonHugePages /sys/devices/system/node/node0/meminfo

# IRQ affinity for network NUMA:
# Bind NIC interrupts to CPUs on same node as NIC's PCIe slot
# This keeps DMA buffers and interrupt processing on the same NUMA node
cat /proc/interrupts | grep eth0
echo "0-11" > /proc/irq/23/smp_affinity_list  # bind eth0 irq to node0 CPUs

Historical Context

NUMA architectures became common in servers with the introduction of the NUMA-Q architecture (Sequent, 1996) and SGI Origin series (1996). Early NUMA workstations (DEC Alpha, 1994) had separate memory banks per CPU cluster. The inflection point for mainstream servers was Intel's Nehalem architecture (2008), which moved the memory controller onto the CPU die and connected sockets via QPI (QuickPath Interconnect). AMD's HyperTransport (2003) did the same for Opteron. Linux NUMA support (mm/mempolicy.c) was significantly improved for Nehalem-era systems in Linux 2.6.18–2.6.22. Automatic NUMA balancing was added in Linux 3.8 (2013).

Production Examples

30% PostgreSQL improvement on NUMA: A production database migration from 2-socket Intel E5-2690 to 4-socket Intel E7-4870 (more cores, more NUMA nodes) actually DECREASED performance. Root cause: PostgreSQL shared buffers allocated on node 0 by postmaster; all 4 nodes' worker processes accessed them remotely. Fix: numactl --interleave=all postgres restored performance to better than baseline.

Redis latency spikes on AMD EPYC: An AMD EPYC 7742 has 8 internal NUMA nodes (CCDs). Redis, running single-threaded, allocated its hash table pages on node 0 but its key-space was accessed by connections arriving on CPUs across all nodes. Latency was bimodal: 50µs for local access, 200µs for remote. Fix: pin Redis to a single CCD (numactl --cpunodebind=0 --membind=0).

Java GC slowdown on NUMA: A Java application with -Xmx64g on a 4-socket machine showed G1GC mixed collection pauses of 500ms. Profiling revealed that G1GC's evacuation phase was copying objects across NUMA nodes (old gen region happened to be on node 2, survivor region on node 0). Fix: -XX:+UseNUMA + dedicated per-node G1 regions.

Debugging Notes

# Overall NUMA stats per node
numastat

# Process-level NUMA stats
numastat -p $(pidof postgres)
# Shows per-node allocation counts: should have numa_hit >> numa_miss

# Kernel NUMA event counters
grep -E "numa|migration" /proc/vmstat
# numa_pte_updates: hinting fault PTEs modified
# numa_huge_pte_updates: hinting faults on huge pages
# numa_hint_faults: actual hint faults taken
# numa_hint_faults_local: faults where page was already local
# numa_pages_migrated: pages migrated by NUMA balancing

# Check if NUMA balancing is doing useful work
# (hint_faults_local / hint_faults should be trending toward 1.0)
awk '/numa_hint_faults/{f=$2} /numa_hint_faults_local/{l=$2} END{print l/f}' /proc/vmstat

# Hardware-level NUMA miss measurement (requires Intel PCM or AMD uProf)
pcm-memory.x 1 -csv=numa_stats.csv  # Intel PCM

# perf for NUMA events
perf stat -e LLC-load-misses,LLC-loads,\
             mem_load_retired.l3_miss,\
             mem_load_retired.local_pmm,\
             offcore_requests.all_requests \
          -- numastat -p $PID

# Check distance matrix
numactl --hardware | grep -A5 "node distances"

Security Implications

NUMA side channels: NUMA topology can be used as a side channel. An attacker on the same physical machine can infer which NUMA node a victim's data is on by measuring memory access latency. Combined with knowledge of NUMA node-to-CPU mapping, this narrows the physical location of cryptographic keys in memory.

NUMA and VM isolation: On cloud hosts with multiple VMs, VMs from different tenants may share a NUMA node. Cache timing attacks (Flush+Reload, Prime+Probe) are possible between VMs on the same NUMA node. NUMA-level tenant separation is a security boundary some providers enforce.

Row hammer across NUMA: DRAM row hammer attacks target physically adjacent DRAM rows. On NUMA systems, the attacker must be on the same NUMA node to hammer rows in that node's DRAM. Cross-NUMA hammering is ineffective.

Performance Implications

Band-limited workloads: A single NUMA node's memory bandwidth is typically 50–100 GB/s (4-channel DDR5). For applications exceeding this, NUMA interleaving doubles available bandwidth (but at the cost of 2x latency for all accesses).
NUMA-aware thread pools: Thread pool frameworks (Intel TBB, OpenMP) support NUMA domains. Workers are created on specific NUMA nodes, and work items are dispatched to workers whose local memory holds the relevant data.
Transparent page faults with NUMA: Every anonymous page fault is on a specific NUMA node. NUMA misses can be observed with perf mem record --type=load and perf mem report.

Failure Modes and Real Incidents

NUMA amnesia after fork: An application calls numactl --membind=0, then forks. The child inherits the NUMA policy. But if the parent then calls execve, the new binary starts with the process's mempolicy reset to MPOL_DEFAULT. Data allocated before exec stays on the original node; new data follows default policy. Mixed NUMA locality results.

NUMA imbalance in containers: Kubernetes doesn't natively support NUMA topology in its scheduler. A pod may be scheduled on CPUs spanning two NUMA nodes (e.g., CPUs 0–3 on node 0 and CPUs 24–27 on node 1 in a 4-CPU pod). The container then has split memory locality. Kubernetes 1.18+ supports the TopologyManager with single-numa-node policy to prevent this.

Memory migration storm during NUMA rebalancing: Linux's automatic NUMA balancing can trigger mass page migration when a process moves between NUMA nodes (after taskset or load balancing). Migrating 10 GB of working set takes ~40 seconds, during which performance is degraded (migration competes with allocation, TLB shootdowns occur). Mitigation: disable NUMA balancing for long-lived, carefully bound processes.

Modern Usage

CXL (Compute Express Link): CXL 2.0 memory expanders attach additional DRAM to CPUs via PCIe-derived interface. They appear as NUMA nodes with higher latency (300–500 ns). Linux 5.18+ has initial CXL NUMA node support. Applications must be NUMA-aware to use CXL memory effectively (for warm/cold data tiering).
Optane PMEM as NUMA node: Intel Optane Persistent Memory appears as a NUMA node with higher latency and lower bandwidth than DRAM. Linux memory_mode exposes it as a standard NUMA node; app_direct_mode exposes it as a DAX device. Per-VMA mbind() can place cold data on PMEM while keeping hot data on DRAM.
NUMA in eBPF: eBPF programs can query the NUMA node of a page via bpf_skb_get_nlattr() and related helpers, enabling NUMA-aware eBPF-based load balancers and memory monitors.

Future Directions

Memory tiering: Combining DRAM (NUMA node 0), HBM (High Bandwidth Memory, node 1), and CXL-attached memory (node 2) in a single system. The kernel's tiered memory management patches (Huang et al., 2022) automatically migrate pages between tiers based on access frequency.
ACPI HMAT (Heterogeneous Memory Attribute Table): Successor to SLIT for complex heterogeneous memory topologies. Provides bandwidth and latency for each memory type, not just distance.
Persistent NUMA topology: For CXL devices, NUMA topology may change dynamically (hot-plug CXL). Linux's memory hotplug + NUMA node addition/removal paths must handle this.

Exercises

On a 2-NUMA-node system, write a benchmark that allocates 8 GB and accesses it from threads pinned to node 0 vs node 1. Measure the bandwidth and latency difference.
Start PostgreSQL without numactl. Run a query. Check numastat -p $(pidof postgres). Then restart with numactl --interleave=all and repeat. Compare numa_hit and numa_miss ratios.
Implement a NUMA-aware memory allocator that wraps alloc_pages_node() to always allocate from the calling CPU's NUMA node. Test with a multi-threaded producer-consumer benchmark.
Enable NUMA balancing and observe numa_pages_migrated in /proc/vmstat while running a workload that moves threads between nodes (using taskset to force CPU migration).
On a multi-socket machine, measure the difference in memcpy throughput for: (a) src and dst on same node, (b) src on node 0, dst on node 1, (c) interleaved. Use numactl to control placement.
Write a tool that reads /proc/PID/pagemap for each page in a process and queries /sys/devices/system/node/node*/memmap to determine which NUMA node each physical page belongs to. Report the per-node distribution.

References

mm/mempolicy.c — NUMA policy implementation, mbind(), set_mempolicy()
mm/migrate.c — migrate_pages(), NUMA page migration
kernel/sched/fair.c — NUMA balancing, task_numa_work(), task_numa_fault()
include/linux/mempolicy.h — MPOL_* policy constants
drivers/acpi/numa/srat.c — ACPI SRAT parsing
include/uapi/linux/mempolicy.h — userspace NUMA policy API
Linux man pages: numactl(8), mbind(2), set_mempolicy(2), get_mempolicy(2), move_pages(2)
Lameter & Heim, "NUMA Memory Architecture and the Linux Kernel", OLS 2006
LWN: "NUMA in a hurry" — https://lwn.net/Articles/486858/
LWN: "Automatic NUMA balancing" — https://lwn.net/Articles/558579/
Intel Architecture Memory Resources reference manual