NUMA-Aware Scheduling

Technical Overview

Non-Uniform Memory Access (NUMA) fundamentally changes the performance model of multi-socket systems. On a NUMA machine, the time to access a byte of memory depends on which memory controller owns that byte and which CPU is requesting it. A 64-byte cache line fetch from local memory on a dual-socket Intel Xeon system takes approximately 80ns; the same fetch from the remote socket crosses the QPI/UPI interconnect and costs approximately 150-200ns — a 2-3x penalty.

The Linux scheduler's response to NUMA is a two-part system: automatic NUMA balancing (CONFIG_NUMA_BALANCING), which migrates tasks and pages to maximize locality, and the sched_domain hierarchy, which constrains load balancing to respect NUMA boundaries. Understanding both is critical for anyone tuning multi-socket database servers, HPC workloads, or any application that is memory-bandwidth-bound.

The core insight: scheduling and memory placement are inseparable on NUMA systems. The scheduler cannot just ask "which CPU is least loaded?" — it must ask "which CPU is least loaded and has the fastest access to this task's memory pages?"

Prerequisites

01-scheduling-fundamentals.md (multi-core scheduling challenges)
03-linux-cfs.md (CFS runqueue and load balancing)
Understanding of cache hierarchy (L1/L2/L3)
Basic virtual memory concepts (pages, page tables, TLB)

NUMA Hardware Overview

Dual-Socket NUMA System:

┌─────────────────────────────────────────────────────────────────────┐
│                         Socket 0 (Node 0)                           │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐           │
│  │  Core 0  │  │  Core 1  │  │  Core 2  │  │  Core 3  │           │
│  │  L1: 32K │  │  L1: 32K │  │  L1: 32K │  │  L1: 32K │           │
│  │  L2: 1M  │  │  L2: 1M  │  │  L2: 1M  │  │  L2: 1M  │           │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘           │
│       └──────────────┴─────────────┴─────────────┘                 │
│                          L3: 32MB (shared)                          │
│                    ┌─────────────────────┐                          │
│                    │ Memory Controller 0  │                          │
│                    │  Local DRAM: 128GB   │ ← ~80ns latency         │
│                    └──────────┬──────────┘                          │
└───────────────────────────────┼─────────────────────────────────────┘
                                │ UPI/QPI Interconnect (~100ns)
┌───────────────────────────────┼─────────────────────────────────────┐
│                         Socket 1 (Node 1) │                          │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐           │
│  │  Core 4  │  │  Core 5  │  │  Core 6  │  │  Core 7  │           │
│  │  L1: 32K │  │  L1: 32K │  │  L1: 32K │  │  L1: 32K │           │
│  │  L2: 1M  │  │  L2: 1M  │  │  L2: 1M  │  │  L2: 1M  │           │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘           │
│       └──────────────┴─────────────┴─────────────┘                 │
│                          L3: 32MB (shared)                          │
│                    ┌─────────────────────┐                          │
│                    │ Memory Controller 1  │                          │
│                    │  Local DRAM: 128GB   │ ← ~80ns latency (local) │
│                    └─────────────────────┘   ~170ns from Socket 0  │
└─────────────────────────────────────────────────────────────────────┘

NUMA distances (from `numactl --hardware` on real systems):
  node 0 0: 10   (local, normalized)
  node 0 1: 21   (remote, 2.1x penalty)
  node 1 0: 21
  node 1 1: 10

On 4-socket EPYC systems, multi-hop distances can reach 32-38.

sched_domain Hierarchy

The scheduler models the machine topology as a hierarchy of scheduling domains. Each domain defines a set of CPUs that share a resource (cache, memory controller, interconnect) and has policies governing how aggressively to balance within it.

sched_domain Hierarchy (4-core, 2-socket, HT-enabled system):

System Domain (all CPUs 0-7)
├─ flags: SD_NUMA, load balance across NUMA nodes
├─ imbalance_pct: 125 (tolerate 25% imbalance before migrating)
├─ interval: 64ms (rebalance every 64ms if needed)
│
├─ NUMA Node 0 Domain (CPUs 0-3)
│   ├─ flags: SD_SHARE_PKG_RESOURCES (shared L3)
│   ├─ imbalance_pct: 117
│   ├─ interval: 8ms
│   │
│   ├─ Core 0 MC Domain (CPUs 0, 1)   [HT siblings]
│   │   ├─ flags: SD_SHARE_CPUCACHE (shared L1/L2)
│   │   ├─ imbalance_pct: 110
│   │   ├─ CPU 0 (HT thread 0)
│   │   └─ CPU 1 (HT thread 1)
│   │
│   └─ Core 1 MC Domain (CPUs 2, 3)   [HT siblings]
│       ├─ CPU 2
│       └─ CPU 3
│
└─ NUMA Node 1 Domain (CPUs 4-7)
    ├─ Core 2 MC Domain (CPUs 4, 5)
    └─ Core 3 MC Domain (CPUs 6, 7)

Domain flags (key ones): - SD_BALANCE_NEWIDLE: balance when a CPU becomes idle - SD_BALANCE_EXEC: balance on exec() system call - SD_BALANCE_FORK: balance on fork() - SD_NUMA: this is a cross-NUMA domain; be conservative about migration - SD_SHARE_CPUCACHE: CPUs share L2 cache (SMT siblings) — preferred for short tasks - SD_SHARE_PKG_RESOURCES: CPUs share LLC (same socket) — next preference

Imbalance tolerance: Higher imbalance_pct means the scheduler accepts larger load imbalances before migrating. NUMA domains have higher imbalance tolerance (25%) than cache-sharing domains (10%) to avoid costly cross-NUMA migrations.

Load Balancing: rebalance_domains

Load balancing is triggered by scheduler_tick() on every timer interrupt. The actual balancing runs in a softirq context:

scheduler_tick() on CPU X
  └─ trigger_load_balance(rq)
       └─ raise_softirq(SCHED_SOFTIRQ)
            └─ run_rebalance_domains() [softirq handler]
                 └─ rebalance_domains(this_rq, idle)
                      └─ for each sched_domain (innermost to outermost):
                           if (time to rebalance this domain?)
                             load_balance(this_rq, idle, sd, ...)

load_balance() compares the load on the busiest group in the domain with the load on the local (least loaded) group:

load_balance():
  1. find_busiest_group(): scan all groups in domain, find busiest
  2. find_busiest_queue(): within busiest group, find busiest CPU's rq
  3. detach_tasks():       remove tasks from busiest queue
  4. attach_tasks():       add tasks to local queue

Constraints on task migration:
  - Task must be migratable (no CPU affinity restriction)
  - Task must not be "cache hot" (ran within sched_migration_cost_ns)
  - Load must actually be imbalanced (> imbalance_pct threshold)
  - NUMA domain: additional check for page locality

Cache hotness: A task is considered "cache hot" if it ran within sched_migration_cost_ns (default 500µs) on its current CPU. Migrating a hot task wastes the warm cache state. The load balancer skips hot tasks.

NUMA Balancing: Automatic Page Migration

CONFIG_NUMA_BALANCING (enabled by default) implements automatic detection and correction of NUMA placement mismatches. The mechanism is elegant: the kernel uses page protection faults to sample which CPU accesses which pages, then migrates pages and/or tasks to maximize locality.

NUMA Hinting Faults

The key mechanism: NUMA balancing periodically scans a task's virtual address space and removes the present bit from page table entries (using PROT_NONE protection). The next access to any such page faults. The fault handler records: 1. Which physical page was accessed 2. Which CPU (therefore NUMA node) generated the fault 3. Whether the page is local or remote to that CPU

NUMA Balancing Mechanism:

1. NUMA scanner (runs in task context, amortized over time):
   for each VMA in task's address space:
     change_protection(VMA, PROT_NONE)  ← make pages faultable

2. Task accesses a page → page fault (NUMA hint fault)
   handle_mm_fault()
     → do_numa_page()
       → record: "page P was accessed by CPU C (node N)"
       → restore page access (remove PROT_NONE)
       → if page is remote to node N:
           numa_migrate_prep()  ← schedule page migration

3. Page migration:
   migrate_pages() moves physical page to node N's memory
   Update page table entries to point to new physical location

4. Task migration (if task repeatedly accesses pages on node N):
   numa_preferred_nid = N  ← record preferred NUMA node
   CFS considers moving task to a CPU on node N

Scanning Rate Control

The NUMA scanner adapts its scanning rate to avoid excessive overhead:

/proc/sys/kernel/numa_balancing_scan_period_min_ms = 1000  (1 second)
/proc/sys/kernel/numa_balancing_scan_period_max_ms = 60000 (60 seconds)
/proc/sys/kernel/numa_balancing_scan_size_mb       = 256   (scan 256MB/period)

If many pages are remote: scan more aggressively (approach min period)
If locality is good: scan less frequently (approach max period)

Overhead: NUMA page faults typically add ~0.1-1% overhead.
          On workloads that benefit: 20-30% speedup from locality.
          On workloads that don't (already local): negligible.

Task Migration via NUMA Balancing

When a task's numa_faults statistics show that most accesses are to pages on node N ≠ current node, the scheduler sets numa_preferred_nid = N. During load balancing, the scheduler prefers to migrate this task to a CPU on node N, co-locating computation with data.

The task group problem: If a multi-threaded application has threads on both nodes, with shared memory, NUMA balancing sees conflicting signals (both nodes access the same pages). The numa_group mechanism handles this: threads that share a memory area are grouped, and the group's placement is optimized for the dominant access node.

numastat: Observing NUMA Behavior

numastat                    # per-node memory stats (basic)
numastat -p [pid]           # per-process NUMA allocation breakdown
numastat -m                 # per-module (kernel) NUMA stats

Example output (numastat -p postgres):
Per-node process memory usage (in MBs) for PID 1234 (postgres)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                       312.45          102.33          414.78
Stack                        0.23            0.00            0.23
Private                    280.12          156.78          436.90
----------------  --------------- --------------- ---------------
Total                      592.80          259.11          851.91

Interpretation: 70% of memory on Node 0 (local to CPUs 0-3)
                30% on Node 1 (remote) — NUMA imbalance

# Watch NUMA miss rates (cross-node accesses):
numastat -n  # shows numa_hit, numa_miss, numa_foreign per node

# Per-CPU NUMA events via perf:
perf stat -e cache-misses,LLC-load-misses,\
             offcore_requests.all_data_rd,\
             offcore_response.all_data_rd.llc_miss.remote_dram \
          -- sleep 10

High numa_miss or high offcore_response.*.remote_dram indicates NUMA locality problems.

Manual NUMA Binding

numactl

The primary tool for manual NUMA placement:

# Bind a process to CPU node 0 with memory from node 0:
numactl --cpunodebind=0 --membind=0 ./my_program

# Interleave memory across all nodes (for bandwidth-hungry, NUMA-unaware apps):
numactl --interleave=all ./my_program

# Bind to specific CPUs (not just nodes):
numactl --physcpubind=0,1,2,3 --membind=0 ./my_program

# Prefer node 0 but allow remote if node 0 is full:
numactl --preferred=0 ./my_program

# Show NUMA topology:
numactl --hardware

taskset

CPU affinity without NUMA memory binding:

# Pin process to CPUs 0-3 (Node 0):
taskset -c 0-3 ./my_program

# Change affinity of running process:
taskset -cp 0-3 [pid]

# Show current affinity:
taskset -p [pid]

cgroup cpuset

For persistent, inherited NUMA binding in containers and services:

# cgroup v1:
mkdir /sys/fs/cgroup/cpuset/node0
echo 0-3 > /sys/fs/cgroup/cpuset/node0/cpuset.cpus
echo 0 > /sys/fs/cgroup/cpuset/node0/cpuset.mems
echo [pid] > /sys/fs/cgroup/cpuset/node0/tasks

# cgroup v2:
mkdir /sys/fs/cgroup/node0
echo "0-3" > /sys/fs/cgroup/node0/cpuset.cpus
echo "0" > /sys/fs/cgroup/node0/cpuset.mems
echo [pid] > /sys/fs/cgroup/node0/cgroup.procs

cpuset.mems binds memory allocation to NUMA nodes, even for future allocations — critical for ensuring that data allocated after binding stays local.

cpuset.memory_migrate = 1: When set, migrates existing pages to the new node when cpuset.mems changes. Without it, existing remote pages stay remote.

NUMA Effects on PostgreSQL

PostgreSQL is a canonical example of NUMA-sensitive database behavior:

The problem: PostgreSQL's shared buffer pool (often 25-30% of RAM, e.g., 64GB on a 256GB system) is allocated once at startup. Without NUMA-aware allocation, malloc/mmap will allocate pages in round-robin or interleaved fashion across NUMA nodes. Database backends (connection handler processes) are scheduled by the kernel across all CPUs. Result: ~50% of buffer pool accesses are cross-NUMA, adding 2x memory latency to every cache miss.

Solutions:

# Option 1: Bind PostgreSQL to a single NUMA node:
numactl --cpunodebind=0 --membind=0 postgres -D /var/lib/postgresql/data

# Option 2: Interleave shared memory for balanced bandwidth:
numactl --interleave=all postgres -D /var/lib/postgresql/data
# Good for: OLAP workloads that saturate memory bandwidth
# Bad for: OLTP workloads where latency matters more than bandwidth

# Option 3: Use multiple PostgreSQL instances, one per NUMA node:
# postgres1 bound to node 0, postgres2 bound to node 1
# Load balanced at application layer (PgBouncer, HAProxy)
# Best isolation, most complex operational setup

Measurement: Compare pgbench -c 16 -T 60 with default placement vs numactl --membind=0. On a 2-socket system with 16 cores per socket, typical improvement with node binding: 15-40% on OLTP benchmarks, depending on workload cache-fit and connection count.

huge_pages: PostgreSQL's huge_pages = on uses 2MB transparent hugepages for shared memory. Huge pages reduce TLB misses but also make NUMA migration more expensive (migrating a 2MB page vs a 4KB page takes longer). With hugepages, NUMA binding is even more important (can't cheaply migrate after binding).

NUMA in the JVM

The JVM (HotSpot) has NUMA-aware heap allocation:

# Enable NUMA-aware garbage collection:
java -XX:+UseNUMA -XX:+UseParallelGC MyApp

# With G1GC (partial NUMA support added in JDK 14):
java -XX:+UseNUMA -XX:+UseG1GC MyApp

-XX:+UseNUMA enables "NUMA-aware" allocators in the JVM: - Young generation memory is partitioned into per-NUMA-node spaces - Java threads allocate from their local NUMA node's Eden space - Objects allocated by a thread tend to stay local to that thread's CPU - Reduces cross-NUMA accesses for object allocation (~30-60% of GC pressure)

Limitation: The JVM cannot control where OS threads are scheduled. Without numactl or JVM thread pinning (JDK 21's virtual threads on NUMA is still evolving), threads may migrate and access remote objects.

JVM NUMA tuning checklist: 1. Enable -XX:+UseNUMA with a parallel or G1 collector 2. Pin the JVM process to a NUMA node with numactl --cpunodebind=N --membind=N 3. Size the heap to fit within a single NUMA node if possible 4. Monitor with numastat -p [jvm_pid] and aim for >90% local allocation

Debugging Notes

# Check NUMA topology and distances:
numactl --hardware
lscpu | grep -i numa
cat /sys/devices/system/node/node*/distance

# Check if NUMA balancing is enabled:
cat /proc/sys/kernel/numa_balancing  # 1=enabled, 0=disabled

# Disable NUMA balancing (for benchmarks or if manually binding):
echo 0 > /proc/sys/kernel/numa_balancing

# Check NUMA fault statistics per process:
cat /proc/[pid]/numa_maps | head -20
# Format: address policy N0=pages N1=pages ... file=...
# Look for N1=nnnn on a node-0-bound process: remote pages

# System-wide NUMA statistics:
cat /proc/vmstat | grep numa
# numa_hit: successful local allocations
# numa_miss: forced remote allocations (local memory full)
# numa_foreign: pages intended for another node ended up here
# numa_interleave: pages placed by interleave policy
# numa_local: accesses to local memory
# numa_other: accesses to remote memory (the bad one — keep low)

# Per-CPU performance counters for NUMA:
perf stat -e node_load_cache_l3,node_store_cache_l3,\
             node_load_miss_l3,node_store_miss_l3 -- sleep 10

# Identify which processes cause cross-NUMA traffic:
perf record -e offcore_response.all_data_rd.llc_miss.remote_dram \
            -a -g -- sleep 10
perf report

Security Implications

NUMA-based side-channel attacks: NUMA interconnect bandwidth is shared. An attacker running on socket 0 can saturate the QPI interconnect, degrading performance for a victim on socket 1 that accesses memory across the interconnect. This is a covert channel and a denial-of-service vector. Mitigation: NUMA node isolation (cpuset cgroups to separate tenants to different nodes).

Information leakage via NUMA timing: Memory access timing on NUMA systems leaks information about data locality. An attacker can use timing differences between local and remote accesses to infer whether a victim process has certain data cached locally. This is a more exotic side channel than cache attacks but real in cloud environments.

NUMA and live migration in VMs: When a VM is live-migrated between physical hosts, its memory and vCPU scheduling is reset. The new host may have different NUMA topology. If the VM's workload had optimal NUMA placement, migration degrades performance until NUMA balancing re-optimizes. Hypervisors (KVM, VMware) must rebuild NUMA topology information on migration.

Performance Implications

Cross-NUMA memory access penalty: 2-3x latency, ~50% bandwidth reduction compared to local access
NUMA balancing overhead: ~0.1-1% CPU for page fault handling; scanning adds minor TLB pressure
Page migration cost: migrating one 4KB page takes ~10-50µs; migrating a 1GB working set takes 2-5 seconds — acceptable for steady-state workloads, disruptive for latency-sensitive
Load balancing across NUMA nodes is 10-100x more expensive than within-node balancing due to migration cost
Correct NUMA placement can improve memory-bandwidth-limited workloads by 40-80%

Failure Modes

NUMA imbalance with cpuset cgroups: If a cgroup's cpuset.cpus spans multiple NUMA nodes but cpuset.mems only lists one node, processes will run on CPUs across both nodes but allocate memory only from one — remote access guaranteed.
NUMA balancing fighting cpuset: With manual numactl binding, NUMA automatic balancing may try to move pages away from the explicitly bound node. Disable with echo 0 > /proc/sys/kernel/numa_balancing when manually controlling placement.
Transparent huge pages + NUMA migration: THP migration requires the page be split from 2MB into 4KB pages, migrated, then re-promoted. This causes latency spikes during migration. Disable THP if NUMA migration is frequent: echo madvise > /sys/kernel/mm/transparent_hugepage/enabled.
False NUMA sharing in JVM: Multiple JVM threads allocating into the same Eden region (before NUMA-aware GC separates them) leads to remote accesses. Symptom: high numa_other in vmstat despite -XX:+UseNUMA.

Modern Usage and Future Directions

AMD EPYC NUMA complexity: Modern EPYC processors (Milan, Genoa) use a chiplet design. An EPYC Milan has 8 chiplets (CCDs) per socket, each with their own L3 cache. The kernel sees this as a more complex NUMA topology: within a socket, some CPUs are "closer" (share a CCD) than others. numactl --hardware may show multiple NUMA nodes per physical socket. Correct NUMA binding for EPYC requires binding to the specific CCD, not just the socket.

Intel Sapphire Rapids NUMA: Intel's Sapphire Rapids (2023) introduces HBM (High Bandwidth Memory) alongside DDR5. The system appears as a 4-node NUMA system on a single socket: 4 DRAM nodes + 4 HBM nodes. Applications can explicitly bind to HBM nodes for bandwidth-intensive workloads (numactl --membind=4,5,6,7 for HBM nodes).

NUMA-aware container scheduling: Kubernetes 1.26+ includes CPU Manager and Memory Manager that can co-locate a pod's CPUs and memory on the same NUMA node. The TopologyManager policy single-numa-node guarantees all resources are on one NUMA node.

CXL (Compute Express Link) and memory pooling: CXL 2.0/3.0 allows memory expansion via PCIe-like links with NUMA-like access semantics. CXL memory appears as additional NUMA nodes with higher latency than local DRAM. Linux 6.1+ supports CXL as tiered memory (hot data in DRAM, cold data in CXL memory).

Exercises

Measure NUMA distance impact: On a multi-socket system, run stream benchmark first with numactl --membind=0 --cpunodebind=0 (local) and then with numactl --membind=1 --cpunodebind=0 (remote). Measure bandwidth (MB/s) and latency (ns) for each. What is the observed NUMA penalty?
numastat analysis: Start PostgreSQL without NUMA binding. After 5 minutes of pgbench activity, run numastat -p $(pgrep postgres). What fraction of memory is local vs remote? Repeat with numactl --membind=0 and compare.
NUMA balancing observation: Enable NUMA balancing tracing: echo 1 > /sys/kernel/debug/tracing/events/sched/sched_move_numa/enable. Run a NUMA-imbalanced workload and observe page migration events in the trace.
cgroup cpuset isolation: Create two cpuset cgroups, each bound to one NUMA node. Place competing processes in each cgroup. Verify with numastat that their memory is isolated and cross-NUMA traffic is reduced.
JVM NUMA measurement: Run a Java application (e.g., a Spring Boot service under load) with and without -XX:+UseNUMA. Use numastat -p [jvm_pid] to measure local vs remote allocation rates. Measure throughput difference.

References

Linux Kernel Documentation: Documentation/vm/numa.rst
Linux Kernel Documentation: Documentation/scheduler/sched-domains.rst
Lameter, C., "NUMA (Non-Uniform Memory Access): An Overview", ACM Queue, 2013
Dashti, M. et al., "Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems", ASPLOS 2013
Lozi, J-P. et al., "The Linux Scheduler: a Decade of Wasted Cores", EuroSys 2016 — includes NUMA load balancing bugs
AMD EPYC NUMA Topology white paper (per CPU generation)
Intel Platform Optimization Playbook: NUMA section
numactl documentation: man 8 numactl, man 3 numa
Gorman, M., "Understanding the Linux Virtual Memory Manager", Chapter 5 (NUMA)
Linux source: kernel/sched/topology.c, mm/numa_balancing.c, mm/migrate.c