Scalability to Many Cores
Overview
The Linux kernel's journey from supporting a handful of CPUs to running on systems with 1000+ logical cores represents one of the most sustained engineering efforts in operating systems history. What began as bolted-on SMP support in the 2.0 era has evolved through decades of contention removal, data structure redesign, and algorithmic improvement into a system that, while imperfect, scales to configurations the original authors never imagined.
This document traces that evolution, examines current bottlenecks at extreme core counts, explores how hyperscale companies work around the remaining limitations, and looks ahead at proposals for more radical architectural changes.
Prerequisites
- Familiarity with kernel locking primitives (spinlocks, mutexes, RCU)
- Understanding of NUMA (Non-Uniform Memory Access) topology
- Basic knowledge of CPU cache coherence protocols (MESI/MOESI)
- Awareness of the Linux scheduler's Completely Fair Scheduler (CFS)
Historical Context
The Big Kernel Lock Era (Linux 2.0–2.4)
Early Linux SMP support, introduced around version 2.0 (1996), took the expedient path: a single global spinlock called the Big Kernel Lock (BKL) protected nearly all kernel code. Any CPU entering the kernel grabbed the BKL; every other CPU waited. This made porting to SMP trivial but imposed a hard serialization ceiling.
LINUX 2.4 SMP MODEL
CPU 0 CPU 1 CPU 2 CPU 3
| | | |
v v v v
[BKL?] [BKL?] [BKL?] [BKL?]
| | | |
+----wait------+----wait------+----wait------+
|
[BKL HOLDER]
|
[Kernel code]
A system with 4 CPUs could, in the worst case, run kernel code on only 1 CPU at a time. Benchmarks showed that adding a third or fourth CPU often yielded diminishing or even negative returns for syscall-heavy workloads.
Per-Subsystem Locks (Linux 2.5–2.6)
The 2.5 development cycle (2002–2003), targeting the 2.6 release, began aggressive BKL decomposition. The strategy was to replace the single global lock with finer-grained locks scoped to individual subsystems:
- The VFS layer got its own inode locks (
i_lock,i_mutex) - The networking stack developed per-socket locks and per-device locks
- The scheduler got per-runqueue spinlocks
- Memory management moved toward per-zone and per-node locks
By the time 2.6.39 was released (2011), Arnd Bergmann and others had reduced BKL usage to near zero, and it was formally removed in 3.2 (2012).
RCU: Read-Copy-Update
The introduction and broad adoption of RCU (Read-Copy-Update) was arguably the most impactful scalability improvement of the 2.6 era. RCU allows multiple readers to proceed without any locking while updates occur via a copy-and-swap pattern that defers reclamation until all current readers have passed through a quiescent state.
RCU READ PATH (no lock, no atomic op)
Reader Writer
| |
| rcu_read_lock() | rcu_assign_pointer(p, new)
| (preempt disable) | // atomic pointer swap
| |
| dereference p | synchronize_rcu()
| (reads old data) | // wait for all readers
| | // in progress at swap time
| rcu_read_unlock() |
| | kfree(old) // safe now
RCU transformed hot read paths in routing tables, process credential checks, module lists, and dozens of other structures from O(lock contention) to effectively O(1) on read-heavy workloads.
Current Bottlenecks at 1000+ Cores
Modern systems have exposed a new generation of scaling limits. Below are the major bottlenecks that dominate at very high core counts.
Memory Allocator Contention
The slab/slub allocator moved to per-CPU caches (magazines/slabs) to eliminate cross-CPU lock contention on the common hot path. This works well up to ~64 cores. At 256+ cores:
- Per-CPU magazines drain and refill more often, increasing the rate of slow-path allocations that require the per-node slab lock
- The kmalloc lock for large objects becomes contended when many threads simultaneously handle network bursts
- Page allocator zone locks (
zone->lock) are hit when per-CPU page sets are exhausted
Google's internal patches and upstream work on the SLUB allocator's partial list have reduced but not eliminated this.
Scheduler Load Balancing: O(N) at High Core Count
The CFS scheduler organizes CPUs into hierarchical scheduling domains (SMT siblings → LLC domain → NUMA node → system). Load balancing runs periodically within each domain and, in the worst case, scans the runqueues of all CPUs in a domain.
SCHEDULING DOMAINS ON A 128-CORE SYSTEM
[System domain: 128 CPUs]
/ \
[NUMA node 0: 64] [NUMA node 1: 64]
/ \ / \
[LLC 0:32] [LLC 1:32] [LLC 2:32] [LLC 3:32]
...
[SMT pair] [SMT pair] ...
At 1000+ cores, load balancing becomes a meaningful fraction of CPU time. Each CPU periodically calls load_balance(), which may walk all other CPUs in the domain. At 1024 cores with a 10ms rebalance period, the system performs on the order of 100K runqueue inspections per second just for load balance—before doing any actual work.
Per-CPU Data: Effective but Memory-Expensive
DEFINE_PER_CPU variables are the standard solution for eliminating contention on frequently-updated counters (network statistics, page allocator state, scheduler data). They work: each CPU reads and writes its own cacheline, zero contention.
The cost is memory: a single per-CPU variable of 64 bytes on a 1024-core system consumes 64 KB of RAM. When multiplied by hundreds of per-CPU variables across subsystems, the total per-CPU data segment can reach hundreds of MB on very large systems—memory that cannot be used for workloads.
Cache Coherence Traffic: NUMA Effects Dominate
At high core counts, the primary performance limiter shifts from software locking to hardware cache coherence traffic. Even a lock-free RCU path requires that the pointer being dereferenced live in a cacheline local to the CPU. If a writer on NUMA node 3 modifies a structure, all readers on other nodes experience a cache miss until the new value propagates.
NUMA COHERENCE COST (AMD EPYC 9654, 96 cores, 8 NUMA nodes)
Node 0 Node 1 Node 2 Node 3
Node 0 read: ~4 ns ~20 ns ~30 ns ~38 ns
(approximate local vs remote DRAM latency)
Cross-node cache miss (MESI Invalid): ~80-120 ns
vs local cache hit: ~4 ns
On AMD EPYC systems with 8+ NUMA nodes, a single cross-node cache miss is 20–30x more expensive than a local one. Any shared kernel data structure—even if protected only by RCU—causes coherence traffic proportional to the number of readers across nodes.
Scalability Bottleneck Diagram
SCALABILITY LIMIT BY ERA AND MECHANISM
Throughput
(normalized)
^
| [future: sharding?]
| ****
| ******
| ***** <-- per-CPU data + RCU dominate
| *****
| ***** <-- per-subsystem locks replace BKL
| *****
| ***** <-- BKL removed, tickless kernel
| ****
|* <-- BKL era: hard ceiling ~4 CPUs useful
+----------------------------------------------------> Core count
1 4 8 16 32 64 128 256 512 1024
Linux at Hyperscale: Google, Meta, Amazon
Google runs on custom Linux kernels maintained by the ChromeOS/Android/Borg teams. Key patches include:
- Scheduler topology customization: Borglet (the per-machine agent) configures scheduling domains based on workload type. CPU-intensive batch jobs get wide domains; latency-sensitive services get narrow ones to avoid cross-NUMA migration.
- scx (sched_ext): Google contributed the
sched_extframework (merged in 6.12) that allows writing schedulers in BPF—enabling per-fleet custom scheduling policies without kernel recompilation. - Memory tiering: Hot/cold page classification and demotion to CXL or NUMA-far memory.
Meta
Meta's kernel team (contributing under meta-platforms on GitHub) has focused heavily on the networking stack:
- TC (Traffic Control) and sch_fq patches: Custom Fair Queue disciplines for handling multi-million-connection servers
- Network stack per-CPU queuing: Ensuring receive processing stays on the CPU that owns the socket, reducing cross-NUMA packet processing
- iptables → eBPF migration: Replacing O(N) iptables rule traversal with O(1) eBPF maps
Amazon (AWS)
AWS Nitro architecture offloads network and storage processing entirely to a dedicated hardware card, removing these interrupt loads from the guest kernel. Within guest kernels:
- Graviton (ARM-based) tuning: ARM's interconnect topology differs from x86; AWS engineers adjust NUMA balancing parameters
- ENA driver tuning: Elastic Network Adapter uses per-CPU queue pairs, ensuring interrupt affinity matches processing cores
NUMA at Hyperscale: AMD EPYC
AMD's EPYC "Genoa" (9004 series) places 96 physical cores across 12 Core Complexes (CCDs), organized into 4–8 NUMA nodes depending on BIOS configuration. Memory bandwidth is asymmetric: a thread on CCD 0 reading from memory attached to CCD 5 may cross two or three inter-die interconnects (Infinity Fabric links).
AMD EPYC 9654 SIMPLIFIED TOPOLOGY
[CCD 0: 8 cores] [CCD 1: 8 cores] -- Infinity Fabric -- [CCD 6: 8 cores] [CCD 7: 8 cores]
| | | |
[DRAM Ch 0] [DRAM Ch 1] [DRAM Ch 4] [DRAM Ch 5]
| | | |
[CCD 2: 8 cores] [CCD 3: 8 cores] [CCD 8: 8 cores] [CCD 9: 8 cores]
|
[DRAM Ch 2/3]
NUMA nodes (typical config): 4 nodes x 24 cores
Cross-NUMA bandwidth penalty: ~40% reduction vs local
Applications like Redis, memcached, and RocksDB have been tuned to use numactl --localalloc and CPU pinning to keep working sets local to a single NUMA node.
Scheduler Tuning at 256+ vCPU VMs
Google Borglet and similar systems expose a problem: a VM with 256 vCPUs running on a 256-core host has a flat scheduling domain—every CPU is "equidistant" from the scheduler's perspective, even if the underlying hardware has 4 NUMA nodes.
Key tuning levers:
# Reduce scheduler migration frequency (higher value = less aggressive)
echo 500000 > /proc/sys/kernel/sched_migration_cost_ns
# Reduce wake-up load balancing (critical for latency-sensitive workloads)
echo 0 > /proc/sys/kernel/sched_wakeup_granularity_ns
# Limit load balancing domain width via cpuset
# (confine a service to one NUMA node's CPUs)
cgcreate -g cpuset:/service_A
echo "0-63" > /sys/fs/cgroup/cpuset/service_A/cpuset.cpus
echo "0" > /sys/fs/cgroup/cpuset/service_A/cpuset.mems
The sched_ext framework allows even more radical control: a BPF program can implement a custom scheduler that consults external load data, container metadata, or power budgets when making placement decisions.
Failure Modes
- Lock convoy: A single hot spinlock turns into a thundering herd when many CPUs simultaneously reach it after a waitqueue release. Symptoms: periodic latency spikes visible in
perf stat -e lock:*orbpftrace. - NUMA thrashing: The kernel's automatic NUMA balancing (AutoNUMA) migrates pages toward the CPU that touches them most. On multi-tenant systems this can cause continuous page migration that wastes memory bandwidth. Disable with
kernel.numa_balancing=0for latency-critical services. - Scheduler starvation at high core count: With O(N) load balancing and many runqueues, a newly queued task on an idle CPU may not be found for multiple milliseconds if the wakeup path uses the wrong domain. This manifests as tail latency spikes.
- Per-CPU counter wrap: Some per-CPU statistics use 32-bit counters that wrap on very long-running high-throughput systems. Rare, but seen in production network counters.
Security Implications
Scalability patches often trade safety for performance in ways with security relevance:
- lockless fast paths: RCU's lockless reads require careful attention to memory ordering; missing
smp_rmb()or incorrectrcu_dereference()usage can lead to use-after-free in concurrent modification scenarios. - per-CPU data races: Code that reads its own per-CPU variable, gets preempted, migrates, and then writes back can introduce subtle data corruption. Kernel code uses
get_cpu()/put_cpu()guards, but bugs have shipped. - CPU hotplug interaction: Per-CPU data must be correctly initialized and torn down during CPU hotplug events. Missing teardown has caused information leaks.
Performance Implications
- Correct NUMA topology awareness can yield 2–4x throughput improvement for memory-bound workloads on large EPYC systems.
- Disabling AutoNUMA (
numa_balancing=0) and usingnumactlmanual pinning reduces worst-case tail latency by 30–60% for latency-sensitive services at Google and Meta scale. - The
sched_extframework's ability to implement work-stealing schedulers in BPF has demonstrated 5–15% CPU efficiency gains in preliminary Google benchmarks by reducing scheduler overhead at high core count. - RCU grace period batching (
synchronize_rcu()vscall_rcu()) is critical: synchronous grace periods serialize kernel threads and have caused multi-second stalls during memory pressure on systems with thousands of RCU callbacks.
Modern Usage
- Linux 6.x: Scheduler energy-awareness (
CONFIG_ENERGY_MODEL), sched_ext BPF schedulers, improved NUMA balancing heuristics - AMD EPYC + CXL: Memory tiering where hot pages live in DRAM and cold pages are demoted to CXL-attached memory pools; Linux's
demotion_pathinfrastructure (6.1+) manages this - Intel Xeon Scalable: Platform QoS (Cache Allocation Technology, Memory Bandwidth Monitoring) allows the OS to partition LLC capacity between VMs or cgroups, reducing noisy-neighbor effects
Future Directions
Kernel Sharding / Multi-Kernel
Research systems (Barrelfish, Popcorn Linux) explore running separate kernel instances on separate NUMA nodes, communicating via message passing rather than shared memory. This eliminates cross-node kernel lock contention entirely at the cost of complexity when processes migrate across node boundaries.
BPF Scheduler Proliferation
The sched_ext framework is expected to enable an ecosystem of domain-specific schedulers: latency-optimized, throughput-optimized, energy-optimized, and deadline-based variants all deployable without kernel patches.
Hardware-Assisted Coherence Reduction
Intel's Hybrid Cloud and AMD's XGMI/NVLink-like coherence fabrics expose new topologies. Future kernels will need scheduler primitives that understand chiplet-level latency variation within a single socket.
Debugging Notes
# Identify lock hotspots
perf lock record -- workload
perf lock report
# Measure NUMA locality of memory accesses
perf stat -e cache-misses,cache-references,\
mem_load_retired.l3_miss,mem_load_retired.local_pmm ./workload
# Observe scheduler migration frequency
perf stat -e sched:sched_migrate_task -- workload
# Check per-CPU slab allocator pressure
cat /proc/slabinfo | awk '{print $1, $3, $4}' | sort -k3 -rn | head -20
# Visualize scheduling domains
cat /sys/devices/system/cpu/cpu0/topology/core_cpus_list
numactl --hardware
Exercises
-
Boot a Linux VM with 8+ vCPUs, run
sysbench --threads=8 cpu runand compare with--threads=1. Useperf statto measure IPC (instructions per cycle). At what thread count does IPC start to drop? Why? -
Write a BPF program using
bpftracethat counts the number of timesload_balance()is called per CPU per second. Correlate with system load. -
On a NUMA system, run
numactl --cpunodebind=0 --membind=1 streamand compare tonumactl --localalloc stream. Measure bandwidth difference. -
Examine
/proc/sys/kernel/sched_*parameters on your system. Write a brief explanation of what each controls and which workload type each benefits. -
Study the
sched_extkernel documentation and write a minimal BPF scheduler that simply round-robins tasks across CPUs. Measure its overhead vs the default CFS scheduler.
References
- Corbet, J. et al. "Linux Kernel Development" (O'Reilly) — Chapters on locking and SMP
- McKenney, P. "Is Parallel Programming Hard, And If So, What Can You Do About It?" — Definitive RCU reference
- Lozi, J.P. et al. "The Linux Scheduler: a Decade of Wasted Cores" (EuroSys 2016) — Seminal paper on scheduler bugs at scale
- Amit, N. et al. "Optimizing the TLB Shootdown Algorithm with Page Access Tracking" (USENIX ATC 2017)
- AMD EPYC 9004 Series NUMA Topology Whitepaper (AMD, 2022)
- Linux kernel documentation:
Documentation/scheduler/,Documentation/mm/numa.rst - sched_ext kernel documentation:
Documentation/scheduler/sched-ext.rst - Axboe, J. "Linux Block IO: Present and Future" — covers per-CPU queue evolution in block layer