Scalability to Many Cores

Overview

The Linux kernel's journey from supporting a handful of CPUs to running on systems with 1000+ logical cores represents one of the most sustained engineering efforts in operating systems history. What began as bolted-on SMP support in the 2.0 era has evolved through decades of contention removal, data structure redesign, and algorithmic improvement into a system that, while imperfect, scales to configurations the original authors never imagined.

This document traces that evolution, examines current bottlenecks at extreme core counts, explores how hyperscale companies work around the remaining limitations, and looks ahead at proposals for more radical architectural changes.

Prerequisites

Familiarity with kernel locking primitives (spinlocks, mutexes, RCU)
Understanding of NUMA (Non-Uniform Memory Access) topology
Basic knowledge of CPU cache coherence protocols (MESI/MOESI)
Awareness of the Linux scheduler's Completely Fair Scheduler (CFS)

Historical Context

The Big Kernel Lock Era (Linux 2.0–2.4)

Early Linux SMP support, introduced around version 2.0 (1996), took the expedient path: a single global spinlock called the Big Kernel Lock (BKL) protected nearly all kernel code. Any CPU entering the kernel grabbed the BKL; every other CPU waited. This made porting to SMP trivial but imposed a hard serialization ceiling.

LINUX 2.4 SMP MODEL

CPU 0          CPU 1          CPU 2          CPU 3
  |              |              |              |
  v              v              v              v
[BKL?]         [BKL?]         [BKL?]         [BKL?]
  |              |              |              |
  +----wait------+----wait------+----wait------+
                 |
            [BKL HOLDER]
                 |
           [Kernel code]

A system with 4 CPUs could, in the worst case, run kernel code on only 1 CPU at a time. Benchmarks showed that adding a third or fourth CPU often yielded diminishing or even negative returns for syscall-heavy workloads.

Per-Subsystem Locks (Linux 2.5–2.6)

The 2.5 development cycle (2002–2003), targeting the 2.6 release, began aggressive BKL decomposition. The strategy was to replace the single global lock with finer-grained locks scoped to individual subsystems:

The VFS layer got its own inode locks (i_lock, i_mutex)
The networking stack developed per-socket locks and per-device locks
The scheduler got per-runqueue spinlocks
Memory management moved toward per-zone and per-node locks

By the time 2.6.39 was released (2011), Arnd Bergmann and others had reduced BKL usage to near zero, and it was formally removed in 3.2 (2012).

RCU: Read-Copy-Update

The introduction and broad adoption of RCU (Read-Copy-Update) was arguably the most impactful scalability improvement of the 2.6 era. RCU allows multiple readers to proceed without any locking while updates occur via a copy-and-swap pattern that defers reclamation until all current readers have passed through a quiescent state.

RCU READ PATH (no lock, no atomic op)

Reader                Writer
  |                     |
  | rcu_read_lock()      | rcu_assign_pointer(p, new)
  | (preempt disable)    |      // atomic pointer swap
  |                     |
  | dereference p        | synchronize_rcu()
  | (reads old data)     |  // wait for all readers
  |                     |      // in progress at swap time
  | rcu_read_unlock()    |
  |                     | kfree(old)   // safe now

RCU transformed hot read paths in routing tables, process credential checks, module lists, and dozens of other structures from O(lock contention) to effectively O(1) on read-heavy workloads.

Current Bottlenecks at 1000+ Cores

Modern systems have exposed a new generation of scaling limits. Below are the major bottlenecks that dominate at very high core counts.

Memory Allocator Contention

The slab/slub allocator moved to per-CPU caches (magazines/slabs) to eliminate cross-CPU lock contention on the common hot path. This works well up to ~64 cores. At 256+ cores:

Per-CPU magazines drain and refill more often, increasing the rate of slow-path allocations that require the per-node slab lock
The kmalloc lock for large objects becomes contended when many threads simultaneously handle network bursts
Page allocator zone locks (zone->lock) are hit when per-CPU page sets are exhausted

Google's internal patches and upstream work on the SLUB allocator's partial list have reduced but not eliminated this.

Scheduler Load Balancing: O(N) at High Core Count

The CFS scheduler organizes CPUs into hierarchical scheduling domains (SMT siblings → LLC domain → NUMA node → system). Load balancing runs periodically within each domain and, in the worst case, scans the runqueues of all CPUs in a domain.

SCHEDULING DOMAINS ON A 128-CORE SYSTEM

  [System domain: 128 CPUs]
       /              \
[NUMA node 0: 64]  [NUMA node 1: 64]
    /       \           /        \
[LLC 0:32] [LLC 1:32] [LLC 2:32] [LLC 3:32]
  ...
[SMT pair] [SMT pair] ...

At 1000+ cores, load balancing becomes a meaningful fraction of CPU time. Each CPU periodically calls load_balance(), which may walk all other CPUs in the domain. At 1024 cores with a 10ms rebalance period, the system performs on the order of 100K runqueue inspections per second just for load balance—before doing any actual work.

Per-CPU Data: Effective but Memory-Expensive

DEFINE_PER_CPU variables are the standard solution for eliminating contention on frequently-updated counters (network statistics, page allocator state, scheduler data). They work: each CPU reads and writes its own cacheline, zero contention.

The cost is memory: a single per-CPU variable of 64 bytes on a 1024-core system consumes 64 KB of RAM. When multiplied by hundreds of per-CPU variables across subsystems, the total per-CPU data segment can reach hundreds of MB on very large systems—memory that cannot be used for workloads.

Cache Coherence Traffic: NUMA Effects Dominate

At high core counts, the primary performance limiter shifts from software locking to hardware cache coherence traffic. Even a lock-free RCU path requires that the pointer being dereferenced live in a cacheline local to the CPU. If a writer on NUMA node 3 modifies a structure, all readers on other nodes experience a cache miss until the new value propagates.

NUMA COHERENCE COST (AMD EPYC 9654, 96 cores, 8 NUMA nodes)

               Node 0    Node 1    Node 2    Node 3
Node 0 read:   ~4 ns     ~20 ns    ~30 ns    ~38 ns
(approximate local vs remote DRAM latency)

Cross-node cache miss (MESI Invalid): ~80-120 ns
vs local cache hit: ~4 ns

On AMD EPYC systems with 8+ NUMA nodes, a single cross-node cache miss is 20–30x more expensive than a local one. Any shared kernel data structure—even if protected only by RCU—causes coherence traffic proportional to the number of readers across nodes.

Scalability Bottleneck Diagram

SCALABILITY LIMIT BY ERA AND MECHANISM

Throughput
(normalized)
  ^
  |                                         [future: sharding?]
  |                                    ****
  |                              ******
  |                         ***** <-- per-CPU data + RCU dominate
  |                    *****
  |               ***** <-- per-subsystem locks replace BKL
  |          *****
  |     ***** <-- BKL removed, tickless kernel
  | ****
  |*  <-- BKL era: hard ceiling ~4 CPUs useful
  +----------------------------------------------------> Core count
  1   4   8  16  32  64  128  256  512 1024

Linux at Hyperscale: Google, Meta, Amazon

Google

Google runs on custom Linux kernels maintained by the ChromeOS/Android/Borg teams. Key patches include:

Scheduler topology customization: Borglet (the per-machine agent) configures scheduling domains based on workload type. CPU-intensive batch jobs get wide domains; latency-sensitive services get narrow ones to avoid cross-NUMA migration.
scx (sched_ext): Google contributed the sched_ext framework (merged in 6.12) that allows writing schedulers in BPF—enabling per-fleet custom scheduling policies without kernel recompilation.
Memory tiering: Hot/cold page classification and demotion to CXL or NUMA-far memory.

Amazon (AWS)

AWS Nitro architecture offloads network and storage processing entirely to a dedicated hardware card, removing these interrupt loads from the guest kernel. Within guest kernels:

Graviton (ARM-based) tuning: ARM's interconnect topology differs from x86; AWS engineers adjust NUMA balancing parameters
ENA driver tuning: Elastic Network Adapter uses per-CPU queue pairs, ensuring interrupt affinity matches processing cores

NUMA at Hyperscale: AMD EPYC

AMD's EPYC "Genoa" (9004 series) places 96 physical cores across 12 Core Complexes (CCDs), organized into 4–8 NUMA nodes depending on BIOS configuration. Memory bandwidth is asymmetric: a thread on CCD 0 reading from memory attached to CCD 5 may cross two or three inter-die interconnects (Infinity Fabric links).

AMD EPYC 9654 SIMPLIFIED TOPOLOGY

[CCD 0: 8 cores] [CCD 1: 8 cores] -- Infinity Fabric -- [CCD 6: 8 cores] [CCD 7: 8 cores]
        |                 |                                       |                  |
   [DRAM Ch 0]       [DRAM Ch 1]                           [DRAM Ch 4]         [DRAM Ch 5]
        |                 |                                       |                  |
[CCD 2: 8 cores] [CCD 3: 8 cores]                         [CCD 8: 8 cores] [CCD 9: 8 cores]
        |
   [DRAM Ch 2/3]

NUMA nodes (typical config): 4 nodes x 24 cores
Cross-NUMA bandwidth penalty: ~40% reduction vs local

Applications like Redis, memcached, and RocksDB have been tuned to use numactl --localalloc and CPU pinning to keep working sets local to a single NUMA node.

Scheduler Tuning at 256+ vCPU VMs

Google Borglet and similar systems expose a problem: a VM with 256 vCPUs running on a 256-core host has a flat scheduling domain—every CPU is "equidistant" from the scheduler's perspective, even if the underlying hardware has 4 NUMA nodes.

Key tuning levers:

# Reduce scheduler migration frequency (higher value = less aggressive)
echo 500000 > /proc/sys/kernel/sched_migration_cost_ns

# Reduce wake-up load balancing (critical for latency-sensitive workloads)
echo 0 > /proc/sys/kernel/sched_wakeup_granularity_ns

# Limit load balancing domain width via cpuset
# (confine a service to one NUMA node's CPUs)
cgcreate -g cpuset:/service_A
echo "0-63" > /sys/fs/cgroup/cpuset/service_A/cpuset.cpus
echo "0" > /sys/fs/cgroup/cpuset/service_A/cpuset.mems

The sched_ext framework allows even more radical control: a BPF program can implement a custom scheduler that consults external load data, container metadata, or power budgets when making placement decisions.

Failure Modes

Lock convoy: A single hot spinlock turns into a thundering herd when many CPUs simultaneously reach it after a waitqueue release. Symptoms: periodic latency spikes visible in perf stat -e lock:* or bpftrace.
NUMA thrashing: The kernel's automatic NUMA balancing (AutoNUMA) migrates pages toward the CPU that touches them most. On multi-tenant systems this can cause continuous page migration that wastes memory bandwidth. Disable with kernel.numa_balancing=0 for latency-critical services.
Scheduler starvation at high core count: With O(N) load balancing and many runqueues, a newly queued task on an idle CPU may not be found for multiple milliseconds if the wakeup path uses the wrong domain. This manifests as tail latency spikes.
Per-CPU counter wrap: Some per-CPU statistics use 32-bit counters that wrap on very long-running high-throughput systems. Rare, but seen in production network counters.

Security Implications

Scalability patches often trade safety for performance in ways with security relevance:

lockless fast paths: RCU's lockless reads require careful attention to memory ordering; missing smp_rmb() or incorrect rcu_dereference() usage can lead to use-after-free in concurrent modification scenarios.
per-CPU data races: Code that reads its own per-CPU variable, gets preempted, migrates, and then writes back can introduce subtle data corruption. Kernel code uses get_cpu()/put_cpu() guards, but bugs have shipped.
CPU hotplug interaction: Per-CPU data must be correctly initialized and torn down during CPU hotplug events. Missing teardown has caused information leaks.

Performance Implications

Correct NUMA topology awareness can yield 2–4x throughput improvement for memory-bound workloads on large EPYC systems.
Disabling AutoNUMA (numa_balancing=0) and using numactl manual pinning reduces worst-case tail latency by 30–60% for latency-sensitive services at Google and Meta scale.
The sched_ext framework's ability to implement work-stealing schedulers in BPF has demonstrated 5–15% CPU efficiency gains in preliminary Google benchmarks by reducing scheduler overhead at high core count.
RCU grace period batching (synchronize_rcu() vs call_rcu()) is critical: synchronous grace periods serialize kernel threads and have caused multi-second stalls during memory pressure on systems with thousands of RCU callbacks.

Modern Usage

Linux 6.x: Scheduler energy-awareness (CONFIG_ENERGY_MODEL), sched_ext BPF schedulers, improved NUMA balancing heuristics
AMD EPYC + CXL: Memory tiering where hot pages live in DRAM and cold pages are demoted to CXL-attached memory pools; Linux's demotion_path infrastructure (6.1+) manages this
Intel Xeon Scalable: Platform QoS (Cache Allocation Technology, Memory Bandwidth Monitoring) allows the OS to partition LLC capacity between VMs or cgroups, reducing noisy-neighbor effects

Future Directions

Kernel Sharding / Multi-Kernel

Research systems (Barrelfish, Popcorn Linux) explore running separate kernel instances on separate NUMA nodes, communicating via message passing rather than shared memory. This eliminates cross-node kernel lock contention entirely at the cost of complexity when processes migrate across node boundaries.

BPF Scheduler Proliferation

The sched_ext framework is expected to enable an ecosystem of domain-specific schedulers: latency-optimized, throughput-optimized, energy-optimized, and deadline-based variants all deployable without kernel patches.

Hardware-Assisted Coherence Reduction

Intel's Hybrid Cloud and AMD's XGMI/NVLink-like coherence fabrics expose new topologies. Future kernels will need scheduler primitives that understand chiplet-level latency variation within a single socket.

Debugging Notes

# Identify lock hotspots
perf lock record -- workload
perf lock report

# Measure NUMA locality of memory accesses
perf stat -e cache-misses,cache-references,\
  mem_load_retired.l3_miss,mem_load_retired.local_pmm ./workload

# Observe scheduler migration frequency
perf stat -e sched:sched_migrate_task -- workload

# Check per-CPU slab allocator pressure
cat /proc/slabinfo | awk '{print $1, $3, $4}' | sort -k3 -rn | head -20

# Visualize scheduling domains
cat /sys/devices/system/cpu/cpu0/topology/core_cpus_list
numactl --hardware

Exercises

Boot a Linux VM with 8+ vCPUs, run sysbench --threads=8 cpu run and compare with --threads=1. Use perf stat to measure IPC (instructions per cycle). At what thread count does IPC start to drop? Why?
Write a BPF program using bpftrace that counts the number of times load_balance() is called per CPU per second. Correlate with system load.
On a NUMA system, run numactl --cpunodebind=0 --membind=1 stream and compare to numactl --localalloc stream. Measure bandwidth difference.
Examine /proc/sys/kernel/sched_* parameters on your system. Write a brief explanation of what each controls and which workload type each benefits.
Study the sched_ext kernel documentation and write a minimal BPF scheduler that simply round-robins tasks across CPUs. Measure its overhead vs the default CFS scheduler.

References

Corbet, J. et al. "Linux Kernel Development" (O'Reilly) — Chapters on locking and SMP
McKenney, P. "Is Parallel Programming Hard, And If So, What Can You Do About It?" — Definitive RCU reference
Lozi, J.P. et al. "The Linux Scheduler: a Decade of Wasted Cores" (EuroSys 2016) — Seminal paper on scheduler bugs at scale
Amit, N. et al. "Optimizing the TLB Shootdown Algorithm with Page Access Tracking" (USENIX ATC 2017)
AMD EPYC 9004 Series NUMA Topology Whitepaper (AMD, 2022)
Linux kernel documentation: Documentation/scheduler/, Documentation/mm/numa.rst
sched_ext kernel documentation: Documentation/scheduler/sched-ext.rst
Axboe, J. "Linux Block IO: Present and Future" — covers per-CPU queue evolution in block layer