Skip to content

03 — Memory Performance

Technical Overview

Memory performance is determined by the interplay of the cache hierarchy, DRAM subsystem, and memory access patterns. Modern CPUs execute instructions at nanosecond timescales but DRAM responds in ~100 ns—a 100x mismatch. The hardware closes this gap through caches, prefetchers, and out-of-order execution that hides latency by doing other work while waiting on memory.

When applications exceed cache capacity or exhibit irregular access patterns, they become memory-bound: the CPU's execution units are starved of data, and IPC collapses toward 0.5 or lower. Optimizing a memory-bound program requires reducing working set size, improving spatial and temporal locality, and leveraging hardware mechanisms like huge pages and NUMA-aware allocation.


Prerequisites

  • Cache hierarchy fundamentals (L1/L2/L3, inclusive vs. exclusive, set-associative layout).
  • Virtual memory and page tables.
  • Basic NUMA topology understanding.
  • Linux perf and numactl.

Core Content

Memory Latency Table

This table represents approximate hardware latencies on a modern x86-64 system (3 GHz clock, DDR5-4800). Absolute values vary by generation; relative values are stable.

Level Latency (cycles) Latency (ns) Bandwidth (GB/s) Capacity
L1 cache 4–5 ~1.5 ~2 TB/s 32–64 KB
L2 cache 12–14 ~4 ~400 GB/s 256 KB – 1 MB
L3 cache (LLC) 30–50 ~10–16 ~200 GB/s 8–192 MB
DRAM (local) 180–220 ~60–75 ~50–100 GB/s Hundreds of GB
DRAM (remote NUMA) 350–450 ~120–150 ~25–50 GB/s
NVMe SSD ~100,000 ~100 µs ~7 GB/s TB
SATA SSD ~200,000 ~200 µs ~500 MB/s TB
Spinning HDD ~10,000,000 ~10 ms ~150 MB/s TB

The ratio between L1 latency (4 cycles) and DRAM latency (200 cycles) is 50x. A single cache miss in a tight loop can dominate runtime.


Cache Efficiency

A cache line is the fundamental unit of cache transfer: 64 bytes on all modern x86. Accessing any byte in a cache line pulls the entire 64 bytes from the next level.

Cache line utilization: if your access pattern reads only 8 of 64 bytes in a loaded cache line, you waste 87.5% of the bandwidth.

// Poor utilization: accessing field at offset 0 of a large struct
struct record {
    uint64_t key;        // offset 0 — used
    char     payload[56]; // offsets 8-63 — unused (one cache line)
    char     metadata[192]; // offsets 64-255 — also loaded needlessly
};

// Better: hot/cold separation
struct record_hot {
    uint64_t key;
    uint32_t index_into_cold;
};  // 12 bytes — ~5 per cache line
struct record_cold {
    char payload[56];
    char metadata[192];
};

False sharing: two variables that are semantically independent but reside on the same cache line are written by different CPUs. The MESI protocol invalidates the entire cache line on every write, causing coherence traffic even though the CPUs are writing different bytes.

CPU 0 writes counter_a (offset 0)  ──┐
CPU 1 writes counter_b (offset 8)  ──┴── Both on cache line [0..63]
                                        → Every write invalidates both CPUs' L1

Detect false sharing:

perf c2c record -ag -- sleep 30
perf c2c report --stdio
# Look for "HITM" (Hit In Modified) — indicates cache line bouncing

Cache thrashing: a working set larger than the cache causes continuous eviction and reload. The cache becomes an expensive latency buffer rather than a hit. Profile with:

perf stat -e LLC-loads,LLC-load-misses -p <pid> sleep 10
# LLC miss rate = LLC-load-misses / LLC-loads
# > 30% miss rate = likely thrashing

DRAM Bandwidth

DDR5 (dual-channel, 4800 MT/s, 2× 64-bit wide): theoretical ~77 GB/s. In practice, efficiency losses (refresh cycles, row activation, ECC overhead) reduce this to ~60–70 GB/s.

Multi-channel: most workstation and server platforms use 4 or 8 memory channels to multiply bandwidth:

Intel Xeon Sapphire Rapids: 8 DDR5 channels × ~50 GB/s = ~400 GB/s peak
AMD EPYC Genoa: 12 DDR5 channels × ~38 GB/s = ~460 GB/s peak

Applications must saturate all channels to achieve peak bandwidth. If all memory allocations land on NUMA node 0, only that node's channels are used. Interleaved allocation across nodes uses all channels.

Measure bandwidth:

# Intel PCM
pcm-memory 1  # per-channel bandwidth

# membw (LIKWID)
likwid-perfctr -C 0 -g MEM_DP ./stream_benchmark

# stream benchmark (industry standard)
./stream  # reports Triad bandwidth

Memory Access Patterns

Sequential access is prefetcher-friendly. The hardware prefetcher detects stride-1 patterns and issues prefetch requests before the CPU reaches the next cache line. Effective bandwidth: near theoretical peak.

Random access defeats the prefetcher. The CPU must wait for each load. Effective bandwidth on a 4 KB random read workload:

Sequential 64-byte reads (prefetched):   ~50 GB/s
Random 64-byte reads (no pattern):       ~0.5–2 GB/s

This 25–100x gap motivates data structure choices: array-of-structs vs. struct-of-arrays, sorted vs. hash-table lookups, row-major vs. column-major matrix traversal.

Matrix traversal example:

// Row-major traversal (cache-friendly, sequential):
for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
        sum += A[i][j];  // sequential cache access

// Column-major traversal (cache-unfriendly, stride-N):
for (int j = 0; j < N; j++)
    for (int i = 0; i < N; i++)
        sum += A[i][j];  // jumps N*sizeof(float) per access → cache miss

For large N (e.g., N=4096, matrix = 64 MB), the column-major version can be 10–50x slower.


TLB Performance

The TLB (Translation Lookaside Buffer) caches virtual-to-physical address translations. TLB miss requires a page table walk (~100–300 cycles on a 4-level page table).

TLB Level Entries Hit Latency Coverage (4 KB pages)
L1 DTLB 64 0 extra cycles 256 KB
L2 TLB 1536 ~7 cycles 6 MB
L3 TLB (STLB) 6144 ~30 cycles 24 MB
Page table walk ~100 cycles Any

If the working set is 1 GB with 4 KB pages, it requires 262,144 TLB entries—far exceeding the STLB. Every access to a cold page incurs a page walk.

Huge pages reduce TLB pressure by using 2 MB pages (512x coverage) or 1 GB pages (262,144x coverage):

# Transparent Huge Pages (THP) — automatic
echo always > /sys/kernel/mm/transparent_hugepage/enabled

# Explicit huge pages via hugetlbfs
echo 1000 > /proc/sys/vm/nr_hugepages  # allocate 1000 × 2 MB = 2 GB

# mmap with MAP_HUGETLB
void *p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
               MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0);

# Check TLB miss rate
perf stat -e dTLB-loads,dTLB-load-misses -p <pid> sleep 10

Memory hierarchy with TLB overlay:

Virtual Address Space
      │
      ▼
  L1 DTLB (64 entries, 4K pages)
      │ miss
      ▼
  L2/STLB (1536 entries)
      │ miss
      ▼
  Page Table Walk (4 levels: PGD→PUD→PMD→PTE)
      │ each level may miss LLC
      ▼
  Physical Address → Cache Lookup

NUMA Bandwidth

On multi-socket systems, accessing memory on a remote NUMA node incurs extra latency and competes with that socket's local traffic.

Socket 0 (CPUs 0-23)          Socket 1 (CPUs 24-47)
┌────────────────────┐         ┌────────────────────┐
│  CPU 0..23         │         │  CPU 24..47         │
│  L3 cache (48 MB)  │         │  L3 cache (48 MB)   │
│  IMC controller    │         │  IMC controller     │
│  DRAM 0 (256 GB)   │         │  DRAM 1 (256 GB)    │
└────────┬───────────┘         └───────────┬─────────┘
         │◄── QPI/UPI Interconnect ────────►│
         │    ~10 GB/s, ~100 ns extra latency│

Local access: ~70 ns. Remote access: ~150 ns. Use numastat -p <pid> to see remote page accesses.


Memory Prefetcher Effectiveness

The hardware prefetcher works for: - Stride-1 (sequential) access. - Constant-stride access (e.g., stride 128 bytes). - Streaming loads (one direction).

It does NOT work for: - Pointer chasing (linked lists, trees)—each pointer dereference depends on the previous load. - Random-access patterns. - Strides that change per iteration.

For pointer chasing, the software prefetch intrinsic can help when the next pointer is known ahead of time:

// Prefetch next node while processing current
node_t *next = current->next;
__builtin_prefetch(next, 0, 1);  // prefetch for read, low locality
// ... process current ...
current = next;

In practice, this helps when the prefetch-to-use distance is > 1 miss penalty (~200 ns ÷ 0.33 ns/iteration = 600 iterations ahead for a 3 GHz iteration loop).


Memory Bandwidth Saturation

A program is memory-bound when the bottleneck is memory bandwidth, not compute. Signs:

  • IPC < 1.0 despite simple instructions.
  • mem-loads stalls dominate perf stat output.
  • perf stat shows high cycle_activity.stalls_l3_miss (Intel).
  • DRAM bandwidth near theoretical peak (measured by PCM).

The Roofline model graphically identifies whether a kernel is compute-bound or memory-bound:

GFLOPS/s
│
│     Compute Roof (peak FLOPS)
│─────────────────────────────────
│              ●  (compute-bound kernel)
│         /
│        /  ← 45° slope = DRAM bandwidth
│       ●  (memory-bound kernel)
│      /
│─────/
│
└──────────────────── FLOP/Byte (arithmetic intensity)
   Low               High
   (memory-bound)    (compute-bound)

Arithmetic intensity = FLOP count / bytes transferred from DRAM. - Matrix-vector multiply: ~0.25 FLOP/byte → memory-bound. - Dense matrix-matrix multiply: ~16 FLOP/byte → compute-bound.


Tools

# perf mem: sample memory loads with address + latency info
perf mem record -p <pid> -- sleep 30
perf mem report --sort=mem

# pcm-memory: Intel PCM DRAM bandwidth
pcm-memory 1

# valgrind massif: heap profiling (memory allocation patterns)
valgrind --tool=massif --pages-as-heap=yes ./program
ms_print massif.out.<pid> | head -80

# heaptrack: lightweight heap profiler (< 5% overhead)
heaptrack ./program
heaptrack_gui heaptrack.program.<pid>.gz

# BCC memleak: detect memory leaks using eBPF
/usr/share/bcc/tools/memleak -p <pid>

# Bandwidth test with stream
./stream | grep -E "Triad|Add"

Historical Context

The "Memory Wall" problem was identified by Wulf and McKee in 1995: CPU performance was scaling at 60% per year, while DRAM bandwidth scaled at only 10% per year. This gap motivated caches, prefetchers, and ultimately the multi-level cache hierarchies of modern processors.

NUMA architectures became standard with the introduction of AMD Opteron (2003) and Intel Nehalem (2008), which moved the memory controller on-die. The ccNUMA (cache-coherent NUMA) model used today requires software to be NUMA-aware to achieve peak performance.

Brendan Gregg's perf mem integration and the BCC toolkit (2015–2018) made fine-grained memory profiling accessible without custom hardware.


Production Examples

Case: Redis cluster latency doubling after hardware upgrade. A Redis cluster moved from single-socket to dual-socket servers. p99 GET latency doubled from 0.5 ms to 1.1 ms. numastat -p showed 60% of Redis memory accesses were to the remote NUMA node. Root cause: Redis allocated memory interleaved across nodes, but the CPU affinity was set to socket 0 only. Fix: numactl --cpunodebind=0 --membind=0 redis-server. Latency returned to 0.5 ms.

Case: JVM GC pause from TLB shootdowns. A Java service with a 32 GB heap running with 4 KB pages experienced GC pauses extending to 500 ms. perf stat showed extreme dTLB-load-misses. Enabling THP reduced page table entries from 8 million to 16,000. GC pauses dropped to 50 ms and TLB miss rate fell by 20x.


Debugging Notes

  • cat /proc/<pid>/smaps | grep -A 5 AnonHugePages — shows whether THP is active for a process's mappings.
  • cat /proc/buddyinfo — shows free page fragmentation (high fragmentation prevents THP allocation).
  • /sys/kernel/mm/transparent_hugepage/defrag — if set to always, the kernel will compact memory to satisfy THP allocations, which can cause latency spikes. Prefer defer+madvise.
  • perf stat -e cache-misses,cache-references reports on the last-level cache by default, not L1. Use specific event names for L1 (e.g., L1-dcache-load-misses).
  • MALLOC_MMAP_THRESHOLD_ environment variable: glibc will mmap allocations above this size. Large mmapd regions benefit from THP; small heap allocations do not.

Security Implications

Memory side-channels exploit timing differences between cache hits and misses. FLUSH+RELOAD (used in Meltdown, Spectre, and numerous cryptographic attacks) works by: 1. Flushing a cache line with CLFLUSH. 2. Letting the victim access a secret-dependent address. 3. Timing access to candidate lines—the one that loads fast was accessed by the victim.

Mitigations: constant-time algorithms (no data-dependent memory access patterns), cache partitioning (CAT — Intel Cache Allocation Technology), core isolation.

Rowhammer: repeated reads to a DRAM row flip bits in adjacent rows. This has been demonstrated to escalate privileges (CVE-2015-0272, CVE-2016-6728). DRAM manufacturers have added Target Row Refresh (TRR) as mitigation, though it has been bypassed (TRRespass, 2020).


Performance Implications

Memory access patterns determine whether the hardware prefetcher fires. A linked-list traversal of the same data as a contiguous array can be 10x slower due to pointer-chasing cache misses. Data structure selection (array vs. linked list, open-addressing hash table vs. chained hash table) has more performance impact than algorithmic complexity for small N in cache-resident workloads.

Cache line padding (64-byte alignment, pad structs to cache-line boundaries) eliminates false sharing but increases memory footprint, potentially worsening true sharing pressure. Always measure.


Failure Modes and Real Incidents

THP-induced latency spikes. Transparent Huge Pages with defrag=always caused 1–2 second latency spikes at Monzo Bank (2019) during memory compaction for THP allocation. The kernel's khugepaged daemon would compact memory, causing page faults during compaction. Fix: set transparent_hugepage/defrag to madvise, allowing applications to opt in explicitly via madvise(MADV_HUGEPAGE) only for regions that benefit.

OOM killer + NUMA. A workload on a dual-socket system was killed by the OOM killer even though total system memory was 30% free. Root cause: NUMA zone balancing was disabled, causing one NUMA node's zone to be exhausted while the other was 60% free. Fix: echo 1 > /proc/sys/kernel/numa_balancing.


Modern Usage

Memory tiering (CXL 1.0+, Linux 5.18+): CXL-attached memory appears as a separate NUMA node with ~300 ns latency (higher than DDR5, lower than NVMe). Tiering-aware allocators (memtier in the Linux kernel, HMEM in NUMA API) place hot data in DDR5 and cold data in CXL memory automatically.

Persistent memory (Intel Optane PMEM): latency ~300 ns, byte-addressable, non-volatile. Requires PMDK (Persistent Memory Development Kit) for correct use. ndctl for management. Largely discontinued after Intel exited the PMEM market in 2022, but architectural lessons inform CXL memory design.


Future Directions

  • CXL 3.0 memory pooling: multiple hosts share a CXL memory pool, enabling memory disaggregation. Latency target: < 200 ns. Software challenges: NUMA-aware allocators, coherence.
  • HBM (High Bandwidth Memory) in CPUs: AMD Genoa-X includes HBM2e as an L3 victim cache (1 TB/s bandwidth). Intel Sapphire Rapids HBM offers 64 GB on-package at 1 TB/s. Programs benefit automatically via the hardware memory mode.
  • DRAM-less servers with disaggregated CXL memory are being prototyped at hyperscalers for cost efficiency.

Exercises

  1. Write a benchmark that accesses a 512 MB array in two patterns: sequential and random (using a Fisher-Yates shuffle for index ordering). Measure bandwidth with perf stat -e LLC-load-misses and explain the 10x+ difference.

  2. Demonstrate false sharing: create two threads incrementing adjacent uint64_t counters 1 billion times each. Measure wall-clock time. Pad each counter to 64 bytes and re-measure. Use perf c2c to confirm the cache line bouncing.

  3. Enable huge pages for a memory-intensive application using madvise(MADV_HUGEPAGE). Measure TLB miss rate before and after with perf stat -e dTLB-load-misses,dTLB-loads. Calculate the percentage improvement.

  4. Use numactl --hardware to examine a NUMA system's topology. Then run a bandwidth benchmark with memory bound to local vs. remote nodes and measure the latency difference.

  5. Profile a pointer-chasing linked list traversal vs. a contiguous array traversal of the same N integers using perf stat. Identify which PMU events explain the performance difference.


References

  • Wulf, W. A., McKee, S. A. "Hitting the Memory Wall: Implications of the Obvious." ACM SIGARCH, 1995.
  • Drepper, U. "What Every Programmer Should Know About Memory." Red Hat, 2007. https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
  • Gregg, B. Systems Performance (2nd ed., 2020). Chapter 7: Memory.
  • Williams, S. et al. "Roofline: An Insightful Visual Performance Model for Multicore Architectures." CACM, 2009.
  • van der Veen, V. et al. "Drammer: Deterministic Rowhammer Attacks on Mobile Platforms." CCS 2016.
  • Frigo, M. et al. "TRRespass: Exploiting TRR in Commodity DRAM." IEEE S&P 2020.
  • NUMA programming: https://www.kernel.org/doc/html/latest/admin-guide/mm/numa_memory_policy.html