Cache Hierarchy: L1 Through LLC, Coherence, and Prefetching

Prerequisites

CPU pipeline and OoO execution (01-cpu-pipeline.md, 02-out-of-order-execution.md): understanding why memory latency matters
Basic digital logic: SRAM vs DRAM, addressing concepts
Virtual memory basics: page tables, TLB (for VIPT understanding)
Assembly language: load/store instructions, memory access patterns
Big-O notation for understanding associativity tradeoffs

Technical Overview

The cache hierarchy exists because of a fundamental physical reality: the fastest memory is expensive and physically large, making it impossible to build a single fast memory system large enough for program data. DRAM (Dynamic RAM) is dense and cheap — commodity DDR5 delivers ~50 GB/s bandwidth — but its latency is approximately 50-100 ns (200+ CPU cycles on a 4 GHz processor). SRAM (Static RAM), used for caches, is 10-50x faster but requires 4-6 transistors per bit vs 1 transistor + 1 capacitor for DRAM. A full 64 MB SRAM at L1 cache latency would occupy most of a processor die.

The solution: a hierarchy of caches, each larger and slower as distance from the CPU core increases. The hierarchy exploits temporal locality (recently accessed data will be accessed again soon) and spatial locality (data near recently accessed data will be accessed soon).

Modern caches are one of the primary determinants of real-world CPU performance. Programs that fit in L1/L2 cache execute at near-peak IPC. Programs that generate frequent L3 or DRAM accesses see 5-50x performance degradation regardless of CPU compute capacity.

Historical Context

1965 — IBM System/360 Model 85: First production CPU cache, 16 KB with 64-byte lines. Engineers at IBM (L. J. Slotnik, others) demonstrated that a small fast buffer between the CPU and main memory could dramatically improve average memory access time.

1968 — Maurice Wilkes coins the term "cache" in his paper "Slave Memories and Dynamic Storage Allocation."

1970s-1980s: Cache research flourishes. Direct-mapped vs set-associative tradeoffs analyzed. Working set theory (Denning) provides the theoretical basis for understanding cache behavior.

1989 — Intel i486: First Intel CPU with an on-chip cache (8 KB unified L1).

1993 — Intel Pentium: Split 8 KB I-cache + 8 KB D-cache. External L2 cache on motherboard (256 KB–512 KB). L2 at half CPU speed.

2000 — Intel Pentium III Coppermine: On-die 256 KB L2 (full-speed). HUGE performance boost over off-die L2.

2008 — Intel Nehalem: First Intel CPU with an on-die L3 cache shared across all cores. L3 is inclusive of L1/L2 on Nehalem through Broadwell.

2019 — AMD Zen 2 (Rome, 7nm): Non-inclusive (exclusive) L3 policy with chiplet design — 8-core CCDs with 32 MB L3 each, up to 256 MB L3 total on EPYC Rome. This design maximizes effective cache capacity.

2022 — AMD Zen 4 with 3D V-Cache: AMD stacks an additional 64 MB SRAM die on top of the compute die ("3D V-Cache"), tripling L3 to 96 MB per CCD. Latency impact: ~3 cycles additional for stacked SRAM, but L3 hit rate dramatically improves for cache-sensitive workloads.

Memory Latency Reference Table

Memory Level          Latency         Bandwidth        Size (per core/socket)
─────────────────────────────────────────────────────────────────────────────
Register               1 cycle        unlimited        ~1 KB (register file)
L1 instruction cache   4 cycles       ~2 TB/s          32-64 KB
L1 data cache          4-5 cycles     ~2 TB/s          32-64 KB
L2 cache               12-15 cycles   ~500 GB/s        256 KB - 1 MB
L3 cache (on-die)      40-50 cycles   ~200 GB/s        4-96 MB (shared)
L4 (eDRAM, rare)       ~100 cycles    ~50 GB/s         64-128 MB (Broadwell-H)
DRAM (DDR5-4800)       70-100 ns      ~50 GB/s         16-2048 GB
                       (~280-400 cyc)
NVMe SSD               ~100 μs        ~7 GB/s          TB range
HDD                    ~10 ms         ~200 MB/s        TB range

Note: Latency figures for Intel Core/AMD Zen, 4-5 GHz clock, 2023.
Bandwidth figures are theoretical peak per core (L1/L2) or socket (L3/DRAM).
Actual available bandwidth depends on access pattern and concurrency.

Core Content: Cache Structure

Cache Line

The basic unit of caching is the cache line (also called "cache block"). All modern CPUs use 64-byte cache lines. This is universal across x86-64, ARM64, RISC-V, POWER — a hardware-software interface standard.

Cache line (64 bytes):
┌──────────────────────────────────────────────────────────────┐
│ byte 0 │ byte 1 │ ... │ byte 62 │ byte 63                   │
└──────────────────────────────────────────────────────────────┘
 ↑                                                              ↑
 addr & ~63 (aligned)                               addr | 63

A single load instruction fetches 8 bytes (uint64_t) but causes
the entire 64-byte line to be loaded into cache.

Spatial locality is why 64-byte lines work: accessing arr[0] loads arr[0] through arr[7] into cache simultaneously. If the program then accesses arr[1], arr[2], ... sequentially, all subsequent accesses are cache hits.

Cache Organization: Set-Associative

A direct-mapped cache maps each memory address to exactly one cache slot. Simple but suffers from conflict misses: two frequently used addresses that map to the same slot evict each other repeatedly.

A fully associative cache allows any memory block to reside in any cache slot. Optimal hit rate, but the hardware needed to search all slots in parallel grows prohibitively with cache size.

N-way set-associative is the universal compromise: the cache is divided into S sets, each containing N slots ("ways"). A memory address maps to exactly one set (determined by address bits), but can occupy any of the N ways in that set.

ADDRESS DECOMPOSITION:

  Physical Address (48 bits on modern x86-64):
  ┌─────────────────────┬─────────────────┬────────────────┐
  │      Tag            │   Set Index     │  Block Offset  │
  │    bits[47:12+s]    │  bits[12+s-1:6] │  bits[5:0]     │
  └─────────────────────┴─────────────────┴────────────────┘

  Block offset = log2(64) = 6 bits (select byte within cache line)
  Set index = log2(number_of_sets) = s bits
  Tag = remaining bits (identify which memory block is in this set)

Example: 32 KB 8-way L1 D-cache:
  Number of sets = 32KB / (8 × 64B) = 64 sets
  Set index = log2(64) = 6 bits
  Block offset = 6 bits
  Tag = 48 - 6 - 6 = 36 bits

  Address breakdown: [tag:36 | set:6 | offset:6]

Cache lookup:

  Physical Address
        │
        ├──────────────► [tag bits]
        │                     │
        ├──► [set index] ─►  Set S:    ┌──────────────────────────────┐
        │                              │ Way 0: tag0 | valid | dirty | data │
        │                              │ Way 1: tag1 | valid | dirty | data │
        │                              │ ...                                 │
        │                              │ Way 7: tag7 | valid | dirty | data │
        │                              └──────────────────────────────┘
        │                                    │
        └──────────────────────────────────► Compare all tags in parallel
                                             │
                                             ▼
                                    Hit? → return data[offset]
                                    Miss? → fetch from next level

Associativity tradeoff: - 1-way (direct-mapped): lowest cost, highest conflict miss rate - 4-way: good balance for L1 (Intel/AMD L1 D-cache is 8-way) - 8-way: standard for L1 D-cache - 16-way: common for L2, L3 - Fully associative: used for TLB (small), not feasible for large caches

Modern Cache Hierarchy Parameters

CPU                     L1I    L1D   L2       L3
───────────────────────────────────────────────────────────────
Intel Skylake           32 KB  32 KB 256 KB   depends on SKU
Intel Alder Lake (P)    32 KB  48 KB 1.25 MB  3 MB per 2P-cores
Intel Sapphire Rapids   32 KB  48 KB 2 MB     60 MB
AMD Zen 4               32 KB  32 KB 1 MB     32 MB per CCD
AMD Zen 4 + 3D V-Cache  32 KB  32 KB 1 MB     96 MB per CCD
Apple M1                192 KB 128 KB 12 MB   (no separate L3;
                                               SLC=shared 16 MB)
Apple M3                192 KB 128 KB 16 MB   (SLC)
ARM Cortex-X4           64 KB  64 KB  512 KB  per-cluster, varies
IBM POWER10             32 KB  32 KB  2 MB    120 MB

Note: Apple M-series uses "System Level Cache" (SLC) — shared across CPU,
GPU, Neural Engine. L1/L2 sizes per performance core.

Inclusion Policies

Intel Inclusive L3 (Sandy Bridge through Cascade Lake)

Every line present in L1 or L2 must also be present in L3. Advantage: snooping for cache coherence can check just the L3, simplifying coherence protocol. Disadvantage: L3 capacity is "wasted" storing data that already exists in L1/L2.

   Core 0 L1 ──┐
   Core 0 L2 ──┼──► All must be subsets of L3
   Core 1 L1 ──┤
   Core 1 L2 ──┘
   Core 2 L1 ──┐
               └──► Inclusive L3 must contain all

With 8 cores, each with 512 KB L2: up to 4 MB of L3 is occupied just for L1/L2 inclusions — wasted capacity on a 32 MB L3.

L3 inclusive capacity penalty: Typically 15-30% of L3 capacity is occupied by L1/L2 inclusions.

On invalidation in inclusive L3: when the L3 line is evicted, it must snoop and invalidate the corresponding L1/L2 copies. This creates "silent evictions" — data disappears from L1/L2 without the core requesting it, due to L3 eviction. This is the "back-invalidation" problem. Under high L3 pressure on many-core systems, L1/L2 hit rates degrade due to back-invalidations.

AMD Non-Inclusive (Victim Cache) L3 (Zen 2+)

L3 acts as a victim cache: it stores data that has been evicted from L2, not duplicates of active L1/L2 data.

   L1 miss → check L2
   L2 miss → check L3 (may have recently evicted data)
   L3 miss → fetch from DRAM → install in L2 (and L1)
            → when L2 evicts a line → install in L3

   L3 never has data that's also in L1/L2 (non-inclusive)

Advantage: full L3 capacity available for unique data. A 32 MB L3 in AMD Zen 4 holds 32 MB of unique data vs ~27 MB effective for inclusive designs.

Disadvantage: coherence is more complex — must check L1/L2 directly, not just L3.

Intel Post-Ice-Lake: NINE (Non-Inclusive Non-Exclusive)

Modern Intel Xeon (Sapphire Rapids) uses a hybrid: the L3 is non-inclusive (doesn't require all L2 data present) and non-exclusive (may contain data also in L2). Provides good capacity utilization without strict victim-cache constraints.

Write Policies

Write-Back (Standard for all modern CPUs)

On a cache write hit: update data in cache, mark the line "dirty." Do NOT write to memory immediately. Write to memory only when the dirty line is evicted.

   Write to addr X:
     L1 hit, dirty? → update L1, mark dirty
     L1 hit, clean? → update L1, mark dirty
     L1 miss?       → allocate line in L1 (fetch from L2)
                      → update L1, mark dirty

   Eviction of dirty line:
     L1 → L2: write dirty data to L2, L1 slot freed
     L2 → L3: write dirty data to L3 (or DRAM)
     L3 → DRAM: write dirty data to DRAM ("write-back")

Write-allocate: On a write miss, fetch the cache line first, then write. This allows subsequent reads to the same line to hit the cache. Standard behavior alongside write-back.

Write-Through (Less Common, Used for Some Peripheral Registers)

Every write to cache is simultaneously written to memory. Simpler coherence, but generates write traffic for every store.

   Write to addr X:
     L1 hit → update L1 AND write to memory (every write!)
     No dirty bits needed (cache always matches memory)

Write-through is used for memory-mapped I/O regions where devices must see writes immediately, and for some L1 designs in embedded processors. Not used for main data caches in general-purpose CPUs.

Cache Replacement Policies

When a set is full and a new line must be loaded, which way to evict?

LRU (Least Recently Used)

Evict the way that was accessed least recently. Optimal for temporal locality. True LRU requires tracking access order of all N ways per set.

True LRU in hardware: N-way requires tracking permutations. For 8-way LRU: 8! = 40320 states, requiring log2(40320) = 15.3 bits per set. For a 1024-set 8-way cache: ~2 KB just for replacement state. Tractable for L1.

Pseudo-LRU (PLRU)

Approximates LRU with a binary tree or bit vector per set.

8-way PLRU tree (one tree per set):

        [b0]
       /    \
    [b1]    [b2]
   /   \   /   \
 [b3] [b4][b5] [b6]
 / \  / \ / \  / \
 W0 W1 W2 W3 W4 W5 W6 W7

Each access updates tree bits along the path to the accessed way.
Eviction: follow path choosing 1-bits to find LRU approximation.
Cost: 7 bits per 8-way set (N-1 bits for N-way tree)

Intel uses PLRU variants in L1. AMD uses similar schemes.

RRIP (Re-Reference Interval Prediction, Jaleel et al. 2010)

Instead of tracking recency, track predicted re-reference interval (how soon will this line be accessed again?).

Each cache line has a 2-bit RRPV (Re-Reference Prediction Value): - 0: near-immediate re-reference (hot line) - 3: distant re-reference (cold line, candidate for eviction)

On insertion:     RRPV = 2 (distant, but not immediate eviction)
On cache hit:     RRPV = 0 (now predicted near-immediate)
On eviction need: find first RRPV == 3
                  if none found: increment all RRPVs by 1, retry

SRRIP (Static RRIP): always insert with RRPV = 2
DRRIP (Dynamic RRIP): OS or hardware monitors hit rate, switches
                       between SRRIP (workload fits in cache) and
                       BRRIP (bimodal: most inserts at RRPV=3,
                               occasional at RRPV=2) to resist
                       thrashing

RRIP is particularly effective for L3 where scan/streaming workloads cause cache thrashing. Intel Sandy Bridge and later use a variant of RRIP for L3.

VIPT (Virtually Indexed Physically Tagged)

L1 caches face a critical latency dilemma: to index into the cache, you need the physical address. But physical address translation (TLB lookup) takes 1-2 cycles. For a target L1 hit latency of 4 cycles, you can't afford to wait.

VIPT solution: Use virtual address bits for the set index (available immediately, no translation), but use physical address bits for the tag (for correctness after TLB lookup).

   Virtual Address (on load instruction issue):
   ┌───────────────┬──────────────┬──────────────┐
   │  (VA bits)    │  Set Index   │ Block Offset  │
   │               │  (VA bits)   │               │
   └───────────────┴──────────────┴───────────────┘
           │              │               │
           │       Used immediately        │
           │       to index L1 cache       │
           │       (no TLB needed)         │
           │                              │
   ┌───────▼──────────────────────────────▼───────────┐
   │  L1 D-Cache                                       │
   │  [sets indexed by VA, ways read in parallel]     │
   └───────────────────────────────────────────────────┘
                           │
                    Tags from all ways

   Physical Address (from TLB, 1-2 cycles later):
   ┌───────────────┬──────────────┬──────────────┐
   │      Tag      │              │               │
   │  (PA bits)    │              │               │
   └───────┬───────┴──────────────┴───────────────┘
           │
           Compare PA tag against cached way tags
           → determines which (if any) way is a hit

Aliasing constraint: For VIPT to be correct without aliasing, the set index bits must come from bits below the page offset (bits 11:0). This means: index_bits + offset_bits ≤ 12. With 6 offset bits: at most 6 index bits → max 64 sets for VIPT with no aliasing risk.

For an 8-way VIPT cache: 64 sets × 8 ways × 64 bytes = 32 KB maximum without alias risk at one set index bit per page alias. This is why Intel's L1 D-cache is 32 KB in Sandy Bridge through Raptor Lake — the constraint is not arbitrary, it is the VIPT aliasing limit.

Intel Alder Lake P-core has 48 KB L1 D-cache: it uses 7 index bits (bits[11:6]), exceeding the 6-bit limit. This creates a 2:1 alias mapping per physical page, which the hardware handles with synonym tracking (additional tag bits or probing). More complex but enables larger VIPT L1.

Hardware Prefetchers

Prefetchers predict future memory accesses and fetch cache lines before they are requested, hiding miss latency.

Stream Prefetcher

Detects sequential access patterns. When N consecutive cache misses to increasing (or decreasing) addresses are observed, the prefetcher issues ahead-of-demand requests for the next K lines.

   Access pattern: addr, addr+64, addr+128, addr+192 ...
   Prefetcher detects stream → issues requests for addr+256, addr+320, ...

   Intel has "Adjacent Cache Line Prefetch" (L1 → L2 on L1 miss,
   always fetches the adjacent line to the missing line)
   and a more aggressive L2 stream prefetcher (up to 8 streams).

Stride Prefetcher

Detects constant-stride patterns. If accesses follow address[i] = base + i × stride, issue prefetches for base + (i+1) × stride, base + (i+2) × stride, etc.

   Example: struct array access  struct { int a; double b; } arr[N];
   Access to arr[i].b: stride = sizeof(struct) = 16 bytes (not cache-line-aligned)
   Stride prefetcher detects 16-byte stride → issues ahead-of-time requests

Spatial Memory Streaming (SMS, Somogyi et al. 2006)

Observes the spatial access pattern within a physical page (which offsets within a 4 KB page are accessed) and replicates this pattern for future accesses to the same page-offset-aligned region.

Instruction Pointer (PC)-based Prefetcher

Correlates load instruction PCs with their access patterns. A load that always accesses an array sequentially is identified and prefetched aggressively; a load that accesses random addresses is suppressed (to avoid polluting the cache).

Prefetcher Aggressiveness and Pollution

Prefetchers must balance: - Timeliness: Issue prefetches early enough to arrive before the demand access (usually several hundred cycles ahead for DRAM) - Accuracy: Only prefetch data that will actually be used (inaccurate prefetches evict useful lines and consume memory bandwidth) - Coverage: Detect enough different patterns to improve hit rates

AMD Zen 4 has approximately 12 distinct prefetch engines across the L1/L2/L3 hierarchy. Intel Sapphire Rapids has hardware and microcode-managed prefetch engines.

Cache Performance Analysis

perf stat for Cache Events

# L1, L2, L3 miss rates
perf stat -e \
  L1-dcache-loads,L1-dcache-load-misses,\
  L2-loads,L2-load-misses,\
  LLC-loads,LLC-load-misses,\
  cache-misses,cache-references \
  ./program

# Example output:
#  5,000,000,000  L1-dcache-loads
#     50,000,000  L1-dcache-load-misses     # 1.0% L1 miss rate (good)
#     40,000,000  L2-load-misses            # 80% L2 miss rate (bad!)
#     30,000,000  LLC-load-misses           # 75% LLC miss rate (bad!)

Interpreting miss rates: - L1 miss rate > 5%: working set exceeds L1, likely needs data restructuring - LLC miss rate > 10%: frequent DRAM access, primary performance bottleneck - High LLC load miss + low cycles: memory latency is hidden well by OoO - High LLC load miss + high stalled-cycles-backend: memory latency is the bottleneck (ROB full, OoO can't hide it)

Cache Hierarchy Bandwidth Measurement

# Run stream benchmark to measure memory bandwidth
# https://www.cs.virginia.edu/stream/
./stream_benchmark
# Output: triad bandwidth (GB/s) is the relevant memory BW figure

# Intel MLC (Memory Latency Checker)
mlc --latency_matrix   # NUMA latency between nodes
mlc --bandwidth_matrix  # NUMA bandwidth between nodes

Visualizing Cache Behavior with `perf c2c`

# Detect false sharing (cache line ping-pong between cores)
perf c2c record ./multithreaded_program
perf c2c report
# Identifies hot cache lines with high "Hitm" (hit modified) events

Failure Modes

Cache thrashing: Working set slightly larger than cache. All sets are full of "hot" data. Every new access evicts a line that will be needed again soon. LRU or RRIP replacement degrades to near-DRAM performance. Fix: reduce working set (data structure optimization), pad to avoid set conflicts, or use software prefetching to amortize latency.
False sharing (see 06-cache-coherence.md): Two threads write different fields within the same cache line. The cache coherence protocol ping-pongs the line between cores at ~200 ns per write, despite there being no logical sharing. Classic example: adjacent elements of an array processed by different threads.
Cache set conflicts (direct-mapped or low-associativity): Two frequently accessed arrays whose sizes are exact multiples of the cache size map to the same sets. With an 8-way cache, up to 8 simultaneous aliases can co-exist; a 9th causes eviction. Manifests as unexpected cache miss rate spikes when array sizes are powers of 2.
Prefetcher misfiring: Random-access workloads (hash tables, tree traversals) defeat all prefetchers. The prefetcher issues speculative requests that bring in wrong data, consuming cache capacity and memory bandwidth without benefit. On such workloads, prefetcher can reduce performance. Some Intel CPUs allow disabling specific prefetch engines via MSR bits.
Non-temporal store cache bypass: MOVNT (non-temporal move) bypasses the cache hierarchy to avoid cache pollution. If used incorrectly on data that will be read soon, it prevents cache allocation and forces DRAM access on the subsequent read.

Security Implications

Cache timing side channels: Cache hit vs miss timing (typically 4 vs 200+ cycles) is the foundation of virtually all microarchitectural timing attacks. Flush+Reload and Prime+Probe are the two primary techniques:
Flush+Reload: attacker flushes target cache line, victim accesses it (or not), attacker measures reload time
Prime+Probe: attacker fills a cache set, victim accesses it (evicting attacker's data), attacker probes all ways and observes which is evicted
Spectre via cache covert channel: Spectre and Meltdown both use cache state as the information encoding for leaking secrets. The cache is the covert channel — without timing measurability, the attack fails.
Cache side channels in crypto: AES table-based implementations access S-boxes based on key material. Cache timing of these accesses leaks key bits. Mitigation: AES-NI hardware instructions (Intel: AESENC, AMD equivalent) perform AES in hardware with data-independent timing — no cache side channel.
FLUSH+FLUSH attack: Even the timing of CLFLUSH itself varies depending on whether the line was cached. Slower CLFLUSH on a hit → detects cache residence without even measuring reload time. Harder to defend against.

Performance Implications

Cache-Oblivious Algorithms

Some algorithms achieve optimal cache performance without knowing the cache parameters — they recursively divide the problem until it fits in whatever level of the hierarchy is available.

Naive matrix multiply: O(N^3) cache misses (row × column → N^3/B misses)
Cache-oblivious (recursive block): O(N^3 / (B × sqrt(M))) misses
  where B = cache line size, M = cache size

For N=1024, B=64B, M=32KB: ~8x fewer cache misses with recursive approach.

Memory Access Pattern Optimization

Row-major access (C-style, cache friendly):
  for i: for j: access arr[i][j]
  → accesses arr[i][0], arr[i][1], arr[i][2] ... sequentially
  → cache line loaded once, 8 accesses per line

Column-major access (cache unfriendly):
  for j: for i: access arr[i][j]
  → accesses arr[0][j], arr[1][j], arr[2][j] ... with stride N
  → cache line loaded, 1 access, evicted before row loops back
  → N^2 cache misses vs N^2/8 for row-major

Modern Usage

Linux `perf mem` for Cache-Level Attribution

# Record memory accesses with cache level attribution
perf mem record ./program
perf mem report

# Shows: % of loads hitting L1/L2/L3/DRAM + average latency

Intel VTune Amplifier

Provides detailed cache hierarchy analysis with source-level attribution. The "Memory Access" analysis in VTune identifies specific source lines responsible for LLC misses.

`valgrind --tool=cachegrind`

Simulates cache behavior and counts hits/misses at each level for each function and source line. Slower than hardware counters but provides instruction-level attribution:

valgrind --tool=cachegrind --branch-sim=yes ./program
cg_annotate cachegrind.out.12345 --auto=yes

Future Directions

3D-stacked SRAM cache (AMD 3D V-Cache): Vertically stacking SRAM dice using through-silicon vias (TSVs). AMD ships 64 MB additional L3 per CCD. Trade: ~4 cycle additional L3 latency but 3x capacity. Intel has similar technology (Foveros) available.
Near-data computing (Processing-in-Memory): Moving compute units closer to DRAM (HBM with embedded compute, UPMEM DPU). Bypasses the cache hierarchy entirely for memory-bound workloads.
Persistent memory as cache hierarchy level: Intel Optane DCPMM (discontinued 2022, but successors pending) inserted between DRAM and storage. ~10 μs latency, byte-addressable. Research continues on PM-aware cache hierarchy design.
Predictive prefetching with ML: Neural prefetchers (Bhatotia et al., "Voyager"; "Pythia" from Microsoft Research) use small neural networks in hardware to predict irregular access patterns. Not yet in shipping CPUs but competitive with state-of-art hardware prefetchers on pointer-chasing workloads.
Cache partitioning (Intel CAT — Cache Allocation Technology): Hardware mechanism to assign portions of LLC ways to specific cores/processes, preventing cache thrashing in multi-tenant systems. Deployed in Intel Xeon since Broadwell. AMD has similar UMSR-based mechanisms in Zen 4.

Exercises

Cache parameter measurement: Write a microbenchmark that measures average access latency for arrays of increasing size (512 B to 512 MB in 2x steps). Plot latency vs array size. Identify the L1/L2/L3/DRAM breakpoints and confirm they match your CPU's cache sizes (check with lscpu or cat /sys/devices/system/cpu/cpu0/cache/index*/size).
Set-associativity conflict measurement: Create two large arrays whose sizes are exact powers of 2. Access them interleaved. Vary the size from 8 KB to 64 KB. Observe the miss rate spike when sizes match the direct-mapped conflict threshold. Explain using set index arithmetic.
False sharing detection: Create a struct with two integer fields. Launch two threads, each incrementing one field 100 million times. Measure time. Then pad the struct so each field is on a separate cache line. Measure again. Use perf c2c to observe the "Hitm" (Hit Modified) count in both cases.
Prefetcher behavior: Write a benchmark accessing a 256 MB array with stride S (vary S from 1 to 1024 cache lines). For each stride, measure throughput. Identify: (a) where the L2 stream prefetcher stops being effective (beyond its stride detection range), (b) the stride at which throughput drops to DRAM bandwidth.
Cache replacement policy analysis: Implement a 4-way set-associative cache simulator with both LRU and RRIP replacement. Feed it the access traces from a linked-list traversal vs a sequential scan. Compare miss rates. Explain why RRIP improves over LRU for scan workloads.

References

Wilkes, M. V. (1965). Slave Memories and Dynamic Storage Allocation. IEEE Transactions on Electronic Computers, EC-14(2), 270–271.
Hill, M. D. (1988). A Case for Direct-Mapped Caches. IEEE Computer, 21(12), 25–40.
Smith, A. J. (1982). Cache Memories. ACM Computing Surveys, 14(3), 473–530.
Jaleel, A., et al. (2010). High Performance Cache Replacement Using Re-Reference Interval Prediction. ISCA 2010.
Somogyi, S., et al. (2006). Spatial Memory Streaming. ISCA 2006.
Drepper, U. (2007). What Every Programmer Should Know About Memory. Red Hat. https://people.redhat.com/drepper/cpumemory.pdf
Intel Corporation. (2022). Intel 64 and IA-32 Architectures Optimization Reference Manual. Chapter 2 (Intel Microarchitecture Codename Skylake) and Chapter 3 (Optimization for Memory).
Fog, A. (2023). Optimizing Software in C++. Chapter 9: Memory Access. https://www.agner.org/optimize/optimizing_cpp.pdf
Kim, Y., et al. (2012). Case for Locality-Based Cache Management. USENIX ATC 2012.