Cache Coherence: MESI, Snooping, and Directory Protocols

Prerequisites

Cache hierarchy (05-cache-hierarchy.md): cache lines, write policies, L1/L2/L3 structure
Multi-core CPU concepts: multiple cores, shared memory
Basic bus protocols: how devices communicate over shared buses
Memory consistency models (briefly referenced): sequential consistency, TSO

Technical Overview

Cache coherence is the problem of maintaining a consistent view of shared memory when multiple caches each hold copies of the same memory location. Without coherence, Core 0 could read a stale value from its local L1 cache while Core 1 has already updated that memory location in its own L1.

A coherence protocol defines rules ensuring that at any moment: 1. Any read returns the most recently written value 2. All writes to a location are eventually visible to all processors 3. Writes to a single location appear to occur in the same order to all processors (write serialization)

This is distinct from memory consistency (which defines the allowed ordering of operations across different memory locations). Coherence is per-address; consistency is system-wide ordering.

Two fundamental approaches to coherence dominate modern hardware:

Snooping: All caches monitor ("snoop") a shared interconnect for operations affecting lines they hold. Scales to approximately 8-16 cores. All commercial dual/quad-socket systems use snooping at the die level.
Directory: A central directory tracks which caches hold each memory line. Scales to thousands of nodes by eliminating broadcast. Used in large NUMA systems, multi-socket NUMA architectures with more than ~16 cores per interconnect.

Historical Context

1974: The coherence problem was recognized in early multiprocessor designs. Initial solutions were brute-force: no private caches (all caches shared) or write-through with bus broadcasting.

1983 — Illinois MESI Protocol: Papamarcos and Patel at University of Illinois proposed the MESI protocol (ISCA 1984 paper). The "Illinois" protocol distinguished between "Exclusive" (private, clean) and "Shared" (multiple copies exist), allowing write hit without bus broadcast in the Exclusive state — a key performance optimization.

1986 — First snooping commercial system: Sequent Balance, using a modified bus-based snooping protocol.

1988 — Stanford DASH: First large-scale NUMA system with hardware directory coherence, developed at Stanford by Lenoski, Laudon, and others. Demonstrated that directory coherence could scale to 64+ processors.

1992 — SGI Challenge: Commercial snooping bus system scaling to 36 MIPS processors on a high-bandwidth bus. Pushed snooping to its practical limit.

1996 — SGI Origin 2000: First commercial ccNUMA (cache-coherent NUMA) system using directory coherence. 64 nodes, scalable to 1024. NUMA architecture is covered in 07-numa-architecture.md.

2003 — AMD Opteron (first x86 NUMA): HyperTransport interconnect, distributed memory, directory-based coherence across sockets.

2017 — AMD EPYC Naples (1st gen): 4 dies per socket (MCM: Multi-Chip Module), each die is a coherence domain. Infinity Fabric interconnects dies. Cross-die coherence via directory.

2019 — AMD EPYC Rome (2nd gen EPYC): 8 chiplets (CCDs), dedicated I/O die. Coherence managed across chiplets.

The Coherence Problem: Concrete Example

Initial state: x = 0 in DRAM

   Core 0 cache:            Core 1 cache:
   [invalid — no copy]      [invalid — no copy]

Step 1: Core 0 reads x
   Core 0 cache: x = 0 (fetched from DRAM)

Step 2: Core 1 reads x
   Core 1 cache: x = 0 (fetched from DRAM)

   Core 0 cache: x = 0      Core 1 cache: x = 0
                 ↑                         ↑
                 Both hold copies of x

Step 3: Core 0 writes x = 42
   Core 0 cache: x = 42 (write-back cache — not yet in DRAM)
   Core 1 cache: x = 0  ← STALE! Core 1 has wrong value

Step 4: Core 1 reads x
   Core 1 returns x = 0 ← INCORRECT (should be 42)

WITHOUT coherence: program behavior is undefined.
WITH coherence: the protocol ensures Step 3 invalidates
  Core 1's copy, so Step 4 fetches x = 42 correctly.

Snooping-Based Coherence

How Snooping Works

All L1/L2 caches are connected to a shared coherence interconnect (originally a physical bus; modern CPUs use ring buses or mesh fabrics that support logical broadcast). Each cache monitors all transactions on this interconnect.

    Core 0    Core 1    Core 2    Core 3
      │          │         │         │
   ┌──▼──┐    ┌──▼──┐   ┌──▼──┐   ┌──▼──┐
   │ L1  │    │ L1  │   │ L1  │   │ L1  │
   │ L2  │    │ L2  │   │ L2  │   │ L2  │
   └──┬──┘    └──┬──┘   └──┬──┘   └──┬──┘
      │          │         │         │
   ═══╪══════════╪═════════╪═════════╪═══ Snooping Ring/Bus
      │          │         │         │
   ┌──▼──────────▼─────────▼─────────▼──┐
   │              Shared L3              │
   │    (all transactions visible here)  │
   └────────────────────────────────────┘

Each cache controller monitors all transactions on the ring.
When Core 0 reads a line, every other cache sees this request.
When Core 1 writes a line, every other cache sees the write.

Bus bandwidth is the scaling bottleneck: every coherence transaction is broadcast. With 8 cores each generating frequent writes, bus saturation occurs. This limits snooping to approximately 8-16 cores per coherence domain (per socket in Intel/AMD CPUs).

Intel's Ring Bus (Sandy Bridge through Ice Lake): a bidirectional ring connects all cores and the shared L3 cache banks. Transactions propagate around the ring; all agents snoop. Supports ~24 cores (Xeon Platinum 8280).

Intel Mesh Fabric (Skylake-SP Xeon+): replaced the ring bus. An N×M mesh of cores and L3 slices. Still snooping but with shorter point-to-point distances.

MESI Protocol

MESI is the foundation of all snooping-based coherence protocols. Each cache line in each cache has one of four states:

  ┌──────────────────────────────────────────────────────────────┐
  │ State     │ Valid │ Dirty │ Shared │ Description             │
  ├───────────┼───────┼───────┼────────┼─────────────────────────┤
  │ Modified  │  YES  │  YES  │   NO   │ Only copy, dirty        │
  │ Exclusive │  YES  │   NO  │   NO   │ Only copy, clean        │
  │ Shared    │  YES  │   NO  │  YES   │ Multiple clean copies   │
  │ Invalid   │   NO  │  N/A  │  N/A   │ Not present / invalid   │
  └───────────┴───────┴───────┴────────┴─────────────────────────┘

MESI State Machine

                        ┌────────────────────────────────────┐
                        │             MESI                   │
                        │          State Machine              │
                        └────────────────────────────────────┘

  Events:
    PrRd  = Processor (local core) Read
    PrWr  = Processor Write
    BusRd = Snooped bus Read (another core is reading)
    BusRdX= Snooped bus ReadExclusive (another core is writing)
    BusUpgr= Snooped bus Upgrade (owner upgrades S→M)
    Flush = Write dirty line back to memory

                    BusRd / Supply data
                    ┌─────────────────────────────────────────┐
    PrRd(miss)/     │                   BusRdX / Supply+Flush │
    BusRd→Shared    │          ┌─────────────────────────────┐│
         ┌──────────▼──────────▼──┐                          ││
         │       MODIFIED         │◄──────────────────┐      ││
         │         (M)            │  PrWr(hit)/--      │      ││
         └──────────┬─────────────┘                   │      ││
              │     │ BusRd /                 ┌────────┴──────┤│
              │     │ Flush → SHARED          │  EXCLUSIVE    ││
              │     ▼                         │     (E)       ││
              │  ┌──────────────────────┐     │               ││
    PrRd(hit)/│  │      SHARED          │     └───────────────┘│
         --   │  │        (S)           │          ▲  ▲        │
              │  └──────────┬───────────┘          │  │        │
              │             │ PrWr / BusUpgr →     │  │        │
              │             │   Invalidate others   │  │        │
              │             ▼                       │  │        │
              │  ┌──────────────────────┐           │  │        │
              └─►│      INVALID         │───────────┘  │        │
                 │        (I)           │ PrRd(miss)/   │        │
                 └──────────────────────┘ BusRd→Excl    │        │
                           │                            │        │
                           └────────────────────────────┘        │
                           PrRd(miss)/BusRd → Shared (if shared)  │
                                                                  │
  ┌───────────────────────────────────────────────────────────────┘
  │  Simplified transitions (full protocol has more edges):
  │
  │  I → E: local read miss, no other cache has it → fetch from mem, Exclusive
  │  I → S: local read miss, other caches have it → fetch, Shared
  │  E → M: local write hit (Exclusive line → Modified, no bus transaction needed)
  │  S → M: local write hit (Shared → Modified: send BusUpgr, others go I→S..I)
  │  S → I: snoop BusRdX from another core
  │  M → S: snoop BusRd from another core → flush dirty data, transition to Shared
  │  M → I: snoop BusRdX from another core → flush dirty data, go Invalid
  └───────────────────────────────────────────────────────────────

MESI Transition Table

Current State	Event	Action	New State
Invalid	PrRd miss	Fetch from mem/cache	Exclusive (if only copy) or Shared
Invalid	PrWr miss	Fetch, invalidate others	Modified
Exclusive	PrRd hit	Return data	Exclusive
Exclusive	PrWr hit	Update locally (free!)	Modified
Exclusive	Snoop BusRd	Transition, keep clean copy	Shared
Exclusive	Snoop BusRdX	Invalidate, supply data	Invalid
Shared	PrRd hit	Return data	Shared
Shared	PrWr hit	Send BusUpgr, others → Invalid	Modified
Shared	Snoop BusRdX	Invalidate	Invalid
Modified	PrRd hit	Return data	Modified
Modified	PrWr hit	Update locally	Modified
Modified	Snoop BusRd	Flush to mem/cache, supply	Shared
Modified	Snoop BusRdX	Flush, invalidate	Invalid

The Exclusive state is the key optimization: a line is exclusively held (no other cache has it) and clean (matches memory). A write to an Exclusive line transitions directly to Modified — no bus transaction required. This avoids the broadcast overhead for lines that are effectively private despite being technically shared-accessible.

MOESI (AMD Extension)

AMD CPUs extend MESI with an Owned state:

  Owned (O):  Line is dirty AND shared with other caches
              This cache is "owner" — responsible for supply
              on BusRd, and for write-back on eviction

  Without Owned (standard MESI):
    M → S transition (on snoop BusRd):
      Must flush dirty data to DRAM first, then all caches go to Shared.
      Bus transaction + DRAM write required.

  With Owned:
    M → O transition (on snoop BusRd):
      Cache supplies dirty data DIRECTLY to requesting cache (cache-to-cache transfer).
      No DRAM write needed.
      Memory is stale, but Owner is responsible for eventual writeback.

MOESI advantage: eliminates "write to DRAM then read from DRAM" round-trip for Modified-to-Shared transitions. Reduces memory bus bandwidth by 50% for cache-to-cache transfers. AMD uses MOESI in all Opteron, Ryzen, and EPYC processors.

Intel MESIF (Intel Extension)

Intel extends MESI with a Forward state:

  Forward (F): A Shared line that is designated as the "responder"
               — it will supply the data to other requesters
               rather than reading from memory.

  When multiple caches hold a Shared line and a third core requests it:
    Without Forward: any or all Shared holders might respond (conflict)
    With Forward: exactly ONE holder (the Forward holder) responds → clean protocol

MESIF was introduced with Intel QPI (QuickPath Interconnect) in Nehalem. It ensures exactly one cache responds for each BusRd transaction on a shared line, preventing duplicate responses.

Directory-Based Coherence

Snooping requires all caches to observe all transactions — fundamentally limited by broadcast bandwidth. For systems with many nodes (NUMA, server-class multi-socket, HPC), directory coherence scales better.

Directory Structure

A directory entry tracks sharing information for each cache line in memory:

Directory Entry per Cache Line (example: 4-node system):
┌─────────────────────────────────────────────────────────────┐
│  State   │  Owner  │  Presence Bits  │  (Other metadata)    │
│  (2 bits)│  (node) │  (1 bit/node)   │                      │
│  M/E/S/I │  0-3    │  [N0 N1 N2 N3]  │                      │
└─────────────────────────────────────────────────────────────┘

State: Uncached / Shared / Exclusive-Modified
Presence bits: which nodes/caches currently hold this line
Owner: (when Modified) which node has the dirty copy

Directory Protocol Example: Read Miss

Node 1 requests line X, which Node 2 has in Modified state:

Node 1 → Home (directory):  Read Request for line X
Home → Node 2:              Intervention (you have line X dirty, supply it)
Node 2 → Node 1:            Data (line X, dirty copy)
Node 2 → Home:              Acknowledgment (I've given it up)
Home:                        Update directory: State=Shared, Presence={N1}

Protocol messages: 3 hops (request + intervention + data). Compare to snooping: 1 hop (broadcast + response). Directory trades hop count for scalability.

Scalability: Why Directory Wins at Scale

Snooping: every write = broadcast to ALL nodes
  N nodes, W writes/second:
  Traffic = W × N (every write goes to every node)
  Grows as O(N × W) — unsustainable for large N

Directory: every write = point-to-point to sharing nodes
  N nodes, typical sharing degree k << N:
  Traffic = W × k (only write to nodes that have the line)
  Grows as O(W × k) — k is typically 2-4 even on 1000-node systems

For 1024-node systems (IBM Blue Gene, SGI UV, large NUMA systems), directory coherence is mandatory.

Directory Scaling Challenge

The directory itself must have an entry for every cacheable memory line. For 1 TB of memory at 64 bytes/line: 16 billion directory entries. Even at 4 bytes each, that's 64 GB of directory storage — clearly impractical.

Solutions: - Sparse directories: Only track lines that are actually cached - Distributed directories: Partition directory across nodes (each node manages its local memory's directory) - Limited pointer directories: Store only K pointers per entry (K=4), use broadcast for "overflow" cases - Hash-based placement: home node for line X = hash(X) % N, distributes directory load

Modern NUMA systems use distributed directories. AMD EPYC: the DRAM controller on each CCD die manages coherence for its local memory ranges.

Coherence vs Consistency Distinction

This is a common source of confusion. They are related but distinct:

Cache Coherence (per-address contract): - Guarantees that all processors eventually see the same value for any single memory location - A coherence protocol like MESI is the implementation - Coherence is necessary but not sufficient for correct shared-memory programs

Memory Consistency (system-wide ordering): - Defines the allowed orderings of memory operations across DIFFERENT addresses - Examples: - Sequential Consistency (SC): all operations appear in some total order respecting per-thread order - Total Store Order (TSO / x86 memory model): stores can be delayed (store buffer), loads can bypass stores to different addresses — only store-load reordering allowed - Relaxed Consistency (ARM, POWER): many orderings allowed, must use explicit barriers

  Example showing coherence ≠ consistency:

  Thread 0:              Thread 1:
    store x = 1            store y = 1
    load r0 = y            load r1 = x

  All four outcomes are possible under TSO (x86):
    r0=1, r1=1   (both stores visible before both loads)
    r0=1, r1=0   (Thread 0's store delayed, T1's visible first)
    r0=0, r1=1   (Thread 1's store delayed, T0's visible first)
    r0=0, r1=0   (ALLOWED ON TSO! Both stores delayed in store buffers)

  Note: r0=0, r1=0 is forbidden by Sequential Consistency but allowed by TSO.
  MESI ensures EVENTUALLY x=1 and y=1 will be visible everywhere.
  TSO does not guarantee when or in what order.

x86 provides TSO semantics (Total Store Order). ARM/RISC-V provide a more relaxed model. Java and C++ memory models define happens-before relationships that map to hardware memory barriers.

False sharing is a coherence performance pathology where two threads write to different data that happens to reside in the same cache line.

// Problematic: counter0 and counter1 likely on same cache line
struct {
    volatile uint64_t counter0;
    volatile uint64_t counter1;
} shared_counters;

// Thread 0 increments counter0 at high frequency
// Thread 1 increments counter1 at high frequency

Despite threads accessing logically independent data, every write to counter0 invalidates the cache line in Thread 1's core (because the cache line contains both), and vice versa. The cache line ping-pongs between cores via the coherence protocol.

Measured cost: cache line coherence round-trip ≈ 200-300 ns (through L3 cache and ring bus). If the loop body takes 2 ns and incurs false sharing on each iteration: 100x slowdown possible.

Diagnosis with `perf c2c`

# Record cache-to-cache traffic with 4-cycle load latency threshold
perf c2c record -F 100 --ldlat=30 -a -- ./multithreaded_program
perf c2c report -NN --stdio

# Output will show:
# =================================================
# Trace Event Information
# =================================================
# HITM (Hit Modified):
#   Idx  Object                 Symbol             Shared Cache Line Data
#    0   libpthread             (unnamed)           0xffff880000000000 Hitm=95234
#
# Indicates: 95234 "hit modified" events on this cache line
# (Thread loaded a line that another thread had in Modified state)

Fix: Padding to Cache Line Boundary

// Solution 1: explicit padding
struct {
    volatile uint64_t counter0;
    uint8_t pad0[64 - sizeof(uint64_t)];  // pad to 64 bytes
    volatile uint64_t counter1;
    uint8_t pad1[64 - sizeof(uint64_t)];
} shared_counters;

// Solution 2: alignas (C++11)
struct alignas(64) Counter {
    volatile uint64_t value;
};
Counter counter0, counter1;

// Solution 3: per-NUMA-node or per-core private counters
// (reduce sharing entirely — best for uncontested hot counters)
uint64_t per_core_counter[MAX_CORES][8];  // [8] = 64 bytes padding

Lock + data in same line: A mutex protecting data placed adjacent to that data in the same struct. Every lock acquisition (which modifies the mutex) invalidates the data line in other caches.
Producer flag + consumer buffer pointer: struct { bool ready; uint8_t *buf; } — the flag and pointer on the same line. When the producer sets ready, it invalidates the line in the consumer, causing the consumer's read of buf to miss.
Reference counting: shared_ptr's control block contains both strong count and weak count. Threads incrementing and decrementing the strong count (normal shared_ptr operations) false-share with threads holding weak references.

Debugging Notes

# Check coherence-related performance events (Intel)
perf stat -e \
  mem_load_retired.l1_hit,\
  mem_load_retired.l2_hit,\
  mem_load_retired.l3_hit,\
  mem_load_retired.l3_miss,\
  machine_clears.memory_ordering \
  ./program

# machine_clears.memory_ordering: counts pipeline clears due to
# memory ordering issues (store-to-load forwarding violations, coherence conflicts)

# Intel PCM (Processor Counter Monitor)
pcm-memory.x   # per-socket memory bandwidth including coherence traffic
pcm.x          # per-core coherence metrics

# Linux kernel trace events
trace-cmd record -e "cache_coherence:*" ./program

Security Implications

Coherence traffic as a side channel: The timing of coherence invalidations is observable. If Core 0 accesses a shared line, Core 1's next access to that line will be slower (line invalidated, must re-fetch). This difference is measurable — Prime+Probe exploits this.
Thunderstrike / cache flush attacks: An attacker with local code execution can deliberately generate large volumes of coherence traffic (by repeatedly writing to a shared cache line from multiple cores), degrading performance of co-located tenants (denial of service). No authentication or rate-limiting on coherence bus.
Rowhammer via coherence bypass: Rowhammer attacks that repeatedly write to DRAM rows to cause bit flips can be enhanced by flushing cache lines (clflush) after each write — this forces the write to DRAM rather than staying in cache. The coherence protocol's eviction of dirty lines is the mechanism that enables this.

Performance Implications

Cache Line Ping-Pong

A cache line that is written by two cores alternately incurs a coherence round-trip on every write:

Core 0 writes → line in M state at Core 0
Core 1 writes → BusRdX: Core 0 flushes to L3 (or DRAM), Core 1 gets line in M
Core 0 writes → BusRdX: Core 1 flushes, Core 0 gets line in M
...

Each transition: ~50-200 ns depending on cache level
If each write takes 1 ns computation but 100 ns coherence: 100x slowdown

Optimal case: data that is read by many cores but written by only one. Read-mostly data in Shared state costs no coherence overhead.

Lock-free data structures: carefully designed to minimize cache line sharing. Lock-free queues (e.g., Michael-Scott queue) use separate "head" and "tail" cache lines for producer and consumer to avoid ping-pong.

Modern Usage

Coherence in Intel Multi-Socket Systems

Intel Xeon multi-socket uses UPI (Ultra Path Interconnect, successor to QPI). Each socket snoops the others' caches via UPI. The protocol is a variant of MESI with Intel-specific extensions for cross-socket coherence.

AMD Infinity Fabric Coherence

AMD Zen 3/4 EPYC uses Infinity Fabric for both intra-socket (CCD-to-CCD) and inter-socket coherence. The coherence granularity is at the CCD level — each CCD manages coherence for its attached cache lines. See 07-numa-architecture.md for Infinity Fabric topology.

ARM DSU (DynamIQ Shared Unit)

ARM big.LITTLE/DynamIQ clusters share a single L3 via the DSU. Coherence within the cluster uses a variant of MOESI. Cross-cluster coherence (e.g., big cluster + little cluster on Cortex-X4 + Cortex-A520) uses the AMBA CHI (Coherent Hub Interface) protocol — a directory-based protocol designed for SoC integration with many heterogeneous agents (CPU, GPU, NPU, DMA engines).

CXL (Compute Express Link) Coherence

CXL 2.0+ extends CPU cache coherence over a PCIe 5.0 physical layer to attached devices (FPGAs, smart NICs, memory expansion cards). A CXL device can participate in MESI coherence — the CPU and device share a coherent memory space. This enables CPU cache lines to directly include device-local memory (CXL Type 3 memory) without explicit software invalidation.

Future Directions

CXL pooled memory with coherence: CXL 3.0 allows multiple hosts to share a coherent memory pool — effectively extending cache coherence across servers over a fabric. First products shipping 2024-2025.
GPU-CPU unified coherence: Apple M-series achieves GPU-CPU cache coherence via the unified memory architecture (all dies share the SLC). AMD and Intel are expanding GPU-CPU coherence via their respective fabric interconnects.
Heterogeneous coherence (AMBA CHI): As SoCs integrate more accelerators, coherence protocols must handle heterogeneous agents with different cache topologies, access patterns, and consistency requirements. AMBA CHI 7 supports up to 4096 agents.
Near-data coherence (PIM — Processing in Memory): With compute inside DRAM (Samsung HBM-PIM, SK Hynix GDDR6-AiM), coherence between in-DRAM compute and CPU caches must be managed. Current designs bypass CPU coherence entirely — future designs aim for transparent coherent PIM.
Persistent memory coherence: NVDIMM/Optane required extending coherence to ensure dirty cache lines are made persistent on power failure. CLWB (Cache Line Write Back) and CLFLUSHOPT instructions add explicit persistence points to the coherence protocol.

Exercises

MESI trace: Given 3 cores and the following sequence of reads/writes, trace the MESI state of the cache line containing variable x at each step. Identify all bus transactions generated. Core 0: Write x = 1 Core 1: Read x Core 2: Read x Core 0: Write x = 2 Core 1: Write x = 3 Core 2: Read x
False sharing benchmark: Implement two versions of a parallel sum:
Version A: two threads share a struct { uint64_t sum0, sum1; } (same cache line)
Version B: two threads have struct { uint64_t sum0; uint8_t pad[56]; uint64_t sum1; } (separate lines)
Measure wall clock time for 1 billion iterations each. Use perf stat -e cache-misses,LLC-store-misses to quantify coherence traffic.
MESI state transitions in hardware: Write a multi-threaded program that exercises each MESI transition:
M→S: Write a line, then have another thread read it
E→M: Read a line (no other copies), then write it
S→I: Have three threads read a line, then one write it Verify transitions with perf stat -e mem_load_retired.l1_miss before and after each step.
Directory protocol design: Design a directory entry for a 16-node NUMA system. What is the minimum number of bits required per directory entry? Consider: presence vector, state, owner. For 64 GB of main memory at 64-byte lines, what is the total directory storage requirement?
CXL coherence research: Read the CXL 2.0 specification sections on cache coherence (Section 3.4+). Describe how a CXL Type 2 device participates in CPU cache coherence. What MESI states can a CXL device hold? How does cache invalidation work when the CXL device modifies a cache line?

References

Papamarcos, M. S., & Patel, J. H. (1984). A Low Overhead Coherence Solution for Multiprocessors with Private Cache Memories. ISCA 1984, 348–354.
Censier, L. M., & Feautrier, P. (1978). A New Solution to Coherence Problems in Multicache Systems. IEEE Transactions on Computers, C-27(12), 1112–1118.
Lenoski, D., et al. (1992). The Stanford Dash Multiprocessor. IEEE Computer, 25(3), 63–79.
Sorin, D. J., Hill, M. D., & Wood, D. A. (2011). A Primer on Memory Consistency and Cache Coherence. Morgan & Claypool. (Free PDF available)
Martin, M. M. K., et al. (2012). Multifacet's General Execution-Driven Multiprocessor Simulator (GEMS). ACM SIGARCH Computer Architecture News.
AMD Corporation. (2021). AMD EPYC 7003 Series Processor Architecture. White Paper.
Intel Corporation. (2022). Intel Xeon Scalable Processor Family Technical Overview. https://www.intel.com/content/www/us/en/developer/articles/technical/xeon-processor-scalable-family-technical-overview.html
CXL Consortium. (2023). Compute Express Link Specification Revision 3.1. https://www.computeexpresslink.org/