Memory Controllers and DRAM
Technical Overview
DRAM (Dynamic Random Access Memory) remains the dominant main memory technology in servers, desktops, and mobile systems due to its combination of high density and relatively low cost. However, the "DRAM gap"—the widening speed difference between CPU compute rates and DRAM access latency—is the defining performance challenge of modern computing. DRAM latency has improved by only ~3× in 30 years while CPU clock speeds increased ~1000× and DRAM bandwidth ~100×. Modern memory controllers (integrated into the CPU die since Intel Nehalem, 2008) must extract maximum bandwidth from the physical DRAM while hiding latency through request pipelining, prefetching, and channel interleaving.
Prerequisites
- Understanding of CPU memory hierarchy (L1/L2/L3 caches)
- Basic digital design: flip-flops, row/column decoders, sense amplifiers
- Familiarity with DDR signaling and timing parameters
- Understanding of ECC (Error-Correcting Code) concepts
- Familiarity with Linux memory management (physical pages, NUMA)
Core Content
DRAM Architecture: From Cells to Banks
DRAM cell: A single bit is stored as charge in a capacitor, with a transistor acting as an access gate. Charge leaks over time (hence "Dynamic"—must be refreshed every 64 ms). Reading is destructive: reading the capacitor discharges it; the sense amplifier must rewrite the value.
DRAM organization hierarchy:
DIMM (Dual Inline Memory Module)
└── Rank (one or two per DIMM, accessed independently on chips)
└── Bank Group (DDR5: 4 bank groups per rank)
└── Bank (8 banks per bank group typical)
└── Row (8192–65536 columns wide)
└── Column (1 bit)
DRAM bank architecture:
DRAM Bank (simplified):
Rows: ┌───────────────────────────────────────┐ Row 0
2^15 │ c c c c c c c c c c c c c c c c c c c │ Row 1
rows │ c c c c c c c c c c c c c c c c c c c │ Row 2
│ ... │ ...
│ c c c c c c c c c c c c c c c c c c c │ Row 32767
└───┬───────────────────────────────────┘
Row Addr │ Row Address Decoder
Row
Select ──▶ ┌────────────────────────┐
│ Row Buffer (Sense Amp) │ (2^12 bits wide = 512 bytes)
└────────────────┬───────┘
Column Addr │
Decoder ▼
Data output (4/8/16 bits wide)
DRAM timing parameters (critical for performance): | Parameter | Symbol | Typical DDR5-6400 | Description | |-----------|--------|-------------------|-------------| | CAS Latency | CL | 30 | Cycles from RD command to first data | | RAS-to-CAS | tRCD | 30 | Cycles from ACT to RD/WR | | Row Precharge | tRP | 30 | Cycles from PRE to ACT | | Row Active Time | tRAS | 75 | Minimum cycles ACT must remain active | | t_CWL | tCWL | 28 | Write latency |
DDR5-6400 at CL30-30-30: - First-access latency: tRCD + CL = 30 + 30 = 60 cycles - At 6400 MT/s (3200 MHz bus), 1 cycle = 0.3125 ns - First byte latency: 60 × 0.3125 ns = 18.75 ns - But actual CPU-observed latency: 40–80 ns (memory controller latency + I/O buffer delays)
DRAM Command Sequence
Accessing a closed bank (no open row):
1. ACT (Activate): Open a row
- Assert RAS#, provide row address
- Row decoder drives wordline; all capacitors in row connect to bitlines
- Sense amplifiers latch entire row (row buffer = sense amplifier latch)
- Duration: tRCD cycles until row is stable
2. RD/WR (Read/Write): Access a column
- Provide column address
- Column decoder selects bits from row buffer
- Duration: CL cycles until data appears at DRAM output
3. PRE (Precharge): Close row, prepare for next row access
- Equalizes bitlines; discharges row buffer
- Duration: tRP cycles until ready for next ACT
Total closed-row access: tRCD + CL + tRP = ~30+30+30 = 90 cycles
At DDR5-6400: 90 × 0.3125 = 28 ns DRAM-internal latency
Row buffer management policies: - Open page policy: Leave the row open after access; subsequent accesses to same row are "row hits" (skip ACT+PRE). Optimal for sequential/streaming access. - Close page policy: Immediately close the row after access; next access to same bank gets a fresh row. Optimal for random access (avoids row conflicts). - Adaptive: Modern memory controllers predict row-buffer locality dynamically.
Row buffer hit vs miss latency: - Row hit: just RD (CL cycles): 30 × 0.3125 = 9.4 ns - Row miss (same bank, different row): tRAS check + PRE + ACT + RD: much slower - Bank conflict: must wait for previous operation to complete before new ACT
DDR5 Specification
DDR5 (JEDEC 2020, first CPUs in 2021 with Intel Alder Lake, AMD Ryzen 7000) introduced significant architectural changes from DDR4:
DDR5 key improvements: - Speed range: 4800–6400 MT/s JEDEC standard; overclocked to 7200–8000+ MT/s - Width: 64-bit data per DIMM (same as DDR4), but... - Sub-channel mode: Each DDR5 DIMM is split into two independent 32-bit sub-channels, each with its own command/address bus. Enables higher bank-level parallelism. - Bank groups: 4 bank groups (up from 2 in DDR4), each with 4 banks = 16 total banks per 32-bit sub-channel. - On-Die ECC (ODECC): Error correction within the DRAM chip (corrects in-array errors before data leaves chip). Not the same as system ECC (which corrects errors in the memory bus + DRAM). - PMIC (Power Management IC) on DIMM: DDR5 includes a voltage regulator on the DIMM (from 5V/12V supply), reducing motherboard complexity. - Capacities: 16–64 GB per DIMM standard; 128 GB RDIMM (registered, for server).
DDR5 bandwidth calculation:
Bandwidth = (Transfer rate) × (bus width) / 8
DDR5-6400: 6400 MT/s × 64 bits / 8 = 51.2 GB/s per channel
Dual-channel (2 DIMMs): 102.4 GB/s
Quad-channel (Intel Sapphire Rapids): 4 × 51.2 = 204.8 GB/s peak
LPDDR5 (Low Power DDR5): Used in mobile (Apple M2 uses LPDDR5 variants, Samsung S23, etc.). LPDDR5: 6400 MT/s, 16-bit bus width per channel, but 4 channels = 4 × 16-bit = 64-bit effective. Lower operating voltage (1.05V vs DDR5's 1.1V). Apple M2: 8-channel LPDDR5 = 100 GB/s (unified memory, shared CPU+GPU).
Memory Controller
Location evolution: Intel's memory controller was on the northbridge chip (discrete) until Nehalem (2008, "Integrated Memory Controller" — IMC on CPU die). AMD moved to on-die IMC with Athlon 64 (2003). Benefits: eliminates northbridge latency (10–15 ns), reduces power, enables lower voltage signaling.
Memory controller functions: 1. Command scheduling: Translate physical memory requests (physical addresses) to DRAM commands (ACT/RD/WR/PRE). Scheduler reorders requests to maximize bank-level parallelism and row-buffer hits. 2. Address mapping: Map physical memory address to (rank, bank group, bank, row, column). Interleaving: spread consecutive addresses across banks and channels to maximize parallelism. 3. Refresh management: Issue REFRESH (REF) commands to DRAM every 64 ms for each row (or 32 ms for 1x temperature mode). Refresh stalls all accesses to a bank for ~350 ns (tRFC). Modern DRAM: "fine-grained refresh" spreads REF across time. 4. Power management: DIMM power states (power-down, self-refresh), CKE (Clock Enable) gating. 5. ECC (optional): Encode/decode ECC on writes/reads.
Channel interleaving:
Memory address physical → (channel, rank, bank, row, column) mapping
With 2 channels:
Byte 0–63: Channel 0, Bank 0, Row 0
Byte 64–127: Channel 1, Bank 0, Row 0
Byte 128–191: Channel 0, Bank 0, Row 0 (different column)
...
Sequential 256-byte read: interleaved across 2 channels → 2× bandwidth
NUMA (Non-Uniform Memory Access): Multi-socket servers have multiple memory controllers, each attached to local DRAM. AMD EPYC Genoa (2022): 12 CCDs, 8 memory channels per socket, 4 memory channels per NUMA domain (3 NUMA domains on 96-core Genoa). Remote memory access (across UPI/Infinity Fabric): 150–200 ns vs 80–100 ns for local memory.
HBM (High Bandwidth Memory)
HBM (JEDEC 2013) stacks multiple DRAM dies vertically using TSV (Through-Silicon Via) and a silicon interposer. Used in discrete GPUs (HBM2e on A100, HBM3 on H100), AMD Instinct MI300X, and Intel Xeon Max.
HBM3 (JEDEC 2022): - 8 stacked DRAM dies per stack - 1024-bit wide bus per stack (vs 64-bit for DDR5) - 6400 MT/s transfer rate - Bandwidth: 1024 bits × 6400 MT/s / 8 = 819 GB/s per stack - NVIDIA H100 SXM5: 6 HBM3 stacks = 3.35 TB/s total bandwidth - Latency: similar to LPDDR5 (~60–80 ns typical)
HBM architecture:
HBM3 Stack (side view):
┌──────────────────────────────┐ ← Base die (logic die)
│ Memory Controller │ Silicon interposer
│ PHY │
└──────────────────────────────┘
│ TSV (Through-Silicon Vias, ~5 µm diameter)
┌──────────────────────────────┐
│ DRAM Die 0 (8GB) │
└──────────────────────────────┘
│ TSV
┌──────────────────────────────┐
│ DRAM Die 1 (8GB) │
└──────────────────────────────┘
... (up to 8 stacked dies)
1024-bit wide interface between base die and DRAM stack
HBM vs GDDR6X vs LPDDR5: | Memory | Bandwidth | Latency | Capacity/Stack | Cost | |--------|-----------|---------|----------------|------| | DDR5-6400 | 51 GB/s/channel | 80 ns | 64 GB/DIMM | $$ | | LPDDR5 | 68 GB/s (Apple M2) | 60 ns | 24 GB (M2) | $$$ | | GDDR6X | 96 GB/s | 80 ns | 16 GB | $$ | | HBM2e | 461 GB/s/stack | 70 ns | 32 GB/stack | $$$$ | | HBM3 | 819 GB/s/stack | 60 ns | 64 GB/stack | $$$$$ |
DRAM Refresh
Every DRAM cell must be refreshed periodically (typically every 64 ms per row at standard temperature). The JEDEC standard requires REF commands such that every row is refreshed within 64 ms.
Refresh timing (DDR5): - tREFI: REF interval = 64ms / number_of_rows ≈ 7.8 µs (for 8,192 rows typical) - tRFC: time for one REFRESH to complete ≈ 350–600 ns (depends on DRAM density) - During tRFC, all banks in the rank are blocked
Impact on latency: At 7.8 µs intervals, a request that arrives just as a REF command is issued must wait up to 600 ns (tRFC). P99 latency spikes of ~1 µs are normal and unavoidable without advanced DRAM refresh modes.
Temperature derating: At temperatures >85°C, refresh interval must halve (32 ms). Data center servers maintain DIMM temperatures <70°C to avoid performance degradation.
Fine Granularity Refresh (FGR): DDR5 optional feature: break each refresh into finer units spread more evenly, reducing peak stall time. Reduces worst-case stall from 600 ns to ~150 ns at the cost of slightly more command bus traffic.
ECC DRAM
SECDED (Single Error Correction, Double Error Detection): The standard ECC algorithm for server DRAM. Adds 8 parity bits to every 64-bit word (total: 72-bit bus — this is why server DIMMs have 9 DRAM chips per rank instead of 8).
Operation:
Write: 64-bit data → Hamming encode → 72-bit write to DRAM
Read: 72-bit read → syndrome check → if syndrome ≠ 0:
1-bit error: correct automatically (log error to MCE log)
2-bit error: detect, cannot correct, report MCA (Machine Check Architecture)
Chipkill: multi-bit error if DRAM chip fails entirely → SECDED insufficient
Chipkill / x8 SDDC (Single Device Data Correction): A failed DRAM chip (x8 device) corrupts 8 consecutive bits in the same word—a 2-bit error per 64-bit chunk, which SECDED cannot correct. Chipkill extends ECC to tolerate a single chip failure, using erasure coding across multiple chips. Requires specialized DRAM configuration (typically 18 chips per rank).
On-Die ECC (ODECC, DDR5): Corrects errors within the DRAM array before data reaches the bus. ODECC operates at a finer granularity (per 128-bit DRAM word). Does not provide protection against bus errors. Combined with system ECC, provides two layers of protection.
ECC performance overhead: Negligible (<1% bandwidth, ~0 latency). Required for all production server deployments (AWS, Azure, GCP require ECC for all instances).
Rowhammer Attack
Rowhammer (Kim et al., 2014) demonstrated that repeatedly reading a DRAM row (hammering) causes charge leakage in physically adjacent rows, eventually flipping bits.
Mechanism:
DRAM bank layout (physical rows adjacent in silicon):
Row 1000 ← victim row (not accessed by attacker)
Row 1001 ← aggressor row (repeatedly ACT/PRE by attacker)
Row 1002 ← aggressor row
Hammering Row 1001 and 1002 alternately:
for(i = 0; i < 1,000,000; i++) {
*(volatile char*)row_1001; // ACT+RD row 1001
*(volatile char*)row_1002; // PRE then ACT+RD row 1002
clflush(row_1001); // evict from cache, force DRAM ACT next iteration
clflush(row_1002);
}
// After ~1M iterations: bit flip probability in row_1000 ≈ 10^-4 to 10^-3 per bit
Impact: Reading ~139,000 times per 64 ms refresh interval is sufficient to cause flips on vulnerable DIMMs. A bit flip in a page table entry can change a read-only page to writable, enabling privilege escalation.
Exploits: Rowhammer privilege escalation demonstrated on Linux (2015), JavaScript remote exploit in browser (2015), iOS (2017). "Drammer" exploited DMA on ARM Android devices (2016).
Mitigations: - TRR (Target Row Refresh, DDR4 2017): DRAM internally tracks frequently-accessed rows and refreshes their neighbors proactively. Not standardized; implementations vary and some are bypassable ("TRRespass" attack, 2020). - PRTL (Per-Row Tracking Limit, DDR5): Standardized in JEDEC DDR5, requires memory controller to track hammered rows and issue refreshes (pRFM — per-Row Refresh Management). - ECC Rowhammer: With system ECC, single-bit flips are corrected. Attacker needs to flip 2+ bits in the same word simultaneously—harder but demonstrated ("ECCploit", 2018). - LPDDR5 RFM: Mobile DRAM with Refresh Management addresses mobile Rowhammer exposure.
Historical Context
DRAM was invented at Intel by Robert Dennard in 1968 (one transistor per cell, vs earlier 3-transistor cells). SDRAM (Synchronous DRAM) was introduced in 1993, synchronizing to the system bus clock. DDR (Double Data Rate) SDRAM in 2000 transferred data on both rising and falling clock edges, doubling bandwidth. DDR2 (2003), DDR3 (2007), DDR4 (2014), DDR5 (2020) each roughly doubled bandwidth while improving latency modestly. Intel integrating the memory controller on-die (Nehalem, 2008) was a major architectural shift, eliminating the northbridge bottleneck. HBM was introduced for GPU applications in 2015 (AMD Fury X with HBM1). Rowhammer was discovered by researchers at CMU and Intel Labs in 2014 and published at ISCA 2014.
Production Examples
Intel Sapphire Rapids (4th Gen Xeon, 2023): 8 DDR5 channels, HBM2e option (Xeon Max = 64 GB HBM2e on-package), PCIe Gen5. DRAM bandwidth: 307 GB/s (DDR5-4800, 8 channels).
AMD EPYC Genoa (2022): 12 DDR5 channels (6400 MT/s support), 409.6 GB/s peak bandwidth. 96-core socket.
Apple M2 Ultra (2023): 800 GB/s unified memory bandwidth (8-channel LPDDR5-6400, 192 GB maximum). Unified memory is shared between CPU cores, GPU, and Neural Engine—no discrete GPU VRAM copies.
NVIDIA H100 SXM5: 80 GB HBM3, 3.35 TB/s bandwidth. 6 stacks × ~550 GB/s each. Memory bandwidth is the dominant performance limiter for inference at batch=1.
Debugging Notes
Memory errors in /var/log/mcelog (Linux): mcelog decodes Machine Check Architecture (MCA) errors. PROCESSOR 0: MCE 0x12 (MCA_STATUS_UCNA_ADDRV) typically indicates a corrected ECC error. Accumulating corrected errors on the same DIMM slot indicate a failing DIMM.
DIMM identification: dmidecode -t 17 shows DIMM population and type. edac-util shows error counts per DIMM slot.
Memory bandwidth measurement:
# STREAM benchmark
./stream_omp # Look for "Triad" bandwidth (close to theoretical peak)
# Theoretical DDR5-6400 dual-channel: 102 GB/s
# Measured: 90-95 GB/s (streaming loops saturate ~92%)
Rowhammer testing:
# fliptester or rowhammer-test from Google Project Zero
./rowhammer --dimm-size=$(grep MemTotal /proc/meminfo | awk '{print $2}')
# If bit flips found: replace DIMM immediately
NUMA memory placement:
numactl --hardware # Show NUMA topology and memory per node
numastat # Show per-NUMA-node memory allocation
# For latency-sensitive apps: numactl --membind=0 --cpunodebind=0 ./app
Security Implications
Rowhammer as root privilege escalation: The most exploitable DRAM vulnerability. CVE-2015-0565 demonstrated a Linux kernel privilege escalation. All user-space processes on systems without TRR are potentially vulnerable if they can allocate large amounts of contiguous DRAM.
ECC bypasses: SECDED corrects 1-bit errors silently. An attacker who can cause multiple-bit errors in the same word (with careful timing and adjacent row selection) can bypass SECDED. "ECCploit" demonstrated this with off-the-shelf DDR4.
Memory safety and DRAM integrity: The Rowhammer vulnerability is a physical phenomenon, not a software bug—no amount of software hardening prevents it. Hardware-level mitigations (TRR, RFM) are the only complete defense.
Cross-VM Rowhammer in cloud: If two VMs share physical DRAM pages in the same bank rows, a malicious VM can flip bits in the other VM's memory. Cloud providers mitigate via physical memory isolation (balloon driver + large page reservations), memory encryption (AMD SME/SEV), and DRAM refresh control.
Performance Implications
Bandwidth-latency tradeoff: Higher bandwidth (more channels, faster DRAM) typically increases latency slightly (due to more complex routing). Apple M2's unified memory achieves high bandwidth with reasonable latency by co-locating DRAM on the same package via LPDDR5.
Refresh overhead: At DDR5-6400, tRFC ≈ 295–600 ns. With tREFI = 7.8 µs, refresh takes 295/7800 ≈ 3.8% of bandwidth. Refresh-induced latency spikes are visible in P99 measurements.
Cache line granularity: All DRAM accesses are in 64-byte cache line granularity (matched to x86 cache line size). An 8-byte read still transfers 64 bytes from DRAM to cache. This is why false sharing is so harmful—sharing a cache line between cores causes the memory controller to bounce the entire 64-byte block.
Failure Modes and Real Incidents
Incident: AWS gamma bit ECC storms (2019, reported by Werner Vogels): A batch of DRAM DIMMs from a specific manufacturing lot developed "gamma bit" — cells that reliably flip under specific refresh patterns. ECC corrected the errors, but the MCE storm logged millions of corrected errors per hour per server, triggering automated DIMM replacement at scale. Discovered via aggregated MCE monitoring across the fleet.
Incident: Google Rowhammer-based privilege escalation PoC (2015): Googler Mark Seaborn published a working Linux root exploit using Rowhammer via /dev/mem. Google immediately restricted /dev/mem access. The exploit had a ~40% success rate on unpatched DDR3 systems. Most users had vulnerable DIMMs and never knew.
Incident: Rowhammer on cloud ECC DDR4 (ECCploit, 2018): Researchers at VU Amsterdam demonstrated Rowhammer on ECC-protected DDR4 in a cloud environment by inducing 3 simultaneous bit flips (bypassing SECDED). Success rate: low but non-zero. AWS, Google, Azure updated BIOS settings to use maximum refresh rates (1x refresh) and enabled TRR aggressively.
Modern Usage
AMD RDIMM/LRDIMM for large memory (2023): 256 GB RDIMM per DIMM slot (DDR5 3DS — 3D-stacked), enabling 8 TB per socket. Used in in-memory database servers (SAP HANA, Oracle Exadata).
Intel Optane PMem (DC Persistent Memory, 2019–2022): 512 GB DIMMs using 3D XPoint phase-change memory, filling the gap between DRAM speed and SSD capacity. Discontinued in 2022 but demonstrated the "storage class memory" concept (persistent, byte-addressable, slower than DRAM but faster than NVMe).
CXL (Compute Express Link) Memory Expansion: CXL 2.0 (2021) enables memory-semantic access to remote DRAM via PCIe 5 interface with <500 ns latency. Used for memory disaggregation: attach 1–8 TB of extra DRAM to a server without it being on the CPU die. Samsung CXL DIMMs available in 2023.
Future Directions
- DDR6 (JEDEC, expected 2025–2026): 12800 MT/s, 128-bit sub-channel, on-die ECC mandatory, RFM mandatory. Roughly 2× DDR5 bandwidth.
- LPCAMM2 (LPDDR5X replacement, 2024): CAMM (Compression Attached Memory Module) format replaces SO-DIMM in laptops; enables LPDDR5X speeds in a smaller footprint with compression-mounted connector for better signal integrity.
- Processing-in-Memory (PIM): Samsung HBM-PIM (2021) adds simple ALUs inside HBM DRAM to perform operations (e.g., vector addition) without moving data to CPU. Reduces memory bus traffic by performing computation at memory.
- DRAM scaling limits: DRAM cells at 12 nm node struggle with charge retention time (shorter refresh intervals needed). Below 10 nm, DRAM may require FinFET transistors or alternative storage capacitor materials; density scaling slowing.
Exercises
-
DRAM timing calculation: Given DDR5-6400 with CL32-34-34 timings: calculate (a) first-access latency in nanoseconds for a row miss, (b) row hit latency, (c) theoretical peak bandwidth for sequential reads, (d) actual measured bandwidth assuming 95% row hit rate.
-
Rowhammer vulnerability assessment: Write a C program that (a) allocates a large block of memory, (b) finds two rows in the same DRAM bank using virtual-to-physical address analysis (
/proc/self/pagemap), (c) hammers those rows 1 million times, (d) checks adjacent rows for bit flips. Run on a system without ECC and report results. -
Memory controller scheduling simulation: Implement a simple FR-FCFS (First Ready - First Come First Serve) memory controller scheduler in Python. Queue 100 random memory requests. Compare: (a) FCFS scheduling, (b) FR-FCFS (prioritize row hits), (c) closed-page policy. Measure average latency and bank-level parallelism.
-
NUMA performance experiment: On a dual-socket server (or NUMA-aware VM), write a benchmark that allocates memory on both NUMA nodes and measures read bandwidth. Compare local vs remote NUMA node access latency and bandwidth. Use
numactlto control placement. Calculate the NUMA factor (remote/local latency ratio). -
ECC syndrome analysis: Given the Hamming code for 64-bit SECDED, implement an encoder and decoder in Python. (a) Encode a test 64-bit word, (b) flip one bit, verify decoder corrects it, (c) flip two bits, verify decoder detects (not corrects), (d) generate the syndrome lookup table and verify all 1-bit error positions are uniquely identified.
References
- JEDEC DDR5 SDRAM Standard (JESD79-5B), 2022
- Kim et al., "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors," ISCA 2014
- Seaborn, M., "Exploiting the DRAM Rowhammer Bug to Gain Kernel Privileges," Google Project Zero Blog, 2015
- Mutlu, "Memory Scaling: A Systems Architecture Perspective," IMW 2013
- Intel Xeon Scalable Memory Reference (Sapphire Rapids), Intel 2023
- AMD EPYC 9004 Series Architecture Guide, AMD 2022
- Cojocar et al., "ECCploit: ECC Memory Vulnerable to Rowhammer Attacks After All," IEEE S&P 2019
- JEDEC HBM3 Standard (JESD238), 2022