01 - Storage Hierarchy

Technical Overview

The storage hierarchy is one of the foundational organizing principles of computer architecture. It describes a spectrum of memory technologies ordered by speed, cost per bit, and capacity. As you descend the hierarchy, storage becomes slower, cheaper, and larger. Every modern system exploits this hierarchy to provide the illusion of both abundant and fast storage simultaneously.

The key insight is that real programs exhibit locality: temporal locality (recently accessed data will be accessed again) and spatial locality (data near recently accessed data will be accessed). Caches at every level exploit these patterns to hide latency from slower tiers.

Prerequisites

Basic computer architecture (registers, cache, RAM)
Operating systems fundamentals (virtual memory, page cache)
Familiarity with orders of magnitude in latency (ns, µs, ms)

Core Content

The Hierarchy

Fastest / Most Expensive / Smallest
+------------------------------------------+
|  CPU Registers                           |  ~0.3 ns   |   ~1 KB
|  L1 Cache (per core)                     |  ~1 ns     |  32-64 KB
|  L2 Cache (per core)                     |  ~4 ns     | 256-512 KB
|  L3 Cache (shared)                       |  ~20 ns    |   4-64 MB
|  DRAM (main memory)                      |  ~80 ns    |  16-512 GB
|  Storage-Class Memory (Optane / 3D XPoint)|  ~300 ns  |   64 GB-6 TB
|  NVMe SSD (PCIe Gen4)                    |  ~100 µs   | 250 GB-8 TB
|  SATA SSD                                |  ~100 µs   | 500 GB-4 TB
|  HDD (7200 RPM)                          |  5-10 ms   |   1-20 TB
|  Magnetic Tape (LTO-9)                   |  30-60 s   |  18-45 TB/cartridge
|  Archive / Cold Object Storage (Glacier) |  minutes   |  unlimited
+------------------------------------------+
Slowest / Cheapest / Largest

Latency and Bandwidth Reference Table

Level	Read Latency	Sequential BW	Random 4K IOPS	Cost/GB (2024)
L1 Cache	~1 ns	~1 TB/s	N/A	~$1000+
L2 Cache	~4 ns	~500 GB/s	N/A	~$500
L3 Cache	~20 ns	~200 GB/s	N/A	~$10
DRAM (DDR5)	~80 ns	50-100 GB/s	N/A	~$3-5
Optane PMEM	~300 ns	50 GB/s	~600K	~$10-20
NVMe SSD (Gen4)	~100 µs	7 GB/s	~1M	~$0.10-0.20
SATA SSD	~100 µs	550 MB/s	~100K	~$0.05-0.10
HDD (7200 RPM)	5-10 ms	150-250 MB/s	~150	~$0.02
LTO-9 Tape	30-60 s	400 MB/s	N/A	~$0.002
S3 Glacier Deep	hours	varies	N/A	~$0.001

Storage-Class Memory: Intel Optane / 3D XPoint

Intel Optane (based on 3D XPoint technology, co-developed with Micron as "QuantX") occupied a unique position in the hierarchy between DRAM and NAND flash. Unlike DRAM, it is byte-addressable and persistent. Unlike NAND, it has no erase-before-write requirement.

Key characteristics: - Latency: ~300 ns read, ~100 ns write (vs DRAM ~80 ns, vs NVMe ~100 µs) - Endurance: far superior to NAND (no wear leveling needed) - Byte-addressable when used as PMEM (persistent memory) via DAX mode - Block-addressable when used as Optane SSD

Two deployment modes: 1. App Direct mode: filesystem with DAX (Direct Access) — bypasses page cache, CPU stores/loads go direct to persistent medium via mmap(). Used with ext4 -o dax or PMDK library. 2. Memory mode: DRAM acts as L4 cache in front of Optane, transparent to OS. Provides cheap large memory.

Intel discontinued Optane in 2022 due to business reasons, but the concept of storage-class memory (SCM) remains active research and commercial interest (Samsung Z-NAND, CXL-attached memory).

Working Set Size Concept

The working set of a process at time T with window delta is the set of pages referenced in the interval [T-delta, T]. If the working set fits in a cache tier, that tier provides effective latency. If it does not fit, you suffer capacity misses down to the next tier.

For databases, the working set is the "hot" portion of the dataset — the pages accessed frequently enough to remain in the buffer pool. A 100 GB database with a 20 GB working set can be served mostly from a 32 GB buffer pool, with cold pages fetching from NVMe only occasionally.

Working set overflow scenarios: - DRAM working set exceeds physical RAM → kernel swaps to swap partition (NVMe/SSD) → latency jumps from 80 ns to ~100 µs (1000x degradation) - Database buffer pool smaller than working set → high I/O rate, read amplification, poor cache hit ratio

Tiered Storage: Hot/Warm/Cold

Production systems classify data by access frequency:

+-------------------+--------------------+------------------------+
|   HOT DATA        |   WARM DATA        |   COLD / ARCHIVE       |
|   (active use)    |   (recent, infreq) |   (compliance, backup) |
|                   |                    |                        |
|  NVMe SSD         |  SATA SSD or HDD   |  Tape or Glacier       |
|  In-memory DB     |  Object store (S3  |  S3 Glacier Deep       |
|  Redis/Memcached  |   Standard-IA)     |  Azure Archive         |
+-------------------+--------------------+------------------------+

Automated tiering systems move data based on heat (access frequency): - NetApp FabricPool: moves cold LUN data to object store automatically - IBM Spectrum Storage: policy-based tiering across flash/HDD/tape - Ceph CRUSH rules: can direct hot pools to NVMe OSDs, cold to HDD OSDs

Storage Hierarchy in Databases

Modern databases implement their own mini-hierarchy:

SQL Query
    |
    v
Query Executor
    |
    v
+----------------------------------+
|   Buffer Pool (in-process DRAM)  |  Page cache managed by DB
|   (e.g. InnoDB: innodb_buffer_   |
|    pool_size, typically 70% RAM) |
+----------------------------------+
    |  miss
    v
+----------------------------------+
|   OS Page Cache (kernel DRAM)    |  Second chance buffer
|   (relevant for non-O_DIRECT)    |
+----------------------------------+
    |  miss
    v
+----------------------------------+
|   NVMe / SATA SSD                |  Data files, WAL, indexes
+----------------------------------+

PostgreSQL uses the OS page cache (no O_DIRECT by default) — this means effective memory for DB pages is buffer pool + OS page cache. MySQL InnoDB uses O_DIRECT by default on Linux to avoid double-buffering.

Cloud Storage Hierarchy

+------------------------------------------+
|  Instance Memory (EC2 instance RAM)       | AWS: up to 24 TB (x2iedn)
+------------------------------------------+
|  Instance Store (ephemeral NVMe)          | up to 8x 7.5 TB NVMe
|  (lost on stop/terminate)                 |
+------------------------------------------+
|  EBS (Elastic Block Store)                | io2 Block Express: 256K IOPS
|  (network-attached, persistent)           | gp3: 3-16K IOPS baseline
+------------------------------------------+
|  S3 Standard (object, 3-AZ)               | ~100-200 ms first byte
+------------------------------------------+
|  S3 Standard-Infrequent Access            | same latency, lower cost
+------------------------------------------+
|  S3 Glacier Instant Retrieval             | ms retrieval
+------------------------------------------+
|  S3 Glacier Flexible Retrieval            | minutes to hours
+------------------------------------------+
|  S3 Glacier Deep Archive                  | 12-48 hours, $0.00099/GB/mo
+------------------------------------------+

GCP equivalent: Local SSD → Persistent Disk (SSD/Balanced/HDD) → Cloud Storage (Standard/Nearline/Coldline/Archive)

Historical Context

The memory hierarchy concept was formalized in the 1960s at IBM and Manchester University. The Atlas computer (1962) introduced the concept of virtual memory — automatically managing the two-level hierarchy of core memory and drum storage. The principle that "a hierarchy of memories with different speeds and costs can appear as a single fast, large memory" is the foundation of all modern memory systems.

Peter Denning's working set model (1968) provided the theoretical basis for understanding what data needs to be in fast memory. Denning's seminal paper established that thrashing occurs when the sum of all process working sets exceeds physical memory — still directly observable in modern systems when kswapd pegs a CPU.

The introduction of flash memory (Toshiba, 1980s) and its eventual integration into the storage hierarchy as SSDs fundamentally changed cost/performance tradeoffs. The decade 2007-2017 saw SSDs displace HDDs for system disks due to their orders-of-magnitude better random I/O.

Production Examples

Netflix tiered storage: Hot content (top 1000 titles, 90% of traffic) served from CDN edge (essentially a cache layer). Warm content on S3 Standard. Full catalog in S3 Standard-IA. Original masters in Glacier. A single 4K master can be 400 GB — keeping it in Standard would cost 10x vs Glacier.

LinkedIn Voldemort: Key-value store that explicitly manages hot/warm/cold tiers in-process. Hot keys stay in off-heap memory, warm keys on local SSD, cold keys fetched from distributed storage.

Cloudflare Workers KV: Uses a read-through cache hierarchy: V8 isolate memory → regional PoP cache → central R2 object storage. Reads at the edge are sub-millisecond for cached keys, ~50ms for cold fetches.

Debugging Notes

Check effective cache hit rates:

# Linux page cache stats
cat /proc/meminfo | grep -E 'MemTotal|MemFree|Buffers|Cached|SwapUsed'

# Check if swap is being used (bad sign for memory pressure)
vmstat 1 5

# Database buffer pool hit rate (MySQL)
SHOW STATUS LIKE 'Innodb_buffer_pool_read%';
# innodb_buffer_pool_read_requests / innodb_buffer_pool_reads = hit ratio

# PostgreSQL buffer hit rate
SELECT
  sum(heap_blks_hit) / (sum(heap_blks_hit) + sum(heap_blks_read)) AS hit_rate
FROM pg_statio_user_tables;

# Check storage tier on Linux block device
cat /sys/block/nvme0n1/queue/rotational  # 0 = SSD, 1 = HDD

Security Implications

Data remanence: Data persists on lower storage tiers longer than expected. DRAM contents can be read after power loss via cold boot attack (data persists for seconds to minutes at room temperature, much longer when chilled). SSDs with FTL complicate secure erasure — logical TRIM does not guarantee physical erasure; use ATA Secure Erase or cryptographic erasure (encrypt then discard key).

Cache side-channels: CPU cache hierarchy timing attacks (Spectre, Meltdown) exploit the measurable timing difference between L1 cache hits (~1 ns) and DRAM accesses (~80 ns). The attack infers secret data by measuring which cache lines are hot.

Cloud storage class leakage: Glacier retrieval times create observable timing side-channels. An attacker who can trigger retrievals can infer storage tier of target objects.

Performance Implications

Memory bandwidth saturation: With many cores sharing L3 and DRAM, memory bandwidth becomes the bottleneck before compute. NUMA topology means DRAM on a remote socket has effectively double the latency. numactl --hardware shows NUMA distances.

Storage tail latency: P99 and P999 latencies often jump tiers. An NVMe device with median 100 µs latency may have P999 at 1 ms due to GC pauses or power-state transitions (NVMe APST — Autonomous Power State Transitions). Disable APST for latency-sensitive workloads: nvme set-feature /dev/nvme0 -f 0x0c -v 0.

Tiering amplification: Cold data access incurs not just fetch latency but also eviction of hot data. In a database with a warm buffer pool, a single cold table scan can evict all hot pages (the "full table scan" problem). PostgreSQL's enable_seqscan = off or explicit pg_prewarm can manage this.

Failure Modes and Real Incidents

2012 — Knight Capital Group: Not a storage hierarchy failure directly, but illustrates working set sensitivity. Their trading system had memory pressure during deployment, causing GC pauses at a critical moment. $440M loss in 45 minutes.

Swap storms in production: When Linux systems under memory pressure start heavy swapping to SSD/HDD, latency for all processes jumps. The standard mitigation is vm.swappiness=1 (not 0 — 0 causes OOM killer to trigger before any swapping) and ensuring adequate DRAM is provisioned. Many production incidents at Google, Facebook, and smaller companies were traced to unexpected swap activity.

Optane PMEM failures (2021-2022): Several early adopters of Optane in App Direct mode encountered silent data corruption bugs in PMDK and kernel DAX code. Intel's abandonment of Optane left organizations with stranded investments.

Modern Usage

CXL (Compute Express Link): CXL 2.0 enables pooled memory — a CXL memory expander (DRAM or SCM) sits on PCIe 5.0, accessible as coherent memory by the CPU. Latency ~250-350 ns (vs DRAM 80 ns). Enables disaggregated memory pools in data centers.
Persistent Memory tiering in Linux: memtier and daxctl tools manage tiered DRAM+PMEM. Linux 5.14+ demotion feature moves cold DRAM pages to PMEM tier automatically.
NVMe ZNS (Zoned Namespaces): collapses the gap between NVMe and tape by enabling sequential-write-only zones, reducing WAF and improving predictability.

Future Directions

CXL memory pooling: Disaggregate DRAM from compute, share across hosts in rack-scale memory pools. Meta's and Microsoft's rack designs are already planning for this.
Computational storage: SSDs with embedded CPUs that run compute (filtering, compression, encryption) near the data, reducing PCIe bandwidth requirements.
Photonic interconnects: Replace PCIe electrical signaling with optical links, extending low-latency storage access over longer distances.
DNA storage: Theoretical ~1 EB/gram density, but read/write latency measured in hours. Active research at Microsoft and Twist Bioscience.

Exercises

On a Linux system, use perf stat -e cache-misses,cache-references on a memory-bound program. Vary the working set size from 1 MB to 1 GB and plot the cache miss rate. At what size do you see L3 spill?
Write a benchmark that reads 4 KB random blocks from a file. Compare throughput when the file fits in page cache (file < RAM/2) vs. when it does not (file > RAM). Quantify the latency difference.
Configure a tiered storage policy using fstrim and blkdiscard on an NVMe SSD. Measure write performance before and after over-provisioning 10% additional space via a partition offset.
Instrument a PostgreSQL instance with pg_statio_user_tables. Create a table larger than shared_buffers, run a full sequential scan, and observe the buffer hit rate drop. Restore it with pg_prewarm.
Research Intel Optane PMEM App Direct mode. Sketch the software stack from a pmdk pmem_persist() call down to the CPU store instruction and how persistence is guaranteed on power loss (hint: CPU store + CLWB + SFENCE).

References

Hennessy, J. and Patterson, D. Computer Architecture: A Quantitative Approach, 6th ed. — Appendix B (Memory Hierarchy Design)
Denning, P.J. "The Working Set Model for Program Behavior." CACM, 1968.
Intel Optane Persistent Memory Architecture Guide: https://www.intel.com/content/www/us/en/developer/articles/technical/persistent-memory-architecture.html
Izraelevitz, J. et al. "Basic Performance Measurements of the Intel Optane DC Persistent Memory Module." arXiv:1903.05714
CXL Consortium Specification 2.0: https://www.computeexpresslink.org/
Linux kernel: Documentation/admin-guide/mm/ and Documentation/ABI/testing/sysfs-block
Gregg, B. Systems Performance: Enterprise and the Cloud, 2nd ed. — Chapter 7 (Memory), Chapter 9 (Disks)