03 - SSD Internals
Technical Overview
Solid-State Drives store data in NAND flash memory — arrays of floating-gate transistors that trap charge to represent bits. Unlike HDDs, SSDs have no moving parts, delivering random I/O latency 100-1000x lower than spinning disk and near-sequential performance for random reads. However, NAND's fundamental physics — cells can only be programmed (written) after a slow bulk erase operation, and cells wear out — impose constraints that require sophisticated firmware (the Flash Translation Layer) to manage transparently.
Understanding SSD internals is essential for interpreting write amplification, understanding why SSD performance degrades under sustained writes, diagnosing age-related performance cliffs, and making informed decisions about NVMe vs SATA vs storage-class memory.
Prerequisites
- Basic semiconductor physics concepts (useful but not required)
- Understanding of the storage hierarchy (see 01-storage-hierarchy.md)
- Familiarity with I/O benchmarking concepts
Core Content
NAND Flash Cell Types
NAND flash stores data by trapping electrons in a floating gate or charge trap layer within a transistor. The number of distinct charge levels per cell determines bits per cell:
SLC (Single Level Cell) — 1 bit/cell
|LOW|HIGH|
0 1
Endurance: 50,000-100,000 P/E cycles
Latency: read ~25µs, write ~200µs
MLC (Multi Level Cell) — 2 bits/cell
|LOW|MID-LOW|MID-HIGH|HIGH|
00 01 10 11
Endurance: 3,000-10,000 P/E cycles
Latency: read ~50µs, write ~600µs
TLC (Triple Level Cell) — 3 bits/cell
8 discrete voltage levels
Endurance: 1,000-3,000 P/E cycles
Latency: read ~75µs, write ~900µs
QLC (Quad Level Cell) — 4 bits/cell
16 discrete voltage levels
Endurance: 100-1,000 P/E cycles
Latency: read ~100µs, write ~1500µs
As bits/cell increases, the margin between voltage levels narrows, requiring more precise sensing, increasing read/write latency, and reducing endurance (each erase widens threshold distributions).
PLC (Penta Level Cell): 5 bits/cell — announced by Kioxia/BiCS; targeted at cold/archive storage. Endurance <100 P/E cycles.
NAND Flash Physics
Cross-section of a floating-gate NAND transistor:
Control Gate (Word Line)
|
+----+----+
| Oxide | <-- Inter-Poly Oxide (IPO)
+---------+
|Floating | <-- Floating Gate (stores charge)
| Gate |
+---------+
| Oxide | <-- Tunnel Oxide (~8nm thick)
+---------+
Channel (p-type silicon)
/ \
Source Drain
(Bit Line) (Bit Line)
Programming (write): Apply high voltage to control gate → electrons tunnel through oxide into floating gate (Fowler-Nordheim tunneling). Trapped electrons reduce cell's threshold voltage → bit encoding.
Erasing: Apply high voltage to substrate → electrons tunnel out of floating gate → cells reset to high charge state (erased = all 1s). Erasing operates on entire erase block (not individual cells or pages) because NAND cell transistors are wired in series — you cannot selectively address individual cells within a block.
Reading: Apply read voltage; if cell conducts at that voltage, it's one bit value; if not, the other. For MLC/TLC, multiple read voltages are applied to detect which of N voltage levels the cell holds.
Wear: Each P/E (Program/Erase) cycle slightly widens the threshold voltage distribution due to charge trapping in tunnel oxide defects. After enough cycles, distributions overlap → uncorrectable bit errors.
3D NAND: Since ~2015, NAND manufactures stack cell layers vertically (32→64→128→176→232 layers as of 2024) rather than shrinking planar cell size. Vertical NAND (V-NAND/BiCS) improves density and, counterintuitively, endurance (larger cells due to vertical geometry).
NAND Hierarchy and Parallelism
SSD Device
|
+-- Controller (ARM cores, DRAM, ECC engine, FTL)
|
+-- NAND Package 0 (one physical chip, e.g. Micron NAND BGA)
| |
| +-- Die 0 (independent array, own address space)
| | +-- Plane 0 (can run parallel operations within die)
| | | +-- Block 0 (erase unit, ~256KB-4MB)
| | | | +-- Page 0 (read/write unit, 4KB-16KB)
| | | | +-- Page 1
| | | | +-- ...
| | | | +-- Page N (e.g. 256 pages/block for 1MB block)
| | | +-- Block 1
| | | +-- ...
| | +-- Plane 1
| +-- Die 1
+-- NAND Package 1
+-- ...
Parallelism exploited by FTL: - Package-level: Multiple packages on separate NAND buses (4-16 channels on high-end SSDs) - Die-level: Multiple dies per package share a bus but can interleave operations - Plane-level: Plane operations (multi-plane program/read) execute simultaneously within a die - Full stripe: For a 4-channel SSD with 2 dies/channel and 2 planes/die: 4×2×2=16 parallel units → 16x throughput amplification over single NAND array
Consumer NVMe SSDs: typically 4 channels, 4-8 NAND packages, ~16-32 parallel units. Enterprise NVMe SSDs: 8-16 channels, up to 64+ parallel units.
Flash Pages and Erase Blocks
Page (4 KB, 8 KB, or 16 KB): atomic unit of read/write. You cannot write less than one page. Writing a partial page wastes the remainder (or requires read-modify-write).
Erase block (256 KB to 4 MB): atomic unit of erase. Before a page can be written, its containing block must be erased. A block in "dirty" state (some pages valid, some invalid) cannot be partially re-used without an erase.
Why you cannot overwrite in place: If you update a 4K page, the FTL must: 1. Read the original block's other valid pages into DRAM 2. Erase the entire block 3. Write back all valid pages plus the new page
This is write amplification. FTL avoids this by writing to a fresh empty page (out-of-place write), marking the old page invalid, and deferring erase to garbage collection time.
Flash Translation Layer (FTL)
The FTL is the firmware subsystem that makes NAND flash look like a block device:
Host LBA (Logical Block Address)
|
v
+-----------------------------+
| FTL: Logical-to-Physical | L2P mapping table
| Mapping Table (in DRAM) | e.g., LBA 0x1000 -> Physical: package 2,
| | die 1, plane 0, block 47, page 12
+-----------------------------+
|
v
+-----------------------------+
| Wear Leveling | Distribute writes across all blocks
| Garbage Collection | Reclaim blocks with invalid pages
| Read Disturb Management | Refresh blocks read too many times
| Bad Block Management | Skip factory-defect and worn blocks
+-----------------------------+
|
v
NAND Flash Array
Garbage Collection (GC): When free blocks run low, the FTL selects victim blocks with highest ratio of invalid (stale) pages, copies valid pages to a new block, and erases the victim. This creates write amplification even if the host isn't writing — GC reads and rewrites existing data.
GC trigger: Typically starts when free space falls below a threshold (e.g., 10% of capacity). Heavy GC degrades foreground I/O latency — this is the source of "SSD write cliff" behavior.
Wear Leveling
Dynamic wear leveling: New writes go to the least-worn free blocks. Ensures new data is spread evenly. Does not help with "cold" data (static blocks written once and never updated — they accumulate write cycles on all surrounding blocks while cold blocks remain young).
Static wear leveling: Periodically evict cold data from young blocks to old blocks, freeing young blocks for new writes. More complex, adds write amplification, but extends drive life.
Write Amplification Factor (WAF)
WAF = (NAND bytes written) / (host bytes written)
Ideal WAF = 1.0. Real-world WAF depends on: - Workload: Random small writes → high WAF (GC must copy many valid pages). Sequential large writes → WAF near 1.0 - Utilization: Near-full drive → more GC needed → higher WAF - Over-provisioning: More spare area → GC can find emptier victim blocks → lower WAF
WAF formula approximation for random writes:
WAF ≈ (block size) / (average invalid pages × page size) when GC-bound
For a drive filled to 90% with 4K random writes, WAF can reach 3-10x.
Measure WAF on Linux:
# NVMe SMART log shows host vs NAND writes
nvme smart-log /dev/nvme0 | grep -E 'data_units_written|nand_bytes_written'
# WAF = nand_bytes_written / (data_units_written * 512KB per unit)
Over-Provisioning (OP)
Over-provisioning is NAND capacity reserved by the SSD controller and not exposed to the host. It serves as: - GC staging area (clean blocks for writing during GC) - Bad block pool (replacement for worn/defective blocks) - WAF reduction buffer
Standard OP levels: - Consumer drives: ~7-8% (e.g., 240 GB drive with 256 GB NAND) - Prosumer/high-endurance: 28% (960 GB drive with 1.28 TB NAND) - Enterprise: 28-100% (write-intensive models)
Additional host-level OP: Creating a smaller partition than full drive size adds effective OP. For a 1 TB NVMe, partitioning 900 GB and leaving 100 GB unpartitioned adds ~11% OP, reducing WAF and improving endurance.
Endurance: TBW and DWPD
TBW (Terabytes Written): Total lifetime NAND writes before expected failure at rated WAF. - Consumer 1 TB TLC NVMe (e.g., Samsung 990 Pro): ~600 TBW - Enterprise write-intensive 1 TB (e.g., Samsung PM9A3): ~3600 TBW
DWPD (Drive Writes Per Day): TBW / (drive capacity × warranty years × 365) - Consumer: ~0.3 DWPD (write 300 GB/day over 5 years on 1 TB drive) - Mixed-use enterprise: ~3 DWPD - Write-intensive enterprise: ~10 DWPD
Read/Write/Erase Latency Summary
| Cell Type | Read Latency | Program Latency | Erase Latency | P/E Endurance |
|---|---|---|---|---|
| SLC | ~25 µs | ~200 µs | ~2 ms | 50K-100K |
| MLC | ~50 µs | ~600 µs | ~3 ms | 3K-10K |
| TLC | ~75 µs | ~900 µs | ~5 ms | 1K-3K |
| QLC | ~100 µs | ~1500 µs | ~10 ms | 100-1K |
Note: Many consumer SSDs use an SLC write cache (pSLC — pseudo-SLC mode where TLC cells are temporarily programmed as SLC for speed). Until the cache fills, writes land at SLC speeds; after the cache is exhausted, writes go directly to TLC at 3-10x lower throughput. This is why SSD benchmarks show an initial burst followed by a sustained throughput cliff.
Historical Context
Flash memory was invented by Fujio Masuoka at Toshiba in 1984. The first NAND flash (as distinct from NOR flash — used for firmware/bootloaders) was developed by Toshiba in 1987. SSDs appeared in enterprise storage in the early 2000s at extreme cost. The first consumer SSD to gain significant traction was the Intel X25-M (2008) — 80 GB, MLC, SATA, $595, 250 MB/s sequential read, 70 MB/s write — which embarrassed contemporary HDDs on random I/O.
The transition from 2D planar NAND to 3D NAND (Samsung V-NAND in 2013) was the pivotal technology shift that enabled continued density scaling without cell shrink. SSD adoption crossed 50% of PC shipments around 2018.
Production Examples
Samsung SLC cache sizes: Samsung 870 EVO (SATA, TLC) uses TurboWrite — SLC cache size scales with drive capacity: 6 GB static + ~21% dynamic for 1 TB. When writing more than ~220 GB sequentially, TurboWrite exhausts and TLC sustained write speed drops from 530 MB/s to ~300 MB/s. Relevant for large backup jobs.
Cloudflare's NVMe fleet: Cloudflare moved their edge servers to NVMe SSDs for cache storage. They observed WAF of ~1.5-2.5 for their workload mix (mostly reads with periodic evictions), and extended SSD life by setting 10% over-provisioning via namespace size reduction.
Database WAL on SSD: PostgreSQL's WAL is sequential writes. WAL on a separate SSD with OP runs at near-ideal WAF (~1.0-1.2). Mixing WAL and data on one SSD with heavy random reads/writes increases WAF to 3-5x, shortening drive life.
Debugging Notes
# NVMe SMART log — comprehensive health data
nvme smart-log /dev/nvme0
# Key fields:
# - percentage_used: 0-100%, 100% = exceeded TBW rating
# - available_spare: % of reserved spare blocks remaining
# - data_units_read/written: in 512KB units
# - media_errors: unrecoverable NAND errors (should be 0)
# - num_err_log_entries: error log count
# Get NVMe error log
nvme error-log /dev/nvme0
# Identify if SSD is using SLC write cache
fio --filename=/dev/nvme0n1 --direct=1 --rw=write --bs=128k \
--ioengine=libaio --iodepth=32 --name=filltest --size=100%
# Watch for throughput drop — indicates SLC cache exhaustion
# Check for SSD over temperature (>70°C is concerning)
nvme smart-log /dev/nvme0 | grep temperature
# SATA SSD SMART via smartctl
smartctl -a /dev/sda
# Look for: Reallocated_Sector_Ct, Wear_Leveling_Count,
# Reported_Uncorrect, Host_Writes_32MiB
# Linux SSD flush behavior
cat /sys/block/nvme0n1/queue/write_cache
# "write back" = volatile write cache (may lose data on power loss)
# "write through" = no volatile write cache
Security Implications
SSD secure erase complexity: Unlike HDDs, TRIM on SSD only marks pages as invalid — NAND cells retain charge until erased. Secure erasure options:
1. NVMe Secure Erase (Sanitize command): nvme format /dev/nvme0 --ses=1 — cryptographic sanitize or overwrite. Takes minutes.
2. ATA Secure Erase (SATA): hdparm --security-erase — triggers drive firmware to erase all blocks including OP area.
3. Encryption-then-discard: If drive uses hardware encryption (TCG Opal), discarding the encryption key renders data unrecoverable instantly.
4. Physical destruction: Only guaranteed method for highly classified data.
FTL mapping table confidentiality: FTL maps are stored in DRAM and periodically checkpointed to NAND. After a "secure delete" of a file, the LBA→PBA mapping for those blocks is removed from the FTL table, but NAND cells are not immediately erased. A determined attacker with direct NAND access (chip-off forensics) could recover data until GC erases those physical blocks. For sensitive data, use filesystem-level encryption (LUKS/dm-crypt).
Power loss data integrity: Drives without power-loss protection (PLP capacitors) may lose up to 20 seconds of write-back cache data on sudden power loss. Consumer drives often lack PLP. Enterprise drives (Intel P4510, Micron 7450) include supercapacitors to flush DRAM to NAND on power loss.
Performance Implications
Queue depth matters enormously for NAND: NAND dies have multi-millisecond program latency but can pipeline multiple operations. NVMe with queue depth 1: ~40K IOPS. Queue depth 32: ~500K IOPS. The FTL can pipeline writes across multiple dies while one is programming.
# Benchmark at different queue depths
fio --filename=/dev/nvme0n1 --direct=1 --rw=randread --bs=4k \
--ioengine=libaio --iodepth=1 --runtime=30 --name=qd1
fio --filename=/dev/nvme0n1 --direct=1 --rw=randread --bs=4k \
--ioengine=libaio --iodepth=32 --runtime=30 --name=qd32
Write cliff detection: Monitor write throughput over time during sustained writes. A sudden drop of 50-80% indicates SLC cache exhaustion. Relevant for: OS backup tools, large database imports, video recording.
Read disturb: Reading a page repeatedly disturbs neighboring cells (parasitic capacitance effects in NAND string). After ~100K reads of the same block without an erase, bit errors increase. FTL tracks per-block read counts and proactively moves data (read-disturb refresh). On heavily read-heavy workloads (CDN caches), read disturb is a real wear mechanism.
Failure Modes and Real Incidents
SSD write cliff in cloud (2019, various providers): Multiple reports of application servers with SSDs provisioned at 90%+ capacity experiencing sudden I/O latency spikes (5-50x degradation) under sustained writes. Root cause: GC pressure when OP area exhausted. Mitigation: provision SSDs at 70-80% max capacity.
Samsung 840 EVO read-performance degradation (2014): TLC cells written and left unread for months experienced threshold voltage drift, causing read errors requiring multiple retries. Samsung issued firmware updates that performed background scanning to refresh vulnerable cells. Affected potentially millions of drives. First public evidence of MLC/TLC long-term data retention issues.
Micron M600 power loss data corruption: A specific interaction between the M600 SSD's power-loss handling and Linux's write-back cache caused silent data corruption when power was cut during writes. Affected certain Linux kernel versions (3.x era) that used specific I/O ordering. Fixed via kernel patches and drive firmware update.
Intel 320 Series 8MB bug (2011): A firmware bug in Intel 320 Series SSDs caused the drive to report only 8 MB of capacity after an unclean shutdown (power loss). Required firmware recovery tool. Affected enterprise users relying on consumer SSDs.
Modern Usage
- QLC for cold-tier SSDs: QLC drives (Samsung 870 QVO, Micron 5210 ION) used as HDD replacement in cold-access applications where sequential read performance matters but write endurance does not (read-mostly cold storage racks).
- ZNS (Zoned Namespace) NVMe: Exposes NAND's zone-sequential-write semantics to the host, eliminating FTL GC overhead. Used in Samsung's ZAC SSDs for hyperscalers; RocksDB has ZNS support.
- Open-Channel SSDs: Host controls NAND block management completely (FTL in software). Used in Facebook's Hyper-100 NVMe SSDs (CNEX Labs partnership) for custom FTL tuning. Provides lowest WAF for specific workloads.
- 3D NAND scaling: 200+ layer stacking (Micron 232-layer, Samsung 200+ layer). Diminishing returns — taller stacks increase program/erase latency due to longer NAND strings.
Future Directions
- CXL-attached NAND: NAND behind CXL interfaces, enabling byte-addressability and coherent memory semantics for flash
- Computational storage drives (CSD): FPGA or RISC-V cores within the SSD execute computation (filtering, compression, encryption) in the drive, reducing PCIe bandwidth pressure
- ReRAM / PCM / MRAM: Alternative non-volatile memory technologies with better endurance than NAND. PCM (Phase Change Memory) used in Optane. ReRAM under development at multiple vendors. None have achieved NAND cost/density parity yet
- PLC NAND: Penta-level cell (5 bits) being developed for archive applications where endurance is less important than density
Exercises
-
Use
nvme smart-log /dev/nvme0to find your SSD'spercentage_usedandavailable_spare. Calculate how many terabytes have been written to the drive vs. its rated TBW to estimate remaining life. -
Write a
fioscript that writes sequentially until the SLC write cache is exhausted (visible as a throughput drop). What is the SLC cache size for your drive? How does it compare to the manufacturer's spec? -
Implement a simple log-structured file layout (append-only) in a script or program. Measure WAF by tracking bytes written to underlying storage vs. application logical bytes. Compare to an in-place update pattern.
-
Examine the kernel's NVMe driver at
drivers/nvme/host/core.c. Find where theNVM Express 1.4spec command set commands are issued. Trace the code path from abiosubmission to the NVMe doorbell write. -
Estimate the TBW you generate daily on a development machine by running
nvme smart-log /dev/nvme0before and after a full workday and recordingdata_units_writtendelta. Extrapolate to years and compare against the drive's rated TBW.
References
- Cornwell, M. "Anatomy of a Solid-State Drive." ACM Queue, 2012.
- Grupp, L. et al. "Characterizing Flash Memory: Anomalies, Observations, and Applications." IEEE/ACM MICRO 2009.
- Agrawal, N. et al. "Design Tradeoffs for SSD Performance." USENIX ATC 2008.
- Kim, J. et al. "Revisiting Storage for Smartphones." ACM USENIX FAST 2012.
- JEDEC JESD218B: Solid-State Drive Endurance Workloads and Targets standard
- NVM Express Base Specification 2.0: https://nvmexpress.org/specifications/
- Linux NVMe driver:
drivers/nvme/in kernel source - Gregg, B. Systems Performance, 2nd ed., Chapter 9 (Disks)
- Samsung SSD 840 EVO performance fix: https://www.anandtech.com/show/8550/