04 - NVMe
Technical Overview
NVMe (Non-Volatile Memory Express) is a host controller interface specification designed from the ground up for flash-based storage. It replaces AHCI (Advanced Host Controller Interface), which was originally designed in 2004 for SATA HDDs. The fundamental problem NVMe solves: AHCI has one command queue with 32 slots — barely enough for HDD NCQ, completely inadequate for SSDs capable of 1 million IOPS. NVMe provides up to 65,535 queues with 65,535 commands each, removing the queuing bottleneck entirely and exploiting PCIe's full bandwidth.
NVMe is not merely a faster cable — it is a reimagined command protocol with dramatically reduced software overhead (20 µs vs 70 µs for AHCI), exposing the raw capability of modern NAND flash arrays.
Prerequisites
- PCIe architecture basics (lanes, generations, bandwidth)
- SSD internals (NAND parallelism, FTL) — see 03-ssd-internals.md
- Linux block layer concepts — see 05-linux-block-layer.md
- I/O queue concepts (producer/consumer queues)
Core Content
Why NVMe Over AHCI
AHCI (Advanced Host Controller Interface), standardized in 2004, was designed for SATA HDDs. Its constraints:
AHCI Model:
+--------+ 1 Queue +----------------+
| CPU | ----------------> | HBA (AHCI) |
| | <---------------- | 32 slots max |
+--------+ +----------------+
|
One interrupt
per I/O batch
Limitations when applied to NVMe-class SSDs: - Single queue: Multiple CPUs must serialize through one queue → lock contention - 32 command slots: An NVMe SSD capable of 1M IOPS needs 1000 in-flight ops to saturate at 1 µs each. 32 slots means max 32 in-flight → cannot exceed ~450K IOPS at 70µs overhead - Interrupt handling overhead: AHCI uses MSI/legacy interrupts with significant per-I/O software overhead (~70 µs vs NVMe's ~20 µs) - No multi-core awareness: Queue sits in host memory at a fixed location — no per-CPU queue affinity
NVMe solution: - Up to 65,535 I/O queue pairs per controller - Per-CPU submission queues — no cross-CPU locking - MSI-X interrupts — one interrupt vector per completion queue, pinned to specific CPUs - NVMe command overhead: ~20 µs (vs AHCI ~70 µs)
PCIe Bandwidth at Each Generation
NVMe rides on PCIe lanes. Bandwidth per lane per direction:
| PCIe Gen | Per-lane BW | x4 BW | x8 BW | x16 BW |
|---|---|---|---|---|
| Gen 3 | 985 MB/s | 3.9 GB/s | 7.9 GB/s | 15.8 GB/s |
| Gen 4 | 1.97 GB/s | 7.9 GB/s | 15.8 GB/s | 31.5 GB/s |
| Gen 5 | 3.94 GB/s | 15.8 GB/s | 31.5 GB/s | 63 GB/s |
Consumer NVMe drives: x4 PCIe interface. - Gen 3 x4: ~3.5 GB/s peak (Samsung 970 EVO Plus: 3.5 GB/s seq read) - Gen 4 x4: ~7 GB/s peak (Samsung 990 Pro: 7.45 GB/s seq read) - Gen 5 x4: ~14 GB/s peak (Crucial T705: 14.5 GB/s seq read — 2024)
Enterprise SSDs: x4 or x8. U.2 drives often x4 (2.5" form factor). EDSFF E3.S drives up to x8 Gen 5 = 31 GB/s.
NVMe Queue Architecture
Host (CPU side): NVMe Controller (SSD side):
+--------------------------------------------------+
| |
| CPU 0 CPU 1 CPU 2 |
| | | | |
| v v v |
| SQ #1 (4KB) SQ #2 (4KB) SQ #3 (4KB) |
| (Submission (Submission (Submission |
| Queue in Queue in Queue in |
| host DRAM) host DRAM) host DRAM) |
| |
| CQ #1 (4KB) CQ #2 (4KB) CQ #3 (4KB) |
| (Completion (Completion (Completion |
| Queue in Queue in Queue in |
| host DRAM) host DRAM) host DRAM) |
+--------------------------------------------------+
PCIe Bus
+--------------------------------------------------+
| |
| Admin SQ/CQ (setup, identify, create queues) |
| |
| I/O Queue Pairs mapped 1:1 to host queues |
| |
| DMA Engine reads commands from host SQ |
| DMA Engine writes completions to host CQ |
| |
| NVMe Controller Internals (FTL, NAND arrays) |
+--------------------------------------------------+
Doorbell Mechanism:
1. CPU writes 64-byte NVMe command to SQ[tail]
2. CPU increments SQ Tail Doorbell register (one MMIO write)
3. Controller DMAs command, executes I/O
4. Controller writes 16-byte completion to CQ[head]
5. Controller raises MSI-X interrupt on completion queue vector
6. CPU reads completion, increments CQ Head Doorbell
Command structure (64 bytes): - Opcode (1 byte): Read (0x02), Write (0x01), Flush (0x00) - Namespace ID (4 bytes): which namespace to address - PRPs or SGLs: Physical Region Pages or Scatter-Gather Lists pointing to host DRAM buffers - Starting LBA, length - Command-specific fields (flags, dataset management hints)
Completion entry (16 bytes): - Command-specific result (bytes read/written) - SQ identifier - Command identifier (echoes back command ID for matching) - Status field (success or error code) - Phase bit (toggles each wrap-around of CQ — allows completion detection without memory clearing)
NVMe Namespaces
A namespace is a logical partition of NVMe storage, analogous to a SCSI LUN. One NVMe controller can present up to 65,535 namespaces. Each namespace has: - Namespace ID (NSID): 1-indexed - Total capacity in LBAs - Independent LBA format (sector size) - Optional: shared access from multiple hosts
# List namespaces on an NVMe device
nvme list-ns /dev/nvme0
nvme id-ns /dev/nvme0n1 # identify namespace
nvme id-ctrl /dev/nvme0 # identify controller (firmware, model, capabilities)
Namespace use cases: - Separate namespaces for different workloads (database data vs WAL vs temp) - Multi-host shared namespaces (NVMe SR-IOV or Namespace Sharing for clustered storage) - Namespace-level secure erase without affecting other namespaces
NVMe-oF (NVMe over Fabrics)
NVMe-oF extends the NVMe protocol over a network fabric, allowing NVMe namespaces to be accessed remotely with latency approaching local NVMe:
NVMe-oF Architecture:
Host (Initiator) Target (NVMe Controller)
+------------------+ Fabric +----------------------+
| NVMe Driver | <-----------> | NVMe-oF Target |
| (fabric client) | | Subsystem |
| | RDMA/ | (nvmetd or kernel |
| /dev/nvme1n1 | TCP/FC | nvmet module) |
+------------------+ +----------------------+
|
Local NVMe SSDs
Transport options: - RDMA (RoCE v2 / iWARP): Lowest latency, ~10-15 µs RTT. Requires RDMA-capable NICs (InfiniBand or RoCE). Used by hyperscalers for disaggregated storage (AWS EBS, NetApp AFF). - TCP: Software RDMA replacement, ~100 µs RTT, universal NIC support. Linux kernel NVMe-oF TCP since 5.0. Standard for general-purpose shared NVMe. - Fibre Channel (FC): Legacy SAN integration. Used in enterprises with existing FC fabric.
NVMe-oF latency vs local NVMe: - Local NVMe (PCIe): ~100 µs read latency - NVMe-oF over RDMA: ~130-200 µs (adds ~30-100 µs network RTT) - NVMe-oF over TCP (optimized): ~200-500 µs - iSCSI (for comparison): ~500 µs - 2 ms
# Configure NVMe-oF TCP target (kernel nvmet)
modprobe nvmet
modprobe nvmet-tcp
# Create target subsystem
mkdir /sys/kernel/config/nvmet/subsystems/my-nvme-target
echo 1 > /sys/kernel/config/nvmet/subsystems/my-nvme-target/attr_allow_any_host
# Add a namespace
mkdir /sys/kernel/config/nvmet/subsystems/my-nvme-target/namespaces/1
echo /dev/nvme0n1 > \
/sys/kernel/config/nvmet/subsystems/my-nvme-target/namespaces/1/device_path
echo 1 > /sys/kernel/config/nvmet/subsystems/my-nvme-target/namespaces/1/enable
# Create TCP port
mkdir /sys/kernel/config/nvmet/ports/1
echo "tcp" > /sys/kernel/config/nvmet/ports/1/addr_trtype
echo "192.168.1.100" > /sys/kernel/config/nvmet/ports/1/addr_traddr
echo "4420" > /sys/kernel/config/nvmet/ports/1/addr_trsvcid
echo "ipv4" > /sys/kernel/config/nvmet/ports/1/addr_adrfam
# Connect from initiator
nvme connect -t tcp -a 192.168.1.100 -s 4420 -n my-nvme-target
nvme list # shows the remote namespace as /dev/nvme1n1
Linux NVMe Driver
The Linux NVMe driver lives at drivers/nvme/:
drivers/nvme/
├── host/
│ ├── core.c # Core NVMe host driver — queue management, I/O submission
│ ├── pci.c # PCIe transport (local NVMe)
│ ├── tcp.c # NVMe-oF TCP transport
│ ├── rdma.c # NVMe-oF RDMA transport
│ ├── fc.c # NVMe-oF Fibre Channel
│ ├── fabrics.c # Common NVMe-oF framework
│ └── nvme.h # Core data structures
└── target/
├── core.c # NVMe target framework
├── tcp.c # TCP target transport
└── ...
Key data structures:
struct nvme_dev { // per-PCIe device
struct nvme_queue *queues; // array of queue pairs
struct nvme_ctrl ctrl;
};
struct nvme_queue { // one submission + completion queue pair
dma_addr_t sq_dma_addr; // host DRAM address of SQ (DMA-mapped)
dma_addr_t cq_dma_addr; // host DRAM address of CQ
void __iomem *q_db; // doorbell register address (MMIO)
u16 sq_tail; // current submission queue tail
u16 cq_head; // current completion queue head
};
ZNS (Zoned Namespaces)
ZNS is an NVMe extension that exposes the SSD's zoned write model to the host: - Sequential write zones: Each zone must be written sequentially from zone start to zone end - Zone capacity: typically 256 MB - 2 GB - Write pointer: tracks current append position within zone - Zone states: Empty, Open (being written), Full, Closed, Read-Only, Offline
ZNS advantages: - Eliminates FTL garbage collection overhead (host manages zones explicitly) - Reduces write amplification to ~1.0x for sequential workloads - Lower over-provisioning needed → higher usable capacity - More predictable latency (no GC interference)
ZNS use cases: RocksDB (SST file compaction output is sequential), Ceph BlueStore (object placement maps to zones), Lustre OST.
# Check if device supports ZNS
nvme id-ctrl /dev/nvme0 | grep zns
nvme zns id-ns /dev/nvme0n1 # identify ZNS namespace
nvme zns report-zones /dev/nvme0n1 # list zones and their write pointers
NVMe in Cloud
AWS EC2 NVMe (Nitro):
- All current-generation EC2 instances use NVMe via the AWS Nitro card
- Instance store NVMe: physically local SSDs, very low latency (~100 µs), non-persistent (data lost on stop/terminate)
- EBS (Elastic Block Store) volumes also appear as NVMe devices (/dev/nvme0n1) via Nitro — network-attached but NVMe protocol
- AWS Nitro uses SR-IOV to present NVMe devices directly to VMs without hypervisor I/O overhead
# On EC2, distinguish EBS from instance store NVMe
nvme id-ctrl /dev/nvme0 | grep -E 'mn|sn'
# AWS EBS volumes have model name "Amazon Elastic Block Store"
# Instance store volumes have model "Amazon EC2 NVMe Instance Storage"
Azure Ultra Disk: NVMe-based ultra-low-latency block storage with configurable IOPS/throughput. Up to 160K IOPS and 2000 MB/s per disk.
Google Hyperdisk Extreme: NVMe-based high-performance persistent storage, up to 350K IOPS.
Historical Context
NVMe was developed by a working group of storage vendors (Intel, Samsung, SanDisk, Dell, NetApp) from 2008-2011. The NVMe 1.0 specification was published in March 2011. The first consumer NVMe drives appeared in 2013 (Samsung XP941 M.2, Intel 750).
The initial form factors for NVMe were PCIe add-in card and M.2 (2280 form factor). The M.2 connector enabled NVMe to enter the laptop market, and by 2016, most premium laptops had switched from SATA to NVMe SSDs.
NVMe 1.3 (2017) introduced critical features: Host Memory Buffer (HMB — allows controller to use host DRAM for mapping table, avoiding expensive onboard DRAM), Directives (stream hints to FTL), and Enhanced Power States.
NVMe 2.0 (2021) consolidated the spec and introduced: ZNS (Zoned Namespaces), KV (Key-Value) command set, Persistent Memory Region, and improved security features (Replay Protected Memory Block for UEFI secure boot). The NVM Express Management Interface (NVMe-MI) provides out-of-band management.
Production Examples
Netflix NFLX storage fleet: Netflix uses NVMe SSDs in their Open Connect Appliance (OCA) CDN nodes. Each OCA has multiple NVMe drives serving video content. The OCA software is tuned for high queue depth sequential reads, exploiting NVMe's multi-queue architecture.
Meta's Tectonic storage: Meta's next-gen storage system uses NVMe SSDs with ZNS to reduce write amplification in their custom RocksDB-based object storage. ZNS alignment with RocksDB's compaction output (sequential zone writes per SST file) achieves near-1.0x WAF.
LinkedIn Ambry: LinkedIn's blob storage uses NVMe SSDs exposed via NVMe-oF (TCP) to compute nodes, decoupling compute from storage in their media storage tier.
Debugging Notes
# Full NVMe device enumeration
nvme list
# Device information
nvme id-ctrl /dev/nvme0
nvme id-ns /dev/nvme0n1
# SMART health (critical for monitoring)
nvme smart-log /dev/nvme0
# Watch: percentage_used, available_spare, media_errors, unsafe_shutdowns
# Error log
nvme error-log /dev/nvme0
# Check PCIe link state and speed
lspci -vvv | grep -A 30 "Non-Volatile"
# Look for: LnkSta: Speed 16GT/s (gen4), Width x4
# Performance benchmark
fio --filename=/dev/nvme0n1 --direct=1 --rw=randread \
--bs=4k --ioengine=libaio --iodepth=32 --numjobs=4 \
--runtime=30 --group_reporting --name=nvme-test
# Monitor NVMe queue stats
cat /sys/block/nvme0n1/queue/nr_requests
cat /sys/block/nvme0n1/queue/scheduler # should be "none" for NVMe
# Check NVMe power management (APST)
nvme get-feature /dev/nvme0 -f 0x0c -H # Autonomous Power State Transition
# Disable APST for latency-sensitive applications:
nvme set-feature /dev/nvme0 -f 0x0c -v 0
# PCIe error checking
dmesg | grep -i "nvme\|pcie\|aer"
Security Implications
TCG Opal Self-Encrypting Drives (SED): Many NVMe drives support TCG Opal 2.0 — hardware AES-256 encryption with key stored in controller. The drive is locked until the pre-boot authentication (via UEFI or sedutil) unlocks it. Lock/unlock via sedutil-cli. If the drive is removed and power-cycled, it returns to locked state.
NVMe Namespace Isolation: Multiple namespaces on one drive provide logical isolation but NOT security isolation — all namespaces share the same controller and NAND. For multi-tenant environments, use separate physical drives.
NVMe-oF Authentication: NVMe-oF 1.1 introduced DH-HMAC-CHAP authentication and TLS 1.3 for TCP transport. Without authentication, any host on the network segment can connect to an NVMe-oF target. Always configure authentication in production NVMe-oF deployments.
Firmware Attacks: NVMe controller firmware is updateable and has been a target. The NVMe-MI (Management Interface) spec includes firmware update commands. Ensure firmware is from verified sources and validate using nvme fw-download with authenticated firmware images.
Performance Implications
APST (Autonomous Power State Transitions): NVMe drives can enter power-saving states automatically when idle. Wake-up from deep power states adds 5-100 ms latency. For always-on production servers, disable APST:
echo 0 > /sys/module/nvme_core/parameters/default_ps_max_latency_us
Or in grub: nvme_core.default_ps_max_latency_us=0
IRQ affinity: For maximum throughput, pin NVMe completion interrupts to specific CPUs:
# See NVMe IRQ vectors
cat /proc/interrupts | grep nvme
# Set IRQ affinity (e.g., pin completion queue 0 to CPU 0)
echo 1 > /proc/irq/<irq_number>/smp_affinity
io_uring for NVMe: Linux io_uring (5.1+) with IORING_SETUP_SQPOLL enables kernel polling mode for NVMe submissions — eliminates system call overhead for I/O, approaching SPDK-level performance from userspace.
Failure Modes and Real Incidents
NVMe controller firmware hangs: Multiple enterprise NVMe drives from major vendors (Intel P3700, Samsung PM9A3 early firmware) had bugs that caused the controller to hang under specific I/O patterns. Manifests as nvme: I/O timeout in dmesg followed by drive reset. Mitigated by firmware updates and watchdog timeouts in Linux NVMe driver (nvme_reset_timeout_ms).
PCIe bifurcation issues: Some motherboards/servers incorrectly configure PCIe bifurcation for M.2/U.2 slots, running Gen4 NVMe at Gen3 speeds or x2 instead of x4. Always verify with lspci -vv | grep LnkSta after installation.
Metadata integrity on power loss: Consumer NVMe drives without power-loss protection can corrupt the FTL mapping table on sudden power loss. This can brick the drive (controller cannot find its own mapping table). Enterprise drives use supercapacitors to complete in-flight FTL writes on power loss. A 2020 incident at a major European cloud provider saw dozens of consumer-grade NVMe drives simultaneously brick during a PDU failure.
High-temperature throttling: NVMe drives in M.2 slots without heatsinks can reach 80°C+ under sustained load. At 70°C+, most drives throttle performance by 50-80%. Monitor with nvme smart-log /dev/nvme0 | grep temp and ensure adequate airflow or add M.2 heatsinks.
Modern Usage
- CXL-attached NVMe: Emerging CXL 2.0 type-2 devices present NVMe flash as coherent memory, accessible via CPU load/store instructions
- NVMe SR-IOV: Virtual Functions allow a single NVMe controller to be shared among multiple VMs with hardware isolation
- NVMe CMB (Controller Memory Buffer): Controller exposes a portion of its onboard SRAM to the host as PCIe BAR memory, usable for I/O queue placement — reduces DMA overhead
- Persistent Memory Region (PMR): NVMe 2.0 feature — controller exposes persistent DRAM region accessible via PCIe MMIO. Used for crash-consistent queue storage
Future Directions
- PCIe 6.0: 240 GB/s x16 (PAM4 signaling). x4 NVMe would hit ~60 GB/s — exceeding current NAND bandwidth limits; multi-chip SSD arrays needed to saturate
- Computational Storage (NVMe CSD): NVMe 2.0 defines computational storage devices with embedded processors. First commercial CSDs available (NGD Newport, Samsung SmartSSD)
- NVMe Key-Value store (KV): NVMe 2.0 includes a native KV command set —
store,retrieve,delete,list— eliminating FTL overhead for KV workloads by mapping directly to NAND structure - Disaggregated NVMe pools: Rack-scale NVMe-oF fabric with centrally managed NVMe SSD pools, dynamically allocated to compute nodes. AWS Nitro SSD and Azure Local Storage are early forms of this
Exercises
-
Install
nvme-cliand identify all NVMe devices on a system. Examinenvme id-ctrloutput and identify the number of supported queue pairs, maximum queue depth, and supported commands (CNS field). -
Use
fioto benchmark a local NVMe drive at queue depths 1, 4, 8, 16, 32, and 64 with 4K random reads. Plot IOPS vs QD. At what QD does IOPS plateau? What does this reveal about the drive's internal parallelism? -
Set up an NVMe-oF TCP target and initiator on the same machine using loopback (
127.0.0.1). Measure the overhead: comparefiolatency on/dev/nvme0n1(direct) vs/dev/nvme1n1(NVMe-oF TCP loopback). -
Read
drivers/nvme/host/pci.cin the Linux kernel. Find the function that writes the doorbell register after submitting a command. Trace the path fromnvme_submit_cmd()to the actual MMIO write. -
Disable APST on an NVMe drive and measure the P99 latency improvement under a bursty I/O workload (idle periods followed by bursts). Compare
nvme get-feature -f 0x0cbefore and after.
References
- NVM Express Base Specification 2.0: https://nvmexpress.org/specifications/
- NVMe-oF Specification 1.1: https://nvmexpress.org/specifications/
- Linux NVMe driver:
drivers/nvme/(kernel.org) - NVMe Zoned Namespaces Command Set: https://nvmexpress.org/specifications/
- Xu, Q. et al. "Performance Analysis of NVMe SSDs and Their Implication on Real World Databases." SYSTOR 2015.
- Yang, J. et al. "Don't Stack your Log on my Log." INFLOW 2014.
- AWS EC2 NVMe: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/nvme-ebs-volumes.html
- Gregg, B. Systems Performance, 2nd ed., Chapter 9
- nvme-cli tool: https://github.com/linux-nvme/nvme-cli