02 - HDD Internals
Technical Overview
Hard Disk Drives (HDDs) are electromechanical devices that store data on magnetic platters. Despite solid-state storage displacing HDDs in most latency-sensitive applications, HDDs remain dominant for cold storage, backup, and bulk capacity due to their unmatched cost-per-gigabyte. Understanding HDD internals is essential for diagnosing I/O performance problems, understanding why random I/O is fundamentally expensive on spinning media, and appreciating why so much database and filesystem design exists specifically to cope with mechanical constraints.
Prerequisites
- Basic understanding of magnetic storage principles
- Familiarity with I/O latency concepts
- Linux block layer fundamentals (or will be covered alongside this topic)
Core Content
Mechanical Components
Side view of HDD:
Spindle Motor
|
+--------+--------+
| Platter 0 | <-- top of stack
| Platter 1 |
| Platter 2 |
+--------+--------+
|
(stacked platters spin together)
Top view of single platter:
Actuator Pivot
|
+------ Actuator Arm ------+
|
+==========================+ (voice coil on left)
| Track 0 (outermost)
Platter Track N (innermost)
|
Spindle
Read/Write Head at tip of actuator arm
(one head per platter surface — typically 2 per platter)
Platters: Aluminum or glass disks coated with a ferromagnetic material (cobalt alloy). Modern HDDs have 2-9 platters. Data density: up to 1.5 TB per platter in 2024 (PMR). The platters spin continuously at a constant RPM while powered.
Spindle Motor: Brushless DC motor drives all platters simultaneously on a common spindle. Consumer: 5400 or 7200 RPM. Enterprise: 10K or 15K RPM (rare, mostly replaced by SSDs).
Read/Write Heads: One head per platter surface (top and bottom). Heads do not touch the platter — they fly ~3-10 nm above the surface on an air bearing. Head crash (contact with platter) destroys both head and platter surface. Modern heads use Giant Magnetoresistance (GMR) or Tunneling Magnetoresistance (TMR) for reading; perpendicular recording for writing.
Actuator Arm: Voice coil actuator — a magnet and coil assembly that positions all heads simultaneously. The voice coil is driven by a servo control loop using embedded servo patterns on the disk to determine precise head position.
Voice Coil Motor (VCM): Not a stepper motor — it is a linear actuator that can park heads at any radial position continuously. Seek is a controlled acceleration/deceleration trajectory.
Data Organization
Platter surface layout:
Outer edge
+===========================================+ Track 0 (highest LBA density per track)
| Sector 0 | Sector 1 | ... | Sector N-1 |
+-------------------------------------------+ Track 1
| Sector 0 | Sector 1 | ... | Sector N-1 |
+-------------------------------------------+
...
+===========================================+ Track T (innermost, shortest circumference)
| Sector 0 | Sector 1 | ... | Sector M-1 | <- fewer sectors (ZBR)
Inner edge
Cylinders: All tracks at the same radial position across all platters form a cylinder. Sequential reads within a cylinder require no seek — just head switching (electronic, ~1 µs).
Sectors: Smallest addressable unit. Traditional: 512 bytes (512n format). Advanced Format: 4096 bytes physical (4Kn) or 512-byte emulation over 4K physical (512e). The 4K format improves ECC efficiency and reduces overhead at high densities. Linux sees 512e drives as 512-byte sector devices; misaligned partitions (not on 4K boundary) cause write amplification.
Check sector size:
cat /sys/block/sda/queue/hw_sector_size # physical sector size
cat /sys/block/sda/queue/logical_block_size # logical sector size
blockdev --getpbsz /dev/sda # physical block size
blockdev --getss /dev/sda # logical block size
Access Time Formula
Total HDD access time = Seek time + Rotational latency + Transfer time
Seek time: Time to move head to the correct track. - Full stroke (track 0 to track N): ~15-20 ms - Average seek (random access): ~7-9 ms (7200 RPM desktop), ~4 ms (15K RPM enterprise) - Track-to-track (adjacent track): ~0.3-1 ms
Rotational latency: Time for the target sector to rotate under the head. - 7200 RPM → one revolution = 60/7200 = 8.33 ms - Average rotational latency = 8.33 / 2 = 4.17 ms - 5400 RPM → 5.56 ms average - 15000 RPM → 2 ms average
Transfer time: Time to read/write sectors once head is positioned. - Depends on linear density and RPM - Outer tracks (higher linear velocity): 250-300 MB/s - Inner tracks (lower linear velocity, ZBR): 150-200 MB/s
Random 4K I/O worst case: - 7 ms seek + 4 ms rotational + 0.016 ms transfer ≈ 11 ms → ~90 IOPS - Actual random 4K IOPS for a 7200 RPM HDD: 75-150 IOPS
Sequential throughput: - 7200 RPM: 150-250 MB/s depending on zone - HDD sequential >> random by factor of 1000x
Zone Bit Recording (ZBR)
Outer tracks have larger circumference → more sectors per track at the same linear bit density. ZBR groups tracks into zones (typically 30-60 zones) and assigns a fixed number of sectors per track within each zone. Outer zones: more sectors/track, higher data rate. Inner zones: fewer sectors/track, lower data rate.
Zone 0 (outer): 1024 sectors/track, 250 MB/s
Zone 1: 980 sectors/track, 240 MB/s
...
Zone N (inner): 600 sectors/track, 150 MB/s
Performance drops ~40% from outer to inner tracks. Large sequential writes that span from outer to inner zones show decreasing throughput — visible in benchmarks like fio with --filename=/dev/sda.
Shingled Magnetic Recording (SMR)
Classical Perpendicular Magnetic Recording (PMR, also called CMR — Conventional Magnetic Recording) writes non-overlapping tracks. SMR overlaps write tracks like roof shingles to increase areal density.
CMR tracks: |===Track 0===|===Track 1===|===Track 2===|
(non-overlapping)
SMR tracks: |===Track 0=====|
|===Track 1=====|
|===Track 2=====|
SMR consequences: - Tracks are organized in SMR bands (typically 20-40 MB each) - Within a band, tracks are written sequentially only — you cannot rewrite track 1 without erasing track 2 and track 3 (they overlap) - Random writes require: read-modify-write of entire band → write to buffer area → background band compaction - SMR drives come in two variants: - Drive-Managed SMR (DM-SMR): FTL handles band management internally. Appears as normal block device. Write performance degrades severely under random write workloads (Seagate Archive series). Causes disasters when used for ZFS/RAID. - Host-Managed SMR (HM-SMR): Host controls bands. Requires ZNS-aware software. Predictable performance. - Host-Aware SMR (HA-SMR): Hybrid — can use as CMR or SMR mode.
# Detect SMR type
lsblk -d -o NAME,TYPE,ZONED /dev/sda
# ZONED column: none (CMR), host-managed, host-aware, drive-managed
SMART Attributes and Failure Indicators
Self-Monitoring Analysis and Reporting Technology (SMART) exposes drive health metrics:
smartctl -a /dev/sda
Key attributes for HDDs:
| Attribute ID | Name | Critical? | Notes |
|---|---|---|---|
| 5 | Reallocated Sector Count | YES | Sectors remapped due to read errors |
| 7 | Seek Error Rate | No | Vendor-specific scale (Seagate) |
| 9 | Power-On Hours | No | Total usage time |
| 10 | Spin Retry Count | Yes | Bearing/spindle health |
| 187 | Reported Uncorrectable Errors | YES | ECC-uncorrectable reads |
| 188 | Command Timeout | Yes | I/O timeouts |
| 196 | Reallocation Event Count | Yes | Any reallocation is a warning |
| 197 | Current Pending Sector Count | YES | Sectors awaiting reallocation |
| 198 | Uncorrectable Sector Count | YES | Read errors without recovery |
Rule of thumb: Any non-zero value for attributes 5, 187, 196, 197, 198 should trigger immediate backup and disk replacement.
HDD Failure Modes
Gradual sector degradation (bad sectors): Magnetic domains weaken over time. Drive's ECC can correct a number of bit errors, but when beyond ECC capability, sector is unreadable. Drive may remap to spare sectors (reallocated sector count increases). Performance impact: read retries cause latency spikes.
Bearing failure: Spindle or actuator bearings wear out. Symptoms: grinding noise, elevated spin retry count, eventual inability to spin up. Often precedes total failure within weeks.
Head crash: Read/write head contacts platter surface. Causes physical gouging of both head and platter. Can be triggered by shock/vibration while spinning. Often catastrophic — immediate data loss.
Stiction: Heads park on platter surface when powered off (older drives). Can stick to platter. Drive fails to spin up. Less common in modern drives with ramp parking.
PCB failure: Controller board failure. Drive is mechanically fine but electronically dead. Data recovery services can transplant PCB (requires matching firmware chip).
Thermal expansion: HDDs sensitive to temperature. High temp → head position drift → read errors. Operating range typically 0-60°C. Enterprise drives specify 5°C/minute thermal ramp rate.
Historical Context
IBM introduced the first HDD — the IBM 350 RAMAC — in 1956. It stored 5 MB on 50 24-inch platters, weighed over a ton, and rented for $3,200/month. By 1980, 5.25" form factor drives from Seagate (the ST-506) established the standard PC disk size. The 3.5" and 2.5" form factors followed in the late 1980s.
Through the 1990s and 2000s, areal density increased ~100% per year (Kryder's Law), but this pace slowed dramatically around 2010. The 2011 Thailand floods, which destroyed major HDD manufacturing facilities (Western Digital and Seagate plants), caused a global shortage and price spike that lasted 18 months and accelerated SSD adoption.
The transition to 4K Advanced Format sectors (2010-2014) caused widespread compatibility problems with older operating systems that assumed 512-byte sectors, requiring 512e emulation layers. Windows XP on 512e drives with misaligned partitions caused severe performance degradation.
Production Examples
Backblaze Pod storage: Backblaze publishes quarterly HDD reliability reports using their fleet of 200,000+ consumer HDDs in Storage Pods. Key findings: annual failure rates of 0.5%-5% depending on model, with some Seagate models (notably ST3000DM001) hitting 13% annual failure rates. Essential reading for HDD selection: https://www.backblaze.com/cloud-storage/resources/hard-drive-test-data
ZFS + DM-SMR disaster: Multiple reports in 2020-2021 of users unknowingly buying DM-SMR drives (Seagate Barracuda, WD Red without "Plus") for NAS/ZFS builds. ZFS's sequential write patterns interact catastrophically with DM-SMR band management — drives would drop to 1-2 MB/s during RAIDZ resilver. Western Digital was criticized for not labeling SMR drives clearly.
Google Disk Failure Study (2007): Pinheiro et al. analyzed 100,000+ HDDs at Google. Found: age is the dominant failure predictor after 3 years; SMART temperature and scan errors are predictive; there is no single dominant failure mode. AFR is ~2-4% for consumer drives in data center environments.
Debugging Notes
# Watch I/O in real time
iostat -xz 1
# %util: % time device was busy
# await: average I/O wait time (ms) -- for HDD should be <20ms for light load
# svctm: service time (deprecated, unreliable)
# r_await/w_await: separate read/write latency
# Check for I/O errors in dmesg
dmesg | grep -i "error\|failed\|reset\|ata"
# Run SMART short/long test
smartctl -t short /dev/sda
smartctl -t long /dev/sda # 1-3 hours for full surface scan
smartctl -l selftest /dev/sda # view results
# Badblocks scan (non-destructive)
badblocks -sv /dev/sda
# Check HDD queue depth and scheduler
cat /sys/block/sda/queue/scheduler
cat /sys/block/sda/queue/nr_requests
cat /sys/block/sda/device/queue_depth # native command queue depth (NCQ)
# Check rotational flag (1 = HDD)
cat /sys/block/sda/queue/rotational
Security Implications
Magnetic data remanence: When files are deleted on an HDD, the magnetic data persists until overwritten. Tools like shred or dd if=/dev/urandom are needed for secure erasure. However, reallocated sectors (spares) and bad sectors are not accessible to the host and cannot be overwritten by normal means — physical destruction is the only guarantee.
Acoustic attacks: Extremely sensitive microphones placed near HDDs can recover data through acoustic emanations from the voice coil actuator as it seeks. Demonstrated academically; not practically exploited but relevant for high-security environments.
Vibration attacks: Adversarial ultrasonic vibrations can cause HDD read errors by resonating the actuator arm. A 2017 paper ("Acoustic Denial of Service Attacks on HDDs") demonstrated this on commercially deployed systems. SSD adoption mitigates this attack vector.
Performance Implications
NCQ (Native Command Queuing): SATA 3.0 feature allowing drive to reorder up to 32 queued commands to minimize seek distance (elevator algorithm). Enable via kernel (default enabled). Without NCQ, sequential I/Os with queue depth 1 → no reordering possible → worst-case random seeking.
# Verify NCQ is active
cat /sys/block/sda/device/queue_depth
# Should show 31 or 32 for NCQ-enabled drives
Read-ahead: The kernel reads ahead sequentially to fill page cache. Default read-ahead: 128 KB (/sys/block/sda/queue/read_ahead_kb). For sequential streaming workloads on HDD, increase to 1024-4096 KB to keep the disk's transfer buffer full.
RAID rebuild time: A 14 TB HDD at 200 MB/s sequential read → rebuild takes 14 TB / 200 MB/s = 70,000 seconds ≈ 19 hours. During this window, any second drive failure in a RAID-5 causes data loss. RAID-5 with large modern HDDs is widely considered too risky.
Failure Modes and Real Incidents
2011 Thailand floods: Flooding of manufacturing facilities (Seagate in Korat, WD in Bang Pa-in) in Q4 2011 destroyed ~30% of global HDD production capacity. HDD prices tripled in weeks. The PC industry's shift to SSDs accelerated as a direct consequence.
CERN storage system degradation (2013): CERN's CASTOR tape/disk storage system experienced cascading failures when a batch of HDDs from a single firmware revision failed within weeks of each other. Batch failure due to firmware bugs in the same HDD model's power management — drives would spin down incorrectly and fail to spin back up. Lesson: avoid homogeneous disk batches.
Backblaze Pod Gen 2 vibration failures: Backblaze discovered that high-density storage pods with many drives caused resonant vibration that increased HDD error rates. Solution: rubber anti-vibration mounts. Enterprise HDDs specify vibration tolerance (RV sensors) for this reason.
Modern Usage
HDDs remain the dominant medium for: - Cloud object storage cold tiers (AWS S3, GCS, Azure Blob back-end HDD JBOD) - Backup and archive (Backblaze B2, tape+HDD hybrid NAS) - Video surveillance (purpose-built WD Purple, Seagate SkyHawk with rotational vibration sensors) - Bulk NAS (consumer: Synology, QNAP with WD Red Plus / Seagate IronWolf CMR drives)
HAMR (Heat-Assisted Magnetic Recording): Seagate's HAMR uses a laser to heat the platter surface during writing, enabling higher coercivity materials → higher areal density. Seagate shipped first HAMR drives (Mozaic 3+) to hyperscaler customers in 2023 at 30 TB capacity. MACH.2 (multi-actuator) adds a second independent actuator, doubling random IOPS by accessing two zones simultaneously.
MAMR (Microwave-Assisted Magnetic Recording): Western Digital's competing approach to HAMR — uses microwave energy instead of heat. Slightly lower density potential but less thermal stress on heads.
Future Directions
- HAMR density roadmap: Seagate targets 50+ TB per 3.5" drive by 2026 using HAMR
- Multi-actuator drives (Seagate MACH.2): Two actuators per drive → 2x IOPS for random workloads
- DNA/glass storage long-term research: Microsoft's Project Silica (fused silica glass) stores data in 3D voxels, read with femtosecond laser. 75 TB/cm² potential but R/W latency measured in days currently
- HDD market consolidation: By 2024, only Seagate, Western Digital, and Toshiba remain as HDD manufacturers
Exercises
-
Use
fioto measure random 4K IOPS vs sequential throughput on an HDD (or simulate withfio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=1 --runtime=30). Compare to the theoretical access time calculation using the formula in this document. -
Run
iostat -xz 1 30during an intensive HDD workload. Identify which metric (await,%util,r_await) best indicates saturation. When can%utilbe high butawaitbe low? -
Using
smartctl -a, examine a drive's SMART data. Identify the reallocated sector count and pending sector count. What is the raw vs normalized value scale? -
Research the WD Red SMR controversy (2020). List the specific models affected and the performance degradation reported under ZFS resilver. Why is ZFS particularly sensitive to DM-SMR behavior?
-
Calculate the RAID-5 rebuild window for a 20 TB HDD array with 6 drives at 200 MB/s sustained sequential read. What is the probability of a second failure during rebuild if the AFR is 2%? (Hint: probability of failure in a given time window = 1 - e^(-λt) where λ = AFR/8760.)
References
- Anderson, D. et al. "More Than an Interface: SCSI vs. ATA." USENIX HotOS 2003.
- Pinheiro, E. et al. "Failure Trends in a Large Disk Drive Population." USENIX FAST 2007.
- Backblaze Hard Drive Stats: https://www.backblaze.com/cloud-storage/resources/hard-drive-test-data
- WD SMR controversy: https://arstechnica.com/gadgets/2020/04/wd-nas-hard-drives-are-actually-slow-smr-drives/
- Seagate HAMR: https://www.seagate.com/innovation/hamr/
- ATA/ATAPI Command Set (ACS-4): INCITS 529-2018
- Linux kernel SMART interface:
drivers/ata/libata-scsi.c - Cornwell, M. "Anatomy of a Solid-State Drive." ACM Queue, 2012. (comparison context)