02 - HDD Internals

Technical Overview

Hard Disk Drives (HDDs) are electromechanical devices that store data on magnetic platters. Despite solid-state storage displacing HDDs in most latency-sensitive applications, HDDs remain dominant for cold storage, backup, and bulk capacity due to their unmatched cost-per-gigabyte. Understanding HDD internals is essential for diagnosing I/O performance problems, understanding why random I/O is fundamentally expensive on spinning media, and appreciating why so much database and filesystem design exists specifically to cope with mechanical constraints.

Prerequisites

Basic understanding of magnetic storage principles
Familiarity with I/O latency concepts
Linux block layer fundamentals (or will be covered alongside this topic)

Core Content

Mechanical Components

Side view of HDD:

         Spindle Motor
              |
     +--------+--------+
     |    Platter 0    |  <-- top of stack
     |    Platter 1    |
     |    Platter 2    |
     +--------+--------+
              |
         (stacked platters spin together)

Top view of single platter:

         Actuator Pivot
              |
              +------ Actuator Arm ------+
                                         |
              +==========================+  (voice coil on left)
              |         Track 0 (outermost)
         Platter    Track N (innermost)
              |
         Spindle

    Read/Write Head at tip of actuator arm
    (one head per platter surface — typically 2 per platter)

Platters: Aluminum or glass disks coated with a ferromagnetic material (cobalt alloy). Modern HDDs have 2-9 platters. Data density: up to 1.5 TB per platter in 2024 (PMR). The platters spin continuously at a constant RPM while powered.

Spindle Motor: Brushless DC motor drives all platters simultaneously on a common spindle. Consumer: 5400 or 7200 RPM. Enterprise: 10K or 15K RPM (rare, mostly replaced by SSDs).

Read/Write Heads: One head per platter surface (top and bottom). Heads do not touch the platter — they fly ~3-10 nm above the surface on an air bearing. Head crash (contact with platter) destroys both head and platter surface. Modern heads use Giant Magnetoresistance (GMR) or Tunneling Magnetoresistance (TMR) for reading; perpendicular recording for writing.

Actuator Arm: Voice coil actuator — a magnet and coil assembly that positions all heads simultaneously. The voice coil is driven by a servo control loop using embedded servo patterns on the disk to determine precise head position.

Voice Coil Motor (VCM): Not a stepper motor — it is a linear actuator that can park heads at any radial position continuously. Seek is a controlled acceleration/deceleration trajectory.

Data Organization

Platter surface layout:

    Outer edge
    +===========================================+ Track 0 (highest LBA density per track)
    |  Sector 0 | Sector 1 | ... | Sector N-1  |
    +-------------------------------------------+ Track 1
    |  Sector 0 | Sector 1 | ... | Sector N-1  |
    +-------------------------------------------+
    ...
    +===========================================+ Track T (innermost, shortest circumference)
    |  Sector 0 | Sector 1 | ... | Sector M-1  |  <- fewer sectors (ZBR)
    Inner edge

Cylinders: All tracks at the same radial position across all platters form a cylinder. Sequential reads within a cylinder require no seek — just head switching (electronic, ~1 µs).

Sectors: Smallest addressable unit. Traditional: 512 bytes (512n format). Advanced Format: 4096 bytes physical (4Kn) or 512-byte emulation over 4K physical (512e). The 4K format improves ECC efficiency and reduces overhead at high densities. Linux sees 512e drives as 512-byte sector devices; misaligned partitions (not on 4K boundary) cause write amplification.

Check sector size:

cat /sys/block/sda/queue/hw_sector_size     # physical sector size
cat /sys/block/sda/queue/logical_block_size # logical sector size
blockdev --getpbsz /dev/sda                 # physical block size
blockdev --getss /dev/sda                   # logical block size

Access Time Formula

Total HDD access time = Seek time + Rotational latency + Transfer time

Seek time: Time to move head to the correct track. - Full stroke (track 0 to track N): ~15-20 ms - Average seek (random access): ~7-9 ms (7200 RPM desktop), ~4 ms (15K RPM enterprise) - Track-to-track (adjacent track): ~0.3-1 ms

Rotational latency: Time for the target sector to rotate under the head. - 7200 RPM → one revolution = 60/7200 = 8.33 ms - Average rotational latency = 8.33 / 2 = 4.17 ms - 5400 RPM → 5.56 ms average - 15000 RPM → 2 ms average

Transfer time: Time to read/write sectors once head is positioned. - Depends on linear density and RPM - Outer tracks (higher linear velocity): 250-300 MB/s - Inner tracks (lower linear velocity, ZBR): 150-200 MB/s

Random 4K I/O worst case: - 7 ms seek + 4 ms rotational + 0.016 ms transfer ≈ 11 ms → ~90 IOPS - Actual random 4K IOPS for a 7200 RPM HDD: 75-150 IOPS

Sequential throughput: - 7200 RPM: 150-250 MB/s depending on zone - HDD sequential >> random by factor of 1000x

Zone Bit Recording (ZBR)

Outer tracks have larger circumference → more sectors per track at the same linear bit density. ZBR groups tracks into zones (typically 30-60 zones) and assigns a fixed number of sectors per track within each zone. Outer zones: more sectors/track, higher data rate. Inner zones: fewer sectors/track, lower data rate.

Zone 0 (outer): 1024 sectors/track, 250 MB/s
Zone 1:          980 sectors/track, 240 MB/s
...
Zone N (inner):  600 sectors/track, 150 MB/s

Performance drops ~40% from outer to inner tracks. Large sequential writes that span from outer to inner zones show decreasing throughput — visible in benchmarks like fio with --filename=/dev/sda.

Shingled Magnetic Recording (SMR)

Classical Perpendicular Magnetic Recording (PMR, also called CMR — Conventional Magnetic Recording) writes non-overlapping tracks. SMR overlaps write tracks like roof shingles to increase areal density.

CMR tracks:          |===Track 0===|===Track 1===|===Track 2===|
(non-overlapping)

SMR tracks:    |===Track 0=====|
                          |===Track 1=====|
                                     |===Track 2=====|

SMR consequences: - Tracks are organized in SMR bands (typically 20-40 MB each) - Within a band, tracks are written sequentially only — you cannot rewrite track 1 without erasing track 2 and track 3 (they overlap) - Random writes require: read-modify-write of entire band → write to buffer area → background band compaction - SMR drives come in two variants: - Drive-Managed SMR (DM-SMR): FTL handles band management internally. Appears as normal block device. Write performance degrades severely under random write workloads (Seagate Archive series). Causes disasters when used for ZFS/RAID. - Host-Managed SMR (HM-SMR): Host controls bands. Requires ZNS-aware software. Predictable performance. - Host-Aware SMR (HA-SMR): Hybrid — can use as CMR or SMR mode.

# Detect SMR type
lsblk -d -o NAME,TYPE,ZONED /dev/sda
# ZONED column: none (CMR), host-managed, host-aware, drive-managed

SMART Attributes and Failure Indicators

Self-Monitoring Analysis and Reporting Technology (SMART) exposes drive health metrics:

smartctl -a /dev/sda

Key attributes for HDDs:

Attribute ID	Name	Critical?	Notes
5	Reallocated Sector Count	YES	Sectors remapped due to read errors
7	Seek Error Rate	No	Vendor-specific scale (Seagate)
9	Power-On Hours	No	Total usage time
10	Spin Retry Count	Yes	Bearing/spindle health
187	Reported Uncorrectable Errors	YES	ECC-uncorrectable reads
188	Command Timeout	Yes	I/O timeouts
196	Reallocation Event Count	Yes	Any reallocation is a warning
197	Current Pending Sector Count	YES	Sectors awaiting reallocation
198	Uncorrectable Sector Count	YES	Read errors without recovery

Rule of thumb: Any non-zero value for attributes 5, 187, 196, 197, 198 should trigger immediate backup and disk replacement.

HDD Failure Modes

Gradual sector degradation (bad sectors): Magnetic domains weaken over time. Drive's ECC can correct a number of bit errors, but when beyond ECC capability, sector is unreadable. Drive may remap to spare sectors (reallocated sector count increases). Performance impact: read retries cause latency spikes.

Bearing failure: Spindle or actuator bearings wear out. Symptoms: grinding noise, elevated spin retry count, eventual inability to spin up. Often precedes total failure within weeks.

Head crash: Read/write head contacts platter surface. Causes physical gouging of both head and platter. Can be triggered by shock/vibration while spinning. Often catastrophic — immediate data loss.

Stiction: Heads park on platter surface when powered off (older drives). Can stick to platter. Drive fails to spin up. Less common in modern drives with ramp parking.

PCB failure: Controller board failure. Drive is mechanically fine but electronically dead. Data recovery services can transplant PCB (requires matching firmware chip).

Thermal expansion: HDDs sensitive to temperature. High temp → head position drift → read errors. Operating range typically 0-60°C. Enterprise drives specify 5°C/minute thermal ramp rate.

Historical Context

IBM introduced the first HDD — the IBM 350 RAMAC — in 1956. It stored 5 MB on 50 24-inch platters, weighed over a ton, and rented for $3,200/month. By 1980, 5.25" form factor drives from Seagate (the ST-506) established the standard PC disk size. The 3.5" and 2.5" form factors followed in the late 1980s.

Through the 1990s and 2000s, areal density increased ~100% per year (Kryder's Law), but this pace slowed dramatically around 2010. The 2011 Thailand floods, which destroyed major HDD manufacturing facilities (Western Digital and Seagate plants), caused a global shortage and price spike that lasted 18 months and accelerated SSD adoption.

The transition to 4K Advanced Format sectors (2010-2014) caused widespread compatibility problems with older operating systems that assumed 512-byte sectors, requiring 512e emulation layers. Windows XP on 512e drives with misaligned partitions caused severe performance degradation.

Production Examples

Backblaze Pod storage: Backblaze publishes quarterly HDD reliability reports using their fleet of 200,000+ consumer HDDs in Storage Pods. Key findings: annual failure rates of 0.5%-5% depending on model, with some Seagate models (notably ST3000DM001) hitting 13% annual failure rates. Essential reading for HDD selection: https://www.backblaze.com/cloud-storage/resources/hard-drive-test-data

ZFS + DM-SMR disaster: Multiple reports in 2020-2021 of users unknowingly buying DM-SMR drives (Seagate Barracuda, WD Red without "Plus") for NAS/ZFS builds. ZFS's sequential write patterns interact catastrophically with DM-SMR band management — drives would drop to 1-2 MB/s during RAIDZ resilver. Western Digital was criticized for not labeling SMR drives clearly.

Google Disk Failure Study (2007): Pinheiro et al. analyzed 100,000+ HDDs at Google. Found: age is the dominant failure predictor after 3 years; SMART temperature and scan errors are predictive; there is no single dominant failure mode. AFR is ~2-4% for consumer drives in data center environments.

Debugging Notes

# Watch I/O in real time
iostat -xz 1
# %util: % time device was busy
# await: average I/O wait time (ms) -- for HDD should be <20ms for light load
# svctm: service time (deprecated, unreliable)
# r_await/w_await: separate read/write latency

# Check for I/O errors in dmesg
dmesg | grep -i "error\|failed\|reset\|ata"

# Run SMART short/long test
smartctl -t short /dev/sda
smartctl -t long /dev/sda   # 1-3 hours for full surface scan
smartctl -l selftest /dev/sda  # view results

# Badblocks scan (non-destructive)
badblocks -sv /dev/sda

# Check HDD queue depth and scheduler
cat /sys/block/sda/queue/scheduler
cat /sys/block/sda/queue/nr_requests
cat /sys/block/sda/device/queue_depth  # native command queue depth (NCQ)

# Check rotational flag (1 = HDD)
cat /sys/block/sda/queue/rotational

Security Implications

Magnetic data remanence: When files are deleted on an HDD, the magnetic data persists until overwritten. Tools like shred or dd if=/dev/urandom are needed for secure erasure. However, reallocated sectors (spares) and bad sectors are not accessible to the host and cannot be overwritten by normal means — physical destruction is the only guarantee.

Acoustic attacks: Extremely sensitive microphones placed near HDDs can recover data through acoustic emanations from the voice coil actuator as it seeks. Demonstrated academically; not practically exploited but relevant for high-security environments.

Vibration attacks: Adversarial ultrasonic vibrations can cause HDD read errors by resonating the actuator arm. A 2017 paper ("Acoustic Denial of Service Attacks on HDDs") demonstrated this on commercially deployed systems. SSD adoption mitigates this attack vector.

Performance Implications

NCQ (Native Command Queuing): SATA 3.0 feature allowing drive to reorder up to 32 queued commands to minimize seek distance (elevator algorithm). Enable via kernel (default enabled). Without NCQ, sequential I/Os with queue depth 1 → no reordering possible → worst-case random seeking.

# Verify NCQ is active
cat /sys/block/sda/device/queue_depth
# Should show 31 or 32 for NCQ-enabled drives

Read-ahead: The kernel reads ahead sequentially to fill page cache. Default read-ahead: 128 KB (/sys/block/sda/queue/read_ahead_kb). For sequential streaming workloads on HDD, increase to 1024-4096 KB to keep the disk's transfer buffer full.

RAID rebuild time: A 14 TB HDD at 200 MB/s sequential read → rebuild takes 14 TB / 200 MB/s = 70,000 seconds ≈ 19 hours. During this window, any second drive failure in a RAID-5 causes data loss. RAID-5 with large modern HDDs is widely considered too risky.

Failure Modes and Real Incidents

2011 Thailand floods: Flooding of manufacturing facilities (Seagate in Korat, WD in Bang Pa-in) in Q4 2011 destroyed ~30% of global HDD production capacity. HDD prices tripled in weeks. The PC industry's shift to SSDs accelerated as a direct consequence.

CERN storage system degradation (2013): CERN's CASTOR tape/disk storage system experienced cascading failures when a batch of HDDs from a single firmware revision failed within weeks of each other. Batch failure due to firmware bugs in the same HDD model's power management — drives would spin down incorrectly and fail to spin back up. Lesson: avoid homogeneous disk batches.

Backblaze Pod Gen 2 vibration failures: Backblaze discovered that high-density storage pods with many drives caused resonant vibration that increased HDD error rates. Solution: rubber anti-vibration mounts. Enterprise HDDs specify vibration tolerance (RV sensors) for this reason.

Modern Usage

HDDs remain the dominant medium for: - Cloud object storage cold tiers (AWS S3, GCS, Azure Blob back-end HDD JBOD) - Backup and archive (Backblaze B2, tape+HDD hybrid NAS) - Video surveillance (purpose-built WD Purple, Seagate SkyHawk with rotational vibration sensors) - Bulk NAS (consumer: Synology, QNAP with WD Red Plus / Seagate IronWolf CMR drives)

HAMR (Heat-Assisted Magnetic Recording): Seagate's HAMR uses a laser to heat the platter surface during writing, enabling higher coercivity materials → higher areal density. Seagate shipped first HAMR drives (Mozaic 3+) to hyperscaler customers in 2023 at 30 TB capacity. MACH.2 (multi-actuator) adds a second independent actuator, doubling random IOPS by accessing two zones simultaneously.

MAMR (Microwave-Assisted Magnetic Recording): Western Digital's competing approach to HAMR — uses microwave energy instead of heat. Slightly lower density potential but less thermal stress on heads.

Future Directions

HAMR density roadmap: Seagate targets 50+ TB per 3.5" drive by 2026 using HAMR
Multi-actuator drives (Seagate MACH.2): Two actuators per drive → 2x IOPS for random workloads
DNA/glass storage long-term research: Microsoft's Project Silica (fused silica glass) stores data in 3D voxels, read with femtosecond laser. 75 TB/cm² potential but R/W latency measured in days currently
HDD market consolidation: By 2024, only Seagate, Western Digital, and Toshiba remain as HDD manufacturers

Exercises

Use fio to measure random 4K IOPS vs sequential throughput on an HDD (or simulate with fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=1 --runtime=30). Compare to the theoretical access time calculation using the formula in this document.
Run iostat -xz 1 30 during an intensive HDD workload. Identify which metric (await, %util, r_await) best indicates saturation. When can %util be high but await be low?
Using smartctl -a, examine a drive's SMART data. Identify the reallocated sector count and pending sector count. What is the raw vs normalized value scale?
Research the WD Red SMR controversy (2020). List the specific models affected and the performance degradation reported under ZFS resilver. Why is ZFS particularly sensitive to DM-SMR behavior?
Calculate the RAID-5 rebuild window for a 20 TB HDD array with 6 drives at 200 MB/s sustained sequential read. What is the probability of a second failure during rebuild if the AFR is 2%? (Hint: probability of failure in a given time window = 1 - e^(-λt) where λ = AFR/8760.)

References

Anderson, D. et al. "More Than an Interface: SCSI vs. ATA." USENIX HotOS 2003.
Pinheiro, E. et al. "Failure Trends in a Large Disk Drive Population." USENIX FAST 2007.
Backblaze Hard Drive Stats: https://www.backblaze.com/cloud-storage/resources/hard-drive-test-data
WD SMR controversy: https://arstechnica.com/gadgets/2020/04/wd-nas-hard-drives-are-actually-slow-smr-drives/
Seagate HAMR: https://www.seagate.com/innovation/hamr/
ATA/ATAPI Command Set (ACS-4): INCITS 529-2018
Linux kernel SMART interface: drivers/ata/libata-scsi.c
Cornwell, M. "Anatomy of a Solid-State Drive." ACM Queue, 2012. (comparison context)