Skip to content

03 — Filesystem and Storage Failures

Technical Overview

Filesystems are one of the most demanding components in systems engineering: they must guarantee durability (data written should survive crashes), consistency (no partial operations visible), and performance, all simultaneously. The failure modes in this document span missing write barriers, RAID metadata race conditions, firmware bugs that bypass all software-level protections, and catastrophic cloud storage cascade failures. Each case reveals a different point of fragility in the storage stack and produced lasting changes in how production systems handle data durability.

Prerequisites

  • Filesystem journaling: ordered, writeback, and data modes
  • fsync, fdatasync, and write barriers
  • ext2/ext3/ext4 data journaling vs metadata journaling
  • RAID levels: RAID5, RAID6, write hole problem
  • ZFS architecture: ZIL, COW, checksums, import/export
  • HDFS NameNode architecture: edit logs, checkpointing
  • Distributed storage: quorum writes, erasure coding

Historical Context

The first lesson every filesystem developer learns is that "write() returned 0" does not mean data is on disk. The second lesson is that power failure at any point exposes every implicit ordering assumption. Filesystems have converged on journaling and COW (copy-on-write) designs as the primary durability mechanisms, but both have subtle failure modes that production experience has continued to reveal. Cloud-era storage adds distributed consistency to the already-hard problem of single-node durability.


Case Study 1: ext3 Ordered Mode Data Loss Bug (2009)

What Happened

In 2009, a data loss bug was discovered in ext3's "ordered" journaling mode — the default mode for most Linux distributions from 2001 through 2010. The bug could cause data written before a crash to be silently lost or corrupted even if the application called fsync(). It affected Linux kernels from approximately 2.6.28 through the backport of the fix in 2.6.30 and various distribution kernels.

The trigger was the combination of: ext3 ordered mode + delayed allocation (a performance optimization) + a specific crash timing.

Technical Root Cause

ext3 operates in three journal modes: - data=journal: All data and metadata written to journal. Slowest, safest. - data=ordered (default): Data blocks written to disk before metadata (journal commit). Prevents exposing uninitialized data via metadata corruption. - data=writeback: Metadata journaled, data written asynchronously with no ordering. Fastest, but can expose stale data after crash.

The ordering guarantee in data=ordered is enforced by adding "data blocks" to the journal transaction's j_buffers_heads list. Before the journal commits the transaction's metadata, it waits for all data blocks in that list to be written to disk.

The bug: delayed allocation interaction

ext3 introduced "delayed allocation" (delalloc) to improve write performance. In delalloc, when an application writes data, the kernel does not immediately assign disk blocks. It keeps the data in page cache and waits, hoping to batch contiguous writes for better block allocation.

Application:
  write(fd, data, len)   → data in page cache, NO disk blocks assigned yet
  fsync(fd)              → ...but this is where the bug was

Without delalloc (ext3 traditional behavior):
  write()  → page cache, block allocated immediately
  fsync()  → writes data block to disk, updates journal, commits
  Crash: data is safe (was on disk before journal committed)

With delalloc (ext3 bug):
  write()  → page cache only, block NOT yet allocated
  fsync()  → triggers block allocation (first allocation)
           → data block written to disk
           → BUT: the journal ordering for data=ordered mode requires
                  that data blocks be on the "ordered data" list
                  BEFORE the journal transaction is opened
           → if the block is allocated AFTER the journal transaction opened,
              it may not be on the ordered list for that commit
           → the journal can commit metadata (inode, directory entry)
              pointing to the new block BEFORE the data block is written
           → crash between metadata journal commit and data block write:
              inode points to a block that has NOT been written
              On recovery: inode points to uninitialized (or stale) data
              Data loss without any indication of error

The critical timing:

t0: Transaction T1 opens
t1: Application write() — delalloc queues in page cache
t2: Journal T1 starts to commit
t3: fsync() triggers block allocation
t4: Data block allocated, added to page cache as "dirty"
t5: Journal T1 commits metadata (inode update) — data block NOT in T1's ordered list
t6: CRASH — data block was dirty in page cache, never reached disk
Recovery: inode from T1 commit is valid, but block contains random data

Why it was hard to detect: Applications calling fsync() "should" be safe. The bug required a specific combination of delalloc + crash between journal commit and data write — a narrow window that was rarely triggered in testing.

Detection

  • Reported by Theodore Ts'o and tracked as ext3 bug in the 2.6.28 timeframe
  • Discovered through filesystem stress testing with crash injection (dm-flakey, power cycle tests)
  • Andrea Arcangeli's analysis of the ext3/delalloc interaction in 2009
  • Users reported data corruption after unexpected power loss on ext3 with delalloc enabled

Recovery

  • No automated recovery possible for already-corrupted filesystems
  • Backups from before the crash were the only recovery path
  • For users without backups: extundelete, photorec could sometimes recover data

Fix

The fix involved ensuring that when fsync() triggers allocation via data=ordered mode, the newly-allocated data blocks are added to the journal's ordered data list before the transaction commits. The fix was complex because it had to handle the interaction between delalloc, writepage(), and the journal commit path.

In practice, many distributions disabled delalloc on ext3 by default after this bug. The ext4 implementation was written with this interaction in mind and handles it correctly.

ext4 as the real fix: Linux distributions migrated default filesystems from ext3 to ext4 (which handles ordered mode + delalloc correctly) starting around 2009-2010.

Architectural Lessons

  1. Performance optimizations in the write path must re-validate ordering guarantees. Delalloc was added to improve write performance without fully re-analyzing the ordering guarantees of ordered mode.
  2. fsync() semantics are hard. ext4, XFS, btrfs all have had fsync-related data loss bugs. Full data correctness requires testing with crash injection.
  3. Default mode matters. data=ordered being the default for years meant this bug affected the default configuration.

Case Study 2: Btrfs RAID5/6 Write Hole

What Happened

As of 2024, the btrfs RAID5 and RAID6 implementations have a known, unfixed write hole that makes them unsuitable for production use without external journaling or an upstream fix. The problem has been known since approximately 2015 and repeatedly documented in the btrfs wiki ("RAID5/6 is not recommended for production use").

Technical Root Cause

The RAID write hole is a classic problem for any RAID5/6 implementation that does not use full-stripe writes or a write journal.

RAID5 layout:

Stripe 1: [D1][D2][D3][P1]   P1 = D1 XOR D2 XOR D3
Stripe 2: [D4][D5][P2][D6]   P2 = D4 XOR D5 XOR D6
...

Write a partial stripe (only D1 changes):
  1. Read old D1, old P1
  2. Compute new P1 = new_D1 XOR old_D1 XOR old_P1  (read-modify-write)
  3. Write new_D1 to disk
  4. Write new_P1 to disk

Write hole:
  Crash between step 3 and step 4:
  new_D1 on disk, old_P1 on disk
  P1 is now inconsistent with D1
  RAID array is silently corrupted
  If D1 disk fails now, D1 cannot be reconstructed from P1
  Data loss

Why btrfs is particularly affected:

Btrfs is a COW filesystem. On a single disk, COW means writes always go to new blocks — the write hole doesn't apply because btrfs never does read-modify-write of existing blocks. However, btrfs's RAID5/6 implementation uses the RAID stripe model, which requires parity computation, and thus has the write hole.

A conventional solution (as used in mdadm/Linux MD RAID) is the journal/write-intent bitmap approach: log the intended stripe write before making it. If a crash occurs, replay from the journal. btrfs has not implemented a write journal for RAID5/6.

Additional btrfs RAID5/6 bugs:

Beyond the write hole, there have been additional corruption bugs: - Incorrect handling of incomplete writes during resilvering (reconstruction after disk failure) - Race conditions in the parity computation for partial stripe writes - Scrub not detecting all forms of parity inconsistency

Current Status (2024)

The btrfs development team added a rescue=ignoredatacsums mount option for recovery but has not fixed the fundamental write hole. The btrfs wiki states:

"RAID 5/6 is currently experimental and should not be used for important data."

Discussion of a fix (RAID write journal in btrfs) has been ongoing since 2017. As of kernel 6.x, no upstream fix exists.

Architectural Lessons

  1. Copy-on-write and parity RAID are architecturally incompatible without a write journal. Any COW filesystem implementing RAID5/6 must solve the write hole before production use.
  2. "Experimental" warnings in production storage must be taken literally. Multiple users have lost data using btrfs RAID5/6.
  3. ZFS RAID-Z uses a different architecture (variable-width stripes that are always written as complete stripes) to avoid the write hole. This is the correct design.

Case Study 3: ZFS on Linux Early Bugs (2012–2016)

What Happened

ZFS was ported to Linux (ZFS on Linux, ZoL) starting around 2010, with the project maturing significantly in the 2012-2016 timeframe. The early versions had several data correctness issues that burned early adopters, including import-related corruption and specific workload patterns that triggered pool damage.

Technical Root Cause

Pool import after unclean export:

ZFS uses a uberblock — the top-level pointer into the entire pool's data structure — which is written atomically to an "uberblock array" (128 uberblocks on each vdev). On import, ZFS finds the newest valid uberblock and mounts from there.

Early ZoL bug (2012-2013): On systems with multiple pools, under high memory pressure, the ZFS ARC (Adaptive Replacement Cache) could be reclaiming pages while a pool import was in progress. In rare cases, the uberblock selection code would pick a partially-written uberblock (from a previous transaction group) instead of the latest valid one. The pool would mount successfully but from an older transaction group — dropping recent writes.

The zpool import -f danger:

The -f (force) flag on zpool import ignores the "pool is already imported" status. In 2013-2015, several production incidents occurred when operators on systems with shared storage (iSCSI, FC) force-imported a pool that was already mounted on another host. ZFS in this period did not have adequate cross-host fencing. The result was two ZFS instances writing to the same pool simultaneously — catastrophic, unrecoverable corruption.

Scrub and resilver interaction (2014-2015):

On ZoL 0.6.x, if a resilver (reconstruction after disk failure) was interrupted at exactly the wrong time and then a scrub was run immediately, the scrub's repair code could incorrectly mark certain blocks as repaired when they were not. This required a specific combination of disk failure + interrupted resilver + immediate scrub, making it rare but not impossible.

Detection

  • Pool import issues detected via zpool status showing unexpected transaction group numbers
  • Silent data loss detected only via application-level checksums or data comparison
  • The -f import corruption was typically undetectable until data access found corruption

Recovery

  • ZFS zpool import -o readonly=on allows importing a pool read-only for data recovery
  • zpool scrub after resilver to verify data integrity
  • ZFS send/receive snapshots for backup/restore

Fix

  • ZoL 0.6.4 (2014) fixed the ARC-under-pressure import bug
  • ZoL 0.6.5 (2015) added import locking improvements
  • ZFS replication via zpool import -t (read the active label from the pool for fencing)
  • SCSI PR (Persistent Reservations) recommended for shared storage to provide host-level fencing

Architectural Lessons

  1. ZFS's end-to-end checksumming detected many corruptions that other filesystems would have silently accepted. Even with the early bugs, ZFS's checksum infrastructure limited the blast radius.
  2. Shared storage with ZFS requires external fencing. ZFS trusts that only one host imports a pool. This must be enforced at the storage layer.
  3. Force flags on storage commands are extremely dangerous. Operators must understand what "force" overrides.

Case Study 4: Amazon EBS 2011 Outage — Storage Cascade Failure

What Happened

On April 21, 2011, AWS experienced a major outage in the us-east-1 region. The storage subsystem (EBS) experienced a cascade failure that took approximately 10 days to fully resolve for all affected volumes. This was the most significant public cloud outage to that date and exposed fundamental design gaps in multi-AZ architecture as understood by customers.

Technical Root Cause

Phase 1: Network configuration change triggers the cascade

AWS was executing a routine network upgrade to improve redundancy in us-east-1. The change involved shifting network traffic between backbone routers. Due to a miscommunication, the traffic was moved to a "lower-capacity" backup network path instead of the intended path. This caused a traffic surge on the backup network.

Phase 2: EBS replication triggers a storm

EBS volumes at the time were replicated within an Availability Zone. When the network partition caused EBS primary nodes to lose contact with their replicas, hundreds of EBS nodes simultaneously entered a "re-mirroring" state — attempting to re-sync their replicas across the now-congested network.

Normal state:
  EBS Primary Node → replicates to → EBS Replica Node
  (within AZ, low-latency path)

During network congestion:
  Primary loses contact with Replica
  Primary decides: "I need to find a new replica"
  Primary searches the cluster for available replica capacity
  HUNDREDS of primaries do this simultaneously

Re-mirroring storm:
  Each Primary requests replica space
  Each Primary begins copying its entire volume to new replica
  Network (already congested) now carries full EBS volume copy traffic
  Network congestion worsens
  More primaries lose contact with replicas
  More re-mirroring starts
  Positive feedback loop → cascade

Phase 3: Control plane overload

The EBS control plane (the cluster management system responsible for assigning replica capacity) was overwhelmed by the simultaneous re-mirroring requests. It began responding slowly, then stopped responding. EBS volumes that were mid-replication were left in a "stuck" state — neither fully operational nor failed.

Phase 4: The "stuck" volume problem

Tens of thousands of EBS volumes ended up in a "stuck" state: - The primary data was intact - The replica was partially synchronized - The control plane could not process state transitions to resolve the situation

Recovering each stuck volume required manual intervention by AWS engineers — verifying that the primary data was intact, canceling the failed replication, and initiating a fresh replica sync. With 10,000+ affected volumes, this took days.

Phase 5: Multi-AZ illusion

Many customers believed Multi-AZ deployments (e.g., RDS Multi-AZ) would protect them. They did not. The EBS volumes in both AZs in us-east-1 were part of the same EBS cluster infrastructure. The network congestion affected both. Customers whose RDS instances failed over to the secondary AZ found the secondary's EBS volume was also stuck.

The architectural flaw: multi-AZ redundancy at the EC2/RDS level was not backed by full storage-layer AZ isolation in the 2011 EBS design.

Detection

AWS internal monitoring detected the network anomaly quickly. The EBS re-mirroring storm was detected by the control plane request rate spike. Customer impact (volume unavailability) began appearing in CloudWatch metrics and customer tickets within 20 minutes.

Recovery

  • AWS engineers developed a batch process to verify and recover stuck volumes
  • Manual recovery at scale: 10+ days to resolve all affected volumes
  • Customers with EBS snapshots could restore; customers without were dependent on AWS recovery

Architectural Lessons

  1. Re-mirroring storms are a fundamental availability risk for replicated storage. Modern EBS uses rate-limiting and backoff for re-mirroring to prevent cascade.
  2. Multi-AZ must mean truly independent failure domains at every layer. Post-2011, EBS redesigned storage clusters to be fully AZ-isolated.
  3. Control planes must not be on the same failure path as data planes. The EBS control plane being overwhelmed by volume re-mirroring requests (a data-plane event) was a critical design flaw.
  4. Human-scale recovery processes do not work at cloud scale. AWS built automated volume recovery pipelines post-incident.
  5. This event accelerated adoption of S3 + ephemeral storage architectures (as championed by the "cattle vs pets" movement) — recognizing that EBS persistence guarantees were weaker than assumed.

Reference: AWS post-mortem "Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region" — available at aws.amazon.com/message/65648.


Case Study 5: Cloudflare SSD Firmware Silent Data Corruption (2015)

What Happened

In late 2015, Cloudflare engineers discovered that a specific model of SSD (Samsung 840 EVO) deployed in their edge network was silently corrupting data. The corruption was not immediately causing crashes — data was being written but later reading back differently. The first indication was fsck errors appearing on drives that had not experienced power loss or hardware failure events.

Technical Root Cause

The Samsung 840 EVO SSD had a firmware bug affecting NAND data retention: data that was written and then infrequently read could lose its charge state over time due to a bug in the NAND refresh logic. Specifically, the NAND cells storing data that had not been read in weeks or months could have their charge degrade in a pattern that the ECC logic misclassified — returning incorrect data rather than an ECC error.

The bug only manifested on data that was: 1. Written and then not read for an extended period (weeks to months) 2. In specific NAND blocks that had higher wear (more program/erase cycles)

For Cloudflare's use case, edge cache storage contained objects that were cached but infrequently accessed — exactly the pattern that triggered the bug.

Why "silent": The SSD returned data with incorrect content but with a valid ECC signature. The firmware computed the ECC over the (corrupted) data rather than detecting the corruption. From the host's perspective, every read succeeded with status 0.

Detection

  • Cloudflare's filesystem scrubbing (btrfs scrub or periodic read-verify) found mismatches between expected checksums and actual data
  • The pattern: corruption only on older data (written weeks+ ago), specific SSD firmware version
  • Samsung independently discovered the bug and released firmware 0DB3 in April 2015 fixing it
  • No customer data was lost: Cloudflare detected the corruption before serving corrupted data, and edge caches are non-authoritative (origin server holds the truth)

Recovery

  • Firmware update to Samsung 840 EVO 0DB3 resolved the NAND refresh issue
  • Cloudflare's edge architecture (caches are not authoritative) meant no data loss — corrupted cache entries were simply re-fetched from origin

Architectural Lessons

  1. End-to-end checksums are essential. Without filesystem-level or application-level checksums, this corruption would have been served silently to users for months.
  2. Hardware can fail in modes that return success status codes. The assumption that "read returned 0" means "data is correct" is wrong. ZFS, btrfs, and explicit application checksumming are necessary.
  3. SSD firmware bugs are a real production risk. Firmware validation and monitoring should be part of storage qualification.
  4. Cloudflare's architecture saved them. Non-authoritative caching with origin fallback meant corruption was a performance issue, not a data loss incident.

Case Study 6: Facebook HDFS Birthday Bug in NameNode Checkpointing (2011)

What Happened

In 2011, Facebook's HDFS (Hadoop Distributed File System) cluster serving their data warehouse experienced a NameNode failure that threatened the loss of metadata for petabytes of data. The root cause was a birthday paradox collision in the HDFS NameNode's checkpoint transaction ID system.

Technical Root Cause

HDFS stores filesystem metadata (all file paths, block locations, access times) on a single NameNode. The NameNode has two metadata structures: - fsimage: A full snapshot of the filesystem at a point in time - edits log: A transaction log of all changes since the last fsimage

Periodically, a Secondary NameNode (or Standby NameNode in later versions) merges the edits log into a new fsimage — a "checkpoint" operation.

Transaction ID collision:

HDFS transaction IDs in the version deployed at Facebook were represented as int32 values (32-bit signed integers). The maximum value is 2,147,483,647 (~2.1 billion).

Facebook's HDFS cluster processed approximately 40 million metadata transactions per day. Over 53 days, the transaction ID counter would reach 2.1 billion and overflow back to 0 (or wrap to negative values, depending on interpretation).

Transaction ID space: 0 to 2,147,483,647

After ~53 days at Facebook's transaction rate:
  Transaction ID counter wraps: 2,147,483,647 → 0 (overflow)

The checkpoint comparison logic:
  if (new_txid > checkpoint_txid) { take_checkpoint(); }

  After wrap: new_txid = 50,000 (post-overflow)
              checkpoint_txid = 2,147,483,000 (pre-overflow)
  Comparison: 50,000 > 2,147,483,000 is FALSE
  Result: No checkpoint taken (ever, after the wrap)

The edits log grows indefinitely
NameNode runs out of disk space for edits log
NameNode crashes
Recovery requires replaying 2.1B transactions from the beginning of the log

This is the "birthday" framing: given enough time (transactions), the ID space wraps around and produces a "collision" — an ID that was previously used.

The actual failure: When the NameNode was restarted after running out of disk space, it needed to replay the entire edit log. With 2.1+ billion transactions in the log, this took many hours. During this time, the cluster was completely unavailable for writes (HDFS has a single-writer NameNode model).

Detection

  • Alert on NameNode disk usage growing continuously without bound
  • The checkpoint metric (checkpoints should occur every N minutes) showed no checkpoints being taken
  • HDFS health dashboard showed "last checkpoint: X days ago" (where X was growing)

Recovery

  • Emergency NameNode restart, 53+ hours of edits log replay
  • All data was intact (HDFS data is on DataNodes, not NameNode) but inaccessible during replay
  • The fix: change transaction IDs from int32 to int64 (9.2 × 10^18 maximum — billions of years at any conceivable transaction rate)

Architectural Lessons

  1. Integer overflow in infrastructure ID spaces is a latent time bomb. Any counter that grows monotonically in a production system must be sized for the expected operational lifetime with large margin.
  2. The birthday paradox applies to IDs. 32-bit IDs that seemed "large enough" in 2006 were insufficient by 2011 with Facebook-scale workloads.
  3. HDFS single NameNode is a SPOF. The proper fix was HDFS NameNode HA (high availability), which was implemented in Hadoop 2.0.
  4. Checkpoint monitoring must be a first-class operational metric. The failure could have been detected and fixed without incident if "last successful checkpoint time" had an alert threshold.

ASCII Diagram: Storage Write Path Failure Points

Application
    │
    │ write()
    ▼
Page Cache (RAM)
    │
    │ [delayed allocation — block not yet assigned]
    │ [write barrier — ensures ordering]
    ▼
Block Layer / I/O Scheduler
    │
    │ [RAID parity computation — write hole here]
    │ [EBS replication — replication storm here]
    ▼
Storage Controller
    │
    │ [SSD firmware — silent corruption here]
    │ [HDD write cache — power loss vulnerability]
    ▼
Physical Media
    │
    │ [NAND charge decay — Samsung 840 EVO bug]
    │ [DRAM bit flip — Rowhammer]
    ▼
Verified on-disk data

Checksums must span the entire path:
  Application checksum → validates data at write
  Filesystem checksum (ZFS/btrfs) → validates block layer integrity
  RAID checksum → validates device-level integrity
  No single layer is sufficient

Debugging Notes

# Check ext3/ext4 journal mode
tune2fs -l /dev/sdX | grep "Default mount options"

# Force filesystem check
fsck -f /dev/sdX

# ZFS pool health
zpool status -v  # -v for verbose error counts
zpool scrub tank  # start scrub, verify all data

# Check SMART data for SSD/HDD health
smartctl -a /dev/sdX
smartctl -t long /dev/sdX  # run extended self-test

# EBS volume status from AWS CLI
aws ec2 describe-volumes --filters Name=status,Values=error

# Check filesystem mount options
cat /proc/mounts

# btrfs scrub status
btrfs scrub status /mount/point

# Identify SSD firmware version
smartctl -i /dev/sdX | grep "Firmware"

Security Implications

  • Silent data corruption can be used to inject malicious code into stored binaries (if an attacker can influence which blocks degrade)
  • Storage layer corruption bypasses application-level security if integrity is assumed
  • EBS outage exposed customer data access dependencies — forced disclosure of architectural assumptions to customers

Performance Implications

  • ext4 data=journal mode: 30-50% write throughput reduction vs data=ordered
  • ZFS checksumming: ~1-3% CPU overhead, ~5% throughput overhead for checksumming
  • btrfs RAID5/6: avoid — performance and correctness are both inferior to ZFS RAID-Z
  • EBS re-mirroring: consumes significant network bandwidth; must be rate-limited

Failure Modes

Component Failure Mode Detection Recovery
ext3 ordered + delalloc Metadata journaled before data written Crash, fsck, data mismatch Restore from backup
btrfs RAID5/6 Write hole on partial stripe Scrub finds parity inconsistency Unrecoverable without backup
ZFS force import Dual-mount corruption ZFS checksum errors ZFS send/receive from snapshot
EBS 2011 Re-mirroring cascade, control plane overload Volume stuck in error state AWS manual recovery
Samsung 840 EVO Charge decay → silent corruption Filesystem scrub, checksum mismatch Firmware update, data from origin
HDFS NameNode txid overflow No checkpoint taken, disk exhaustion Checkpoint metric alert Edits log replay (hours)

Modern Usage

  • ext4 remains the most widely deployed Linux filesystem (default on Ubuntu, Debian, RHEL). Ordered mode + proper barriers is safe for most workloads.
  • XFS is preferred for large files and high-throughput workloads (default on RHEL 7+)
  • ZFS (via OpenZFS) is the gold standard for NAS/SAN with end-to-end checksumming and RAID-Z avoiding write hole
  • btrfs suitable for RAID1/RAID10; avoid RAID5/6 in production
  • io_uring (Linux 5.1+) provides new async I/O path — has had its own journaling interaction bugs in early kernels, now mature

Future Directions

  • NVMe over Fabrics (NVMe-oF): Distributed storage at NVMe latency — requires new thinking about write barriers in networked storage
  • ZNS (Zoned Namespace) SSDs: New SSD interface that eliminates the write hole at the hardware level for workloads that can use zone-sequential writes
  • Persistent Memory (pmem/DCPMM): Intel Optane DCPMM brought byte-addressable persistence. Filesystems (NOVA, ext4-dax) for pmem have different durability semantics — new failure modes are being discovered
  • f2fs (Flash-Friendly FS): Android default for /data on many devices, designed for NAND characteristics; ongoing work on power-loss correctness

Exercises

  1. Create an ext4 filesystem on a loopback device. Use dm-flakey to simulate a power failure at a specific point during a write. Compare data consistency between data=ordered and data=journal modes. Use debugfs to inspect journal state.

  2. Set up a btrfs RAID5 volume on three loopback devices. Write data, then simulate a crash during a write (kill -9 the writing process, unmount). Remount and run btrfs scrub. Observe whether btrfs detects the inconsistency.

  3. Deploy ZFS on a test system. Create a zpool with mirrored vdevs. Use zpool scrub to verify data integrity. Then inject a disk error using zpool offline + manual dd of random data to the device file. Observe ZFS scrub detecting and correcting the error.

  4. Reproduce the HDFS transaction ID overflow scenario in a simplified simulation: write a Python program that counts transactions with an int32 counter and a checkpoint function that only fires when txid > last_checkpoint_txid. Observe when the condition permanently fails after overflow.

  5. Read the AWS post-mortem for the April 2011 outage (aws.amazon.com/message/65648). Identify the three distinct failure modes (network, re-mirroring cascade, control plane). Propose an architectural change for each that would have prevented or limited that phase of the failure.

References

  • Ts'o, Theodore. "The disk I'm not sure about." LWN.net discussion of ext3 ordered mode data loss, 2009.
  • btrfs Wiki. "BTRFS and multiple devices." kernel.org/doc/html/latest/filesystems/btrfs.html — RAID5/6 warning.
  • Bonwick, Jeff; Moore, Bill. "ZFS: The Last Word in Filesystems." OpenSolaris presentation, 2006. (Explains RAID-Z write hole avoidance)
  • AWS. "Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region." aws.amazon.com/message/65648, April 2011.
  • Samsung. "Samsung 840 EVO performance restoration." Samsung SSD firmware release notes, 2014.
  • Facebook Engineering. "Apache Hadoop and the Problem of Large-Scale Data." Various internal presentations referenced in Hadoop JIRA HDFS-1609 (txid overflow fix).
  • McKusick, Marshall Kirk; Neville-Neil, George V. "The Design and Implementation of the FreeBSD Operating System." 2nd ed., 2014. (Chapter 8: Local Filesystems — journaling semantics)
  • Bonwick, Jeff. "RAID-Z." blogs.oracle.com/bonwick, 2005.