Skip to content

08 - Distributed Filesystems

Technical Overview

Distributed filesystems allow multiple clients to access a shared storage namespace over a network. They face fundamentally harder problems than local filesystems: network partitions, server failures, clock skew, concurrent access from multiple clients, and the CAP theorem tradeoffs between consistency, availability, and partition tolerance.

This topic covers the major distributed filesystem architectures — from classical NFS to modern hyperscale object storage — with emphasis on the architectural decisions that drive their performance and failure characteristics.

Prerequisites

  • VFS layer concepts (see 01-vfs-layer.md)
  • NFS and SMB protocols (see 12-storage-systems/07-storage-networking.md)
  • Distributed systems fundamentals (CAP theorem, Paxos/Raft, consistent hashing)
  • Block layer and storage networking basics

Core Content

Distributed Filesystem Challenges

Local filesystem challenges:       Distributed filesystem adds:
- Crash consistency                 - Network partitions
- Concurrent access (locking)       - Server failures during operation
- Space management                  - Client clock drift
- Performance (I/O scheduling)      - Multiple concurrent writers
                                    - Network latency (ms vs µs local)
                                    - Cache coherence across clients
                                    - Consistency model choice
                                    - Metadata bottleneck scaling

CAP theorem applied to distributed filesystems: - POSIX semantics require strong consistency (C) — mutually exclusive writes, linearizable reads - POSIX consistency is incompatible with availability (A) under network partitions (P) - Therefore: most distributed filesystems either relax consistency (NFS close-to-open, HDFS no concurrent writes) OR sacrifice availability (strict POSIX with fencing)

NFS (Network File System)

Covered in storage networking (12-storage-systems/07). Summary of architecture:

NFSv3 design choices: - Stateless server: server keeps NO client state - All operations idempotent: WRITE with same args is safe to repeat - Crash recovery: clients retry; servers recover cleanly from reboot - Consistency: close-to-open (writes visible after close() on writer; readers see coherent view after open()) - Not POSIX-compliant: byte-range locking via separate lockd daemon (stateful, complicated)

NFSv4 design choices: - Stateful: server tracks client leases, open files, byte-range locks - Single port 2049 (firewall-friendly) - Delegation: server grants client ownership of file for local caching (no network round trips for reads/writes while delegation held) - Compound operations: pack multiple operations per RPC → reduces latency - Kerberos v5 security

pNFS (parallel NFS, NFSv4.1 extension):

pNFS Client                    pNFS Infrastructure
        |
        | 1. Open file, request layout
        v
+-------+--------+
|                |   Metadata Server (MDS):
|  MDS           |   - Manages namespace (inodes, directories)
|  (metadata     |   - Issues layout: "data is on DS0 blocks 0-100, DS1 blocks 100-200"
|  server)       |
+-------+--------+
  |       |
  |       | 2. Direct I/O to data servers using layout
  v       v
+----+  +----+  +----+
|DS0 |  |DS1 |  |DS2 |    Data Servers:
+----+  +----+  +----+    - Store actual file data
                           - Client contacts directly (bypassing MDS for I/O)

pNFS enables near-SAN performance while maintaining NFS file semantics. The layout types: files (SCSI/NVMe objects on DSs), blocks (raw block device), object (S3-like objects).

HDFS (Hadoop Distributed Filesystem)

HDFS was designed specifically for the batch-processing (MapReduce) workload: large files, streaming reads, write-once/append-only, no POSIX requirement.

HDFS Architecture:

      +------------------+
      |    NameNode      |  (Active)     +------------------+
      |    (Metadata)    +------------->+ Standby NameNode |
      |                  |  journal sync +------------------+
      | - namespace tree |              (for HA; uses shared journal — QJM)
      | - block->DN map  |
      | - all metadata   |
      | - STORED IN RAM  |
      +--------+---------+
               |
     Heartbeats + block reports every 3s
               |
    +----------+-----------+
    |          |           |
+--------+ +--------+ +--------+
|DataNode| |DataNode| |DataNode|  Each DataNode:
|  (data)| |  (data)| |  (data)|  - Stores blocks (128 MB default)
|        | |        | |        |  - Reports block inventory to NameNode
+--------+ +--------+ +--------+  - 3× replication: NameNode decides placement
                                  - Local filesystem: ext4 or XFS

HDFS design constraints: - NameNode in RAM: All filesystem metadata (namespace + block map) must fit in NameNode's heap. Practical limit: ~1 billion files with 64 GB NameNode heap. Each file/block consumes ~150 bytes of NameNode memory. - Large block size (128 MB default): Reduces NameNode memory usage per byte of stored data. One file of 1 TB requires 8,000 block entries vs 262,144 entries with 4 KB blocks. - 3× replication: Default replication factor. Placement is rack-aware: one replica local, one on different node in same rack, one on different rack. - Write-once, append-only: HDFS does not support random writes. truncate() was added only in HDFS 2.7. Designed for MapReduce output: write a file completely, then read it many times.

HDFS is NOT POSIX: No symlinks in early versions, no directory rename atomicity (was added), no mmap(), no byte-range locks. HDFS semantics match MapReduce better than POSIX.

# HDFS CLI
hdfs dfs -ls /user/hadoop/input
hdfs dfs -put localfile.txt /user/hadoop/input/
hdfs dfs -get /user/hadoop/output/result.txt .
hdfs dfs -mkdir /user/hadoop/newdir
hdfs dfs -rm -r /user/hadoop/olddir

# Block-level info
hdfs fsck /user/hadoop/data -files -blocks -locations
# Shows: each file's blocks, their DataNode locations, replication status

# NameNode health
hdfs dfsadmin -report    # cluster summary: node count, capacity, block counts
hdfs dfsadmin -safemode get  # is NameNode in safe mode? (startup or admin)

Ceph

Ceph is a unified distributed storage system providing block (RBD), file (CephFS), and object (RGW/S3) access on top of a single object store (RADOS). Its key architectural innovation is the CRUSH algorithm for deterministic data placement without a central lookup table.

Ceph Architecture:

+================================================+
|               Client Applications              |
|  +----------+  +----------+  +--------------+  |
|  | CephFS   |  |  RBD     |  |  RGW (S3/   |  |
|  | (POSIX   |  | (block   |  |   Swift API) |  |
|  |  file)   |  |  device) |  +--------------+  |
|  +----------+  +----------+                    |
+================================================+
                       |
              librados / librbd
                       |
+================================================+
|                    RADOS                       |
|  Reliable Autonomic Distributed Object Store   |
|                                                |
|  +----------+  +----------+  +----------+     |
|  |  OSD 0   |  |  OSD 1   |  |  OSD 2   |     |
|  | (NVMe)   |  | (NVMe)   |  | (NVMe)   |     |
|  +----------+  +----------+  +----------+     |
|  +----------+  +----------+  +----------+     |
|  |  OSD 3   |  |  OSD 4   |  |  OSD 5   |     |
|  | (HDD)    |  | (HDD)    |  | (HDD)    |     |
|  +----------+  +----------+  +----------+     |
|                                                |
|  +----------+  +----------+  +----------+     |
|  |  Mon 0   |  |  Mon 1   |  |  Mon 2   |     |
|  | (Paxos   |  | (Paxos   |  | (Paxos   |     |
|  | quorum)  |  | quorum)  |  | quorum)  |     |
|  +----------+  +----------+  +----------+     |
|                                                |
|  MGR (Prometheus metrics, dashboard, modules)  |
+================================================+

CRUSH (Controlled Replication Under Scalable Hashing): CRUSH is a pseudo-random placement algorithm that deterministically maps object IDs to OSD lists without a central lookup table:

CRUSH computation (every client can compute independently):

Input: object_name, pg_id, cluster_map (CRUSH map)
Output: [OSD_id_1, OSD_id_2, OSD_id_3]  (primary + replicas)

Algorithm:
1. Hash object_name → placement group (PG): pg_id = hash(object) % pg_count
2. CRUSH(pg_id, crush_map) → OSD list

CRUSH map defines:
- Hierarchy: datacenter → room → row → rack → host → disk
- Placement rules: "3 replicas, one per rack" etc.
- Weight: proportional to device capacity

Benefits:
- No central lookup table → no bottleneck
- Any client or OSD can compute placement independently
- Adding/removing OSDs causes minimal data migration (consistent-hashing property)

Ceph OSD internals: Each OSD daemon manages one physical disk. The OSD stores objects in a local filesystem (default: BlueStore, a custom key-value store on raw block device, replacing the earlier filestore which used ext4/XFS). BlueStore provides: - Direct control over disk layout (no filesystem overhead) - Checksums on all data and metadata - Efficient partial writes (no read-modify-write overhead) - Compression and deduplication hooks

# Ceph cluster management
ceph status                          # overall health
ceph -s                              # same as above
ceph osd tree                        # OSD hierarchy view
ceph df                              # pool-level space usage
ceph osd df                          # per-OSD space and PG count
ceph osd perf                        # per-OSD latency statistics

# Placement group (PG) health
ceph pg stat                         # PG state summary
ceph pg dump | grep -v "^pg"         # PG detail
ceph pg repair 1.2a3                 # repair a specific PG

# Pool management
ceph osd pool create mypool 128 128  # create pool with 128 PGs
ceph osd pool set mypool size 3       # set replication factor to 3
rados -p mypool bench 30 write       # write benchmark
rados -p mypool bench 30 seq         # sequential read benchmark
rados -p mypool bench 30 rand        # random read benchmark

# CephFS
ceph fs status                       # filesystem status
mount -t ceph :/ /mnt/cephfs -o name=admin,secret=...

CephFS MDS (Metadata Server): CephFS adds a metadata server layer on top of RADOS for POSIX filesystem semantics. The MDS handles namespace operations (stat, readdir, create, rename). Data is stored in RADOS objects directly — clients talk to the MDS for metadata and directly to OSDs for data (similar to pNFS). For large deployments, multiple active MDS instances can handle different directory subtrees (CephFS subtree partitioning).

GlusterFS

GlusterFS is a distributed filesystem with a different architecture: no dedicated metadata server. The namespace is distributed among storage bricks (servers) using consistent hashing on filenames:

GlusterFS Architecture:

Client → GlusterFS translator stack (in FUSE module) → Server bricks
         (no separate metadata server — namespace is distributed)

Volume types:
- Distributed: files striped across bricks by filename hash
- Replicated: each file on multiple bricks
- Distributed-Replicated: distribution across replica pairs
- Dispersed: erasure coding across bricks

GlusterFS is used by Red Hat Storage Server (formerly GlusterFS, now OpenShift Container Storage/ODF uses Ceph). It is simpler than Ceph but less scalable at very large (100+ PB) deployments.

Lustre

Lustre is a high-performance computing (HPC) parallel filesystem designed for maximum aggregate bandwidth across thousands of clients:

Lustre Architecture:

Clients (100s-1000s):              Servers:
  Mount /mnt/lustre              MGS: Management Server (config database)
  Access metadata via MDT        MDS: Metadata Server (namespace, MDT)
  Access data via OST            OSS: Object Storage Servers (hold OSTs)
        |                             OST0  OST1  OST2  ...  OST1000
        v
  liblustreapi → LNet (Lustre Network) → InfiniBand or Ethernet

Striping:
  Files are striped across OSTs.
  stripe_count=64 means a 64 GB file uses 64 OSTs (1 GB per OST)
  → 64 × 1 GB/s (per OST) = 64 GB/s aggregate bandwidth for that file

Real-world bandwidth: ORNL Summit → 1.6 TB/s aggregate on Lustre
# Lustre file striping
lfs setstripe -c 64 -S 1M /mnt/lustre/large_file  # 64 OSTs, 1 MB chunk
lfs getstripe /mnt/lustre/large_file                # view stripe info
lfs df /mnt/lustre                                  # per-OST usage
lctl get_param llite.*.stats                        # client statistics

Lustre is the filesystem of choice for TOP500 supercomputers. Aggregate bandwidth can exceed 1 TB/s on large deployments by having thousands of storage nodes serving thousands of compute nodes simultaneously.

Google Colossus

Google Colossus is the successor to the original Google File System (GFS), described in the 2003 Ghemawat et al. paper. GFS's single-master architecture (one metadata server for the entire filesystem) became a bottleneck at Google's scale.

Colossus improvements over GFS: - Multiple metadata masters: Bigtable-based metadata (itself distributed) eliminates the single NameNode bottleneck - Client-side caching of metadata: Clients cache metadata locally, reducing master load - Reed-Solomon erasure coding: 9+3 or 11+5 erasure codes replace 3× replication for cold data - Disaggregated storage: Storage and compute separated at the rack level

Google does not publish detailed Colossus internals. What is known comes from papers on Bigtable, Spanner, and Google's storage papers. Colossus backs Google Search index, Google Drive, BigQuery, and virtually all other Google services.

AWS S3 (Object Storage)

S3 is not a traditional filesystem — it provides an object store with an HTTP API:

S3 Architecture (simplified, not officially documented):

                 Client
                   |
                   | HTTP PUT/GET/DELETE
                   v
               Load Balancer
                   |
               S3 Front-end (request routing, auth, rate limiting)
                   |
               +-----------+-----------+
               |           |           |
           Metadata    Metadata    Metadata
           Shard 0     Shard 1     Shard N
           (bucket/key → storage node mapping)
               |
           Multiple object storage nodes (replication across AZs)
           (actual object data, erasure coded)

S3 semantics: - Not POSIX: no rename atomicity, no symlinks, no byte-range locks - Strongly consistent (since December 2020): read-after-write consistency for all new objects; list-after-write consistency - Multi-part upload: for objects >5 GB; individual parts uploaded independently and assembled by S3 - Object storage: objects are immutable (write once, full object replace). No in-place updates. - Versioning: each overwrite creates a new version; previous versions retained (with delete markers)

# AWS CLI S3 operations
aws s3 cp file.txt s3://mybucket/path/file.txt          # upload
aws s3 cp s3://mybucket/path/file.txt ./                # download
aws s3 sync ./local_dir s3://mybucket/prefix/           # sync directory
aws s3 ls s3://mybucket/prefix/                         # list objects
aws s3 rm s3://mybucket/path/file.txt                   # delete
aws s3 presign s3://mybucket/path/file.txt --expires 3600  # presigned URL

# Multipart upload for large files (automatic with aws s3 cp for >8MB)
# For manual control:
aws s3api create-multipart-upload --bucket mybucket --key large-file.tar.gz
aws s3api upload-part --bucket mybucket --key large-file.tar.gz \
    --part-number 1 --upload-id <upload-id> --body part1.tar.gz
aws s3api complete-multipart-upload --bucket mybucket --key large-file.tar.gz \
    --upload-id <upload-id> --multipart-upload file://parts.json

S3 performance: - Request rate: scales horizontally with key prefix distribution (use randomized prefixes for high-rate workloads) - GET latency: 50-200 ms (first byte), depending on object size and region - PUT latency: 200-500 ms (to durability confirmation) - Aggregate bandwidth: effectively unlimited at prefix partition level; rate limits per prefix at very high IOPS

Historical Context

The original distributed filesystem for UNIX was NFS, developed at Sun Microsystems in 1984. It was designed by Russel Sandberg et al. with the explicit goal of making the network transparent to applications — ideally, applications should not know whether they are accessing a local or remote file.

HDFS was inspired by the Google File System paper (Ghemawat, Gobioff, Leung; SOSP 2003). The GFS paper described Google's internal filesystem designed for batch-processing workloads — large files, streaming reads, write-once. Hadoop, developed at Yahoo in 2006, implemented an open-source version of GFS as HDFS.

Ceph was developed by Sage Weil as his PhD dissertation at UC Santa Cruz (2007). The CRUSH algorithm was his key contribution. Ceph was acquired by Red Hat in 2014 and is now the basis for Red Hat OpenShift Data Foundation.

S3 launched in March 2006 as one of the first AWS services, designed by Andy Jassy and a small team. Its HTTP-based API, flat namespace, and massive scalability fundamentally changed how applications store data. S3's API became an industry standard — MinIO, Ceph RADOS Gateway, OpenStack Swift, and dozens of others implement S3-compatible APIs.

Production Examples

Netflix on HDFS: Netflix uses HDFS (Apache Hadoop) for their content encoding pipeline. Raw video (multi-TB per title) is written to HDFS, processed by Spark, and the encoded output written back. HDFS's large block size and sequential-read optimization perfectly match this write-once, large-file workload.

Dropbox Magic Pocket: Dropbox built their own object storage (Magic Pocket) to replace S3 for their exabyte-scale user data. Built on custom hardware, Ceph-inspired but with proprietary modifications, 4+2 erasure coding across 6 failure zones. Published in 2016, serves hundreds of exabytes.

Pinterest on S3: Pinterest stores all user images in S3, serving 300+ million users. They store thumbnails in multiple sizes (pre-generated) to avoid on-the-fly image resizing. S3 lifecycle rules move older images to S3 Standard-IA, reducing storage costs for less-accessed content.

LinkedIn Kafka on HDFS: LinkedIn uses HDFS as the long-term retention tier for Kafka messages. Kafka brokers write to local SSDs; after configurable retention period, old segments are archived to HDFS via MirrorMaker. HDFS provides cheap petabyte-scale retention for compliance and analytics.

Debugging Notes

# NFS debugging
nfsstat -c                   # client RPC statistics
nfsstat -s                   # server RPC statistics
mountstats                   # per-mount operation latency breakdown
cat /proc/self/mountstats    # raw per-mount NFS stats

# HDFS debugging
hdfs dfsadmin -report        # cluster health and block status
hdfs fsck / -files -blocks   # full filesystem check
hdfs dfs -count -v /path     # file/dir count and size
hadoop checknative            # verify native library loading (LZ4, Snappy)

# Check for under-replicated blocks
hdfs dfsadmin -report | grep -E "Under|Missing"

# Ceph debugging
ceph status
ceph health detail           # verbose health messages
ceph log last 100            # last 100 cluster log entries
ceph osd blocked-by          # show slow OSD operations
ceph daemon osd.0 perf dump  # detailed OSD performance counters

# S3 debugging (aws CLI)
aws s3api head-object --bucket mybucket --key mykey  # check object metadata
aws s3api list-object-versions --bucket mybucket --prefix mykey  # version history
aws cloudtrail lookup-events --lookup-attributes AttributeKey=ResourceName,\
    AttributeValue=mybucket  # audit log for bucket operations

Security Implications

NFS security model weakness: NFSv3 trusts the client's UID/GID. A root user on any client can spoof any UID. The only protection is network-level access control (IP allowlist in /etc/exports). For untrusted clients, always use NFSv4 with Kerberos (sec=krb5p).

HDFS no encryption by default: HDFS traditionally stored data unencrypted. Sensitive data in HDFS at-rest requires Transparent Data Encryption (TDE) using HDFS key management server (KMS). Many production clusters lack TDE, relying on Kerberos for access control (which prevents unauthorized access but not physical disk theft).

Ceph CephX authentication: Ceph uses CephX (a Kerberos-like capability system) for client authentication. Each client gets a keyring with capabilities for specific pools/resources. Without CephX, any host on the Ceph network can access any pool. Enable CephX in all production deployments:

ceph auth get client.admin   # view admin keyring
ceph auth add client.webserver mon 'allow r' osd 'allow rw pool=webfiles'

S3 public access block: S3 buckets misconfigured as "public" have been the source of numerous data breaches (Capital One 2019, etc.). Always enable S3 Block Public Access at the account level:

aws s3api put-public-access-block --bucket mybucket \
    --public-access-block-configuration \
    "BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true"

Performance Implications

HDFS NameNode GC pauses: The NameNode runs in JVM and holds all metadata in heap. Frequent full GC pauses can cause NameNode unavailability (30-300 second pauses on a 60 GB heap). Mitigations: tune JVM GC (G1GC with -XX:MaxGCPauseMillis=300), use NameNode HA (standby takes over during pause), right-size heap (avoid over-provisioning heap → more frequent GC).

Ceph network I/O: Ceph OSDs generate significant network traffic (client writes + replication). A 3× replicated write of 100 MB generates 300 MB of OSD traffic. With 100 GB of writes/hour: ~300 GB of OSD replication traffic/hour. Ensure your Ceph cluster network is 2× the expected write bandwidth.

S3 prefix key distribution: S3 partitions request load by key prefix. If all keys start with 2024/01/01/, all requests go to one partition → rate limiting (3500 PUT/s, 5500 GET/s per prefix). Randomize key prefixes for high-throughput applications: hash(filename)[:4]/filename instead of date/filename.

Failure Modes and Real Incidents

HDFS NameNode single point of failure (pre-2.x): Until HDFS 2.0 (2012), there was a single NameNode. NameNode failure meant total HDFS unavailability for hours (recovery from fsimage + edit log). Major Yahoo outages were traced to NameNode issues. HDFS HA (active/standby NameNode with shared journal) was the fix.

Ceph OSD cascade failure: If multiple OSDs fail simultaneously (power loss affecting multiple nodes in same rack), RADOS may lose quorum for some Placement Groups → cluster enters "degraded" state. If enough replicas are lost, data becomes "unavailable" (PG marked "inactive"). Recovery requires bringing failed OSDs back online. Key mitigation: rack-aware CRUSH rules, diverse power paths.

S3 "slow list" problem: ListObjects (or ListObjectsV2) returns at most 1000 objects per call and requires multiple paginated calls for large buckets. For buckets with 100 million objects, a complete list takes ~100,000 API calls and 30+ minutes. S3 Inventory (daily/weekly) provides a pre-computed CSV listing for large-scale analysis without expensive ListObjects loops.

NFS "thundering herd" on server reboot: When an NFS server reboots, all mounted clients simultaneously attempt to reconnect and reestablish their file lock state. This "thundering herd" of reconnect + NLM (lock manager) state recovery RPCs can overwhelm the NFS server and cause a 30-60 second period of elevated latency for all clients. NFSv4 leases and session state simplify recovery but do not fully eliminate the thundering herd.

Modern Usage

  • Kubernetes persistent volumes: CSI (Container Storage Interface) drivers enable Kubernetes pods to use distributed storage: Ceph RBD, CephFS, NFS, EBS, GCP Persistent Disk. The CSI standard abstracts the distributed filesystem complexity from application developers.
  • S3 as the universal data bus: S3 (and S3-compatible stores like MinIO) is the standard input/output format for data pipelines (Spark, Flink), ML training (S3FS, s3a:// in Hadoop), and serverless functions (AWS Lambda reading from S3).
  • Apache Iceberg on S3: Iceberg is a table format for data lakes on object storage. It stores parquet data files in S3 + JSON metadata files. Combined with query engines (Spark, Trino, Flink), it provides ACID transactions on S3 without a traditional distributed filesystem. This represents the "lake house" architecture trend.
  • NFS over RDMA: Enterprise NAS (NetApp ONTAP, Dell PowerStore) uses NFS/RDMA (RDMA-capable NICs + NFS over RDMA transport) for <200 µs NFS file access — competitive with local SSD for many workloads.

Future Directions

  • Distributed filesystem convergence on object storage: Many distributed filesystems now use object storage as their backend (CephFS on RADOS, HDFS-on-S3 via S3A, GPFS/Spectrum Scale on S3). The "filesystem as metadata layer over object storage" pattern is gaining traction.
  • POSIX-less interfaces: New workloads (ML training, analytics, streaming) don't need POSIX — they work better with object APIs (S3) or streaming APIs (Kafka). The proportion of new applications using POSIX filesystems is decreasing.
  • CXL interconnect for distributed memory: CXL 3.0 (2023) enables memory sharing across servers in the same fabric — a new layer of distributed storage below traditional filesystems. CXL-shared memory pools could replace some NFS use cases with nanosecond-latency access.
  • Wasm-based distributed filesystem clients: As WebAssembly matures for server-side use (WASI), distributed filesystem client libraries are being ported to WASM for universal deployment in containers and edge environments.

Exercises

  1. Deploy a single-node Ceph cluster using cephadm. Create a pool and benchmark it with rados bench. Add a second OSD (another loop device). Observe how CRUSH distributes PGs across the new OSD. What percentage of PGs migrated?

  2. Implement an HDFS file processing pipeline: upload a large text file (1 GB), run a word count MapReduce job, and download the output. Measure the time for each phase (upload, process, download). What is the bottleneck?

  3. Compare NFS rsize/wsize impact. Mount an NFS share with rsize=65536 and rsize=1048576. Run sequential read benchmarks with fio. Plot throughput vs rsize. At what rsize does throughput saturate?

  4. Write a script that uses the S3 API (boto3 or aws CLI) to multipart-upload a 10 GB file and measure total upload time and per-part latency. Compare to a single PUT (up to 5 GB limit). Experiment with different part sizes (5 MB, 50 MB, 500 MB).

  5. Research the Ceph CRUSH algorithm in detail. Read the original CRUSH paper (Weil 2006). Implement a simplified CRUSH placement algorithm in Python that, given a list of OSDs with weights, deterministically maps an object ID to a list of N OSDs. Verify that adding one OSD causes ~1/N of objects to migrate.

References

  • Ghemawat, S., Gobioff, H., Leung, S-T. "The Google File System." SOSP 2003.
  • Shvachko, K. et al. "The Hadoop Distributed File System." MSST 2010.
  • Weil, S.A. et al. "Ceph: A Scalable, High-Performance Distributed File System." OSDI 2006.
  • Weil, S.A. et al. "CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data." SC 2006.
  • NFS v4.1 (pNFS): RFC 5661, RFC 5662
  • Schmuck, F. and Haskin, R. "GPFS: A Shared-Disk File System for Large Computing Clusters." FAST 2002.
  • Beaver, D. et al. "Finding a Needle in Haystack: Facebook's Photo Storage." OSDI 2010.
  • Muralidhar, S. et al. "f4: Facebook's Warm BLOB Storage System." OSDI 2014.
  • Dropbox Magic Pocket: https://dropbox.tech/infrastructure/magic-pocket-infrastructure
  • Amazon S3 documentation: https://docs.aws.amazon.com/s3/
  • Lu, L. et al. "A Study of Linux File System Evolution." FAST 2013.