07 - Storage Networking
Technical Overview
Storage networking encompasses the protocols, topologies, and architectures that allow storage devices to be shared across a network. The fundamental distinction is access semantics: block storage (SAN) presents raw block devices to clients, who manage filesystems locally; file storage (NAS) presents filesystems directly over the network, managed by the storage server; object storage (S3-compatible) provides an HTTP-based key-value store for unstructured data.
Each model reflects different tradeoffs in consistency, performance, protocol complexity, and applicable workloads.
Prerequisites
- Block device and filesystem basics
- TCP/IP networking fundamentals
- SCSI protocol concepts (LUN, CDB)
- NVMe concepts (see 04-nvme.md)
Core Content
SAN vs NAS Architecture
SAN (Storage Area Network): NAS (Network Attached Storage):
+--------+ FC/iSCSI +----------+ +--------+ NFS/SMB +----------+
| Host A |------------>| Storage | | Host A |----------->| NAS |
| (sees | | Array | | (mounts| | Server |
| /dev/ | FC/iSCSI | (presents| | /mnt/ | NFS/SMB | (runs |
| sdb |<----------->| LUNs) | | nfs) |<---------->| NFS/SMB |
+--------+ +----------+ +--------+ | server, |
| manages |
+--------+ FC/iSCSI +--------+ NFS/SMB | FS) |
| Host B |<----------> | Host B |<---------->+----------+
| (sees | | (same |
| /dev/ | | FS |
| sdb | | view) |
+--------+ +--------+
Block access: File access:
- Client manages filesystem - Server manages filesystem
- Low-level SCSI/NVMe commands - VFS syscalls over network
- Client sees raw block device - POSIX semantics (usually partial)
- High performance, complex setup - Simpler, shared namespace
Fibre Channel (FC)
Fibre Channel is a high-speed serial protocol for SANs. Despite its name, FC can run on both optical fiber and copper.
FC characteristics: - Speeds: 2/4/8/16/32/64 Gbps (32Gbps FC-32 is current standard; 64G being deployed) - Lossless: FC uses a credit-based flow control — frames are never dropped (unlike Ethernet) - Dedicated fabric: FC switches (Brocade, Cisco MDS) form an isolated fabric separate from Ethernet - WWPN (World Wide Port Name): 64-bit unique identifier per FC port (like a MAC address) - Zones: FC zoning restricts which initiators (hosts) can see which targets (storage arrays)
FC Fabric Topology:
Host HBA <----FC link----> FC Switch 1 <----ISL----> FC Switch 2 <----FC link----> Storage Array
(WWPN: 10:00:...) (handles routing) (handles routing) (presents LUNs)
FC addressing:
- FCID: 24-bit address assigned by fabric (like IP address)
- WWPN: permanent identifier (like MAC)
- Domain / Area / Port: hierarchical FCID decomposition
FC zoning (critical for security): - Hard zoning: enforced by hardware switches (port-based) — cannot be spoofed - Soft zoning: enforced by name service (WWPN-based) — can be spoofed by a compromised HBA - Best practice: both hard + soft zoning for defense in depth
FC advantages: extremely low latency, deterministic performance, lossless, purpose-built for storage. FC disadvantages: expensive HBAs/switches, requires dedicated infrastructure, specialized expertise.
iSCSI
iSCSI encapsulates SCSI commands in TCP/IP packets, enabling block storage over standard Ethernet.
iSCSI Architecture:
Initiator (client host) Target (storage server)
+------------------+ +---------------------+
| Application | | Storage Array |
| SCSI Layer | TCP/IP | iSCSI Target (tgtd, |
| iSCSI Layer <---+---Ethernet---> | LIO, Starwind) |
| TCP/IP | | SCSI LUN |
| NIC | | NIC |
+------------------+ +---------------------+
iSCSI Naming:
IQN (iSCSI Qualified Name): iqn.2024-01.com.example:storage01:lun1
EUI (Extended Unique Identifier): eui.0123456789abcdef
iSCSI performance considerations: - Software iSCSI (iscsid in userspace): CPU-intensive at high I/O rates (~10-20% CPU per 10G) - Offload: TOE (TCP Offload Engine) on NIC reduces CPU; iSER (iSCSI over RDMA) provides RDMA semantics - Jumbo frames (MTU 9000) recommended to reduce TCP overhead for large sequential I/Os - Separate dedicated network (iSCSI VLAN or dedicated switches) to avoid sharing with traffic that can cause congestion
# iSCSI initiator setup (Linux)
apt install open-iscsi
# Discover targets on a portal (target address)
iscsiadm -m discovery -t sendtargets -p 192.168.100.100
# Login to a target
iscsiadm -m node -T iqn.2024-01.com.storage:array1 -p 192.168.100.100 --login
# View connected sessions
iscsiadm -m session
# The LUN appears as /dev/sdb (or /dev/sdX)
lsblk
NFS: Network File System
NFS provides shared POSIX-like filesystem access over a network. The server runs an NFS daemon; clients mount the exported filesystem.
NFSv3 (stateless): - Server keeps no per-client state — if server reboots, clients retry and operations complete transparently - All operations are idempotent where possible - Consistency: close-to-open consistency (writes visible after close by one client; not immediately to other clients) - Port: 2049 (UDP or TCP). Additional ports for portmap/rpcbind, mountd, statd
NFSv4 (stateful, single port): - Single port 2049 (TCP mandatory) - Stateful leases: server tracks open files and byte-range locks - Delegation: server can delegate file ownership to a client for caching - Strong security: Kerberos V5 (krb5, krb5i, krb5p) via RPCSEC_GSS - COMPOUND operations: multiple RPC operations in one round trip (reduces latency) - NFSv4.1 adds: pNFS (parallel NFS — direct client-to-storage-device I/O, bypassing metadata server for data), sessions, RPCSEC_GSSv3
pNFS (parallel NFS):
pNFS Architecture:
Client
|
| 1. Open file, get layout (which storage nodes hold data)
v
+-------------+
| MDS | Metadata Server: handles metadata, issues layouts
| (Metadata |
| Server) |
+-------------+
|
| 2. Client contacts storage nodes directly for data
|
+------+-------+-------+
| | | |
DS0 DS1 DS2 DS3 Data Servers: hold file data chunks
pNFS achieves near-SAN performance while retaining NFS file semantics. Used in high-performance NAS (NetApp ONTAP, IBM GPFS/Spectrum Scale).
# NFS server configuration (/etc/exports)
/mnt/data 192.168.1.0/24(rw,sync,no_root_squash,fsid=0)
/mnt/data *(ro,sync) # read-only for all others
# NFS client mount
mount -t nfs4 -o rw,relatime,hard,intr,rsize=1048576,wsize=1048576 \
192.168.1.100:/mnt/data /mnt/remote
# Mount options:
# hard: retry indefinitely if server unreachable (vs soft: return error)
# intr: allow signals to interrupt stuck NFS operations
# rsize/wsize: read/write block size (1MB is good for modern networks)
# nfsvers=4.1: use NFSv4.1
# Check NFS statistics
nfsstat -c # client stats
nfsstat -s # server stats
cat /proc/net/rpc/nfs # kernel RPC stats
SMB/CIFS
Server Message Block (SMB) is the native Windows file sharing protocol. SMB3 (Windows Server 2012+) adds significant improvements:
SMB3 features: - SMB Multichannel: aggregate bandwidth across multiple NICs/paths simultaneously - SMB Direct: RDMA-based transport for SMB over InfiniBand/RoCE (similar to NFS+RDMA) - SMB Transparent Failover: client reconnects to passive node after failover without visible disruption - SMB Encryption: AES-128-GCM or AES-256-GCM per-message encryption - Persistent handles: survive server reboots / brief disconnections
Linux Samba implements SMB3 server. Linux kernel CIFS module implements SMB client.
NVMe-oF
NVMe over Fabrics (NVMe-oF) extends NVMe protocol over a network fabric (TCP, RDMA, FC), achieving near-local NVMe latency for networked storage. Covered in detail in 04-nvme.md.
Performance comparison (all at 4K random read, single queue depth):
| Protocol | Latency | Bandwidth | CPU overhead | Notes |
|---|---|---|---|---|
| Local NVMe | ~100 µs | 7 GB/s | Low | PCIe direct |
| NVMe-oF RDMA | ~130 µs | ~25 GB/s | Near-zero | RDMA NIC required |
| NVMe-oF TCP | ~200-500 µs | ~10 GB/s | Moderate | Standard NIC, kernel 5.0+ |
| iSCSI (10GbE) | ~500 µs | ~1 GB/s | High (CPU) | Jumbo frames help |
| NFS over RDMA | ~200 µs | ~12 GB/s | Low | Requires RDMA NIC |
| NFSv4.1 (TCP) | ~500 µs | ~1.5 GB/s | Moderate | pNFS can scale further |
| FC (32G) | ~200 µs | ~3 GB/s | Low (HBA) | Dedicated fabric |
Object Storage Protocols
Object storage treats data as objects (key-value pairs) rather than files or blocks:
Amazon S3 API (de facto standard):
- HTTP REST API: PUT, GET, DELETE, LIST
- Object addressed by bucket + key (path-like key, not a real directory tree)
- Multipart upload: for objects >5 GB, split into 5-500 MB parts and upload in parallel
- Consistency (since S3 2021 update): strong read-after-write consistency
- Operations: aws s3 cp, aws s3 sync, SDK calls
OpenStack Swift: Early open-source S3-like object storage. Ring-based placement (consistent hashing with replicas). Used in many private clouds.
MinIO: S3-compatible high-performance object store, deployed on commodity hardware. Can saturate 100 GbE links with erasure-coded object storage.
# S3 example with aws CLI
aws s3 cp largefile.tar.gz s3://mybucket/backup/ --storage-class STANDARD_IA
aws s3 sync /local/dir s3://mybucket/prefix/ --delete
# MinIO client (mc) — S3 compatible
mc cp largefile.tar.gz minio/mybucket/
mc ls minio/mybucket/
Distributed Filesystems
HDFS (Hadoop Distributed Filesystem):
HDFS Architecture:
+----------+
| NameNode | Single metadata server (active/standby in HA mode)
| (holds | Stores: namespace, block->datanode mappings
| namespace| Does NOT hold actual block data
| in DRAM) |
+----------+
/ | \
/ | \
+--------+ +--------+ +--------+
|DataNode| |DataNode| |DataNode| Store blocks (default 128 MB blocks)
| (data)| | (data)| | (data)| 3× replication by default
+--------+ +--------+ +--------+
Design constraints:
- Optimized for large files (100 MB+)
- Sequential read/write (batch processing)
- Write-once, append-only (no in-place updates)
- Not POSIX (no symlinks, limited permissions, no mmap)
- NameNode is single point of failure in non-HA config
HDFS block size (128 MB default) is intentionally huge to reduce NameNode memory usage (one entry per block) and to make large streaming reads efficient.
Ceph:
Ceph Architecture:
+------------------------------------------+
| Client Layer |
| CephFS (POSIX FS) RBD (block) RGW (S3)|
+------------------------------------------+
| LIBRADOS |
| (object storage client library) |
+------------------------------------------+
| RADOS |
| (Reliable Autonomic Distributed Object |
| Store — the core of Ceph) |
| |
| +--------+ +--------+ +--------+ |
| | OSD 0 | | OSD 1 | | OSD 2 | | OSD = Object Storage Daemon
| | (NVMe) | | (NVMe) | | (HDD) | | One OSD per disk (recommended)
| +--------+ +--------+ +--------+ |
| |
| +--------+ +--------+ |
| | Mon 0 | | Mon 1 | Mon 2 | Monitor: cluster map, quorum
| | (Paxos)| | (Paxos)| |
| +--------+ +--------+ |
| |
| MGR (Ceph Manager: metrics, UI, modules) |
+------------------------------------------+
CRUSH Algorithm:
- Deterministic placement: CRUSH(object_id, cluster_map) -> OSD list
- No central lookup table: any client can compute placement independently
- Failure domains: rack-aware, zone-aware placement rules
- Rebalancing: adding/removing OSDs triggers minimal data migration
Ceph CRUSH algorithm is central to Ceph's scalability — there is no central metadata bottleneck for data placement decisions. Each client (librbd, cephfs, radosgw) computes the OSD list for any object using the CRUSH algorithm and the cluster map distributed to all clients.
Lustre: High-Performance Computing (HPC) parallel filesystem. Architecture: MDT (Metadata Target, inode storage) + OST (Object Storage Targets, data storage). Clients communicate directly with OSTs for data I/Os. Can aggregate 1 TB/s of bandwidth across thousands of OSTs. Used at TOP500 supercomputers (ORNL Summit, LLNL Sierra).
Google Colossus: Successor to Google File System (GFS). Key differences from GFS: multiple metadata masters (no single NameNode), client-side metadata caching, Bigtable for metadata storage, Reed-Solomon erasure coding. Provides the storage backend for Google's entire data infrastructure.
Protocol Performance Comparison Table
| Protocol | Type | Transport | Latency | Max BW | Overhead | Best For |
|---|---|---|---|---|---|---|
| NVMe-oF | Block | RDMA/TCP | ~130-500 µs | 25+ GB/s | Very Low | Disaggregated NVMe |
| FC 32G | Block | FC fabric | ~200 µs | ~3.2 GB/s | Low (HBA) | Enterprise SAN |
| iSCSI | Block | TCP/IP | ~500 µs | ~1-10 GB/s | High CPU | General SAN on Ethernet |
| NFS/RDMA | File | RDMA | ~200 µs | ~12 GB/s | Low | HPC, high-perf NAS |
| NFSv4.1 | File | TCP/IP | ~500 µs | ~1-5 GB/s | Moderate | General file sharing |
| SMB3/RDMA | File | RDMA | ~200 µs | ~10 GB/s | Low | Windows HPC |
| S3/HTTP | Object | HTTP/TCP | ~50-200 ms | 50+ MB/s/req | Low | Cold/warm object store |
| HDFS | Object | TCP/IP | ~1-10 ms | 100+ GB/s agg | Low | Batch analytics |
| Ceph RBD | Block | TCP/RDMA | ~300 µs | 10+ GB/s | Moderate | OpenStack volumes |
Historical Context
Network storage dates to the 1970s with file sharing in ARPANET. NFS was developed by Sun Microsystems in 1984 and became the dominant UNIX network filesystem. CIFS (Common Internet File System) was Microsoft's response for Windows, later standardized as SMB.
Fibre Channel emerged in the early 1990s as a replacement for parallel SCSI buses in data centers. The FC-AL (Arbitrated Loop) topology gave way to switched fabrics (FC-SW), enabling the SAN architecture that dominated enterprise storage through the 2000s.
iSCSI (RFC 3720, 2004) challenged FC by running block storage over commodity Ethernet. By 2010, iSCSI had become viable for performance-sensitive applications as 10GbE NICs became affordable.
The "cloud storage" era (2006 onward, S3 launch) fundamentally changed the paradigm: object storage with eventual consistency and HTTP API became the primary model for new applications, displacing both SAN and NAS for unstructured data.
Production Examples
Netflix OpenConnect: Deploys custom storage appliances (OCAs) in ISP networks. Uses iSCSI internally between OCA components, and POSIX-based CDN software on top. Moving to NVMe SSDs over NVMe-oF for inter-OCA storage pool access.
Dropbox in-house storage (Magic Pocket): Moved from S3 to custom object storage built on Ceph RGW (RADOS Gateway) + custom erasure coding. Stores exabytes of user data. 4+2 erasure code across 6 storage zones. Reduced cost by ~75% vs S3 at their scale.
LinkedIn Kafka on iSCSI: LinkedIn's early Kafka clusters used iSCSI SANs for persistent storage. As Kafka throughput grew, iSCSI became a bottleneck and they migrated to local NVMe SSDs on each broker. Latency dropped from ~2 ms to ~100 µs for sequential log writes.
Debugging Notes
# NFS debugging
showmount -e <server> # list exports
rpcinfo -p <server> # list RPC services on server
mountstats # detailed per-mount NFS statistics (op latencies)
nfsiostat 1 # live NFS I/O stats
cat /proc/self/mountstats # raw NFS mount stats
# iSCSI debugging
iscsiadm -m session -P 3 # verbose session info including negotiated params
iscsiadm -m node -P 3 # node configuration
dmesg | grep -i iscsi # kernel-level iSCSI messages
cat /sys/class/iscsi_session/session*/state
# Ceph debugging
ceph status # cluster health overview
ceph osd df # OSD capacity and PG distribution
ceph df # pool-level usage
ceph osd perf # OSD latency statistics
rados bench -p mypool 30 write --no-cleanup # write benchmark
rados bench -p mypool 30 rand # read benchmark
# FC debugging
systool -c fc_host -v # view FC HBA info
cat /sys/class/fc_host/host0/port_state # HBA port state
cat /sys/class/fc_host/host0/port_name # WWPN
Security Implications
iSCSI security: iSCSI with no authentication (CHAP disabled) means any host on the storage network can access any exposed LUN. Always configure CHAP (at minimum) or IPsec for iSCSI. In a misconfigured iSCSI environment, one compromised host can read/overwrite any other host's storage — a critically dangerous misconfiguration seen in cloud environments.
NFS security: NFSv3 authentication is based on client IP + UID/GID — easily spoofed. NFSv4 with Kerberos (krb5p) provides authentication and encryption. root_squash (default) remaps client root (UID 0) to anonymous user — important to prevent privilege escalation.
FC zoning: Hard zoning ensures storage is isolated even if a host HBA is compromised (cannot enumerate storage outside its zone). Soft zoning alone is insufficient for hostile environments.
Performance Implications
NFS rsize/wsize: Default NFS read/write block sizes vary by client. For high-throughput applications, set rsize=1048576,wsize=1048576 (1 MB). This significantly reduces RPC call overhead for large sequential I/Os.
NFS sync vs async: NFS server with sync export option flushes every write to disk before acknowledging. async allows the server to acknowledge writes held in cache — much faster but risks data loss if server crashes. For a dedicated NAS with UPS, async is acceptable.
iSCSI queue depth: iSCSI negotiates MaxCmdSN - ExpCmdSN (command window). Default 32 commands in flight. For NVMe-backed iSCSI targets, increase to 256-1024 via iscsiadm -m node --op update -n node.session.cmds_max -v 1024.
Failure Modes and Real Incidents
NFS stale file handle: When an NFS server reboots or a file is deleted on the server while a client has it open, the client gets ESTALE (stale file handle). With hard mount option, the client retries indefinitely — process hangs until server is available. Solution: use soft,timeo=600,retrans=3 for applications that must fail fast.
Split-brain SAN: In an FC SAN with active-active storage array and two paths to each LUN, a network partition can cause both hosts to believe they are the active owner of a LUN. Without a proper distributed lock manager or SCSI-3 persistent reservations (PR), both hosts write to the same LUN simultaneously → filesystem corruption. Solution: SCSI-3 PR or GFS2/OCFS2 cluster-aware filesystem.
NFS traffic on shared Ethernet (2018 incident at a financial firm): An NFS-over-Ethernet backup job competing with production iSCSI traffic on the same 10GbE network segment caused TCP congestion, iSCSI retransmits, and elevated storage latency. Result: 20-minute database slowdown during market hours. Root cause: failure to isolate storage traffic onto a dedicated VLAN/network. Lesson: storage traffic should always be on a dedicated network segment.
Modern Usage
- All-NVMe storage arrays (Pure Storage FlashArray, NetApp AFF A-Series, HPE Alletra): Expose LUNs via NVMe-oF (TCP or RDMA) with sub-200 µs latency to hosts. NVMe-oF is replacing iSCSI and FC for new deployments.
- Kubernetes storage (CSI — Container Storage Interface): k8s uses CSI drivers to provision block (NVMe-oF, iSCSI) or file (NFS, Ceph) storage for pods. Ceph RBD and NFS are the most common CSI backends.
- AWS EFS (Elastic File System): Managed NFS (NFSv4.1) at scale. Automatically scales from GB to PB, but latency is ~1-10 ms vs local SSD.
- Azure NetApp Files: Enterprise NFS/SMB-as-a-service, powered by NetApp ONTAP hardware. Provides low-latency (<1 ms) NFS to Azure VMs.
Future Directions
- NVMe-oF universal adoption: As 25/100 GbE becomes standard and RoCE NICs become cheaper, NVMe-oF TCP will replace iSCSI and FC in most new deployments within 5 years
- S3 becoming a universal storage API: S3 compatibility is now expected from every object store (MinIO, Ceph RGW, ONTAP S3). S3 is the lingua franca of object storage.
- CXL memory sharing: CXL 2.0/3.0 enables shared memory pools across servers, blurring the line between network storage and local memory. CXL could replace certain NVMe-oF use cases where latency < 500 ns is required.
- Computational storage integration: NAS/SAN devices with embedded compute for inline compression, deduplication, and analytics (filter pushdown to storage node)
Exercises
-
Set up an NFS server and client on two Linux machines (or two containers). Mount with
rsize=65536andrsize=1048576. Runfiosequential read and measure the difference. Explain the result using RPC call overhead. -
Configure an iSCSI target on Linux using
targetcli(LIO kernel target). Connect from an initiator host. Measure latency withioping -c 100 /dev/sdb. Compare to a locally attached disk. -
Use Ceph's
rados benchto measure write and read latency/throughput on a single-node Ceph cluster (deployed viacephadm). Measure 4K random IOPS viarbd bench. -
Research the "NFS stale file handle" problem. Write a script that creates, writes, and reads an NFS-mounted file, then artificially causes a stale handle (delete the file from the server) and observe the client's behavior with
hardvssoftmount options. -
Compare S3 API (boto3) latency for small (1 KB) vs large (100 MB) object PUT/GET operations to a local MinIO instance. Calculate effective throughput for each. What is the fixed overhead per request?
References
- Patterson, D.A. et al. "A Case for Redundant Arrays of Inexpensive Disks (RAID)." SIGMOD 1988.
- NFS v4.1 RFC 5661: https://www.rfc-editor.org/rfc/rfc5661
- iSCSI RFC 3720: https://www.rfc-editor.org/rfc/rfc3720
- NVMe over Fabrics Specification: https://nvmexpress.org/specifications/
- Weil, S.A. et al. "Ceph: A Scalable, High-Performance Distributed File System." OSDI 2006.
- Weil, S.A. et al. "CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data." SC 2006.
- Shvachko, K. et al. "The Hadoop Distributed File System." MSST 2010.
- Amazon S3 documentation: https://docs.aws.amazon.com/s3/
- Schmuck, F. and Haskin, R. "GPFS: A Shared-Disk File System for Large Computing Clusters." FAST 2002.
- Brewer, E. "Kubernetes and the Path to Cloud Native." EuroSys 2016.