01 — Distributed Systems Fundamentals
Technical Overview
A distributed system is a collection of autonomous computing elements that appears to its users as a single coherent system. That deceptively simple definition hides decades of hard-won lessons about what "coherent" really costs. Unlike a single machine where failure is total and predictable, a distributed system can fail in infinitely creative partial ways — a disk dies on node 3 while nodes 1, 2, and 4 hum along, or a network switch drops packets between two datacenter racks while everything else communicates fine.
The defining characteristic of a distributed system is not merely the presence of multiple machines. It is the presence of partial failure combined with no shared clock and no shared memory. These three constraints together make distributed systems qualitatively harder than concurrent programming on a single machine.
Prerequisites
- Operating systems fundamentals: processes, threads, IPC
- Networking: TCP/IP, UDP, sockets, latency concepts
- Basic data structures and algorithms
- Some familiarity with database concepts (ACID properties helps)
Core Content
What Makes a System Distributed?
Three structural properties define a distributed system:
- Spatial separation: Components run on separate hardware, connected by a network.
- Independent failure domains: A component can fail without affecting others (or while others cannot detect it).
- Concurrent execution: Components proceed independently, with no global synchronization point.
The consequence: you cannot distinguish a slow node from a dead node. If you send a message and hear nothing back, you cannot know whether the message was lost, the remote node crashed, the response was lost, or the remote node is simply slow. This is the fundamental observation driving most distributed systems design.
Deutsch's 8 Fallacies of Distributed Computing
Peter Deutsch at Sun Microsystems in 1994 (later extended by James Gosling) codified the assumptions programmers erroneously make when building distributed systems:
1. The network is reliable.
2. Latency is zero.
3. Bandwidth is infinite.
4. The network is secure.
5. Topology doesn't change.
6. There is one administrator.
7. Transport cost is zero.
8. The network is homogeneous.
Every one of these is false in production. They are false in your local datacenter, and catastrophically false across datacenters or continents. The interesting fallacies for correctness are 1 (reliability) and the implicit ninth fallacy Deutsch left unstated: time is globally consistent.
Partial Failure
Peter Deutsch's first fallacy leads to the concept of partial failure. On a single machine, failure is typically total — the OS crashes, the process segfaults. In a distributed system, some components can fail while others continue. This creates scenarios with no single-machine analog:
- Network partition: Two groups of nodes cannot communicate with each other but operate normally within each group.
- Node crash: A node stops responding but its state is intact (it will recover).
- Node fail-stop: A node detects its own failure and halts cleanly (rare and usually assumed for simplicity).
- Byzantine failure: A node behaves arbitrarily — sending conflicting messages to different peers, lying about its state, or acting maliciously.
- Slow node: A node responds, but after timeouts have fired, causing others to declare it dead.
The canonical thought experiment for partial failure is the Two Generals Problem (also called the Coordinated Attack Problem), published in ACM Computing Surveys in 1975 by Y. C. Hadzilacos based on earlier work by Jim Gray:
Two armies, commanded by Generals A and B, must attack simultaneously to win.
They can only communicate via messengers through enemy territory.
Any messenger may be captured (message loss).
General A sends: "Attack at dawn."
Messenger might be captured. A doesn't know if B received it.
B receives it and sends back confirmation.
That messenger might also be captured. B doesn't know if A got the confirmation.
A sends confirmation of confirmation...
No finite exchange of messages can guarantee both generals know
the other is committed to attacking.
This is not merely an academic puzzle. It proves that you cannot achieve perfect agreement over an unreliable channel. TCP handshakes, distributed commits, and protocol acknowledgments all live under this shadow.
Asynchronous Communication
Distributed systems are asynchronous: there is no bound on message delivery time. A message sent now might arrive in 1 millisecond or 10 seconds, depending on network congestion, OS scheduling, or garbage collection pauses. This has profound implications:
- Timeout-based failure detection is always approximate: A timeout declares a node dead, but the node might just be slow.
- Ordering of events is not naturally preserved: Message A sent before message B might arrive after it.
- You cannot use wall-clock time for ordering: Clocks on different machines differ and drift.
The FLP Impossibility Theorem (Fischer, Lynch, Paterson 1985) proved that in a fully asynchronous distributed system, no consensus protocol can tolerate even a single crash failure while guaranteeing termination. This is the theoretical bedrock explaining why all practical consensus protocols (Paxos, Raft) must make additional assumptions (partial synchrony, leader leases, timeouts).
Time and Clocks
Physical Clocks
Every computer has a hardware clock (quartz oscillator) that drifts. Two machines synchronized via NTP will diverge by microseconds to milliseconds over time. NTP in a datacenter typically achieves 1–10ms accuracy. GPS-disciplined clocks (used by Google Spanner) achieve ~100ns but cost hundreds of thousands of dollars per datacenter.
The problem with physical clocks for distributed systems: if node A's clock reads 10:00:00.500 and node B's clock reads 10:00:00.480, and both claim to have "written" a record at their respective times, who wrote first? You cannot know from the timestamps alone.
Google's Spanner introduced TrueTime: an API that returns an interval [earliest, latest] rather than a point, and guarantees the true time falls within that interval. Commit waits until the interval is in the past. This is expensive and requires atomic clocks + GPS receivers in every datacenter.
Logical Clocks
Leslie Lamport's 1978 paper "Time, Clocks, and the Ordering of Events in a Distributed System" introduced a simpler solution: logical clocks that track causality without requiring synchronized physical clocks.
Lamport's key insight was the happened-before relation (→):
- If a and b are events in the same process, and a comes before b: a → b
- If a is a send event and b is the corresponding receive: a → b
- Transitivity: if a → b and b → c, then a → c
Events that are not related by happened-before are concurrent (written a ∥ b). A Lamport clock assigns a number L(e) to each event such that if a → b then L(a) < L(b). The converse is not guaranteed — L(a) < L(b) does not imply a → b.
(Logical clocks are covered in depth in 04-logical-clocks.md.)
The Byzantine Generals Problem
Lamport, Shostak, and Pease published "The Byzantine Generals Problem" in 1982, generalizing the Two Generals problem to arbitrary failures. The scenario:
Several Byzantine army divisions surround an enemy city.
Each division is commanded by a general.
Generals communicate only by messenger.
Some generals may be traitors (Byzantine = arbitrary/malicious behavior).
Goal: loyal generals must agree on a plan (attack or retreat).
Traitors may send conflicting messages to different generals.
The key result: with m Byzantine traitors, you need at least 3m+1 total generals to reach agreement. With fewer than 3m+1, a Byzantine minority can prevent agreement or cause loyal generals to disagree.
This matters for practical systems because Byzantine failures occur in: - Multi-party systems where some participants may be adversarial (cryptocurrency nodes) - Hardware with bit-flip errors in safety-critical systems - Any system where you cannot trust all participants
Most datacenter distributed systems (Kafka, ZooKeeper, etcd) assume crash-stop failures only (nodes stop rather than lie) and use simpler protocols as a result.
Fundamental Problems in Distributed Systems
Three problems underlie most of distributed systems theory:
1. Agreement (Consensus): All non-faulty nodes must agree on the same value. Used in leader election, distributed commits, replicated state machines.
2. Reliable Delivery: Guarantee that a message sent by a correct process is eventually delivered to all correct processes. Simple in the two-process case; complex with failures and dynamic membership.
3. Total Ordering: Ensure all nodes process events in the same order. Required for replicated state machines (RSM) — the foundation of databases like Spanner and coordination services like ZooKeeper.
The Relationship:
[Agreement] ←→ [Total Order Broadcast]
↑
| (equivalent in power)
↓
[Atomic Broadcast]
|
↓ (solves)
[Replicated State Machine]
Solving total order broadcast in the presence of failures requires consensus, and solving consensus requires total order broadcast. They are equivalent. This is why Paxos/Raft, which solve consensus, also give you a replicated log.
Network Partitions: The Reality
A network partition is a scenario where a subset of nodes cannot communicate with another subset. Partitions are not theoretical:
- 2011, Amazon AWS: EBS outage caused by a network configuration error that partitioned the control plane.
- 2012, LinkedIn: Kafka cluster partition caused consumers to process messages out of order.
- 2016, GitHub: A 43-minute global outage caused by a routing change that partitioned their network, resulting in split-brain in their MySQL clusters.
In a modern datacenter, a single 48-port switch failing creates a partition. A misconfigured BGP route announcement can partition entire availability zones. The question is never "will we have a partition?" but "how does our system behave when we do?"
Normal Operation:
[Node A] ←────────→ [Node B]
↑ ↑
└──────────────────→[Node C]
During Partition:
[Node A] ✗────────✗ [Node B]
↑ ↑
└──────────────────→[Node C]
Node A and C form one partition.
Node B is isolated (or B and C form another partition).
Historical Context
The history of distributed systems tracks the history of networking itself.
1969: ARPANET connects four nodes (UCLA, Stanford Research Institute, UC Santa Barbara, University of Utah). The first inter-node message crashes the receiving machine after two letters.
1970s: Time-sharing systems experiment with networked file systems. Lamport publishes foundational work on logical clocks (1978) and mutual exclusion in distributed systems.
1982: Lamport, Shostak, and Pease publish "The Byzantine Generals Problem." Sun Microsystems introduces NFS, the first widely-deployed distributed file system.
1985: Fischer, Lynch, and Paterson publish the FLP impossibility result. This paper wins the Dijkstra Prize in 2001 as the most influential paper in distributed computing.
1987: Demers et al. at Xerox PARC publish gossip protocols for database replication.
1989/1998: Lamport circulates "The Part-Time Parliament" (published 1998), describing Paxos consensus. Its notoriously difficult prose leads Lamport to publish "Paxos Made Simple" in 2001.
1990s: The web era drives massive scale. Jeff Dean and Sanjay Ghemawat at Google build MapReduce, GFS, and Bigtable. Amazon builds Dynamo (Decanzo et al. 2007) to handle shopping cart state.
2000s: Eric Brewer proposes the CAP theorem (2000). Gilbert and Lynch provide a formal proof (2002). Google Chubby (Burrows 2006) and Apache ZooKeeper (Hunt et al. 2010) make consensus practical.
2012–present: Raft (Ongaro & Ousterhout 2014) makes consensus understandable. Google Spanner (Corbett et al. 2012) achieves global strong consistency using TrueTime. The cloud era makes distributed systems the default for all significant applications.
Production Examples
- Google GFS: Single-master distributed file system. The "chunk server" model trades POSIX compliance for throughput on large files. Master is a single point of coordination (and historically, failure).
- Amazon DynamoDB: Leaderless, eventually-consistent key-value store based on Dynamo paper. Vector clocks for conflict detection (switched to last-write-wins in practice).
- Apache Kafka: Distributed log with leader-per-partition replication. Relies on ZooKeeper (later KRaft) for controller election and metadata.
- Google Spanner: Globally distributed SQL database with external consistency (stronger than linearizability). Uses TrueTime API and Paxos groups per shard.
- etcd: Raft-based distributed key-value store. The backbone of Kubernetes control plane state.
Debugging Notes
Debugging distributed systems is fundamentally different from debugging single-machine systems:
- Non-determinism: The same bug may not reproduce because message ordering varies.
- Distributed tracing: You need correlation IDs propagated across service calls (Jaeger, Zipkin, OpenTelemetry).
- Log aggregation: Events on different nodes need to be correlated. Use structured logs with timestamps and request IDs.
- Clock skew: Comparing timestamps across nodes requires knowing the clock offset. NTP does not guarantee monotonicity —
clock_gettime(CLOCK_REALTIME)can go backwards. - Jepsen testing: Kyle Kingsbury's Jepsen framework (jepsen.io) injects network partitions and node failures while running concurrent workloads, then checks whether linearizability was violated. Has found bugs in Cassandra, MongoDB, Redis, etcd, PostgreSQL, and dozens more.
Key tools: Wireshark/tcpdump for packet inspection, tc netem for inducing latency/loss/partition in testing, Chaos Monkey (Netflix) for production resilience testing.
Security Implications
- Authentication across nodes: In a distributed system, every inter-node call must be authenticated. Assume the network is hostile (zero-trust). Use mTLS (mutual TLS) for service-to-service communication.
- Byzantine failures: Without Byzantine fault tolerance, a single compromised node can corrupt the entire system's state. Most datacenter systems accept this risk; financial systems and blockchains do not.
- Split-brain: A network partition can create two "primaries" that both accept writes, leading to divergent state. This is a correctness failure with security implications in access control systems (two primaries with different ACL state).
- Replay attacks: Distributed systems often use sequence numbers or logical timestamps. An attacker who replays an old message can cause incorrect state transitions.
- Information leakage via timing: Timing-based side channels are harder to exploit but still possible. A slow consensus round can leak information about contention.
Performance Implications
- Latency is multiplicative: A request requiring 3 serial RPCs at 10ms each takes 30ms minimum, not 10ms. Design for parallelism; use fan-out patterns where possible.
- The tail latency problem: In a system with 100 servers, the probability that at least one server is slow is high. Jeff Dean's work on "The Tail at Scale" (2013) shows that hedged requests (send to two, use first response) dramatically reduce p99 latency.
- Coordination is expensive: Anything requiring consensus (Paxos/Raft round-trip) adds a full network round-trip plus write-ahead log fsync. Google Spanner commits take 10–100ms globally. Design to minimize consensus path.
- Serialization overhead: Protobuf, Thrift, Avro, or JSON serialization CPU cost adds up at scale. Choose based on schema evolution needs and performance budget.
Failure Modes and Real Incidents
2011, Amazon AWS us-east-1 EBS Outage: A network change during capacity upgrade created a re-mirroring storm. EBS volumes that lost connectivity tried to re-mirror simultaneously, saturating the network. The partial failure (some nodes could talk, some couldn't) was harder to recover from than a complete outage would have been.
2012, GitHub MySQL Failover: A datacenter switch failure caused the primary MySQL to become unreachable from one datacenter. Automated failover promoted a replica that was 30 seconds behind. When the network recovered, the old primary had newer data. 30 seconds of writes were lost and had to be manually reconciled.
2020, Cloudflare Incident: A router software bug caused BGP route withdrawals that partitioned a portion of their network. Traffic was blackholed for 27 minutes globally.
2021, Facebook October Outage: A BGP configuration change removed all Facebook-owned routes from the global internet. Their distributed systems (which relied on BGP for health checks) declared all datacenters unreachable. DNS servers took themselves offline. The systems worked as designed — they correctly detected an "unreachable" condition — but the condition was caused by a configuration error, not a real failure.
Modern Usage
Every cloud-native application is a distributed system. Microservice architectures have pushed distributed systems concerns (service discovery, partial failure, eventual consistency) into every engineering team, not just infrastructure specialists.
Key modern patterns: - Service mesh (Istio, Linkerd): Offload mTLS, retries, circuit-breaking, observability to a sidecar proxy. - Event-driven architecture: Kafka or Kinesis as the durable distributed log; services consume and produce events asynchronously. - Serverless: Functions as a Service (AWS Lambda, Google Cloud Functions) hide most distributed systems complexity but expose new failure modes (cold starts, invocation timeouts, at-least-once delivery). - Global databases: Spanner, CockroachDB, YugabyteDB, and PlanetScale bring strong consistency to globally distributed deployments.
Future Directions
- Deterministic distributed systems: Research into making distributed systems reproduce bugs deterministically (FoundationDB's simulation testing, TigerBeetle's deterministic simulation).
- Formal verification: TLA+ (Lamport's specification language) is being adopted by AWS, Microsoft Azure, and others to verify distributed protocols before implementation. The "Weeks of coding can save hours of specification" joke runs in reverse.
- Post-quantum distributed security: Current mTLS uses elliptic curve cryptography. As quantum computing matures, key agreement protocols need upgrading.
- CXL and shared memory fabrics: CXL (Compute Express Link) allows memory sharing across machines. This may blur the line between distributed and shared-memory systems for tightly-coupled clusters.
Exercises
-
Two Generals Simulation: Write a simulation in Python where two "generals" communicate over a lossy channel (drop messages with probability p). Observe that even with arbitrarily many message exchanges and p=0.01, they cannot achieve certain agreement. What is the maximum number of rounds needed to achieve 99.9% probability of agreement?
-
Clock Skew Measurement: On a Linux system, compare
CLOCK_REALTIMEvsCLOCK_MONOTONICover 10 seconds with 1ms polling. Plot the drift. Then enable NTP correction and observeCLOCK_REALTIMEjumps backward. Explain why monotonic clocks are required for measuring intervals. -
Partial Failure Exploration: Using Docker Compose, create three nodes running a simple HTTP server that forwards requests to each other. Use
tc netemto introduce 30% packet loss between node 1 and node 2. Observe how node 3 behaves when forwarding requests from node 1 to node 2 and vice versa. Can you detect the partition from node 3's perspective? -
Fallacies in Code: Find a codebase (open source preferred) that assumes zero network latency (e.g., a loop that calls an HTTP endpoint and counts responses per second without batching). Measure the actual throughput vs the theoretical maximum if the latency were zero. What is the ratio?
-
Jepsen Toy: Without using the Jepsen framework, write a concurrent test for a distributed key-value store: 5 threads simultaneously do read-modify-write operations on the same key. Run this for 10 seconds. Compare the final value to what it should be if every operation was atomic. How many times does the result diverge from correct?
References
- Lamport, L. (1978). "Time, Clocks, and the Ordering of Events in a Distributed System." Communications of the ACM, 21(7), 558–565.
- Fischer, M. J., Lynch, N. A., & Paterson, M. S. (1985). "Impossibility of Distributed Consensus with One Faulty Process." Journal of the ACM, 32(2), 374–382.
- Lamport, L., Shostak, R., & Pease, M. (1982). "The Byzantine Generals Problem." ACM Transactions on Programming Languages and Systems, 4(3), 382–401.
- Deutsch, P. (1994). "The Eight Fallacies of Distributed Computing." Sun Microsystems technical note.
- Gray, J. (1978). "Notes on Database Operating Systems." Lecture Notes in Computer Science, 60, 393–481. (Two Generals discussed in the context of commit protocols.)
- Dean, J., & Ghemawat, S. (2004). "MapReduce: Simplified Data Processing on Large Clusters." OSDI.
- DeCandia, G., et al. (2007). "Dynamo: Amazon's Highly Available Key-value Store." SOSP.
- Dean, J., & Barroso, L. A. (2013). "The Tail at Scale." Communications of the ACM, 56(2), 74–80.