Distributed Systems Engineer Learning Roadmap
A complete phase-by-phase roadmap for building deep expertise in distributed systems — from foundations through production systems to research-grade understanding. Each phase includes papers, implementation projects, and explicit advancement criteria.
Overview
| Phase | Months | Focus | Target Outcome |
|---|---|---|---|
| Foundations | 0–3 | Networking, DDIA, MIT 6.824 labs | Can implement MapReduce and basic Raft |
| Core Algorithms | 3–9 | Consensus, replication, foundational papers | Can implement Raft from scratch with correctness tests |
| Production Systems | 9–18 | Kafka, Cassandra, CockroachDB, open-source contribution | Can operate and debug production distributed systems |
| Advanced | 18–36 | Transactions, CRDTs, causal consistency, research frontier | Can design novel distributed protocols |
Phase 1: Foundations (Months 0–3)
Primary Text
Designing Data-Intensive Applications (DDIA) — Martin Kleppmann - Publisher: O'Reilly, 2017 - ISBN: 978-1449373320 - This is the single most important book for the foundations phase. Read it cover to cover — it is the modern consensus starting point for the field.
Month 1: Networking Fundamentals and DDIA Part I
Networking Topics (self-study):
| Topic | Resource | Key Concepts |
|---|---|---|
| TCP/IP internals | Stevens "TCP/IP Illustrated Vol 1" (ISBN: 978-0321336316) | Handshake, flow control, congestion control, TIME_WAIT |
| UDP and why it matters | Same | Unreliable delivery, DNS, QUIC motivation |
| TLS and certificate chains | Cloudflare TLS guide (free online) | Handshake, ALPN, session resumption |
| Linux network sockets | man pages + beej.us/guide/bgnet |
select, epoll, non-blocking I/O |
DDIA Reading Schedule: - Week 1: Chapters 1–3 (reliable/scalable/maintainable; data models; storage engines — B-trees vs. LSM trees) - Week 2: Chapter 4 (encoding: JSON, Protobuf, Avro — schema evolution matters enormously) - Week 3: Chapters 5–6 (replication; partitioning — understand replication lag, read-your-writes) - Week 4: Review + write 2-page summary of every chapter
Lab Exercise: Implement a simple TCP echo server in C using epoll. Measure maximum connections per second. Add connection pooling. This makes the networking primitives concrete before they appear in distributed protocols.
Month 2: MIT 6.824 Labs — MapReduce and Infrastructure
Course: MIT 6.824 Distributed Systems (labs publicly available at https://pdos.csail.mit.edu/6.824/)
Lab 1: MapReduce
| Component | What to Implement | Key Design Decision |
|---|---|---|
| Coordinator | Task assignment, worker crash detection | Use heartbeat timeouts (10 seconds recommended by spec) |
| Worker | Map execution, Reduce execution, intermediate file handling | Write atomically: write to temp file, then rename |
| Fault tolerance | Re-assign tasks from dead workers | Workers are idempotent — safe to re-execute |
Success Criteria for Lab 1: - All provided test cases pass including crash tests - Understand why map outputs are written to local disk (not shared storage)
DDIA Chapters 10–12: Batch processing, stream processing, future of data systems — read alongside Lab 1 to see MapReduce in context.
Month 3: MIT 6.824 Labs — Raft Part 1 and Fault-Tolerant KV
Lab 2: Raft Consensus (Part A — leader election)
Raft paper reading first (Ongaro & Ousterhout, 2014 — full citation in Phase 2 paper list).
| Raft Component | Key Invariant | Common Bug |
|---|---|---|
| Leader election | At most one leader per term | Not resetting election timer on heartbeat |
| Log replication | Leader log is authoritative | Sending wrong prevLogIndex in AppendEntries |
| Safety | Election restriction prevents log loss | Accepting stale votes from smaller term |
| Persistence | currentTerm, votedFor, log must persist | Forgetting to persist before responding |
Lab 3: Fault-Tolerant Key-Value Service
Build a KV store on top of your Raft implementation: - Client retries on timeout/leader change - Duplicate request detection (use client ID + sequence number) - Linearizability: every operation appears instantaneous
Success Criteria for Phase 1: - MIT 6.824 Labs 1, 2A, 2B, 3 all pass their test suites - Can explain the difference between linearizability, sequential consistency, and eventual consistency - Can explain why Raft needs leader leases for stale-read avoidance
Phase 2: Core Algorithms (Months 3–9)
Paper Reading List — Consensus and Replication
Read papers in this order. Each builds on the previous.
| Paper | Full Citation | Key Contribution | Difficulty | Prereqs |
|---|---|---|---|---|
| Lamport Clocks | Leslie Lamport. "Time, Clocks, and the Ordering of Events in a Distributed System." CACM 21(7), 1978, pp. 558–565. | Happened-before relation, logical timestamps | Medium | None |
| FLP Impossibility | Michael J. Fischer, Nancy A. Lynch, Michael S. Paterson. "Impossibility of Distributed Consensus with One Faulty Process." JACM 32(2), 1985, pp. 374–382. | No deterministic consensus in async model with one crash | Hard | Lamport Clocks |
| Paxos Made Simple | Leslie Lamport. "Paxos Made Simple." ACM SIGACT News 32(4), 2001, pp. 51–58. | Two-phase consensus protocol | Hard | FLP |
| Paxos Made Live | Tushar Chandra, Robert Griesemer, Joshua Redstone. "Paxos Made Live: An Engineering Perspective." PODC 2007, pp. 398–407. | What Multi-Paxos actually requires in practice | Hard | Paxos Made Simple |
| Raft | Diego Ongaro, John Ousterhout. "In Search of an Understandable Consensus Algorithm." USENIX ATC 2014, pp. 305–320. | Consensus designed for understandability | Medium | FLP, Paxos |
| ZooKeeper / ZAB | Flavio Junqueira, Benjamin Reed, Marco Serafini. "Zab: High-performance broadcast for primary-backup systems." DSN 2011, pp. 245–256. | Atomic broadcast vs. consensus distinction | Medium | Raft |
| Viewstamped Replication | Barbara Liskov, James Cowling. "Viewstamped Replication Revisited." MIT Tech Report MIT-CSAIL-TR-2012-021, 2012. | Alternative to Paxos, cleaner state machine framing | Medium | Raft |
| PBFT | Miguel Castro, Barbara Liskov. "Practical Byzantine Fault Tolerance." OSDI 1999, pp. 173–186. | Byzantine consensus with 3f+1 nodes | Very Hard | Paxos |
Month 4–5: Implement Raft From Scratch
This is the most important hands-on exercise in the roadmap. Do not skip.
Specification Checklist (from the Raft paper, Figure 2):
| State | Persistent | Notes |
|---|---|---|
currentTerm |
Yes | Update before any RPC response |
votedFor |
Yes | Null or candidateId |
log[] |
Yes | Each entry: term + command |
commitIndex |
No | Volatile — re-derived on restart |
lastApplied |
No | Volatile |
Test Framework Requirements:
- Network partition simulation (drop packets between arbitrary node sets)
- Leader crash and restart
- Concurrent client requests
- Verify linearizability with porcupine (Go library) or equivalent
Implementation Language Options:
| Language | Pros | Cons |
|---|---|---|
| Go | Native goroutines, channels — natural fit for Raft timers and RPCs | Garbage collection pauses can affect timing tests |
| Rust | Memory safety, no GC | More boilerplate, tokio async complexity |
| C++ | Maximum control, close to paper pseudocode | Manual memory management risk |
Recommended: Go. MIT 6.824 labs are in Go and you have tests ready.
Month 5–6: Storage System Papers
| Paper | Full Citation | Key Contribution | Difficulty |
|---|---|---|---|
| GFS | Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung. "The Google File System." SOSP 2003, pp. 29–43. | Large-scale distributed storage; relaxed consistency for throughput | Medium |
| Dynamo | Giuseppe DeCandia et al. "Dynamo: Amazon's Highly Available Key-Value Store." SOSP 2007, pp. 205–220. | Eventual consistency, vector clocks, consistent hashing, sloppy quorum | Medium |
| Bigtable | Fay Chang et al. "Bigtable: A Distributed Storage System for Structured Data." OSDI 2006, pp. 205–218. | Wide-column model, tablet server architecture, compaction | Medium |
| Spanner | James C. Corbett et al. "Spanner: Google's Globally Distributed Database." OSDI 2012, pp. 251–264. | TrueTime for external consistency, distributed transactions globally | Hard |
| CRDT Survey | Marc Shapiro et al. "Conflict-Free Replicated Data Types." SSS 2011, pp. 386–400. | Mathematical framework for convergent replicated data | Hard |
Month 7–9: MIT 6.824 Lab 4 — Sharded KV
Lab 4: Sharded key-value service with dynamic resharding
| Component | Challenge |
|---|---|
| Shard controller | Raft-replicated configuration service |
| Key-value servers | One Raft group per shard |
| Reconfiguration | Move shards between groups atomically |
| Linearizability | Maintained across shard movement |
Success Criteria for Phase 2: - Raft implementation passes all tests including unreliable network + crashes + concurrent clients - Can explain FLP impossibility without notes and why Raft circumvents it (partial synchrony assumption) - Lab 4 (sharded KV) passes all tests - Have read all 8 papers in the consensus/replication list
Phase 3: Production Systems (Months 9–18)
Month 9–11: Apache Kafka Internals
Paper: Jay Kreps, Neha Narkhede, Jun Rao. "Kafka: A Distributed Messaging System for Log Processing." NetDB Workshop at VLDB, 2011.
Source Code Exploration Path:
| Component | Where in Source | Key Concept |
|---|---|---|
| Log segment | core/src/main/scala/kafka/log/ |
Append-only segments, index file, zero-copy sendfile |
| Producer | clients/src/main/java/org/apache/kafka/clients/producer/ |
Batching, linger.ms, acks=all |
| Consumer | clients/src/main/java/org/apache/kafka/clients/consumer/ |
Offset management, partition assignment, consumer groups |
| Replication | core/src/main/scala/kafka/server/ReplicaManager.scala |
ISR (in-sync replicas), high watermark, leader epoch |
| Controller | core/src/main/scala/kafka/controller/KafkaController.scala |
ZooKeeper → KRaft migration |
Lab Exercise: Set up a 3-broker Kafka cluster, produce 10 million messages, measure throughput. Then kill the leader broker mid-produce and verify zero message loss with acks=all. Observe leader election time.
Month 12–14: Apache Cassandra Internals
Paper: Avinash Lakshman, Prashant Malik. "Cassandra: A Decentralized Structured Storage System." ACM SIGOPS Operating Systems Review 44(2), 2010, pp. 35–40.
Key Design Decisions to Understand:
| Decision | Trade-off | Where in DDIA |
|---|---|---|
| Leaderless replication | Higher availability, tunable consistency | Chapter 5 |
| Consistent hashing + vnodes | Even data distribution, easy scaling | Chapter 6 |
| LSM tree storage (SSTable) | Fast writes, read amplification | Chapter 3 |
| Tunable consistency (ONE/QUORUM/ALL) | CAP trade-off made explicit | Chapter 9 |
| Anti-entropy with Merkle trees | Background repair without coordinator | Dynamo paper |
Lab Exercise: Deploy a 3-node Cassandra cluster. Simulate a network partition with iptables. Observe behavior at QUORUM vs. ONE consistency. Measure stale reads during partition.
Month 14–16: CockroachDB — Distributed SQL
Resources: - "CockroachDB: The Resilient Geo-Distributed SQL Database" — SIGMOD 2020 - CockroachDB architecture documentation: cockroachlabs.com/docs/stable/architecture
Key Components:
| Layer | Technology | Key Insight |
|---|---|---|
| SQL | PostgreSQL wire protocol | Apps need zero changes |
| Transactions | MVCC + 2PC over Raft | Serializability by default |
| Replication | Raft per range (64 MB by default) | Thousands of Raft groups |
| Storage | Pebble (RocksDB fork in Go) | LSM with compaction tuning |
| Distribution | Range-based sharding, automatic split/merge | No manual shard management |
Month 16–18: Open-Source Contribution
Recommended Projects for First Contribution:
| Project | Difficulty | Language | Good First Issue Label |
|---|---|---|---|
| TiKV (Rust distributed KV) | Medium | Rust | good-first-issue |
| etcd | Medium | Go | help-wanted |
| Apache Kafka | Medium | Scala/Java | newbie |
| Riak KV | Medium | Erlang | — |
| CockroachDB | Hard | Go | good-first-issue |
Contribution Strategy: 1. Start with documentation or test improvements to learn contribution workflow 2. Fix a bug you encountered while doing the lab exercises above 3. Propose a small performance improvement after profiling (bring data, not opinions)
Success Criteria for Phase 3: - Can explain Kafka's ISR protocol and when a replica falls out of ISR - Can explain CockroachDB's transaction protocol (read timestamp, write timestamp, uncertainty window) - Have at least one open-source contribution merged (doc or code)
Phase 4: Advanced (Months 18–36)
Consensus at Scale
Papers:
| Paper | Full Citation | Key Contribution |
|---|---|---|
| Multi-Paxos | Lamport. "The Part-Time Parliament." ACM TOCS 16(2), 1998, pp. 133–169. | Original Paxos (submitted 1989, published 1998) — leases, reconfiguration |
| EPaxos | Iulian Moraru, David G. Andersen, Michael Kaminsky. "There is More Consensus in Egalitarian Parliaments." SOSP 2013, pp. 358–372. | Leaderless Paxos, dependency graphs, commutative commands |
| Heidi Howard (FPaxos) | Heidi Howard, Dahlia Malkhi, Alexander Spiegelman. "Flexible Paxos: Quorum Intersection Revisited." arXiv 2016. | Quorum intersection requirement weaker than previously believed |
Distributed Transactions
Papers:
| Paper | Full Citation | Key Contribution |
|---|---|---|
| Two-Phase Commit | Jim Gray. "Notes on Database Operating Systems." Lecture Notes in Computer Science 60, 1978. | Original 2PC protocol and blocking problem |
| Percolator | Daniel Peng, Frank Dabek. "Large-scale Incremental Processing Using Distributed Transactions and Notifications." OSDI 2010. | Optimistic concurrency over Bigtable |
| Calvin | Alexander Thomson et al. "Calvin: Fast Distributed Transactions for Partitioned Database Systems." SIGMOD 2012. | Deterministic ordering pre-consensus to avoid 2PC |
| MVTO | Bernstein & Goodman. "Concurrency Control in Distributed Database Systems." ACM Computing Surveys 13(2), 1981. | Multiversion timestamp ordering |
CRDTs and Causal Consistency
Papers:
| Paper | Full Citation | Key Contribution |
|---|---|---|
| CRDTs | Marc Shapiro, Nuno Preguiça, Carlos Baquero, Marek Zawirski. "A Comprehensive Study of Convergent and Commutative Replicated Data Types." INRIA Technical Report RR-7506, 2011. | Formal taxonomy of CvRDTs and CmRDTs |
| Vector clocks | Colin Fidge. "Timestamps in Message-Passing Systems That Preserve the Partial Ordering." Australian Computer Science Communications 10(1), 1988. | Vector timestamps for causal order |
| COPS | Wyatt Lloyd et al. "Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS." SOSP 2011. | Causal+ consistency in geo-distributed setting |
| Bolt-on causality | Peter Bailis et al. "Bolt-on Causal Consistency." SIGMOD 2013. | Add causal consistency to existing eventually consistent store |
| RAMP | Peter Bailis et al. "Scalable Atomic Visibility with RAMP Transactions." SIGMOD 2014. | Read atomic isolation without coordination |
Advanced System Design Practice Problems
Work through these in pairs — design, then critique the other person's design:
| Problem | Key Tensions | Expected Throughput Target |
|---|---|---|
| Design a distributed rate limiter | Accuracy vs. coordination overhead | 10M requests/sec globally |
| Design a distributed lock service | Availability vs. safety under partition | Sub-millisecond lock acquire |
| Design a global sequence number generator | Monotonicity vs. availability | 1B sequences/day |
| Design a distributed cache with consistency guarantees | Read performance vs. write propagation | 1M reads/sec, <100µs p99 |
| Design a multi-master CRDT-based document editor | Conflict resolution vs. ordering guarantees | Offline-capable |
| Design a distributed cron scheduler | Exactly-once execution vs. availability | 100K jobs/day |
| Design cross-region database replication | RPO/RTO vs. consistency | RPO < 5 seconds, RTO < 30 seconds |
Success Criteria for Phase 4: - Can implement a G-Counter CRDT and prove convergence - Can explain the difference between causal consistency and linearizability with concrete examples - Can design a distributed system at the level of a senior systems design interview with rigorous trade-off analysis - Have read all papers in the advanced section
Appendix: Complete Paper Reading Order
For someone starting from zero, read papers in this sequence:
- Lamport Clocks (1978) — logical time
- FLP Impossibility (1985) — understand the fundamental limit
- Gray "Notes on Database Operating Systems" (1978) — transactions
- GFS (2003) — distributed storage in practice
- MapReduce (2004) — batch computation model
- Bigtable (2006) — wide-column storage
- Dynamo (2007) — availability-first design
- Paxos Made Simple (2001) — consensus
- Raft (2014) — consensus made understandable
- Spanner (2012) — global consistency in practice
- CRDT comprehensive study (2011) — convergent replication
- COPS (2011) — causal consistency at scale
Appendix: Recommended Courses
| Course | Institution | Format | Focus |
|---|---|---|---|
| 6.824 Distributed Systems | MIT | Labs + lectures (free) | Raft, MapReduce, Spanner |
| CS 525 Advanced Distributed Systems | UIUC | Lecture notes free | Paper-heavy, broad coverage |
| Distributed Systems by Martin Kleppmann | Cambridge | YouTube | DDIA author, excellent lectures |
| Cloud Computing Specialization | Coursera/UIUC | Structured | Gossip, P2P, Paxos basics |
Appendix: Implementation Projects Summary
| Project | Language | Estimated Time | Skills Developed |
|---|---|---|---|
| MapReduce | Go | 1 week | Coordinator-worker, fault tolerance |
| Raft from scratch | Go or Rust | 3–4 weeks | Core consensus algorithm |
| Fault-tolerant KV on Raft | Go | 2 weeks | State machine replication, client retry |
| Sharded KV with Raft | Go | 3 weeks | Dynamic resharding, cross-shard operations |
| Kafka clone (write-ahead log + replication) | Go/Java | 4–6 weeks | Log storage, ISR, consumer groups |
| Distributed database with 2PC | Go/Rust | 6–8 weeks | Distributed transactions, MVCC |
The foundation of expertise in distributed systems is not reading papers — it is implementing protocols and then breaking them deliberately. Every implementation project should include a fault injection framework that can kill nodes, drop packets, and delay messages at will.