Distributed Systems Engineer Learning Roadmap

A complete phase-by-phase roadmap for building deep expertise in distributed systems — from foundations through production systems to research-grade understanding. Each phase includes papers, implementation projects, and explicit advancement criteria.

Overview

Phase	Months	Focus	Target Outcome
Foundations	0–3	Networking, DDIA, MIT 6.824 labs	Can implement MapReduce and basic Raft
Core Algorithms	3–9	Consensus, replication, foundational papers	Can implement Raft from scratch with correctness tests
Production Systems	9–18	Kafka, Cassandra, CockroachDB, open-source contribution	Can operate and debug production distributed systems
Advanced	18–36	Transactions, CRDTs, causal consistency, research frontier	Can design novel distributed protocols

Phase 1: Foundations (Months 0–3)

Primary Text

Designing Data-Intensive Applications (DDIA) — Martin Kleppmann - Publisher: O'Reilly, 2017 - ISBN: 978-1449373320 - This is the single most important book for the foundations phase. Read it cover to cover — it is the modern consensus starting point for the field.

Month 1: Networking Fundamentals and DDIA Part I

Networking Topics (self-study):

Topic	Resource	Key Concepts
TCP/IP internals	Stevens "TCP/IP Illustrated Vol 1" (ISBN: 978-0321336316)	Handshake, flow control, congestion control, TIME_WAIT
UDP and why it matters	Same	Unreliable delivery, DNS, QUIC motivation
TLS and certificate chains	Cloudflare TLS guide (free online)	Handshake, ALPN, session resumption
Linux network sockets	man pages + `beej.us/guide/bgnet`	`select`, `epoll`, non-blocking I/O

DDIA Reading Schedule: - Week 1: Chapters 1–3 (reliable/scalable/maintainable; data models; storage engines — B-trees vs. LSM trees) - Week 2: Chapter 4 (encoding: JSON, Protobuf, Avro — schema evolution matters enormously) - Week 3: Chapters 5–6 (replication; partitioning — understand replication lag, read-your-writes) - Week 4: Review + write 2-page summary of every chapter

Lab Exercise: Implement a simple TCP echo server in C using epoll. Measure maximum connections per second. Add connection pooling. This makes the networking primitives concrete before they appear in distributed protocols.

Month 2: MIT 6.824 Labs — MapReduce and Infrastructure

Course: MIT 6.824 Distributed Systems (labs publicly available at https://pdos.csail.mit.edu/6.824/)

Lab 1: MapReduce

Component	What to Implement	Key Design Decision
Coordinator	Task assignment, worker crash detection	Use heartbeat timeouts (10 seconds recommended by spec)
Worker	Map execution, Reduce execution, intermediate file handling	Write atomically: write to temp file, then rename
Fault tolerance	Re-assign tasks from dead workers	Workers are idempotent — safe to re-execute

Success Criteria for Lab 1: - All provided test cases pass including crash tests - Understand why map outputs are written to local disk (not shared storage)

DDIA Chapters 10–12: Batch processing, stream processing, future of data systems — read alongside Lab 1 to see MapReduce in context.

Month 3: MIT 6.824 Labs — Raft Part 1 and Fault-Tolerant KV

Lab 2: Raft Consensus (Part A — leader election)

Raft paper reading first (Ongaro & Ousterhout, 2014 — full citation in Phase 2 paper list).

Raft Component	Key Invariant	Common Bug
Leader election	At most one leader per term	Not resetting election timer on heartbeat
Log replication	Leader log is authoritative	Sending wrong prevLogIndex in AppendEntries
Safety	Election restriction prevents log loss	Accepting stale votes from smaller term
Persistence	currentTerm, votedFor, log must persist	Forgetting to persist before responding

Lab 3: Fault-Tolerant Key-Value Service

Build a KV store on top of your Raft implementation: - Client retries on timeout/leader change - Duplicate request detection (use client ID + sequence number) - Linearizability: every operation appears instantaneous

Success Criteria for Phase 1: - MIT 6.824 Labs 1, 2A, 2B, 3 all pass their test suites - Can explain the difference between linearizability, sequential consistency, and eventual consistency - Can explain why Raft needs leader leases for stale-read avoidance

Phase 2: Core Algorithms (Months 3–9)

Paper Reading List — Consensus and Replication

Read papers in this order. Each builds on the previous.

Paper	Full Citation	Key Contribution	Difficulty	Prereqs
Lamport Clocks	Leslie Lamport. "Time, Clocks, and the Ordering of Events in a Distributed System." CACM 21(7), 1978, pp. 558–565.	Happened-before relation, logical timestamps	Medium	None
FLP Impossibility	Michael J. Fischer, Nancy A. Lynch, Michael S. Paterson. "Impossibility of Distributed Consensus with One Faulty Process." JACM 32(2), 1985, pp. 374–382.	No deterministic consensus in async model with one crash	Hard	Lamport Clocks
Paxos Made Simple	Leslie Lamport. "Paxos Made Simple." ACM SIGACT News 32(4), 2001, pp. 51–58.	Two-phase consensus protocol	Hard	FLP
Paxos Made Live	Tushar Chandra, Robert Griesemer, Joshua Redstone. "Paxos Made Live: An Engineering Perspective." PODC 2007, pp. 398–407.	What Multi-Paxos actually requires in practice	Hard	Paxos Made Simple
Raft	Diego Ongaro, John Ousterhout. "In Search of an Understandable Consensus Algorithm." USENIX ATC 2014, pp. 305–320.	Consensus designed for understandability	Medium	FLP, Paxos
ZooKeeper / ZAB	Flavio Junqueira, Benjamin Reed, Marco Serafini. "Zab: High-performance broadcast for primary-backup systems." DSN 2011, pp. 245–256.	Atomic broadcast vs. consensus distinction	Medium	Raft
Viewstamped Replication	Barbara Liskov, James Cowling. "Viewstamped Replication Revisited." MIT Tech Report MIT-CSAIL-TR-2012-021, 2012.	Alternative to Paxos, cleaner state machine framing	Medium	Raft
PBFT	Miguel Castro, Barbara Liskov. "Practical Byzantine Fault Tolerance." OSDI 1999, pp. 173–186.	Byzantine consensus with 3f+1 nodes	Very Hard	Paxos

Month 4–5: Implement Raft From Scratch

This is the most important hands-on exercise in the roadmap. Do not skip.

Specification Checklist (from the Raft paper, Figure 2):

State	Persistent	Notes
`currentTerm`	Yes	Update before any RPC response
`votedFor`	Yes	Null or candidateId
`log[]`	Yes	Each entry: term + command
`commitIndex`	No	Volatile — re-derived on restart
`lastApplied`	No	Volatile

Test Framework Requirements: - Network partition simulation (drop packets between arbitrary node sets) - Leader crash and restart - Concurrent client requests - Verify linearizability with porcupine (Go library) or equivalent

Implementation Language Options:

Language	Pros	Cons
Go	Native goroutines, channels — natural fit for Raft timers and RPCs	Garbage collection pauses can affect timing tests
Rust	Memory safety, no GC	More boilerplate, tokio async complexity
C++	Maximum control, close to paper pseudocode	Manual memory management risk

Recommended: Go. MIT 6.824 labs are in Go and you have tests ready.

Month 5–6: Storage System Papers

Paper	Full Citation	Key Contribution	Difficulty
GFS	Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung. "The Google File System." SOSP 2003, pp. 29–43.	Large-scale distributed storage; relaxed consistency for throughput	Medium
Dynamo	Giuseppe DeCandia et al. "Dynamo: Amazon's Highly Available Key-Value Store." SOSP 2007, pp. 205–220.	Eventual consistency, vector clocks, consistent hashing, sloppy quorum	Medium
Bigtable	Fay Chang et al. "Bigtable: A Distributed Storage System for Structured Data." OSDI 2006, pp. 205–218.	Wide-column model, tablet server architecture, compaction	Medium
Spanner	James C. Corbett et al. "Spanner: Google's Globally Distributed Database." OSDI 2012, pp. 251–264.	TrueTime for external consistency, distributed transactions globally	Hard
CRDT Survey	Marc Shapiro et al. "Conflict-Free Replicated Data Types." SSS 2011, pp. 386–400.	Mathematical framework for convergent replicated data	Hard

Month 7–9: MIT 6.824 Lab 4 — Sharded KV

Lab 4: Sharded key-value service with dynamic resharding

Component	Challenge
Shard controller	Raft-replicated configuration service
Key-value servers	One Raft group per shard
Reconfiguration	Move shards between groups atomically
Linearizability	Maintained across shard movement

Success Criteria for Phase 2: - Raft implementation passes all tests including unreliable network + crashes + concurrent clients - Can explain FLP impossibility without notes and why Raft circumvents it (partial synchrony assumption) - Lab 4 (sharded KV) passes all tests - Have read all 8 papers in the consensus/replication list

Phase 3: Production Systems (Months 9–18)

Month 9–11: Apache Kafka Internals

Paper: Jay Kreps, Neha Narkhede, Jun Rao. "Kafka: A Distributed Messaging System for Log Processing." NetDB Workshop at VLDB, 2011.

Source Code Exploration Path:

Component	Where in Source	Key Concept
Log segment	`core/src/main/scala/kafka/log/`	Append-only segments, index file, zero-copy sendfile
Producer	`clients/src/main/java/org/apache/kafka/clients/producer/`	Batching, linger.ms, acks=all
Consumer	`clients/src/main/java/org/apache/kafka/clients/consumer/`	Offset management, partition assignment, consumer groups
Replication	`core/src/main/scala/kafka/server/ReplicaManager.scala`	ISR (in-sync replicas), high watermark, leader epoch
Controller	`core/src/main/scala/kafka/controller/KafkaController.scala`	ZooKeeper → KRaft migration

Lab Exercise: Set up a 3-broker Kafka cluster, produce 10 million messages, measure throughput. Then kill the leader broker mid-produce and verify zero message loss with acks=all. Observe leader election time.

Month 12–14: Apache Cassandra Internals

Paper: Avinash Lakshman, Prashant Malik. "Cassandra: A Decentralized Structured Storage System." ACM SIGOPS Operating Systems Review 44(2), 2010, pp. 35–40.

Key Design Decisions to Understand:

Decision	Trade-off	Where in DDIA
Leaderless replication	Higher availability, tunable consistency	Chapter 5
Consistent hashing + vnodes	Even data distribution, easy scaling	Chapter 6
LSM tree storage (SSTable)	Fast writes, read amplification	Chapter 3
Tunable consistency (ONE/QUORUM/ALL)	CAP trade-off made explicit	Chapter 9
Anti-entropy with Merkle trees	Background repair without coordinator	Dynamo paper

Lab Exercise: Deploy a 3-node Cassandra cluster. Simulate a network partition with iptables. Observe behavior at QUORUM vs. ONE consistency. Measure stale reads during partition.

Month 14–16: CockroachDB — Distributed SQL

Resources: - "CockroachDB: The Resilient Geo-Distributed SQL Database" — SIGMOD 2020 - CockroachDB architecture documentation: cockroachlabs.com/docs/stable/architecture

Key Components:

Layer	Technology	Key Insight
SQL	PostgreSQL wire protocol	Apps need zero changes
Transactions	MVCC + 2PC over Raft	Serializability by default
Replication	Raft per range (64 MB by default)	Thousands of Raft groups
Storage	Pebble (RocksDB fork in Go)	LSM with compaction tuning
Distribution	Range-based sharding, automatic split/merge	No manual shard management

Month 16–18: Open-Source Contribution

Recommended Projects for First Contribution:

Project	Difficulty	Language	Good First Issue Label
TiKV (Rust distributed KV)	Medium	Rust	`good-first-issue`
etcd	Medium	Go	`help-wanted`
Apache Kafka	Medium	Scala/Java	`newbie`
Riak KV	Medium	Erlang	—
CockroachDB	Hard	Go	`good-first-issue`

Contribution Strategy: 1. Start with documentation or test improvements to learn contribution workflow 2. Fix a bug you encountered while doing the lab exercises above 3. Propose a small performance improvement after profiling (bring data, not opinions)

Success Criteria for Phase 3: - Can explain Kafka's ISR protocol and when a replica falls out of ISR - Can explain CockroachDB's transaction protocol (read timestamp, write timestamp, uncertainty window) - Have at least one open-source contribution merged (doc or code)

Phase 4: Advanced (Months 18–36)

Consensus at Scale

Papers:

Paper	Full Citation	Key Contribution
Multi-Paxos	Lamport. "The Part-Time Parliament." ACM TOCS 16(2), 1998, pp. 133–169.	Original Paxos (submitted 1989, published 1998) — leases, reconfiguration
EPaxos	Iulian Moraru, David G. Andersen, Michael Kaminsky. "There is More Consensus in Egalitarian Parliaments." SOSP 2013, pp. 358–372.	Leaderless Paxos, dependency graphs, commutative commands
Heidi Howard (FPaxos)	Heidi Howard, Dahlia Malkhi, Alexander Spiegelman. "Flexible Paxos: Quorum Intersection Revisited." arXiv 2016.	Quorum intersection requirement weaker than previously believed

Distributed Transactions

Papers:

Paper	Full Citation	Key Contribution
Two-Phase Commit	Jim Gray. "Notes on Database Operating Systems." Lecture Notes in Computer Science 60, 1978.	Original 2PC protocol and blocking problem
Percolator	Daniel Peng, Frank Dabek. "Large-scale Incremental Processing Using Distributed Transactions and Notifications." OSDI 2010.	Optimistic concurrency over Bigtable
Calvin	Alexander Thomson et al. "Calvin: Fast Distributed Transactions for Partitioned Database Systems." SIGMOD 2012.	Deterministic ordering pre-consensus to avoid 2PC
MVTO	Bernstein & Goodman. "Concurrency Control in Distributed Database Systems." ACM Computing Surveys 13(2), 1981.	Multiversion timestamp ordering

CRDTs and Causal Consistency

Papers:

Paper	Full Citation	Key Contribution
CRDTs	Marc Shapiro, Nuno Preguiça, Carlos Baquero, Marek Zawirski. "A Comprehensive Study of Convergent and Commutative Replicated Data Types." INRIA Technical Report RR-7506, 2011.	Formal taxonomy of CvRDTs and CmRDTs
Vector clocks	Colin Fidge. "Timestamps in Message-Passing Systems That Preserve the Partial Ordering." Australian Computer Science Communications 10(1), 1988.	Vector timestamps for causal order
COPS	Wyatt Lloyd et al. "Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS." SOSP 2011.	Causal+ consistency in geo-distributed setting
Bolt-on causality	Peter Bailis et al. "Bolt-on Causal Consistency." SIGMOD 2013.	Add causal consistency to existing eventually consistent store
RAMP	Peter Bailis et al. "Scalable Atomic Visibility with RAMP Transactions." SIGMOD 2014.	Read atomic isolation without coordination

Advanced System Design Practice Problems

Work through these in pairs — design, then critique the other person's design:

Problem	Key Tensions	Expected Throughput Target
Design a distributed rate limiter	Accuracy vs. coordination overhead	10M requests/sec globally
Design a distributed lock service	Availability vs. safety under partition	Sub-millisecond lock acquire
Design a global sequence number generator	Monotonicity vs. availability	1B sequences/day
Design a distributed cache with consistency guarantees	Read performance vs. write propagation	1M reads/sec, <100µs p99
Design a multi-master CRDT-based document editor	Conflict resolution vs. ordering guarantees	Offline-capable
Design a distributed cron scheduler	Exactly-once execution vs. availability	100K jobs/day
Design cross-region database replication	RPO/RTO vs. consistency	RPO < 5 seconds, RTO < 30 seconds

Success Criteria for Phase 4: - Can implement a G-Counter CRDT and prove convergence - Can explain the difference between causal consistency and linearizability with concrete examples - Can design a distributed system at the level of a senior systems design interview with rigorous trade-off analysis - Have read all papers in the advanced section

Appendix: Complete Paper Reading Order

For someone starting from zero, read papers in this sequence:

Lamport Clocks (1978) — logical time
FLP Impossibility (1985) — understand the fundamental limit
Gray "Notes on Database Operating Systems" (1978) — transactions
GFS (2003) — distributed storage in practice
MapReduce (2004) — batch computation model
Bigtable (2006) — wide-column storage
Dynamo (2007) — availability-first design
Paxos Made Simple (2001) — consensus
Raft (2014) — consensus made understandable
Spanner (2012) — global consistency in practice
CRDT comprehensive study (2011) — convergent replication
COPS (2011) — causal consistency at scale

Appendix: Recommended Courses

Course	Institution	Format	Focus
6.824 Distributed Systems	MIT	Labs + lectures (free)	Raft, MapReduce, Spanner
CS 525 Advanced Distributed Systems	UIUC	Lecture notes free	Paper-heavy, broad coverage
Distributed Systems by Martin Kleppmann	Cambridge	YouTube	DDIA author, excellent lectures
Cloud Computing Specialization	Coursera/UIUC	Structured	Gossip, P2P, Paxos basics

Appendix: Implementation Projects Summary

Project	Language	Estimated Time	Skills Developed
MapReduce	Go	1 week	Coordinator-worker, fault tolerance
Raft from scratch	Go or Rust	3–4 weeks	Core consensus algorithm
Fault-tolerant KV on Raft	Go	2 weeks	State machine replication, client retry
Sharded KV with Raft	Go	3 weeks	Dynamic resharding, cross-shard operations
Kafka clone (write-ahead log + replication)	Go/Java	4–6 weeks	Log storage, ISR, consumer groups
Distributed database with 2PC	Go/Rust	6–8 weeks	Distributed transactions, MVCC

The foundation of expertise in distributed systems is not reading papers — it is implementing protocols and then breaking them deliberately. Every implementation project should include a fault injection framework that can kill nodes, drop packets, and delay messages at will.