Skip to content

Distributed Systems Engineer Learning Roadmap

A complete phase-by-phase roadmap for building deep expertise in distributed systems — from foundations through production systems to research-grade understanding. Each phase includes papers, implementation projects, and explicit advancement criteria.


Overview

Phase Months Focus Target Outcome
Foundations 0–3 Networking, DDIA, MIT 6.824 labs Can implement MapReduce and basic Raft
Core Algorithms 3–9 Consensus, replication, foundational papers Can implement Raft from scratch with correctness tests
Production Systems 9–18 Kafka, Cassandra, CockroachDB, open-source contribution Can operate and debug production distributed systems
Advanced 18–36 Transactions, CRDTs, causal consistency, research frontier Can design novel distributed protocols

Phase 1: Foundations (Months 0–3)

Primary Text

Designing Data-Intensive Applications (DDIA) — Martin Kleppmann - Publisher: O'Reilly, 2017 - ISBN: 978-1449373320 - This is the single most important book for the foundations phase. Read it cover to cover — it is the modern consensus starting point for the field.

Month 1: Networking Fundamentals and DDIA Part I

Networking Topics (self-study):

Topic Resource Key Concepts
TCP/IP internals Stevens "TCP/IP Illustrated Vol 1" (ISBN: 978-0321336316) Handshake, flow control, congestion control, TIME_WAIT
UDP and why it matters Same Unreliable delivery, DNS, QUIC motivation
TLS and certificate chains Cloudflare TLS guide (free online) Handshake, ALPN, session resumption
Linux network sockets man pages + beej.us/guide/bgnet select, epoll, non-blocking I/O

DDIA Reading Schedule: - Week 1: Chapters 1–3 (reliable/scalable/maintainable; data models; storage engines — B-trees vs. LSM trees) - Week 2: Chapter 4 (encoding: JSON, Protobuf, Avro — schema evolution matters enormously) - Week 3: Chapters 5–6 (replication; partitioning — understand replication lag, read-your-writes) - Week 4: Review + write 2-page summary of every chapter

Lab Exercise: Implement a simple TCP echo server in C using epoll. Measure maximum connections per second. Add connection pooling. This makes the networking primitives concrete before they appear in distributed protocols.

Month 2: MIT 6.824 Labs — MapReduce and Infrastructure

Course: MIT 6.824 Distributed Systems (labs publicly available at https://pdos.csail.mit.edu/6.824/)

Lab 1: MapReduce

Component What to Implement Key Design Decision
Coordinator Task assignment, worker crash detection Use heartbeat timeouts (10 seconds recommended by spec)
Worker Map execution, Reduce execution, intermediate file handling Write atomically: write to temp file, then rename
Fault tolerance Re-assign tasks from dead workers Workers are idempotent — safe to re-execute

Success Criteria for Lab 1: - All provided test cases pass including crash tests - Understand why map outputs are written to local disk (not shared storage)

DDIA Chapters 10–12: Batch processing, stream processing, future of data systems — read alongside Lab 1 to see MapReduce in context.

Month 3: MIT 6.824 Labs — Raft Part 1 and Fault-Tolerant KV

Lab 2: Raft Consensus (Part A — leader election)

Raft paper reading first (Ongaro & Ousterhout, 2014 — full citation in Phase 2 paper list).

Raft Component Key Invariant Common Bug
Leader election At most one leader per term Not resetting election timer on heartbeat
Log replication Leader log is authoritative Sending wrong prevLogIndex in AppendEntries
Safety Election restriction prevents log loss Accepting stale votes from smaller term
Persistence currentTerm, votedFor, log must persist Forgetting to persist before responding

Lab 3: Fault-Tolerant Key-Value Service

Build a KV store on top of your Raft implementation: - Client retries on timeout/leader change - Duplicate request detection (use client ID + sequence number) - Linearizability: every operation appears instantaneous

Success Criteria for Phase 1: - MIT 6.824 Labs 1, 2A, 2B, 3 all pass their test suites - Can explain the difference between linearizability, sequential consistency, and eventual consistency - Can explain why Raft needs leader leases for stale-read avoidance


Phase 2: Core Algorithms (Months 3–9)

Paper Reading List — Consensus and Replication

Read papers in this order. Each builds on the previous.

Paper Full Citation Key Contribution Difficulty Prereqs
Lamport Clocks Leslie Lamport. "Time, Clocks, and the Ordering of Events in a Distributed System." CACM 21(7), 1978, pp. 558–565. Happened-before relation, logical timestamps Medium None
FLP Impossibility Michael J. Fischer, Nancy A. Lynch, Michael S. Paterson. "Impossibility of Distributed Consensus with One Faulty Process." JACM 32(2), 1985, pp. 374–382. No deterministic consensus in async model with one crash Hard Lamport Clocks
Paxos Made Simple Leslie Lamport. "Paxos Made Simple." ACM SIGACT News 32(4), 2001, pp. 51–58. Two-phase consensus protocol Hard FLP
Paxos Made Live Tushar Chandra, Robert Griesemer, Joshua Redstone. "Paxos Made Live: An Engineering Perspective." PODC 2007, pp. 398–407. What Multi-Paxos actually requires in practice Hard Paxos Made Simple
Raft Diego Ongaro, John Ousterhout. "In Search of an Understandable Consensus Algorithm." USENIX ATC 2014, pp. 305–320. Consensus designed for understandability Medium FLP, Paxos
ZooKeeper / ZAB Flavio Junqueira, Benjamin Reed, Marco Serafini. "Zab: High-performance broadcast for primary-backup systems." DSN 2011, pp. 245–256. Atomic broadcast vs. consensus distinction Medium Raft
Viewstamped Replication Barbara Liskov, James Cowling. "Viewstamped Replication Revisited." MIT Tech Report MIT-CSAIL-TR-2012-021, 2012. Alternative to Paxos, cleaner state machine framing Medium Raft
PBFT Miguel Castro, Barbara Liskov. "Practical Byzantine Fault Tolerance." OSDI 1999, pp. 173–186. Byzantine consensus with 3f+1 nodes Very Hard Paxos

Month 4–5: Implement Raft From Scratch

This is the most important hands-on exercise in the roadmap. Do not skip.

Specification Checklist (from the Raft paper, Figure 2):

State Persistent Notes
currentTerm Yes Update before any RPC response
votedFor Yes Null or candidateId
log[] Yes Each entry: term + command
commitIndex No Volatile — re-derived on restart
lastApplied No Volatile

Test Framework Requirements: - Network partition simulation (drop packets between arbitrary node sets) - Leader crash and restart - Concurrent client requests - Verify linearizability with porcupine (Go library) or equivalent

Implementation Language Options:

Language Pros Cons
Go Native goroutines, channels — natural fit for Raft timers and RPCs Garbage collection pauses can affect timing tests
Rust Memory safety, no GC More boilerplate, tokio async complexity
C++ Maximum control, close to paper pseudocode Manual memory management risk

Recommended: Go. MIT 6.824 labs are in Go and you have tests ready.

Month 5–6: Storage System Papers

Paper Full Citation Key Contribution Difficulty
GFS Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung. "The Google File System." SOSP 2003, pp. 29–43. Large-scale distributed storage; relaxed consistency for throughput Medium
Dynamo Giuseppe DeCandia et al. "Dynamo: Amazon's Highly Available Key-Value Store." SOSP 2007, pp. 205–220. Eventual consistency, vector clocks, consistent hashing, sloppy quorum Medium
Bigtable Fay Chang et al. "Bigtable: A Distributed Storage System for Structured Data." OSDI 2006, pp. 205–218. Wide-column model, tablet server architecture, compaction Medium
Spanner James C. Corbett et al. "Spanner: Google's Globally Distributed Database." OSDI 2012, pp. 251–264. TrueTime for external consistency, distributed transactions globally Hard
CRDT Survey Marc Shapiro et al. "Conflict-Free Replicated Data Types." SSS 2011, pp. 386–400. Mathematical framework for convergent replicated data Hard

Month 7–9: MIT 6.824 Lab 4 — Sharded KV

Lab 4: Sharded key-value service with dynamic resharding

Component Challenge
Shard controller Raft-replicated configuration service
Key-value servers One Raft group per shard
Reconfiguration Move shards between groups atomically
Linearizability Maintained across shard movement

Success Criteria for Phase 2: - Raft implementation passes all tests including unreliable network + crashes + concurrent clients - Can explain FLP impossibility without notes and why Raft circumvents it (partial synchrony assumption) - Lab 4 (sharded KV) passes all tests - Have read all 8 papers in the consensus/replication list


Phase 3: Production Systems (Months 9–18)

Month 9–11: Apache Kafka Internals

Paper: Jay Kreps, Neha Narkhede, Jun Rao. "Kafka: A Distributed Messaging System for Log Processing." NetDB Workshop at VLDB, 2011.

Source Code Exploration Path:

Component Where in Source Key Concept
Log segment core/src/main/scala/kafka/log/ Append-only segments, index file, zero-copy sendfile
Producer clients/src/main/java/org/apache/kafka/clients/producer/ Batching, linger.ms, acks=all
Consumer clients/src/main/java/org/apache/kafka/clients/consumer/ Offset management, partition assignment, consumer groups
Replication core/src/main/scala/kafka/server/ReplicaManager.scala ISR (in-sync replicas), high watermark, leader epoch
Controller core/src/main/scala/kafka/controller/KafkaController.scala ZooKeeper → KRaft migration

Lab Exercise: Set up a 3-broker Kafka cluster, produce 10 million messages, measure throughput. Then kill the leader broker mid-produce and verify zero message loss with acks=all. Observe leader election time.

Month 12–14: Apache Cassandra Internals

Paper: Avinash Lakshman, Prashant Malik. "Cassandra: A Decentralized Structured Storage System." ACM SIGOPS Operating Systems Review 44(2), 2010, pp. 35–40.

Key Design Decisions to Understand:

Decision Trade-off Where in DDIA
Leaderless replication Higher availability, tunable consistency Chapter 5
Consistent hashing + vnodes Even data distribution, easy scaling Chapter 6
LSM tree storage (SSTable) Fast writes, read amplification Chapter 3
Tunable consistency (ONE/QUORUM/ALL) CAP trade-off made explicit Chapter 9
Anti-entropy with Merkle trees Background repair without coordinator Dynamo paper

Lab Exercise: Deploy a 3-node Cassandra cluster. Simulate a network partition with iptables. Observe behavior at QUORUM vs. ONE consistency. Measure stale reads during partition.

Month 14–16: CockroachDB — Distributed SQL

Resources: - "CockroachDB: The Resilient Geo-Distributed SQL Database" — SIGMOD 2020 - CockroachDB architecture documentation: cockroachlabs.com/docs/stable/architecture

Key Components:

Layer Technology Key Insight
SQL PostgreSQL wire protocol Apps need zero changes
Transactions MVCC + 2PC over Raft Serializability by default
Replication Raft per range (64 MB by default) Thousands of Raft groups
Storage Pebble (RocksDB fork in Go) LSM with compaction tuning
Distribution Range-based sharding, automatic split/merge No manual shard management

Month 16–18: Open-Source Contribution

Recommended Projects for First Contribution:

Project Difficulty Language Good First Issue Label
TiKV (Rust distributed KV) Medium Rust good-first-issue
etcd Medium Go help-wanted
Apache Kafka Medium Scala/Java newbie
Riak KV Medium Erlang
CockroachDB Hard Go good-first-issue

Contribution Strategy: 1. Start with documentation or test improvements to learn contribution workflow 2. Fix a bug you encountered while doing the lab exercises above 3. Propose a small performance improvement after profiling (bring data, not opinions)

Success Criteria for Phase 3: - Can explain Kafka's ISR protocol and when a replica falls out of ISR - Can explain CockroachDB's transaction protocol (read timestamp, write timestamp, uncertainty window) - Have at least one open-source contribution merged (doc or code)


Phase 4: Advanced (Months 18–36)

Consensus at Scale

Papers:

Paper Full Citation Key Contribution
Multi-Paxos Lamport. "The Part-Time Parliament." ACM TOCS 16(2), 1998, pp. 133–169. Original Paxos (submitted 1989, published 1998) — leases, reconfiguration
EPaxos Iulian Moraru, David G. Andersen, Michael Kaminsky. "There is More Consensus in Egalitarian Parliaments." SOSP 2013, pp. 358–372. Leaderless Paxos, dependency graphs, commutative commands
Heidi Howard (FPaxos) Heidi Howard, Dahlia Malkhi, Alexander Spiegelman. "Flexible Paxos: Quorum Intersection Revisited." arXiv 2016. Quorum intersection requirement weaker than previously believed

Distributed Transactions

Papers:

Paper Full Citation Key Contribution
Two-Phase Commit Jim Gray. "Notes on Database Operating Systems." Lecture Notes in Computer Science 60, 1978. Original 2PC protocol and blocking problem
Percolator Daniel Peng, Frank Dabek. "Large-scale Incremental Processing Using Distributed Transactions and Notifications." OSDI 2010. Optimistic concurrency over Bigtable
Calvin Alexander Thomson et al. "Calvin: Fast Distributed Transactions for Partitioned Database Systems." SIGMOD 2012. Deterministic ordering pre-consensus to avoid 2PC
MVTO Bernstein & Goodman. "Concurrency Control in Distributed Database Systems." ACM Computing Surveys 13(2), 1981. Multiversion timestamp ordering

CRDTs and Causal Consistency

Papers:

Paper Full Citation Key Contribution
CRDTs Marc Shapiro, Nuno Preguiça, Carlos Baquero, Marek Zawirski. "A Comprehensive Study of Convergent and Commutative Replicated Data Types." INRIA Technical Report RR-7506, 2011. Formal taxonomy of CvRDTs and CmRDTs
Vector clocks Colin Fidge. "Timestamps in Message-Passing Systems That Preserve the Partial Ordering." Australian Computer Science Communications 10(1), 1988. Vector timestamps for causal order
COPS Wyatt Lloyd et al. "Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS." SOSP 2011. Causal+ consistency in geo-distributed setting
Bolt-on causality Peter Bailis et al. "Bolt-on Causal Consistency." SIGMOD 2013. Add causal consistency to existing eventually consistent store
RAMP Peter Bailis et al. "Scalable Atomic Visibility with RAMP Transactions." SIGMOD 2014. Read atomic isolation without coordination

Advanced System Design Practice Problems

Work through these in pairs — design, then critique the other person's design:

Problem Key Tensions Expected Throughput Target
Design a distributed rate limiter Accuracy vs. coordination overhead 10M requests/sec globally
Design a distributed lock service Availability vs. safety under partition Sub-millisecond lock acquire
Design a global sequence number generator Monotonicity vs. availability 1B sequences/day
Design a distributed cache with consistency guarantees Read performance vs. write propagation 1M reads/sec, <100µs p99
Design a multi-master CRDT-based document editor Conflict resolution vs. ordering guarantees Offline-capable
Design a distributed cron scheduler Exactly-once execution vs. availability 100K jobs/day
Design cross-region database replication RPO/RTO vs. consistency RPO < 5 seconds, RTO < 30 seconds

Success Criteria for Phase 4: - Can implement a G-Counter CRDT and prove convergence - Can explain the difference between causal consistency and linearizability with concrete examples - Can design a distributed system at the level of a senior systems design interview with rigorous trade-off analysis - Have read all papers in the advanced section


Appendix: Complete Paper Reading Order

For someone starting from zero, read papers in this sequence:

  1. Lamport Clocks (1978) — logical time
  2. FLP Impossibility (1985) — understand the fundamental limit
  3. Gray "Notes on Database Operating Systems" (1978) — transactions
  4. GFS (2003) — distributed storage in practice
  5. MapReduce (2004) — batch computation model
  6. Bigtable (2006) — wide-column storage
  7. Dynamo (2007) — availability-first design
  8. Paxos Made Simple (2001) — consensus
  9. Raft (2014) — consensus made understandable
  10. Spanner (2012) — global consistency in practice
  11. CRDT comprehensive study (2011) — convergent replication
  12. COPS (2011) — causal consistency at scale

Course Institution Format Focus
6.824 Distributed Systems MIT Labs + lectures (free) Raft, MapReduce, Spanner
CS 525 Advanced Distributed Systems UIUC Lecture notes free Paper-heavy, broad coverage
Distributed Systems by Martin Kleppmann Cambridge YouTube DDIA author, excellent lectures
Cloud Computing Specialization Coursera/UIUC Structured Gossip, P2P, Paxos basics

Appendix: Implementation Projects Summary

Project Language Estimated Time Skills Developed
MapReduce Go 1 week Coordinator-worker, fault tolerance
Raft from scratch Go or Rust 3–4 weeks Core consensus algorithm
Fault-tolerant KV on Raft Go 2 weeks State machine replication, client retry
Sharded KV with Raft Go 3 weeks Dynamic resharding, cross-shard operations
Kafka clone (write-ahead log + replication) Go/Java 4–6 weeks Log storage, ISR, consumer groups
Distributed database with 2PC Go/Rust 6–8 weeks Distributed transactions, MVCC

The foundation of expertise in distributed systems is not reading papers — it is implementing protocols and then breaking them deliberately. Every implementation project should include a fault injection framework that can kill nodes, drop packets, and delay messages at will.