Distributed Systems Projects
Overview
These three projects escalate in complexity and teach the practical mechanics that separate theoretical knowledge from engineering fluency. Project 1 builds the consensus foundation. Project 2 layers a key-value store on top of it. Project 3 tackles a real-world distributed subsystem (a log-structured message queue) with its own distinct consistency and performance challenges. Each project is designed to take 3–6 weeks for an engineer with Go familiarity; the test suites are as important as the production code.
The MIT 6.824 (Distributed Systems) lab series covers similar ground. These projects are standalone alternatives you own and can extend freely — not autograder submissions.
Project 1 — Raft Consensus from Scratch
Goal
Implement the Raft consensus algorithm in Go. By the end, your cluster handles arbitrary leader failures, network partitions, and log divergence while maintaining linearizability.
Architecture
┌────────────────────────────────────────────────────────┐
│ Raft Cluster │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Node 0 │ │ Node 1 │ │ Node 2 │ │
│ │ (Leader) │──▶│(Follower)│ │(Follower)│ │
│ └────┬─────┘ └──────────┘ └──────────┘ │
│ │ AppendEntries RPC (log replication) │
│ │ RequestVote RPC (leader election) │
└───────┴────────────────────────────────────────────────┘
Each node is a goroutine (or OS thread) communicating with others over a simulated or real RPC layer. For testing, use an in-process simulated network that can drop, delay, and partition messages.
API
type Raft struct {
mu sync.Mutex
peers []*labrpc.ClientEnd // or net/rpc.Client
me int // this node's index
currentTerm int
votedFor int // -1 if none
log []LogEntry
commitIndex int
lastApplied int
// Volatile leader state (reinitialized after election)
nextIndex []int
matchIndex []int
state NodeState // Follower | Candidate | Leader
applyCh chan ApplyMsg
}
type LogEntry struct {
Term int
Command interface{}
}
// Start proposes a command to the Raft cluster.
// Returns (log index, term, isLeader).
// If not leader, returns false immediately.
func (rf *Raft) Start(command interface{}) (int, int, bool)
// GetState returns current term and whether this node believes it is the leader.
func (rf *Raft) GetState() (int, bool)
Implementation Checklist
Leader Election
- [ ] Initialize all nodes as Followers with
currentTerm = 0. - [ ] Start an election timer with a randomized timeout (150–300 ms is the Raft paper's recommendation; use 300–600 ms in a test environment to reduce spurious elections under load).
- [ ] On timer expiry: increment
currentTerm, transition to Candidate, vote for self, sendRequestVoteRPCs to all peers in parallel goroutines. - [ ] If quorum (majority) votes received before timeout: transition to Leader.
- [ ] On receiving
RequestVotewith higher term: revert to Follower, grant vote if log is at least as up-to-date (last log term/index comparison per §5.4.1 of the paper). - [ ] Leader sends heartbeat
AppendEntrieswith empty entries every 100 ms to prevent followers from timing out.
Log Replication
- [ ]
Start(): leader appends entry to local log, sendsAppendEntriesto all followers. - [ ]
AppendEntries: follower checksprevLogIndex/prevLogTerm; if consistent, appends entries; updatescommitIndexifleaderCommit > commitIndex. - [ ] Leader advances
commitIndexwhen a majority hasmatchIndex[i] >= Nfor some log index N in the current term. - [ ] Apply committed entries to
applyChin order.
Snapshots
- [ ] Add
InstallSnapshotRPC for when a follower's log falls too far behind. - [ ] Persist
currentTerm,votedFor, andlog(and snapshot) to stable storage after every mutation — simulate with aPersisterstruct that writes to a byte slice (swap for real disk in production).
Test Scenarios
go test -run TestBasicAgree # 3-node, normal operation
go test -run TestRPCBytes # verify no excessive data transfer
go test -run TestFollowerFailure # kill a follower, writes still commit
go test -run TestLeaderFailure # kill leader, new leader elected
go test -run TestRejoin # killed node rejoins, catches up
go test -run TestPartition # network partition, both halves try to make progress
go test -run TestFigure8 # the Figure 8 liveness scenario from the paper
go test -run TestUnreliableAgree # random dropped/delayed RPCs
go test -run TestLinearizability # use Porcupine linearizability checker
Linearizability Verification
Use the Porcupine Go library to verify that the operation history produced by your cluster is linearizable. Record every Start() call and every applied ApplyMsg, then pass the history to porcupine.CheckOperations. A non-linearizable history indicates a correctness bug.
Performance Targets
- Leader election after leader failure: < 500 ms (configurable timeout dependent).
- Throughput with a 3-node cluster on localhost: > 5,000 ops/sec.
- Recovery after a follower rejoins a partition: full log replication within 1 second.
Evaluation Rubric
| Criterion | Points |
|---|---|
| TestBasicAgree passes | 15 |
| TestLeaderFailure passes | 20 |
| TestPartition passes | 25 |
| TestLinearizability passes | 25 |
| TestUnreliableAgree passes (random drops) | 15 |
Project 2 — Distributed Key-Value Store on Raft
Goal
Build a linearizable key-value store with Get, Put, and Delete operations, backed by the Raft implementation from Project 1.
Architecture
Client ──▶ KVServer (leader) ──▶ Raft.Start(Op)
│
Raft commits log entry
│
KVServer.applier() reads ApplyCh
│
Apply to in-memory map, notify client RPC
Each KVServer wraps a Raft instance. Multiple KVServer instances form a replicated group. Clients discover the current leader via retries (any server that is not the leader returns ErrWrongLeader).
API
type KVServer struct {
mu sync.Mutex
rf *raft.Raft
applyCh chan raft.ApplyMsg
store map[string]string
// Track pending client RPCs: log index → response channel
pending map[int]chan OpResult
// Deduplication: last applied serial number per client
lastSeq map[int64]int64
}
// Get returns the current value for key (empty string if absent).
// Linearizable: reflects all Puts that completed before this Get starts.
func (kv *KVServer) Get(args *GetArgs, reply *GetReply) error
// Put sets key = value unconditionally.
func (kv *KVServer) Put(args *PutArgs, reply *PutReply) error
// Delete removes the key.
func (kv *KVServer) Delete(args *DeleteArgs, reply *DeleteReply) error
Linearizable Reads
The naive approach (propose a no-op to Raft for every read) doubles latency. Two optimizations:
- Read index: The leader records the current
commitIndexas the read index, sends heartbeats to confirm it is still the leader, then serves the read onceappliedIndex >= readIndex. No log write required. - Lease-based reads: The leader tracks the time of its last successful heartbeat quorum. If the current time is within the lease window (election timeout minus clock skew margin), it can serve reads directly without any network round trip. This is what etcd uses by default.
Implement the read index approach first; add lease-based reads as an extension.
Client Deduplication
Without deduplication, a client that retries a Put after a timeout may apply it twice. Each client request carries a (clientID, sequenceNumber) pair. The server tracks the last applied sequence number per client; if seq <= lastSeq[clientID], the server returns the cached result without re-applying.
Comparison to etcd
After your implementation works, run equivalent workloads against etcd and compare:
# etcd benchmark
etcdctl put key val
etcdctl get key
# Benchmark with etcd's built-in tool:
benchmark --endpoints=localhost:2379 put --key-size=8 --val-size=256 \
--total=100000 --conns=100 --clients=100
Document the differences: etcd uses a more sophisticated read lease implementation, supports watches, has a gRPC API, and is battle-hardened at Google/Kubernetes scale. Your implementation is ~1000 lines of Go and teaches you exactly how etcd works.
Test Scenarios
go test -run TestGet # basic reads
go test -run TestPut # basic writes
go test -run TestPutGet # put then get on same key
go test -run TestConcurrent # concurrent clients
go test -run TestOnePartition # minority partition cannot make progress
go test -run TestManyPartitionsOneClient
go test -run TestDedup # client retry doesn't double-apply
go test -run TestLinearizability # Porcupine check on full operation history
Evaluation Rubric
| Criterion | Points |
|---|---|
| Single-server correctness | 10 |
| Multi-server linearizability | 30 |
| Correct deduplication | 20 |
| Read index implementation | 20 |
| Partition tolerance (minority can't write) | 20 |
Project 3 — Simple Message Queue (Kafka-Inspired)
Goal
Build a persistent, partitioned message queue where producers write messages, consumers read them in order, and leadership for each partition is managed via a coordination service (your Raft KV store or a mock).
Architecture
Producer ──▶ Broker (Partition Leader) ──▶ Write-Ahead Log (append-only file)
│
Consumer Group ◀── Fetch(topic, partition, offset) ◀┘
Leader Election for partition ownership managed via Raft KV store
(key = "partition:<topic>:<id>/leader", value = broker ID)
Write-Ahead Log (WAL)
The core storage primitive is an append-only log file per partition. Each record is a fixed-size header + variable-length payload:
// (Go equivalent)
type Record struct {
Offset uint64 // monotonically increasing, never reused
Timestamp int64 // Unix nanoseconds
KeySize uint32
ValueSize uint32
Key []byte
Value []byte
CRC32 uint32 // covers all preceding fields
}
The WAL file is the partition's source of truth. It is never modified in place; records are appended. An index file (separate from the WAL) maps offsets to byte positions in the WAL for O(1) seeks.
type WAL struct {
dir string
dataFile *os.File // append-only
indexFile *os.File // offset → file position, written in 12-byte entries
nextOffset uint64
mu sync.Mutex
}
func (w *WAL) Append(key, value []byte) (offset uint64, err error)
func (w *WAL) ReadFrom(startOffset uint64, maxBytes int) ([]Record, error)
Consumer Groups and Offset Tracking
Kafka's key innovation is that offset tracking lives on the consumer, not the broker. This allows each consumer group to read the same partition independently.
Persist committed offsets in the Raft KV store (or a simple local file for the basic version):
key: "cg:<group>/topic:<topic>/partition:<id>"
value: "<committed_offset>"
A consumer reads a batch, processes it, and then commits the offset only after successful processing. If the consumer crashes before committing, it re-reads from the last committed offset, guaranteeing at-least-once delivery.
Leader Election for Partition Ownership
Use the Raft KV store from Project 2 as the coordination service. A broker claims a partition by performing a conditional put:
// Attempt to acquire lease
ok := kvClient.PutIfAbsent(
"partition:events:0/leader",
brokerID,
expirySeconds(30),
)
if ok {
// This broker is now the leader for partition events:0
go renewLease(...)
}
If the leader broker dies, its lease expires and another broker claims ownership. This is analogous to Kafka using ZooKeeper (historically) or KRaft (since 3.x) for controller election.
Benchmark Targets
- Write throughput: > 100,000 messages/sec (1 KB messages, single partition, local disk).
- End-to-end latency (producer write → consumer read): < 10 ms at p99.
- Recovery after leader failure: consumers experience < 5 seconds of unavailability.
Measure write throughput:
start := time.Now()
for i := 0; i < 1_000_000; i++ {
wal.Append(nil, make([]byte, 1024))
}
elapsed := time.Since(start)
fmt.Printf("%.0f msgs/sec\n", 1_000_000/elapsed.Seconds())
The bottleneck is usually fsync — measure with and without O_SYNC and with fsync() called every N records (configurable durability vs throughput tradeoff).
Test Scenarios
# WAL correctness
TestWALAppendRead -- write 1000 records, read back, verify CRC
TestWALRecovery -- truncate WAL mid-record, verify recovery loads partial records correctly
# Consumer group
TestConsumerOffset -- consume 100 msgs, commit, restart, verify re-reads from committed offset
TestAtLeastOnce -- kill consumer after read, before commit; verify replay
# Partition leadership
TestLeaderFailover -- kill partition leader, verify new leader elected, writes resume
TestSplitBrain -- partition network so two brokers think they are leader; verify WAL divergence is detected
Evaluation Rubric
| Criterion | Points |
|---|---|
| WAL append/read correctness | 20 |
| WAL crash recovery (CRC validation) | 15 |
| Consumer group offset tracking | 20 |
| At-least-once delivery guarantee | 15 |
| Partition leader election | 20 |
| Benchmark within 20% of throughput target | 10 |
Cross-Project Notes
Testing Philosophy
Each project's test suite is at least as important as the production code. The defining characteristic of distributed systems bugs is that they appear only under specific timing conditions (message delays, leader failures mid-operation, clock skew). Invest in:
- Fault injection: your simulated network must be able to drop specific messages, delay messages by a configurable duration, and partition arbitrary subsets of nodes.
- Deterministic replay: seed your random number generators so a failing test is reproducible.
- Linearizability checking: use Porcupine for Projects 1 and 2. A test that passes without linearizability checking may be masking real bugs.
Language Choice
Go is recommended because its goroutine model maps cleanly to the node-per-goroutine simulation pattern, its sync package provides production-quality primitives, and the MIT 6.824 test harness (a useful reference) is in Go. Rust is a valid alternative and teaches borrow-checker discipline alongside distributed systems. Avoid Python for these projects: the GIL makes true parallelism difficult in the same process, and Python's performance characteristics hide important bottlenecks.
Reference Material
- Ongaro & Ousterhout, "In Search of an Understandable Consensus Algorithm" (ATC 2014) — the Raft paper; read the extended version (not just the conference paper) for snapshot and membership change details.
- MIT 6.824 course notes and labs (ocw.mit.edu) — your projects parallel Labs 2A/2B/2C/2D (Raft) and Lab 3 (KV store).
- Kafka design documentation (kafka.apache.org/documentation) — covers the WAL, consumer group protocol, and leader election in detail.
- "Designing Data-Intensive Applications" Chapter 5 (replication) and Chapter 9 (consistency and consensus) — theoretical grounding for all three projects.