Skip to content

Distributed Systems Projects

Overview

These three projects escalate in complexity and teach the practical mechanics that separate theoretical knowledge from engineering fluency. Project 1 builds the consensus foundation. Project 2 layers a key-value store on top of it. Project 3 tackles a real-world distributed subsystem (a log-structured message queue) with its own distinct consistency and performance challenges. Each project is designed to take 3–6 weeks for an engineer with Go familiarity; the test suites are as important as the production code.

The MIT 6.824 (Distributed Systems) lab series covers similar ground. These projects are standalone alternatives you own and can extend freely — not autograder submissions.


Project 1 — Raft Consensus from Scratch

Goal

Implement the Raft consensus algorithm in Go. By the end, your cluster handles arbitrary leader failures, network partitions, and log divergence while maintaining linearizability.

Architecture

┌────────────────────────────────────────────────────────┐
│                      Raft Cluster                       │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐           │
│  │  Node 0  │   │  Node 1  │   │  Node 2  │           │
│  │ (Leader) │──▶│(Follower)│   │(Follower)│           │
│  └────┬─────┘   └──────────┘   └──────────┘           │
│       │ AppendEntries RPC (log replication)             │
│       │ RequestVote RPC (leader election)               │
└───────┴────────────────────────────────────────────────┘

Each node is a goroutine (or OS thread) communicating with others over a simulated or real RPC layer. For testing, use an in-process simulated network that can drop, delay, and partition messages.

API

type Raft struct {
    mu          sync.Mutex
    peers       []*labrpc.ClientEnd  // or net/rpc.Client
    me          int                   // this node's index
    currentTerm int
    votedFor    int                   // -1 if none
    log         []LogEntry
    commitIndex int
    lastApplied int
    // Volatile leader state (reinitialized after election)
    nextIndex   []int
    matchIndex  []int
    state       NodeState // Follower | Candidate | Leader
    applyCh     chan ApplyMsg
}

type LogEntry struct {
    Term    int
    Command interface{}
}

// Start proposes a command to the Raft cluster.
// Returns (log index, term, isLeader).
// If not leader, returns false immediately.
func (rf *Raft) Start(command interface{}) (int, int, bool)

// GetState returns current term and whether this node believes it is the leader.
func (rf *Raft) GetState() (int, bool)

Implementation Checklist

Leader Election

  • [ ] Initialize all nodes as Followers with currentTerm = 0.
  • [ ] Start an election timer with a randomized timeout (150–300 ms is the Raft paper's recommendation; use 300–600 ms in a test environment to reduce spurious elections under load).
  • [ ] On timer expiry: increment currentTerm, transition to Candidate, vote for self, send RequestVote RPCs to all peers in parallel goroutines.
  • [ ] If quorum (majority) votes received before timeout: transition to Leader.
  • [ ] On receiving RequestVote with higher term: revert to Follower, grant vote if log is at least as up-to-date (last log term/index comparison per §5.4.1 of the paper).
  • [ ] Leader sends heartbeat AppendEntries with empty entries every 100 ms to prevent followers from timing out.

Log Replication

  • [ ] Start(): leader appends entry to local log, sends AppendEntries to all followers.
  • [ ] AppendEntries: follower checks prevLogIndex/prevLogTerm; if consistent, appends entries; updates commitIndex if leaderCommit > commitIndex.
  • [ ] Leader advances commitIndex when a majority has matchIndex[i] >= N for some log index N in the current term.
  • [ ] Apply committed entries to applyCh in order.

Snapshots

  • [ ] Add InstallSnapshot RPC for when a follower's log falls too far behind.
  • [ ] Persist currentTerm, votedFor, and log (and snapshot) to stable storage after every mutation — simulate with a Persister struct that writes to a byte slice (swap for real disk in production).

Test Scenarios

go test -run TestBasicAgree         # 3-node, normal operation
go test -run TestRPCBytes           # verify no excessive data transfer
go test -run TestFollowerFailure    # kill a follower, writes still commit
go test -run TestLeaderFailure      # kill leader, new leader elected
go test -run TestRejoin             # killed node rejoins, catches up
go test -run TestPartition          # network partition, both halves try to make progress
go test -run TestFigure8            # the Figure 8 liveness scenario from the paper
go test -run TestUnreliableAgree    # random dropped/delayed RPCs
go test -run TestLinearizability    # use Porcupine linearizability checker

Linearizability Verification

Use the Porcupine Go library to verify that the operation history produced by your cluster is linearizable. Record every Start() call and every applied ApplyMsg, then pass the history to porcupine.CheckOperations. A non-linearizable history indicates a correctness bug.

Performance Targets

  • Leader election after leader failure: < 500 ms (configurable timeout dependent).
  • Throughput with a 3-node cluster on localhost: > 5,000 ops/sec.
  • Recovery after a follower rejoins a partition: full log replication within 1 second.

Evaluation Rubric

Criterion Points
TestBasicAgree passes 15
TestLeaderFailure passes 20
TestPartition passes 25
TestLinearizability passes 25
TestUnreliableAgree passes (random drops) 15

Project 2 — Distributed Key-Value Store on Raft

Goal

Build a linearizable key-value store with Get, Put, and Delete operations, backed by the Raft implementation from Project 1.

Architecture

Client ──▶ KVServer (leader) ──▶ Raft.Start(Op)
                                       │
                              Raft commits log entry
                                       │
                              KVServer.applier() reads ApplyCh
                                       │
                              Apply to in-memory map, notify client RPC

Each KVServer wraps a Raft instance. Multiple KVServer instances form a replicated group. Clients discover the current leader via retries (any server that is not the leader returns ErrWrongLeader).

API

type KVServer struct {
    mu      sync.Mutex
    rf      *raft.Raft
    applyCh chan raft.ApplyMsg
    store   map[string]string
    // Track pending client RPCs: log index → response channel
    pending map[int]chan OpResult
    // Deduplication: last applied serial number per client
    lastSeq map[int64]int64
}

// Get returns the current value for key (empty string if absent).
// Linearizable: reflects all Puts that completed before this Get starts.
func (kv *KVServer) Get(args *GetArgs, reply *GetReply) error

// Put sets key = value unconditionally.
func (kv *KVServer) Put(args *PutArgs, reply *PutReply) error

// Delete removes the key.
func (kv *KVServer) Delete(args *DeleteArgs, reply *DeleteReply) error

Linearizable Reads

The naive approach (propose a no-op to Raft for every read) doubles latency. Two optimizations:

  1. Read index: The leader records the current commitIndex as the read index, sends heartbeats to confirm it is still the leader, then serves the read once appliedIndex >= readIndex. No log write required.
  2. Lease-based reads: The leader tracks the time of its last successful heartbeat quorum. If the current time is within the lease window (election timeout minus clock skew margin), it can serve reads directly without any network round trip. This is what etcd uses by default.

Implement the read index approach first; add lease-based reads as an extension.

Client Deduplication

Without deduplication, a client that retries a Put after a timeout may apply it twice. Each client request carries a (clientID, sequenceNumber) pair. The server tracks the last applied sequence number per client; if seq <= lastSeq[clientID], the server returns the cached result without re-applying.

Comparison to etcd

After your implementation works, run equivalent workloads against etcd and compare:

# etcd benchmark
etcdctl put key val
etcdctl get key
# Benchmark with etcd's built-in tool:
benchmark --endpoints=localhost:2379 put --key-size=8 --val-size=256 \
          --total=100000 --conns=100 --clients=100

Document the differences: etcd uses a more sophisticated read lease implementation, supports watches, has a gRPC API, and is battle-hardened at Google/Kubernetes scale. Your implementation is ~1000 lines of Go and teaches you exactly how etcd works.

Test Scenarios

go test -run TestGet              # basic reads
go test -run TestPut              # basic writes  
go test -run TestPutGet           # put then get on same key
go test -run TestConcurrent       # concurrent clients
go test -run TestOnePartition     # minority partition cannot make progress
go test -run TestManyPartitionsOneClient
go test -run TestDedup            # client retry doesn't double-apply
go test -run TestLinearizability  # Porcupine check on full operation history

Evaluation Rubric

Criterion Points
Single-server correctness 10
Multi-server linearizability 30
Correct deduplication 20
Read index implementation 20
Partition tolerance (minority can't write) 20

Project 3 — Simple Message Queue (Kafka-Inspired)

Goal

Build a persistent, partitioned message queue where producers write messages, consumers read them in order, and leadership for each partition is managed via a coordination service (your Raft KV store or a mock).

Architecture

Producer ──▶ Broker (Partition Leader) ──▶ Write-Ahead Log (append-only file)
                                                    │
Consumer Group ◀── Fetch(topic, partition, offset) ◀┘

Leader Election for partition ownership managed via Raft KV store
  (key = "partition:<topic>:<id>/leader", value = broker ID)

Write-Ahead Log (WAL)

The core storage primitive is an append-only log file per partition. Each record is a fixed-size header + variable-length payload:

// (Go equivalent)
type Record struct {
    Offset    uint64   // monotonically increasing, never reused
    Timestamp int64    // Unix nanoseconds
    KeySize   uint32
    ValueSize uint32
    Key       []byte
    Value     []byte
    CRC32     uint32   // covers all preceding fields
}

The WAL file is the partition's source of truth. It is never modified in place; records are appended. An index file (separate from the WAL) maps offsets to byte positions in the WAL for O(1) seeks.

type WAL struct {
    dir       string
    dataFile  *os.File   // append-only
    indexFile *os.File   // offset → file position, written in 12-byte entries
    nextOffset uint64
    mu         sync.Mutex
}

func (w *WAL) Append(key, value []byte) (offset uint64, err error)
func (w *WAL) ReadFrom(startOffset uint64, maxBytes int) ([]Record, error)

Consumer Groups and Offset Tracking

Kafka's key innovation is that offset tracking lives on the consumer, not the broker. This allows each consumer group to read the same partition independently.

Persist committed offsets in the Raft KV store (or a simple local file for the basic version):

key: "cg:<group>/topic:<topic>/partition:<id>"
value: "<committed_offset>"

A consumer reads a batch, processes it, and then commits the offset only after successful processing. If the consumer crashes before committing, it re-reads from the last committed offset, guaranteeing at-least-once delivery.

Leader Election for Partition Ownership

Use the Raft KV store from Project 2 as the coordination service. A broker claims a partition by performing a conditional put:

// Attempt to acquire lease
ok := kvClient.PutIfAbsent(
    "partition:events:0/leader",
    brokerID,
    expirySeconds(30),
)
if ok {
    // This broker is now the leader for partition events:0
    go renewLease(...)
}

If the leader broker dies, its lease expires and another broker claims ownership. This is analogous to Kafka using ZooKeeper (historically) or KRaft (since 3.x) for controller election.

Benchmark Targets

  • Write throughput: > 100,000 messages/sec (1 KB messages, single partition, local disk).
  • End-to-end latency (producer write → consumer read): < 10 ms at p99.
  • Recovery after leader failure: consumers experience < 5 seconds of unavailability.

Measure write throughput:

start := time.Now()
for i := 0; i < 1_000_000; i++ {
    wal.Append(nil, make([]byte, 1024))
}
elapsed := time.Since(start)
fmt.Printf("%.0f msgs/sec\n", 1_000_000/elapsed.Seconds())

The bottleneck is usually fsync — measure with and without O_SYNC and with fsync() called every N records (configurable durability vs throughput tradeoff).

Test Scenarios

# WAL correctness
TestWALAppendRead        -- write 1000 records, read back, verify CRC
TestWALRecovery          -- truncate WAL mid-record, verify recovery loads partial records correctly

# Consumer group
TestConsumerOffset       -- consume 100 msgs, commit, restart, verify re-reads from committed offset
TestAtLeastOnce          -- kill consumer after read, before commit; verify replay

# Partition leadership
TestLeaderFailover       -- kill partition leader, verify new leader elected, writes resume
TestSplitBrain           -- partition network so two brokers think they are leader; verify WAL divergence is detected

Evaluation Rubric

Criterion Points
WAL append/read correctness 20
WAL crash recovery (CRC validation) 15
Consumer group offset tracking 20
At-least-once delivery guarantee 15
Partition leader election 20
Benchmark within 20% of throughput target 10

Cross-Project Notes

Testing Philosophy

Each project's test suite is at least as important as the production code. The defining characteristic of distributed systems bugs is that they appear only under specific timing conditions (message delays, leader failures mid-operation, clock skew). Invest in:

  • Fault injection: your simulated network must be able to drop specific messages, delay messages by a configurable duration, and partition arbitrary subsets of nodes.
  • Deterministic replay: seed your random number generators so a failing test is reproducible.
  • Linearizability checking: use Porcupine for Projects 1 and 2. A test that passes without linearizability checking may be masking real bugs.

Language Choice

Go is recommended because its goroutine model maps cleanly to the node-per-goroutine simulation pattern, its sync package provides production-quality primitives, and the MIT 6.824 test harness (a useful reference) is in Go. Rust is a valid alternative and teaches borrow-checker discipline alongside distributed systems. Avoid Python for these projects: the GIL makes true parallelism difficult in the same process, and Python's performance characteristics hide important bottlenecks.

Reference Material

  • Ongaro & Ousterhout, "In Search of an Understandable Consensus Algorithm" (ATC 2014) — the Raft paper; read the extended version (not just the conference paper) for snapshot and membership change details.
  • MIT 6.824 course notes and labs (ocw.mit.edu) — your projects parallel Labs 2A/2B/2C/2D (Raft) and Lab 3 (KV store).
  • Kafka design documentation (kafka.apache.org/documentation) — covers the WAL, consumer group protocol, and leader election in detail.
  • "Designing Data-Intensive Applications" Chapter 5 (replication) and Chapter 9 (consistency and consensus) — theoretical grounding for all three projects.