Section 38: System Design

Purpose and Scope

System design is the discipline of architecting large-scale distributed systems that meet functional requirements while satisfying non-functional constraints: availability, scalability, consistency, latency, throughput, and cost. Unlike algorithm design — where a provably correct answer exists — system design involves navigating fundamental tradeoffs, making explicit decisions about consistency models, failure domains, and operational complexity. This section covers the methodology for approaching system design problems, the canonical patterns and their tradeoffs, and worked examples of major system categories. It serves both as an interview preparation resource and as a practical reference for engineers designing production systems.

Prerequisites

Distributed systems fundamentals (Section 17): CAP theorem, consistency models, consensus
Database internals (Section 18): storage engines, replication, transactions
Networking and protocols (Section 15, 16): TCP, HTTP, load balancing
Operating system and Linux fundamentals (Section 03)
Basic understanding of cloud infrastructure primitives (Section 21)

Learning Objectives

By the end of this section, you will be able to:

Apply a structured methodology to decompose any system design problem
Select the appropriate consistency model (strong, eventual, causal) for a given use case
Design a scalable caching strategy with appropriate eviction, invalidation, and freshness policies
Design a horizontally scalable database layer using replication and sharding
Explain consistent hashing and design a routing layer using it
Apply the circuit breaker and retry patterns to build fault-tolerant service architectures
Design major systems from scratch: URL shortener, distributed cache, search engine, social feed
Reason about system tradeoffs quantitatively using back-of-envelope estimation

Architecture Overview

System Design Methodology

  Step 1: Requirements Clarification (5 min)
  +-----------------------------------------+
  | Functional: What does the system do?     |
  |   Users, operations, data flows          |
  | Non-functional: Scale, latency, avail.   |
  |   QPS, DAU, SLA, data volume, geo        |
  | Out of scope: explicitly state           |
  +-----------------------------------------+

  Step 2: Capacity Estimation (5 min)
  +-----------------------------------------+
  | Traffic: 100M DAU * 10 req/day = ~12K QPS|
  | Storage: 1M posts/day * 1KB = 1 GB/day  |
  | Bandwidth: 12K QPS * 10KB = 120 MB/s    |
  +-----------------------------------------+

  Step 3: High-Level Design (15 min)
  +-----------------------------------------+
  | Draw components: clients, LB, services,  |
  | caches, databases, queues                |
  | Identify read vs write paths             |
  | Define API contracts                     |
  +-----------------------------------------+

  Step 4: Deep Dive (20 min)
  +-----------------------------------------+
  | Bottleneck identification & resolution   |
  | Data model design                        |
  | Consistency and availability tradeoffs   |
  | Failure modes and mitigations            |
  +-----------------------------------------+

  Step 5: Wrap Up
  +-----------------------------------------+
  | Operational concerns, monitoring         |
  | Future scaling considerations            |
  | Known limitations                        |
  +-----------------------------------------+

Scalability Patterns

  Vertical Scaling:         Horizontal Scaling:
  +----------+              +------+  +------+  +------+
  | Server   |  ->          | Srv1 |  | Srv2 |  | Srv3 |
  | (bigger) |              +--+---+  +--+---+  +--+---+
  +----------+                 |         |         |
  Simple, bounded              +----+----+----+    |
                                    |              |
                               Load Balancer       |
                             (L4 or L7)            |

  Database Scaling Patterns:

  Read Replica:                    Sharding:
  +----------+                     +-------+  +-------+
  | Primary  | -> replication ->   | Shard |  | Shard |
  | (writes) |    +----------+     | 0     |  | 1     |
  +----------+    | Replica  |     | uid%2 |  | uid%2 |
                  | (reads)  |     | ==0   |  | ==1   |
                  +----------+     +-------+  +-------+

  Caching Hierarchy:

  Client <-> CDN <-> Load Balancer <-> App Server <-> Cache <-> DB
                     (edge cache)      (in-process)   (Redis)

  Read-through:  cache miss -> load from DB -> populate cache -> return
  Write-through: write DB and cache simultaneously
  Write-behind:  write cache first, async flush to DB
  Cache-aside:   application manages cache explicitly

Consistent Hashing

  Hash ring (0 to 2^32 - 1):

                 0
             /       \
          node_A   node_D
         /                 \
     node_B               node_C
         \                 /
          ----- ring -----

  Key lookup: hash(key) -> find nearest node clockwise

  Adding a node: only keys between predecessor and new node move
  Removing a node: only keys on that node move to successor

  Virtual nodes: each physical server gets K virtual positions on ring
  -> better load distribution, especially with heterogeneous nodes

  Used by: Amazon DynamoDB, Apache Cassandra, Akamai CDN

Message Queue Architecture

  Producer ---> +----------+ ---> Consumer Group A
  (service)     |  Kafka   |      (real-time processing)
  (writes)      | Partition|
                | P0: msgs |
                | P1: msgs | ---> Consumer Group B
                | P2: msgs |      (analytics / batch)
                +----------+

  Key properties:
  - Decouples producers from consumers (async)
  - Durable: messages persisted to disk (Kafka: 7-day retention)
  - Replay: consumers can re-read from any offset
  - Fan-out: multiple consumer groups independently consume

  Pattern selection:
  Request-reply (synchronous): direct HTTP/gRPC
  Fire-and-forget (async, durable): Kafka, SQS
  Work queue (competing consumers): RabbitMQ, SQS FIFO
  Event streaming (ordered log): Kafka, Kinesis

  User A posts a tweet (100 followers)

  Fan-out on Write (push model):
  Post -> for each follower: write post_id to follower's feed cache
  Read feed: O(1) - just read pre-computed cache
  Write cost: O(followers) - expensive for celebrities (100M followers)

  Fan-out on Read (pull model):
  Post -> write once to user's post store
  Read feed: O(following) - query each followed user's recent posts, merge
  Write cost: O(1)
  Read cost: expensive for users following thousands

  Hybrid (Twitter/X approach):
  Celebrities (>X followers): fan-out on read for their posts
  Regular users: fan-out on write
  Read path: merge pre-computed fan-out cache + celebrity posts

  +----------+         +-----------+
  | Post     | ------> | Fan-out   | -----> Redis feed cache
  | Service  |         | Service   |        per-user sorted set
  +----------+         | (async,   |        (score = timestamp)
                       |  Kafka)   |
                       +-----------+

Key Concepts

CAP Theorem: In a distributed system, you can guarantee at most two of: Consistency, Availability, Partition Tolerance. Since partitions are unavoidable in real networks, the practical choice is CP (consistent but may reject requests) vs AP (always responds but may return stale data).
Consistent Hashing: A technique for distributing load across nodes such that only O(1/n) keys need to move when a node is added or removed. Essential for distributed caches and key-value stores with dynamic membership.
Circuit Breaker: Pattern that stops calling a failing service after a threshold of failures, returning an error immediately. After a timeout, allows a probe request to test recovery. Prevents cascade failures.
Rate Limiting: Controlling request rates per user/service. Token bucket (steady state with burst capacity) and leaky bucket (strict rate) algorithms. Implemented at API gateway or in dedicated rate-limiting service (Redis INCR + TTL).
CQRS (Command Query Responsibility Segregation): Separate read and write models. Write path updates the event log or primary DB; read path maintains denormalized projections optimized for queries. Enables independent scaling of read and write paths.
Saga Pattern: Manages distributed transactions across microservices using a sequence of local transactions with compensating transactions on failure. Either orchestrator-based or choreography-based (event-driven).
Service Discovery: Dynamic registration and lookup of service instances. Client-side (Consul + client library) or server-side (load balancer queries registry). Required for auto-scaled microservice environments.
Back-of-Envelope Estimation: Quantitative sizing of system requirements using approximations. Key numbers: HDD 100 MB/s, SSD 500 MB/s-7 GB/s, RAM 50 GB/s, network 10 Gbps; L1 ~1ns, RAM ~100ns, SSD ~100 microseconds, network RTT 10-100ms.

Major Historical Milestones

Year	Milestone
1997	Amazon begins decomposing from monolith; eventually services
1997	Consistent hashing paper — Karger et al. (Akamai/MIT)
2003	Google MapReduce paper — batch distributed computation
2006	Amazon Dynamo paper — AP key-value store, consistent hashing, vector clocks
2007	CAP theorem formally proven (Gilbert & Lynch, building on Brewer 2000)
2007	Twitter architecture — fail whale era; sharding, caching lessons
2008	Facebook Memcached at scale — cache-aside at billion-user scale
2009	Netflix begins cloud migration from on-premise; chaos engineering origin
2011	Netflix Hystrix — circuit breaker pattern popularized
2011	CQRS and Event Sourcing — Greg Young's formalization
2012	Lambda architecture — Marz's batch + speed + serving layers
2014	Microservices term coined — Fowler & Lewis article
2014	Kafka at LinkedIn — append-only log as distributed system primitive
2015	Kubernetes 1.0 — container orchestration standard
2016	Kappa architecture — stream-only, replacing lambda
2018	Stripe distributed systems reading list — formalization of field
2020	Service mesh (Istio, Linkerd) mainstreams for east-west traffic

Modern Relevance

System design is the interview format for senior and staff engineering positions at virtually all major technology companies. But beyond interviews, the patterns are the vocabulary of how large systems are built and discussed. Every engineering design document at Google, Amazon, Meta, and Stripe invokes these concepts. The shift to cloud-native architectures (containers, Kubernetes, managed databases) has made horizontal scaling the default, not an exception. Rate limiting, circuit breaking, and consistent hashing are now implemented by platform teams and consumed via libraries and service meshes — but the engineers operating them must understand what these systems are doing to debug them effectively.

File Map

38-system-design/
├── 00-overview.md              <- This file
├── 01-methodology.md
├── 02-scalability-patterns.md
├── 03-load-balancing.md
├── 04-caching-strategies.md
├── 05-cdn.md
├── 06-database-scaling.md
├── 07-message-queues.md
├── 08-event-driven-architecture.md
├── 09-microservices-vs-monolith.md
├── 10-api-gateway-patterns.md
├── 11-rate-limiting.md
├── 12-circuit-breaking.md
├── 13-service-discovery.md
├── 14-consistent-hashing.md
├── 15-cap-theorem-applied.md
├── 16-design-url-shortener.md
├── 17-design-distributed-cache.md
├── 18-design-search-engine.md
└── 19-design-social-feed.md

Cross-References

Section 17 (Distributed Systems): Consensus (Raft, Paxos), consistency models, clock synchronization
Section 18 (Database Internals): B-trees, LSM trees, MVCC, replication logs — why database scaling behaves as it does
Section 21 (Cloud Infrastructure): AWS/GCP/Azure primitives that implement these patterns
Section 22 (Kubernetes): Container orchestration as infrastructure for scalable service deployment
Section 23 (Observability): Metrics, tracing, logging for production system health assessment
Section 39 (Large-Scale Case Studies): Real implementations of these patterns at Google, Meta, Amazon