Skip to content

01 — Google Infrastructure

Overview

Google's internal infrastructure is the most studied and influential set of large-scale systems in computing history. The papers Google published between 2003 and 2010 — GFS, Bigtable, MapReduce, Chubby, Dapper, Spanner — defined the field of distributed systems and became the direct ancestors of the entire modern data infrastructure industry. HDFS, HBase, Kafka, Cassandra, Spark, and Kubernetes all trace their intellectual lineage to Google research. Understanding these systems as a coherent ecosystem rather than isolated papers reveals a consistent engineering philosophy: optimize for the failure case, not the success case; prefer simplicity at each layer and handle complexity at the layer above; build for 10× more traffic than today.


Historical Context

By 2002, Google was running thousands of commodity x86 machines in its own data centers, with no off-the-shelf software that could scale to their workloads. Oracle RAC was too expensive and not designed for hundreds of servers. MPI (Message Passing Interface) was designed for supercomputer clusters, not unreliable commodity hardware. Google had to build everything.

The resulting systems were designed for one key constraint that commercial infrastructure vendors had not prioritized: hardware failure is the normal case, not the exceptional case. At Google's scale, multiple disks, multiple servers, and multiple network links fail every day. Systems must be correct and performant in the presence of continuous partial failure.


Google Infrastructure Dependency Diagram

┌────────────────────────────────────────────────────────────────────────┐
│                     USER-FACING SERVICES                               │
│         Search, Gmail, YouTube, Maps, Drive, Ads                       │
└───────────────────────────┬────────────────────────────────────────────┘
                            │
┌───────────────────────────▼────────────────────────────────────────────┐
│                     SERVING INFRASTRUCTURE                             │
│       Bigtable (structured serving)   Colossus (file serving)         │
│       Spanner (global SQL)                                             │
└───────────────┬────────────────────────────────────┬───────────────────┘
                │                                    │
┌───────────────▼──────────────┐   ┌────────────────▼──────────────────┐
│    PROCESSING LAYER          │   │    MONITORING / TRACING           │
│  MapReduce / Flume / Dremel  │   │    Dapper (traces)                │
│  Dataflow (Apache Beam)      │   │    Monarch (metrics)              │
└───────────────┬──────────────┘   └───────────────────────────────────┘
                │
┌───────────────▼──────────────────────────────────────────────────────┐
│                 RESOURCE MANAGEMENT                                   │
│                     BORG / Kubernetes                                 │
└───────────────────────────┬──────────────────────────────────────────┘
                            │
┌───────────────────────────▼──────────────────────────────────────────┐
│               DISTRIBUTED COORDINATION                               │
│                     Chubby (distributed lock)                        │
└──────────────────────────────────────────────────────────────────────┘

Borg: Google's Cluster OS

Borg (2003, paper published 2015) is Google's internal cluster management system — the direct predecessor of Kubernetes. At peak scale, Borg managed hundreds of thousands of machines and millions of jobs.

Core Concepts

Job: A Borg job is a named collection of tasks — roughly equivalent to a Kubernetes Deployment. Jobs have a priority (production > batch > best-effort), resource requests (CPU in millicores, RAM in bytes), and constraints (run on machines with SSD, run in specific cells).

Task: A single process (or container) within a job. Tasks are assigned to machines by the scheduler. A task may be killed and re-scheduled on a different machine at any time.

Alloc set: A reserved set of resources for a group of tasks that run together. Allocs enable containers that must co-locate (e.g., a web server + its log collection sidecar) — this became Kubernetes Pods.

Cell: A Borg cell is a cluster of ~10,000 machines. Each cell has a BorgMaster (the control plane, replicated 5× with Paxos) and a Borglet agent on every machine (reports health, executes task assignments).

Borg Architecture:
┌─────────────────────────────────────────────────────────────┐
│  Borg Cell                                                  │
│                                                             │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  BorgMaster (× 5, Paxos)                           │   │
│  │  scheduler thread + replicated state                │   │
│  └───────────────────┬─────────────────────────────────┘   │
│                      │ TaskAssignment RPCs                  │
│     ┌────────────────┼──────────────┐                       │
│     │                │              │                       │
│  ┌──▼──┐          ┌──▼──┐       ┌──▼──┐                    │
│  │Borglet│        │Borglet│     │Borglet│  (× 10,000)       │
│  │ m001 │        │ m002 │       │ m003 │                    │
│  │tasks │        │tasks │       │tasks │                    │
│  └──────┘        └──────┘       └──────┘                    │
└─────────────────────────────────────────────────────────────┘

Scheduler Design

Borg's scheduler is a two-phase approach: 1. Feasibility checking: Find all machines that meet the task's constraints. 2. Scoring: Rank feasible machines by a combination of: minimizing wasted resources (bin packing), minimizing task preemptions, preferring machines that already have the required packages cached, and spreading tasks across failure domains.

Borg uses optimistic concurrency control for scheduling: multiple scheduler threads propose assignments, and conflicts are resolved by re-running the scoring. This allows parallel scheduling without a central lock.

Borg vs Kubernetes

Kubernetes (2014) was built by Google engineers who worked on Borg, and it consciously fixed several Borg design decisions: - Borg task IDs were opaque integers; Kubernetes uses named resources with rich metadata. - Borg had no equivalent of Kubernetes Services (load balancing was external). - Borg's alloc sets inspired Pods, but Pods are first-class in Kubernetes. - Kubernetes uses a declarative reconciliation model more uniformly than Borg.


Spanner: Globally Distributed SQL

Spanner (2012 paper) is Google's globally distributed relational database. It provides: - Full SQL (including cross-shard JOINs). - External consistency (serializable, globally ordered transactions). - Horizontal sharding across thousands of servers. - Multi-region replication (configurable per table: from 2 regions to 30+).

TrueTime API

The fundamental challenge of distributed transactions is establishing a global ordering of events across machines with unsynchronized clocks. Spanner solves this with TrueTime — a hardware + software API that provides a bounded estimate of the current absolute time:

TrueTime.now() returns:
  TTInterval {
    earliest: time_t  // guaranteed lower bound on actual time
    latest:   time_t  // guaranteed upper bound on actual time
  }

Typical uncertainty (epsilon): 1–7 milliseconds

TrueTime uses two sources: GPS receivers and atomic clocks in each data center. The uncertainty (epsilon = latest - earliest) is kept tight (~1ms typical) by cross-validating multiple time masters.

External Consistency via Commit-Wait

Spanner uses TrueTime to guarantee external consistency: if transaction T1 commits before transaction T2 begins (in wall-clock time), then T1's commit timestamp is less than T2's.

The protocol: 1. The commit coordinator picks a timestamp sTT.now().latest. 2. It waits until TT.now().earliest > s (commit wait — typically 1–7ms). 3. Then commits with timestamp s.

The wait ensures that s is in the past (below all future TT.now().earliest values), making s a valid globally ordered timestamp. This wait is the source of Spanner's ~10ms write latency floor.

Paxos Groups and 2PC

Each Spanner shard (called a "directory" — a range of keys) is replicated via a Paxos group spanning multiple data centers. Within a shard, a leader performs writes; replicas serve reads.

Cross-shard transactions use two-phase commit (2PC) coordinated by a Paxos leader. To avoid the availability problem of 2PC (coordinator failure → blocked participants), Spanner runs 2PC with Paxos-replicated coordinators — the coordinator is a group, not a single machine.


Colossus: Distributed Filesystem

Colossus is the successor to GFS (Google File System, 2003 paper). GFS introduced the single-master + chunkserver design; Colossus redesigns the metadata layer to eliminate the single-master bottleneck.

GFS limitations: - Single metadata master: 64MB namespace RAM limit, single point of failure, hot spot. - 64MB chunk size: Poor for small files. - Strong consistency only for single-writer workloads.

Colossus improvements: - Distributed metadata via a curators layer (multiple metadata servers with range partitioning). - SSD-backed metadata for faster directory operations. - Reed-Solomon erasure coding for warm/cold storage (instead of 3× replication of everything). - Cell-based architecture: each Colossus cell serves a data center region.


Bigtable: Wide-Column Store

Bigtable (2006 paper) is a distributed storage system for structured data. It is a wide-column store: rows are identified by a row key, columns are grouped into column families, and each cell has a timestamp for multi-version access.

Bigtable Data Model:
Row key     │ family:qualifier │ timestamp │ value
────────────┼──────────────────┼───────────┼──────────
com.cnn/    │ anchor:cnnsi.com │ t3        │ "CNN"
com.cnn/    │ anchor:my.look.ca│ t5        │ "CNN.com"
com.cnn/    │ contents:html    │ t6        │ "<html>..."

Internals

Tablet: A contiguous range of row keys is served by a tablet server. Tablets are ~200MB; when a tablet grows too large it is split.

SSTable: The on-disk format. An SSTable is an immutable sorted key-value file with a block index. Multiple SSTables are merged by compaction.

Memtable: Writes go to an in-memory memtable (and the mutation log on GFS). When the memtable fills, it is frozen and written as a new SSTable (minor compaction). Background major compactions merge SSTables to keep read performance bounded.

Chubby: Bigtable uses Chubby (Google's distributed lock service, based on Paxos) for: master election, storing the root tablet location, and service discovery.

Legacy: Bigtable's data model directly inspired Apache HBase. Cassandra took the wide-column model but replaced the master-centric architecture with consistent hashing (borrowed from Dynamo).


MapReduce: Batch Processing

MapReduce (2004 paper) is a programming model for processing large datasets with a parallel, distributed algorithm. It abstracts fault-tolerance, parallelism, and data locality behind two user-supplied functions:

  • Map(key, value) → list of (key, value): Transform and filter input records.
  • Reduce(key, list of values) → list of (key, value): Aggregate all values for a key.
MapReduce Execution:
┌─────────┐       ┌───────────────────────┐      ┌──────────────┐
│  Input  │ split │  Map workers          │ sort │ Reduce       │
│  (GFS)  │──────►│  M1: map(chunk1)      │─────►│ workers      │
│         │       │  M2: map(chunk2)      │      │ R1: reduce(k1│
│         │       │  M3: map(chunk3)      │      │ R2: reduce(k2│
└─────────┘       └───────────────────────┘      └──────┬───────┘
                                                         │
                                                    ┌────▼────┐
                                                    │ Output  │
                                                    │ (GFS)   │
                                                    └─────────┘

Fault tolerance: If a map worker fails, the master reschedules its tasks on another worker. The output of map tasks is stored on the local disk of the map worker; if it fails, the reduce worker fetches the data from a rescheduled worker. This requires re-running the map task on failure — possible because map tasks are pure functions.

Locality optimization: The master tries to schedule map tasks on the machine that stores the input chunk on GFS, or on a machine in the same rack. Network transfer is the bottleneck; avoiding it is the primary optimization.

MapReduce's legacy: Apache Hadoop directly implemented the MapReduce model. Spark replaced MapReduce by keeping intermediate data in memory, enabling iterative algorithms.


Dapper: Distributed Tracing

Dapper (2010 paper) is Google's distributed tracing infrastructure. It solves the observability problem at scale: when a user's request traverses 100+ microservices, where is the latency? Which service failed?

Trace Model

Every request is tagged with a globally unique trace ID. As the request hops from service to service, each hop creates a span: a named, timed unit of work. Spans record: - Start time, end time (latency). - The service name and operation name. - Annotations (key-value metadata). - Parent span ID (establishing the causal tree).

Request trace (simplified):
  TraceID: 8f2a
  ┌─────────────────────────────────────────────────┐
  │ Span: frontend  0ms ─────────────────────── 80ms│
  │  ┌─────────────────────────────────────────┐    │
  │  │ Span: auth   2ms ────── 15ms            │    │
  │  └─────────────────────────────────────────┘    │
  │  ┌─────────────────────────────────────────┐    │
  │  │ Span: search 16ms ─────────────── 72ms  │    │
  │  │  ┌──────────────────────────────────┐   │    │
  │  │  │ Span: bigtable 17ms ────── 60ms  │   │    │
  │  │  └──────────────────────────────────┘   │    │
  │  └─────────────────────────────────────────┘    │
  └─────────────────────────────────────────────────┘

Sampling

At Google's scale, tracing every request would generate petabytes of trace data per day. Dapper uses head-based sampling: when a new trace is started, a coin is flipped (typically 1-in-1000). If sampled, all downstream services are also sampled (the trace decision is propagated in the RPC context).

This means rare events (latency outliers, errors) may be underrepresented. Modern systems like Jaeger and Honeycomb complement head-based sampling with tail-based sampling (collect all spans, then decide post-hoc to keep only the slow/error traces).

Influence: Dapper directly inspired OpenTracing, Zipkin (Twitter), Jaeger (Uber/CNCF), and ultimately OpenTelemetry — the current standard for distributed tracing.


Production Debugging and Operations Notes

Borg job failures: Production incidents at Google often involve Borg preemptions (production jobs preempting lower-priority jobs) cascading into quota exhaustion. The SRE book's chapter on "Practical Alerting" describes the monitoring model built on top of Borg.

Spanner latency tails: Commit-wait introduces a ~5–10ms floor on write latency. Applications that require lower write latency must use eventual-consistency patterns (e.g., Bigtable) rather than Spanner.

Bigtable hot tablets: A single row key receiving all writes causes a tablet to become a hot spot — one tablet server handles all the load. The fix: prefix keys with a salted hash to distribute load across tablets, at the cost of range scan efficiency.


Performance Implications

  • MapReduce vs Spark: MapReduce writes intermediate results to GFS after every stage; Spark keeps them in memory. For iterative algorithms (PageRank, ML training), Spark is 10–100× faster. For single-pass ETL, the difference is smaller.
  • Spanner vs Bigtable: Spanner provides external consistency at the cost of ~10ms write latency and cross-region 2PC overhead. Bigtable provides microsecond-range reads/writes but eventual consistency only.
  • Colossus erasure coding: Reed-Solomon (e.g., 6+3) reduces storage cost vs 3× replication but increases read latency on degraded reads (must reconstruct from 6 shards).

Failure Modes

System Critical Failure Behavior
BorgMaster Leader election 5–10s failover via Paxos; no task disruption
Spanner leader Paxos re-election Writes blocked for election duration (~1–5s)
Bigtable master Chubby lock lost New master recovers from GFS logs; ~30s
GFS master Crash Hot standby with Chubby lease; ~1min
MapReduce coordinator Failure Master restarts from checkpoint; tasks re-run

Future Directions

F1 / SPANNER → AlloyDB: Google continues to evolve Spanner's query processing layer. F1 is the query engine that sits above Spanner; AlloyDB is Google's PostgreSQL-compatible cloud database built on Spanner's storage foundation.

Omega (successor to Borg): A shared-state scheduler with optimistic concurrency — multiple schedulers operate concurrently with no serialization. Kubernetes takes ideas from both Borg and Omega.


Exercises

  1. Read the original Bigtable paper (OSDI 2006). Draw the full read path from a client request to a cell value, including the role of Chubby and the root tablet.
  2. Implement a toy MapReduce framework in Python. Run word count on a 1GB text file. Measure time vs single-threaded counting.
  3. Design a Spanner schema for a global e-commerce order table. What is the read/write cost for a single-region order vs a cross-region order?
  4. Explain why Bigtable compaction is necessary for read performance. What happens to read latency as the number of SSTables grows without compaction?
  5. Read the Dapper paper. Explain the tradeoffs between head-based and tail-based sampling. Design a sampling strategy for a system with 1000 RPS and 0.1% error rate.

References

  • Ghemawat, S. et al. "The Google File System." SOSP 2003.
  • Chang, F. et al. "Bigtable: A Distributed Storage System for Structured Data." OSDI 2006.
  • Dean, J., Ghemawat, S. "MapReduce: Simplified Data Processing on Large Clusters." OSDI 2004.
  • Corbett, J. et al. "Spanner: Google's Globally-Distributed Database." OSDI 2012.
  • Verma, A. et al. "Large-scale cluster management at Google with Borg." EuroSys 2015.
  • Sigelman, B. et al. "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure." Google Technical Report, 2010.