02 — Meta Infrastructure
Overview
Meta (formerly Facebook) built a distinct infrastructure tradition from Google's. Where Google favored building bespoke distributed systems and publishing them as papers, Meta more often contributed to open-source projects, patched the Linux kernel, and engineered extreme efficiency at hardware-software boundaries. Meta's infrastructure is characterized by: aggressive Linux kernel customization, massive Memcache deployments, a social graph data model as first-class concern, RPC-first service decomposition, and a relentless focus on reducing the cost per served byte at multi-billion-user scale.
Historical Context
Facebook launched in 2004 on commodity LAMP stack (Linux, Apache, MySQL, PHP). By 2006 it was at millions of users and hitting MySQL limits. The years 2007–2012 saw a series of architectural transformations driven by the social graph use case: massive fan-out reads (a single news feed load might require hundreds of database queries), write-heavy workloads (status updates, likes, comments at high frequency), and geographic distribution of a user base that expected real-time responsiveness.
Unlike Google (which had purpose-built hardware from early on), Meta operated standard rack servers until the Open Compute Project (OCP, 2011) — Meta's initiative to open-source data center hardware designs for efficiency. OCP designs reduced server cost by ~38% and power consumption by ~24%.
Linux at Meta Scale
Meta runs one of the largest fleets of Linux machines in the world, and it maintains custom kernel forks with patches that never make it upstream for various reasons — performance micro- optimizations, proprietary hardware drivers, or features in development.
Network Stack Optimizations
Meta was an early adopter and contributor to TCP BBR (Bottleneck Bandwidth and Round-trip propagation time), Google's congestion control algorithm that dramatically improves throughput on lossy wide-area links. Meta deployed BBR for inter-datacenter and CDN traffic before it landed in the Linux mainline (4.9, 2016).
Meta also modified the kernel's receive packet steering (RPS) and receive flow steering (RFS) to improve NIC interrupt distribution across CPU cores, reducing latency variance in their Thrift RPC servers.
io_uring Adoption
Meta was one of the earliest large-scale adopters of io_uring (Linux 5.1, 2019), the asynchronous I/O interface designed by Jens Axboe. io_uring uses a ring buffer shared between kernel and userspace — no syscall needed to submit I/O, no syscall needed to collect completions under load. For storage-bound services (Haystack, f4, Tao's persistence layer), io_uring eliminated the syscall overhead that limited throughput at high I/O rates.
Memory Management Modifications
Meta runs large Java services (including significant chunks of the data pipeline and serving stack). Linux's default Transparent Huge Pages (THP) behavior (khugepaged background defragmentation) caused periodic latency spikes in JVM-heavy services. Meta patched THP behavior and eventually contributed changes upstream to make THP more predictable.
Meta also developed TMO (Transparent Memory Offloading) patches for the kernel to move cold anonymous memory pages to CXL (Compute Express Link) memory tier or to zswap, reclaiming DRAM for hot data. Published in 2021, these patches represent Meta's approach to memory tiering at scale.
Tupperware: Container Orchestration
Tupperware is Meta's internal container orchestration system, predating broad Kubernetes adoption by several years. It was designed around Meta's specific requirements that differed from Kubernetes's original design:
- Stateful services as first-class: Meta runs large stateful services (Memcache, MySQL, TAO) that need persistent storage, not just stateless microservices.
- Persistent containers: A Tupperware container can be assigned to a machine semi-permanently (weeks or months), unlike Kubernetes pods which are expected to be ephemeral.
- Incremental updates: Tupperware supports very large services (tens of thousands of containers) with controlled rolling updates that preserve service availability.
- Weighted load balancing: Tupperware-native load balancing is aware of host health and performance variance (not just round-robin), which matters when tail latency of storage lookups dominates.
Meta has been migrating workloads to Kubernetes for developer-facing services while maintaining Tupperware for its internal infrastructure services.
TAO: The Association and Object Graph Cache
TAO (The Associations and Objects) is Meta's distributed graph cache, described in the 2013 USENIX ATC paper. It is the primary data serving system for Facebook's social graph.
Data Model
The social graph is modeled as: - Objects: Nodes with a 64-bit ID, type, and key-value properties (users, posts, photos, comments, pages). - Associations: Directed edges between objects with a type, timestamp, and key-value data (friendship, like, tagged-in, comment-on).
TAO Data Model:
Object: {id: 12345, type: USER, {name: "Alice", location: "NYC"}}
Association: {
id1: 12345, // Alice
type: FRIEND,
id2: 67890, // Bob
time: 1700000000,
data: {}
}
Architecture
TAO Architecture:
┌──────────────────────────────────────────────────────────────────┐
│ Web/Mobile Clients │
└──────────────────────┬───────────────────────────────────────────┘
│
┌──────────────────────▼───────────────────────────────────────────┐
│ TAO Leader Tier (per region: e.g., US-West) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Leader │ │ Leader │ │ Leader │ (Memcache-backed) │
│ │ shard 1 │ │ shard 2 │ │ shard N │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└──────────────────────┬───────────────────────────────────────────┘
│ cache miss → read from follower tier
┌──────────────────────▼───────────────────────────────────────────┐
│ TAO Follower Tier (multiple per region) │
│ (serves most reads; writes forwarded to Leader) │
└──────────────────────┬───────────────────────────────────────────┘
│ cache miss → read from DB
┌──────────────────────▼───────────────────────────────────────────┐
│ MySQL (sharded) │
│ (source of truth; WAL replicated to followers) │
└──────────────────────────────────────────────────────────────────┘
TAO provides read-your-writes consistency within a region and eventual consistency across regions (cross-region replication lag is typically sub-second). Writes to associations go to the leader tier in the user's "home region" and are propagated asynchronously to other regions.
Geographically Distributed Architecture
Meta operates TAO in 5+ global regions. Each region has its own leader and follower tiers plus a MySQL replica. For reads: serve from local follower cache (99%+ hit rate for hot social graph edges). For writes: forward to the home region leader, which writes to MySQL and invalidates caches.
Cross-region write latency is ~50–150ms (WAN RTT). This is acceptable because social graph updates (liking a post) are not time-critical at millisecond resolution.
Thrift: Meta's RPC Framework
Apache Thrift (2007) was developed at Facebook and open-sourced in 2007. It provides:
- IDL (Interface Definition Language): Type-safe service definitions.
- Code generation: Generate client and server stubs for C++, Java, Python, PHP, Ruby, Erlang, Go, and more.
- Protocol: Binary (compact and binary), JSON, and debug protocols.
- Transport: TCP sockets, HTTP, framed, buffered.
// Example Thrift IDL:
struct User {
1: required i64 id,
2: required string name,
3: optional string email
}
service UserService {
User getUser(1: i64 userId) throws (1: NotFoundException nfe),
void updateUser(1: User user)
}
Thrift is still widely used at Meta. Externally, gRPC (Google's similar system using Protocol Buffers + HTTP/2) has become dominant. Meta's internal infrastructure uses Thrift almost exclusively for service-to-service communication.
Haystack: Photo Storage
Haystack is Meta's purpose-built object store for photos, described in the OSDI 2010 paper.
The Problem
Facebook serves billions of photo requests per day. Each photo in a CDN miss requires reading the photo file from disk. Traditional filesystem metadata lookup requires multiple disk I/Os: - inode lookup in page cache (or disk). - Directory entry lookup. - File data read.
At Facebook's scale, even small metadata overhead becomes significant. The NFS-based system Meta used before Haystack had ~10 disk reads per photo (due to metadata layers). Metadata occupied network file system servers 24/7, preventing caching of the photo data.
Haystack Design
Haystack combines many small files into large store files (up to 100GB). Each photo is stored as a needle in a store file with a fixed-size header and the raw photo bytes.
Haystack Store File (physical layout):
┌────────────────────────────────────────────────────────┐
│ Superblock (64 bytes) │
├────────────────────────────────────────────────────────┤
│ Needle 1: {magic, cookie, key, alt_key, flags, │
│ size, data, checksum, padding} │
├────────────────────────────────────────────────────────┤
│ Needle 2: ... │
├────────────────────────────────────────────────────────┤
│ Needle N: ... │
└────────────────────────────────────────────────────────┘
The key insight: each Haystack machine keeps an in-memory index mapping photo ID → (volume
ID, offset, size). A photo read requires:
1. Index lookup (in-memory, microseconds).
2. One pread() to the correct offset in the store file.
This reduces disk reads from ~10 (NFS era) to exactly 1, regardless of how many photos the machine hosts. The memory overhead is small: ~10 bytes per photo × 4 needles (redundant copies) × billions of photos = manageable with commodity RAM.
The "4 hardware needles per byte" phrase in Facebook's paper refers to redundancy: each photo is stored on 4 physical Haystack machines (in 2 failure domains), providing durability and read load distribution.
f4: Warm Storage
f4 is described in the OSDI 2014 paper as Haystack's companion for warm storage — photos that are infrequently accessed (older content, photos of non-public profiles).
The key insight: warm photos consume the same hardware as hot photos, but 82% of Haystack's storage capacity is warm at any time. Providing 4× replication (as in Haystack) for data accessed once per week is inefficient.
f4 uses Reed-Solomon erasure coding (Facebook implemented RS(10, 4) — 10 data blocks + 4 parity blocks — any 4 can be lost and the data reconstructed). This reduces storage overhead from 3.6× (Haystack) to 1.4× for warm data — a ~10× reduction in effective cost per byte.
f4 Erasure Coding:
Original photo: [block 1][block 2]...[block 10]
RS encode: [block 1]...[block 10][parity 1]...[parity 4]
Store on 14 nodes (across 2 data centers, multiple racks)
Any 10 survive → reconstruct original
The reliability tradeoff: reading a degraded block (where a node is down) requires network I/O to reconstruct from 10 surviving fragments. This is acceptable for warm data where latency SLA is seconds, not milliseconds.
Scuba: Real-Time Performance Debugging
Scuba is Meta's internal tool for real-time, ad-hoc querying of large event tables. It is described in the VLDB 2013 paper.
Design
Scuba is optimized for one use case: engineers debugging production performance problems need to query billions of events per second with sub-second latency for arbitrary aggregations.
Key design decisions: - In-memory columnar storage: Data is stored column-by-column in RAM (never on disk in the hot path). Queries operate on columns, enabling vectorized scans. - Sampling: Scuba accepts sampled inputs (e.g., log 1% of RPC calls). The sampling factor is tracked per record; aggregations are scaled up accordingly. - No durability guarantee: Scuba data has a TTL (typically 1–30 days). When nodes restart, data is lost. This trades reliability for simplicity. - Scatter-gather query: A coordinator fan-outs a query to all leaf servers, each scanning its in-memory partition. Partial results are merged at the coordinator.
Scuba influenced Druid (a similar open-source real-time OLAP system) and internal tools at other large web companies.
Debugging and Operations Notes
TAO cache invalidation storms: When a popular entity (viral post) is updated, its cache entry is invalidated across all follower tiers simultaneously, causing a thundering herd against MySQL. Meta mitigates this with lease-based read-repair: a follower that misses issues a lease token; subsequent misses wait for the lease holder to refresh rather than all hitting the DB.
Thrift connection pooling: At Meta's service-to-service call volume, connection pool exhaustion is a common incident trigger. Pool size tuning is critical. Each service has its own Thrift client pool per downstream service.
Haystack volume corruption: If a needle is corrupted on disk (bitrot), the checksum fails and the other 3 replicas serve the request. Haystack background integrity checks detect and repair corruption proactively.
Security Implications
TAO access control: Because TAO stores the social graph, access control is enforced at the application layer (PHP/Hack code) rather than the storage layer. A bug in the access control logic can expose private social graph data without any infrastructure-level protection.
Thrift deserialization: Thrift's binary protocol deserialization is a common target for denial-of-service: sending deeply nested structs or very large collections can cause OOM in the server. Production Thrift servers set strict limits on deserialization depth and size.
Performance Implications
| System | Key Metric | Typical Value |
|---|---|---|
| TAO | Cache hit rate | ~99% for hot social graph reads |
| TAO | Read latency (cache hit) | ~1ms (LAN) |
| Haystack | Photo read I/Os | 1 disk read per photo |
| f4 | Storage efficiency | 1.4× overhead vs raw |
| Scuba | Query latency | <1s for ad-hoc aggregations over billions of rows |
Failure Modes
| System | Failure | Consequence |
|---|---|---|
| TAO Leader failure | Paxos re-election | Write unavailability ~1–5s for affected shards |
| TAO cross-region lag | Network partition | Followers serve stale data (eventual consistency) |
| Haystack machine failure | 1 of 4 replicas lost | Reads degrade; background re-replication starts |
| f4 erasure decode | >4 simultaneous node failures in stripe | Data loss (extremely rare) |
| Thrift thundering herd | Upstream restart | Fan-in causes downstream service overload |
Future Directions
Meta has been investing in: - PyTorch and AI infrastructure: The AI Training cluster (RSC — Research SuperCluster) uses custom NFS over RDMA. The storage and networking architecture for AI training represents Meta's latest infrastructure investment. - Babelfish / Velox: Velox is Meta's open-source vectorized query engine used across Presto, Spark, and internal analytics pipelines — a significant contribution to the query engine ecosystem. - CXL Memory Tiering: Meta is deploying CXL (Compute Express Link) memory expansion cards to add capacity to memory-bound services cheaply.
Exercises
- Read the TAO paper (USENIX ATC 2013). Trace the full path of a "getAssociations" query from a PHP web server through TAO to MySQL and back. Identify every cache tier involved.
- Implement a simplified Haystack needle reader in Python: given a store file and in-memory index, locate and read a needle by photo key.
- Compare Reed-Solomon(10,4) and 3× replication on the following metrics: storage overhead, reads required to serve a request, network overhead to reconstruct a lost block.
- Design a Scuba-like in-memory columnar store for 1 billion events with 10 integer columns. Estimate RAM usage and single-node scan throughput.
- Evaluate the TAO eventual consistency model: what happens if a user A unfriends user B, but user B's region hasn't received the invalidation yet? What content can B see?
References
- Nishtala, R. et al. "Scaling Memcache at Facebook." NSDI 2013.
- Bronson, N. et al. "TAO: Facebook's Distributed Data Store for the Social Graph." ATC 2013.
- Beaver, D. et al. "Finding a Needle in Haystack: Facebook's Photo Storage." OSDI 2010.
- Muralidhar, S. et al. "f4: Facebook's Warm BLOB Storage System." OSDI 2014.
- Wiener, J. et al. "Scuba: Diving into Data at Facebook." VLDB 2013.
- Thrift White Paper:
https://thrift.apache.org/static/files/thrift-20070401.pdf