Section 21: Cloud Infrastructure — Overview
Section Purpose and Scope
This section examines cloud infrastructure at the systems level — not the API surface advertised to application developers, but the physical and virtual machinery underneath: how hyperscalers build programmable networks at planet scale, how object storage achieves eleven nines of durability, how AWS Nitro offloads the hypervisor to dedicated silicon, and how Google's Colossus filesystem underpins every GCP storage product. The goal is to build the mental model that lets an engineer reason about failure modes, latency sources, and cost structures in cloud environments.
Prerequisites
- Section 06: CPU Architecture (NUMA, SIMD, hardware virtualization extensions)
- Section 11: Memory Management (virtual memory, IOMMU)
- Section 13: Filesystems (distributed filesystem concepts)
- Section 15: Networking (TCP/IP, BGP, routing fundamentals)
- Section 17: Distributed Systems (consensus, replication, CAP)
- Section 18: Database Internals (storage engines, replication)
- Section 19: Virtualization (hypervisor types, KVM, SR-IOV)
Learning Objectives
- Compare IaaS, PaaS, SaaS, and FaaS responsibility models with concrete examples.
- Explain how VPC, SDN, and VXLAN compose to create tenant-isolated networks at scale.
- Describe the AWS Nitro system's architecture and why offloading to dedicated hardware matters.
- Articulate Google Colossus's design and its role in GCS, BigQuery, and Spanner.
- Analyze object storage consistency models and durability erasure-coding schemes.
- Explain BGP's role in cloud anycast and cross-region routing.
- Apply FinOps principles to identify cloud cost optimization opportunities.
- Enumerate cloud failure mode categories and design mitigations.
Architecture Overview
Cloud Infrastructure Stack:
┌──────────────────────────────────────────────────────────────────┐
│ Customer Abstraction Layer │
│ SaaS PaaS IaaS FaaS │
│ (GMail, etc) (App Engine, etc) (EC2, VMs) (Lambda, etc) │
└───────────────────────────┬──────────────────────────────────────┘
│
┌───────────────────────────▼──────────────────────────────────────┐
│ Control Plane (per region) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ VM Scheduler│ │ Network Ctrl │ │ Storage Placement │ │
│ │ (fleet mgmt)│ │ (SDN ctrl) │ │ (replica placement) │ │
│ └──────────────┘ └──────────────┘ └──────────────────────┘ │
└───────────────────────────┬──────────────────────────────────────┘
│
┌───────────────────────────▼──────────────────────────────────────┐
│ Physical Fabric │
│ │
│ Availability Zone A Availability Zone B │
│ ┌───────────────────┐ ┌───────────────────┐ │
│ │ ┌─────────────┐ │ │ ┌─────────────┐ │ │
│ │ │ Rack N │ │ │ │ Rack M │ │ │
│ │ │ ┌──────────┐│ │ │ │ ┌──────────┐│ │ │
│ │ │ │ Host ││ │ │ │ │ Host ││ │ │
│ │ │ │ ┌──────┐ ││ │ │ │ │ ┌──────┐ ││ │ │
│ │ │ │ │ VM │ ││ │ │ │ │ │ VM │ ││ │ │
│ │ │ │ │Nitro │ ││ │ │ │ │ │ KVM │ ││ │ │
│ │ │ │ └──────┘ ││ │ │ │ │ └──────┘ ││ │ │
│ │ │ └──────────┘│ │ │ │ └──────────┘│ │ │
│ │ └─────────────┘ │ │ └─────────────┘ │ │
│ │ ToR Switch │ │ ToR Switch │ │
│ └────────┬──────────┘ └────────┬──────────┘ │
│ │ │ │
│ ┌────────▼──────────────────────────▼────────────┐ │
│ │ Spine / Clos Network Fabric │ │
│ │ VXLAN overlay, BGP-EVPN, 100G/400G links │ │
│ └───────────────────────────────────────────────── ┘ │
└──────────────────────────────────────────────────────────────────┘
AWS Nitro System:
┌──────────────────────────────────────────────┐
│ EC2 Host │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ Nitro Card │ │ Nitro Security │ │
│ │ (NVMe/ENA I/O) │ │ Chip (root of │ │
│ │ VPC networking │ │ trust, firmware │ │
│ │ offloaded from │ │ attestation) │ │
│ │ host CPU │ │ │ │
│ └──────────────────┘ └──────────────────┘ │
│ ┌──────────────────────────────────────────┐ │
│ │ Nitro Hypervisor (KVM-based, minimal) │ │
│ │ No host OS processes; near bare metal │ │
│ └──────────────────────────────────────────┘ │
└──────────────────────────────────────────────┘
Key Concepts
- IaaS (Infrastructure as a Service): Provider manages physical hardware, hypervisor, and network fabric. Customer manages OS upward. EC2, GCE Compute Engine, Azure VMs.
- PaaS (Platform as a Service): Provider manages runtime, middleware, and OS. Customer manages application and data. App Engine, Elastic Beanstalk, Heroku.
- SaaS (Software as a Service): Provider manages everything. Customer configures and uses. Gmail, Salesforce, Workday.
- FaaS / Serverless: Event-driven function execution. No persistent server. AWS Lambda, GCP Cloud Functions, Azure Functions. Billed per invocation and execution duration.
- VPC (Virtual Private Cloud): Software-defined isolated network partition within a cloud provider. Subnets span availability zones. Route tables, security groups, NACLs enforce policy.
- SDN (Software-Defined Networking): Separation of control plane from data plane. Central controller programs forwarding rules into commodity switches. Enables programmable network policy.
- VXLAN: Layer-2 overlay over Layer-3 underlay. 24-bit VNI (16 million segments). UDP encapsulation port 4789. Used by most cloud providers for tenant network isolation.
- BGP in Cloud: Used at spine layer for east-west routing, for anycast (routing traffic to nearest healthy endpoint), and for Direct Connect / ExpressRoute peering.
- Object Storage: Flat namespace, HTTP API (GET/PUT/DELETE), content-addressed by key. Durability through erasure coding (e.g., Reed-Solomon 6+3 across AZs). S3, GCS, Azure Blob.
- Erasure Coding: Data split into k data shards + m parity shards. Any k of k+m shards sufficient to reconstruct. More storage-efficient than 3-way replication for same durability.
- AWS Nitro System: Custom silicon (Nitro Cards) offload ENA (networking) and NVMe (storage) from the host CPU to dedicated ASICs. Nitro Hypervisor is a minimal KVM variant. Nitro Security Chip provides hardware root of trust.
- Google Colossus: Successor to GFS. Distributed filesystem providing the storage substrate for GCS, BigQuery, Spanner, and other Google services. Metadata managed by Curators; data stored on D servers (disk processes).
- Azure SmartNIC: FPGA-based (Project Catapult) network card handling Azure SDN (Azure Virtual Filtering Platform), encryption, and RDMA. Freed host CPU from network processing.
- Cloud-Native Design: Stateless compute, externalized state, horizontal scaling, failure as expected condition, automation over manual intervention. Twelve-Factor App as a practical guide.
- FinOps: Financial operations discipline applied to cloud spend. Rightsizing, reserved/spot instance strategy, storage class lifecycle, egress cost awareness, unit economics measurement.
Major Historical Milestones
| Year | Event |
|---|---|
| 2006 | AWS S3 and EC2 launch — commercial public cloud begins |
| 2008 | Google App Engine launched (first PaaS) |
| 2010 | Azure generally available; OpenStack project started |
| 2011 | GCP Compute Engine beta; VPC concepts formalized |
| 2012 | VXLAN RFC 7348 published; SDN/OpenFlow peaks in research |
| 2014 | AWS Lambda launches — serverless/FaaS paradigm introduced |
| 2014 | Google Kubernetes open-sourced |
| 2015 | AWS Direct Connect, Azure ExpressRoute mature; hybrid cloud practical |
| 2017 | AWS Nitro system revealed at re:Invent — hardware offload architecture |
| 2017 | Google Colossus architecture published in academic papers |
| 2018 | Azure FPGA SmartNIC (Project Catapult) published; RDMA at scale |
| 2019 | AWS Graviton (ARM-based) instances — hyperscaler custom silicon mainstream |
| 2020 | AWS Outposts GA; cloud hardware on-premises |
| 2021 | FinOps Foundation established; cloud cost engineering professionalized |
| 2022 | AWS Nitro Enclaves for confidential computing; AMD SEV integration |
| 2023 | Google Axion, AWS Trainium/Inferentia — AI-specific cloud silicon |
| 2024 | 400G fabric upgrades; AI workload networking (RoCE, rail-optimized) |
Modern Relevance
Cloud infrastructure is the operating environment for the majority of new systems. Understanding the physical substrate — that a VPC is VXLAN over a Clos fabric, that S3 durability is erasure coding across geographically separated failure domains, that Lambda cold starts are VM provisioning events — transforms a cloud user into a cloud engineer capable of designing for performance, cost, and resilience.
The AWS Nitro architecture represents the direction of the industry: pushing hypervisor and I/O functionality to dedicated silicon reduces the trusted computing base, eliminates noisy-neighbor CPU contention, and enables near-bare-metal performance. Azure's SmartNIC and Google's Titanium are parallel approaches.
Confidential computing (AMD SEV, Intel TDX, AWS Nitro Enclaves) is the next security frontier: hardware-enforced memory encryption prevents the cloud provider itself from accessing tenant workload memory.
File Map
21-cloud-infrastructure/
├── 00-overview.md ← this file
├── 01-cloud-models.md ← IaaS/PaaS/SaaS/FaaS tradeoffs, shared responsibility
├── 02-hyperscaler-architectures.md ← AWS, GCP, Azure topology, AZ/region design
├── 03-virtualization-in-cloud.md ← KVM at scale, live migration, overcommit
├── 04-cloud-networking.md ← VPC, SDN, VXLAN, BGP-EVPN, security groups
├── 05-object-storage.md ← S3/GCS internals, erasure coding, consistency
├── 06-cloud-databases.md ← RDS, Spanner, DynamoDB, Aurora internals
├── 07-serverless-faas.md ← Lambda lifecycle, cold start, Firecracker MicroVM
├── 08-cloud-native-design.md ← 12-factor, stateless, circuit breakers, SLOs
├── 09-aws-nitro.md ← Nitro Cards, Nitro Hypervisor, Security Chip
├── 10-google-colossus.md ← GFS->Colossus evolution, Curator, D servers
├── 11-azure-smartnic.md ← Project Catapult, FPGA offload, Azure VFP
├── 12-finops.md ← cost models, rightsizing, reserved/spot, egress
└── 13-cloud-failure-modes.md ← AZ failures, gray failures, blast radius design
Cross-References
- Section 15 (Networking): BGP, TCP/IP underpinning cloud networking fabric
- Section 17 (Distributed Systems): Consensus (Paxos/Raft) in cloud control planes; CAP tradeoffs in cloud databases
- Section 18 (Database Internals): Spanner, DynamoDB, Aurora implementation details
- Section 19 (Virtualization): KVM, QEMU, SR-IOV — the hypervisor layer cloud is built on
- Section 20 (Containers): Fargate, Cloud Run, ACI — containers as the cloud compute unit
- Section 22 (Kubernetes Internals): EKS/GKE/AKS — managed Kubernetes on cloud infrastructure
- Section 25 (Performance Engineering): Network latency in cloud, RDMA, placement groups
- Section 26 (Security): Confidential computing, VPC security, IAM trust boundaries
- Section 28 (Reliability Engineering): Multi-region architecture, RTO/RPO, chaos on cloud