Section 21: Cloud Infrastructure — Overview

Section Purpose and Scope

This section examines cloud infrastructure at the systems level — not the API surface advertised to application developers, but the physical and virtual machinery underneath: how hyperscalers build programmable networks at planet scale, how object storage achieves eleven nines of durability, how AWS Nitro offloads the hypervisor to dedicated silicon, and how Google's Colossus filesystem underpins every GCP storage product. The goal is to build the mental model that lets an engineer reason about failure modes, latency sources, and cost structures in cloud environments.

Prerequisites

Section 06: CPU Architecture (NUMA, SIMD, hardware virtualization extensions)
Section 11: Memory Management (virtual memory, IOMMU)
Section 13: Filesystems (distributed filesystem concepts)
Section 15: Networking (TCP/IP, BGP, routing fundamentals)
Section 17: Distributed Systems (consensus, replication, CAP)
Section 18: Database Internals (storage engines, replication)
Section 19: Virtualization (hypervisor types, KVM, SR-IOV)

Learning Objectives

Compare IaaS, PaaS, SaaS, and FaaS responsibility models with concrete examples.
Explain how VPC, SDN, and VXLAN compose to create tenant-isolated networks at scale.
Describe the AWS Nitro system's architecture and why offloading to dedicated hardware matters.
Articulate Google Colossus's design and its role in GCS, BigQuery, and Spanner.
Analyze object storage consistency models and durability erasure-coding schemes.
Explain BGP's role in cloud anycast and cross-region routing.
Apply FinOps principles to identify cloud cost optimization opportunities.
Enumerate cloud failure mode categories and design mitigations.

Architecture Overview

  Cloud Infrastructure Stack:

  ┌──────────────────────────────────────────────────────────────────┐
  │                    Customer Abstraction Layer                    │
  │   SaaS              PaaS              IaaS            FaaS       │
  │ (GMail, etc)   (App Engine, etc)   (EC2, VMs)    (Lambda, etc)  │
  └───────────────────────────┬──────────────────────────────────────┘
                              │
  ┌───────────────────────────▼──────────────────────────────────────┐
  │                    Control Plane (per region)                    │
  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────┐  │
  │  │  VM Scheduler│  │ Network Ctrl │  │  Storage Placement   │  │
  │  │  (fleet mgmt)│  │  (SDN ctrl)  │  │  (replica placement) │  │
  │  └──────────────┘  └──────────────┘  └──────────────────────┘  │
  └───────────────────────────┬──────────────────────────────────────┘
                              │
  ┌───────────────────────────▼──────────────────────────────────────┐
  │                    Physical Fabric                               │
  │                                                                  │
  │   Availability Zone A        Availability Zone B                 │
  │  ┌───────────────────┐      ┌───────────────────┐               │
  │  │  ┌─────────────┐  │      │  ┌─────────────┐  │               │
  │  │  │  Rack N     │  │      │  │  Rack M     │  │               │
  │  │  │ ┌──────────┐│  │      │  │ ┌──────────┐│  │               │
  │  │  │ │ Host     ││  │      │  │ │ Host     ││  │               │
  │  │  │ │ ┌──────┐ ││  │      │  │ │ ┌──────┐ ││  │               │
  │  │  │ │ │  VM  │ ││  │      │  │ │ │  VM  │ ││  │               │
  │  │  │ │ │Nitro │ ││  │      │  │ │ │ KVM  │ ││  │               │
  │  │  │ │ └──────┘ ││  │      │  │ │ └──────┘ ││  │               │
  │  │  │ └──────────┘│  │      │  │ └──────────┘│  │               │
  │  │  └─────────────┘  │      │  └─────────────┘  │               │
  │  │   ToR Switch       │      │   ToR Switch       │               │
  │  └────────┬──────────┘      └────────┬──────────┘               │
  │           │                          │                            │
  │  ┌────────▼──────────────────────────▼────────────┐             │
  │  │         Spine / Clos Network Fabric              │             │
  │  │   VXLAN overlay, BGP-EVPN, 100G/400G links      │             │
  │  └───────────────────────────────────────────────── ┘             │
  └──────────────────────────────────────────────────────────────────┘

  AWS Nitro System:
  ┌──────────────────────────────────────────────┐
  │               EC2 Host                       │
  │  ┌──────────────────┐  ┌──────────────────┐  │
  │  │   Nitro Card     │  │  Nitro Security  │  │
  │  │  (NVMe/ENA I/O)  │  │  Chip (root of   │  │
  │  │  VPC networking  │  │  trust, firmware │  │
  │  │  offloaded from  │  │  attestation)    │  │
  │  │  host CPU        │  │                  │  │
  │  └──────────────────┘  └──────────────────┘  │
  │  ┌──────────────────────────────────────────┐ │
  │  │   Nitro Hypervisor (KVM-based, minimal)  │ │
  │  │   No host OS processes; near bare metal  │ │
  │  └──────────────────────────────────────────┘ │
  └──────────────────────────────────────────────┘

Key Concepts

IaaS (Infrastructure as a Service): Provider manages physical hardware, hypervisor, and network fabric. Customer manages OS upward. EC2, GCE Compute Engine, Azure VMs.
PaaS (Platform as a Service): Provider manages runtime, middleware, and OS. Customer manages application and data. App Engine, Elastic Beanstalk, Heroku.
SaaS (Software as a Service): Provider manages everything. Customer configures and uses. Gmail, Salesforce, Workday.
FaaS / Serverless: Event-driven function execution. No persistent server. AWS Lambda, GCP Cloud Functions, Azure Functions. Billed per invocation and execution duration.
VPC (Virtual Private Cloud): Software-defined isolated network partition within a cloud provider. Subnets span availability zones. Route tables, security groups, NACLs enforce policy.
SDN (Software-Defined Networking): Separation of control plane from data plane. Central controller programs forwarding rules into commodity switches. Enables programmable network policy.
VXLAN: Layer-2 overlay over Layer-3 underlay. 24-bit VNI (16 million segments). UDP encapsulation port 4789. Used by most cloud providers for tenant network isolation.
BGP in Cloud: Used at spine layer for east-west routing, for anycast (routing traffic to nearest healthy endpoint), and for Direct Connect / ExpressRoute peering.
Object Storage: Flat namespace, HTTP API (GET/PUT/DELETE), content-addressed by key. Durability through erasure coding (e.g., Reed-Solomon 6+3 across AZs). S3, GCS, Azure Blob.
Erasure Coding: Data split into k data shards + m parity shards. Any k of k+m shards sufficient to reconstruct. More storage-efficient than 3-way replication for same durability.
AWS Nitro System: Custom silicon (Nitro Cards) offload ENA (networking) and NVMe (storage) from the host CPU to dedicated ASICs. Nitro Hypervisor is a minimal KVM variant. Nitro Security Chip provides hardware root of trust.
Google Colossus: Successor to GFS. Distributed filesystem providing the storage substrate for GCS, BigQuery, Spanner, and other Google services. Metadata managed by Curators; data stored on D servers (disk processes).
Azure SmartNIC: FPGA-based (Project Catapult) network card handling Azure SDN (Azure Virtual Filtering Platform), encryption, and RDMA. Freed host CPU from network processing.
Cloud-Native Design: Stateless compute, externalized state, horizontal scaling, failure as expected condition, automation over manual intervention. Twelve-Factor App as a practical guide.
FinOps: Financial operations discipline applied to cloud spend. Rightsizing, reserved/spot instance strategy, storage class lifecycle, egress cost awareness, unit economics measurement.

Major Historical Milestones

Year	Event
2006	AWS S3 and EC2 launch — commercial public cloud begins
2008	Google App Engine launched (first PaaS)
2010	Azure generally available; OpenStack project started
2011	GCP Compute Engine beta; VPC concepts formalized
2012	VXLAN RFC 7348 published; SDN/OpenFlow peaks in research
2014	AWS Lambda launches — serverless/FaaS paradigm introduced
2014	Google Kubernetes open-sourced
2015	AWS Direct Connect, Azure ExpressRoute mature; hybrid cloud practical
2017	AWS Nitro system revealed at re:Invent — hardware offload architecture
2017	Google Colossus architecture published in academic papers
2018	Azure FPGA SmartNIC (Project Catapult) published; RDMA at scale
2019	AWS Graviton (ARM-based) instances — hyperscaler custom silicon mainstream
2020	AWS Outposts GA; cloud hardware on-premises
2021	FinOps Foundation established; cloud cost engineering professionalized
2022	AWS Nitro Enclaves for confidential computing; AMD SEV integration
2023	Google Axion, AWS Trainium/Inferentia — AI-specific cloud silicon
2024	400G fabric upgrades; AI workload networking (RoCE, rail-optimized)

Modern Relevance

Cloud infrastructure is the operating environment for the majority of new systems. Understanding the physical substrate — that a VPC is VXLAN over a Clos fabric, that S3 durability is erasure coding across geographically separated failure domains, that Lambda cold starts are VM provisioning events — transforms a cloud user into a cloud engineer capable of designing for performance, cost, and resilience.

The AWS Nitro architecture represents the direction of the industry: pushing hypervisor and I/O functionality to dedicated silicon reduces the trusted computing base, eliminates noisy-neighbor CPU contention, and enables near-bare-metal performance. Azure's SmartNIC and Google's Titanium are parallel approaches.

Confidential computing (AMD SEV, Intel TDX, AWS Nitro Enclaves) is the next security frontier: hardware-enforced memory encryption prevents the cloud provider itself from accessing tenant workload memory.

File Map

21-cloud-infrastructure/
├── 00-overview.md                  ← this file
├── 01-cloud-models.md              ← IaaS/PaaS/SaaS/FaaS tradeoffs, shared responsibility
├── 02-hyperscaler-architectures.md ← AWS, GCP, Azure topology, AZ/region design
├── 03-virtualization-in-cloud.md   ← KVM at scale, live migration, overcommit
├── 04-cloud-networking.md          ← VPC, SDN, VXLAN, BGP-EVPN, security groups
├── 05-object-storage.md            ← S3/GCS internals, erasure coding, consistency
├── 06-cloud-databases.md           ← RDS, Spanner, DynamoDB, Aurora internals
├── 07-serverless-faas.md           ← Lambda lifecycle, cold start, Firecracker MicroVM
├── 08-cloud-native-design.md       ← 12-factor, stateless, circuit breakers, SLOs
├── 09-aws-nitro.md                 ← Nitro Cards, Nitro Hypervisor, Security Chip
├── 10-google-colossus.md           ← GFS->Colossus evolution, Curator, D servers
├── 11-azure-smartnic.md            ← Project Catapult, FPGA offload, Azure VFP
├── 12-finops.md                    ← cost models, rightsizing, reserved/spot, egress
└── 13-cloud-failure-modes.md       ← AZ failures, gray failures, blast radius design

Cross-References

Section 15 (Networking): BGP, TCP/IP underpinning cloud networking fabric
Section 17 (Distributed Systems): Consensus (Paxos/Raft) in cloud control planes; CAP tradeoffs in cloud databases
Section 18 (Database Internals): Spanner, DynamoDB, Aurora implementation details
Section 19 (Virtualization): KVM, QEMU, SR-IOV — the hypervisor layer cloud is built on
Section 20 (Containers): Fargate, Cloud Run, ACI — containers as the cloud compute unit
Section 22 (Kubernetes Internals): EKS/GKE/AKS — managed Kubernetes on cloud infrastructure
Section 25 (Performance Engineering): Network latency in cloud, RDMA, placement groups
Section 26 (Security): Confidential computing, VPC security, IAM trust boundaries
Section 28 (Reliability Engineering): Multi-region architecture, RTO/RPO, chaos on cloud