Cloud Networking: VPCs, Overlays, Load Balancers, and Global Routing

Overview

Cloud networking is a software-defined abstraction layer over physical data center fabric. Every cloud provider operates massive physical networks (tens of thousands of switches, hundreds of thousands of servers) and presents customers with a logical view that is isolated, programmable, and independent of the physical topology. Understanding how VPCs, subnets, security groups, and load balancers actually work beneath their abstractions is essential for debugging, performance tuning, and security auditing. This file examines the implementation mechanisms — overlay networks, stateful hardware firewalls, ECMP load distribution, and anycast routing — that underpin cloud networking primitives.

Prerequisites

TCP/IP networking fundamentals (IP addressing, CIDR notation, routing tables)
Layer 2 switching and VLAN concepts
BGP fundamentals (AS paths, route advertisement, prefix length)
Basic understanding of iptables/netfilter
Familiarity with TLS termination and HTTP/2

VPC: Virtual Private Cloud

A VPC is a logically isolated virtual network within a cloud provider's infrastructure. The key word is "logical" — your VPC does not have its own physical switches. It is implemented as an overlay network on top of the provider's shared physical fabric.

Implementation: Overlay Networks

AWS VPC uses a VXLAN-like encapsulation protocol. Each VPC is assigned a unique VNI (VXLAN Network Identifier). When a packet leaves a VM's virtual NIC:

The packet travels from the guest OS through the VirtIO/ENA driver to the Nitro Card
The Nitro Card looks up the destination IP in a mapping table (maintained by the VPC control plane)
The Nitro Card encapsulates the original packet in a UDP header with the destination's physical host IP and VNI
The encapsulated packet traverses the physical underlay network
The destination Nitro Card decapsulates the packet and delivers it to the target VM

This encapsulation is completely invisible to the guest OS. VMs see an ethernet interface; the Nitro Card handles all encapsulation/decapsulation.

+--------------------+         Physical Network         +--------------------+
|   Host A           |                                  |   Host B           |
|  +------------+    |    UDP(src:10.0.0.1,             |    +------------+  |
|  | VM (guest) |    |    dst:10.0.0.2,                 |    | VM (guest) |  |
|  | 172.31.0.5 |    |    VNI:12345)                    |    | 172.31.0.9 |  |
|  +-----+------+    |  +---------------------+         |    +------+-----+  |
|        |           |  | Original IP packet  |         |           |        |
|  +-----+------+    |  | src: 172.31.0.5     |         |    +------+-----+  |
|  | Nitro Card |====+==| dst: 172.31.0.9     |==========+===| Nitro Card |  |
|  | (encap)    |    |  +---------------------+         |    | (decap)    |  |
|  +------------+    |                                  |    +------------+  |
+--------------------+                                  +--------------------+

Google's Andromeda SDN platform and Azure's AccelNet (SmartNIC-based) use analogous approaches with different encapsulation protocols (Geneve, proprietary).

VPC Address Space

Each VPC gets an IPv4 CIDR block (typically /16 to /28). AWS additionally supports IPv6 VPC CIDRs (provider-assigned /56 or customer-owned BYOIPv6). Secondary CIDR blocks can be attached to expand VPC address space without recreating the VPC.

Subnets

Subnets partition the VPC's address space and are scoped to a single Availability Zone. An AZ is a data center building (or cluster of buildings with shared power/cooling) within a region. Subnets cannot span AZs — this is intentional, as AZ boundaries represent failure domains.

Public vs Private Subnets

The distinction is entirely determined by the subnet's route table:

Public Subnet Route Table:        Private Subnet Route Table:
  Destination    Target              Destination    Target
  10.0.0.0/16   local               10.0.0.0/16   local
  0.0.0.0/0     igw-xxxxxxxx        0.0.0.0/0     nat-xxxxxxxx
                (Internet Gateway)               (NAT Gateway)

A "public subnet" has a route to an Internet Gateway for 0.0.0.0/0. Instances in public subnets with public IPs can receive inbound connections from the internet.

A "private subnet" routes 0.0.0.0/0 through a NAT Gateway. Instances can initiate outbound connections to the internet (NAT translates their private IP to a public IP), but cannot receive unsolicited inbound connections.

                Internet
                    |
             +------+------+
             | Internet    |
             | Gateway     |
             +------+------+
                    |
        +-----------+-----------+
        |                       |
+-------+------+        +-------+------+
| Public Subnet|        | Public Subnet|
|  AZ-a        |        |  AZ-b        |
|  Load Balancer        |  Load Balancer
|  Bastion Host|        +-------+------+
+-------+------+                |
        |               +-------+------+
        |               | Private Subnet
        |               |  AZ-b        |
+-------+------+        |  App Servers |
| Private Subnet        +-------+------+
|  AZ-a        |                |
|  App Servers |        +-------+------+
+-------+------+        | Private Subnet
        |               |  AZ-b        |
+-------+------+        |  Databases   |
| Private Subnet        +-------+------+
|  AZ-a        |
|  Databases   |
+--------------+

Best practice: place databases and application servers in private subnets. Only load balancers and bastion hosts require public subnet placement.

Security Groups

Security Groups are stateful packet filters applied per Elastic Network Interface (ENI). Unlike traditional firewalls configured at the network boundary, Security Groups are enforced at the virtual NIC level — every packet in and out of every instance passes through its Security Group evaluation.

Implementation

On Nitro-based instances, Security Groups are implemented in the Nitro Card hardware. On older Xen instances, they were implemented as iptables rules in Dom0. The hardware implementation provides two critical properties: 1. Security Groups cannot be bypassed by a compromised guest OS 2. Evaluation has zero CPU overhead for the instance

Security Group rules are allow-only — there is no explicit deny rule type. Traffic not matching any allow rule is implicitly dropped. The stateful behavior means:

If you allow inbound TCP port 443, the response packets (ephemeral ports) are automatically allowed out
If you allow outbound TCP to a remote host, the connection's response packets are automatically allowed in
Connection tracking is maintained per-flow (5-tuple: src IP, dst IP, src port, dst port, protocol)

Inbound Rules (example web server SG):
  Type       Protocol  Port      Source
  HTTP       TCP       80        0.0.0.0/0
  HTTPS      TCP       443       0.0.0.0/0
  SSH        TCP       22        10.0.0.0/8 (bastion subnet only)

Outbound Rules:
  Type       Protocol  Port      Destination
  All traffic All      All       0.0.0.0/0  (default: allow all outbound)

Security Groups can reference other Security Groups as source/destination, not just CIDR blocks. This allows rules like "allow TCP 5432 from any instance in the app-server-sg" — membership in the SG determines access, not IP addresses. This is the correct pattern for application-tier firewalling.

Network ACLs (NACLs)

NACLs are stateless packet filters applied at the subnet level. Unlike Security Groups, NACLs: - Are stateless: you must explicitly allow both inbound and outbound for a bidirectional connection, including ephemeral ports (TCP 1024-65535 for return traffic) - Support explicit deny rules (useful for blocklisting specific IPs) - Are evaluated before Security Groups (packets hitting NACL deny rules never reach Security Group evaluation) - Rules are processed in order (lowest rule number first)

NACLs are rarely the right tool for fine-grained access control because their statelessness makes them operationally complex. Primary use case: emergency IP blocklisting (e.g., block a specific attacking IP across an entire subnet instantly).

Cloud Load Balancers

Layer 4: Network Load Balancer (NLB)

NLBs operate at TCP/UDP level. Key properties: - Pass-through mode: NLB forwards packets without terminating TCP connections - Preserves client source IP (no SNAT required, client IP visible to backends) - Extreme performance: millions of requests per second, sub-millisecond added latency - Static Elastic IPs: NLB fronts are assigned static IPs (unlike ALB which uses DNS with TTL) - TLS passthrough or TLS termination at NLB with client cert passthrough via PROXY protocol

Use cases: gaming (UDP), financial trading (TCP with strict latency requirements), IoT protocols (MQTT), anything that cannot tolerate HTTP abstraction overhead.

Layer 7: Application Load Balancer (ALB)

ALBs terminate TCP/TLS and speak HTTP/1.1 and HTTP/2 to backends. Key properties: - Content-based routing: route by path (/api/* → backend A), host header (api.example.com → backend B), query string, HTTP method - TLS termination: handles certificate management (ACM integration), HTTPS offload from backends - WAF integration: AWS WAF rules evaluated at ALB for SQL injection, XSS, rate limiting - Sticky sessions: route requests from same user to same backend (cookie-based affinity) - WebSocket support: upgrades connections transparently

Client → ALB → [path /api/* → API target group (port 8080)]
              → [path /static/* → S3 bucket (redirect)]
              → [host app.example.com → App target group (port 3000)]
              → [default → 404 fixed response]

NLB vs ALB Decision

Use NLB when: latency is sub-millisecond critical, you need static IPs, protocol is non-HTTP, you need client IP without extra configuration.

Use ALB when: routing on HTTP attributes, TLS termination, WAF, WebSocket, gRPC (via HTTP/2).

VPC Connectivity: Peering, Transit Gateway, PrivateLink

VPC Peering

Direct connection between two VPCs. Routing is symmetric but must be explicitly configured in both route tables. Limitations: non-transitive (A↔B, B↔C does not mean A↔C), no overlapping CIDRs, maximum ~125 peering connections per VPC.

Transit Gateway

Managed hub-and-spoke router. All VPCs connect to the TGW; TGW routes between them. Supports thousands of VPCs, transitive routing, route tables per attachment for segmentation. Costs per attachment and per GB processed.

VPC A ─────┐
VPC B ──── Transit Gateway ──── On-premises (via Direct Connect)
VPC C ─────┘
VPC D ─────┘ (all VPCs can route to each other and on-prem)

AWS PrivateLink

Expose a service from VPC A to consumers in VPC B without VPC peering or internet routing. Consumer VPC gets an ENI with a private IP; traffic to that ENI is forwarded to the service in the provider VPC via AWS's internal network. Supported for both AWS-managed services (S3, DynamoDB gateway endpoints use a related mechanism) and customer-managed services. Benefit: no route table changes required, no CIDR overlap issues.

Direct Connect / ExpressRoute

Dedicated private network circuit from customer premises to cloud provider. Traffic bypasses the public internet entirely.

Properties: - Consistent latency (dedicated circuit vs internet path variability) - Higher throughput (1Gbps to 100Gbps port options) - Lower per-GB cost vs internet-facing data transfer pricing - Required for some regulatory workloads (ITAR, PCI DSS strict interpretations) - Can carry private (VPC) traffic and public (S3, API endpoints) traffic on separate VIFs (Virtual Interfaces)

Direct Connect does not provide redundancy by itself — a single circuit is a single point of failure. Production use requires two circuits (ideally from separate Direct Connect locations) with BGP failover.

Anycast Routing for Global Services

Anycast advertises the same IP prefix from multiple points of presence (PoPs) simultaneously via BGP. End users' requests are routed to the nearest PoP by BGP path selection.

User in Tokyo ──────→ Route53 Anycast IP (203.0.113.0/24)
                      BGP sees:
                      Tokyo PoP: AS16509, path length 3
                      London PoP: AS16509, path length 8
                      Virginia PoP: AS16509, path length 9
                      → selects Tokyo PoP (shortest path)

User in London ─────→ Route53 Anycast IP (203.0.113.0/24)
                      → selects London PoP

AWS Route 53 uses Anycast for its DNS resolver IPs (205.251.196.1 etc.) — DNS queries are handled by the nearest PoP globally. CloudFront similarly serves content from the nearest edge location.

Anycast does not provide true geographic load balancing (BGP path selection does not consider server load), but significantly reduces latency by avoiding cross-ocean round trips for DNS resolution and CDN edge serving.

Debugging Notes

Intermittent connectivity between instances: First check Security Groups (stateful — check both inbound and outbound if you added a custom outbound rule). Second, check NACLs (stateless — verify return traffic is allowed on ephemeral port range). Third, check route tables. VPC Flow Logs are essential: enable on all VPCs, filter for REJECT action to see dropped traffic.
NLB target health failures: NLB performs health checks from its own IP range. Target Security Groups must allow traffic from the NLB's ENI IPs or from the target's own subnet CIDR.
ALB 504 Gateway Timeout: usually idle timeout mismatch. ALB idle timeout defaults to 60 seconds; if backend keeps connections open longer, ALB will close them mid-request. Set backend idle timeout > ALB idle timeout.
VPC peering connectivity issues: verify routes in both directions (VPC A's route table must have route to VPC B's CIDR via peering, and vice versa). VPC peering does not support transitive routing — cannot reach a VPC's VPN attachment through peering.
Direct Connect BGP flapping: check BFD (Bidirectional Forwarding Detection) configuration. Without BFD, failover relies on BGP keepalive timers (30-90 seconds). With BFD, failover is subsecond.

Security Implications

Security Groups are allow-only. The default SG (auto-created) allows all traffic between members and blocks all inbound from outside. Never use the default SG — create explicit, named SGs for each role.
VPC Flow Logs provide a raw traffic audit trail but do not log packet payloads. For payload inspection, deploy an IDS (Suricata via AWS Gateway Load Balancer, or a managed service like AWS Network Firewall).
NACLs as an emergency control: if a specific IP is actively attacking, a NACL deny rule takes effect in seconds and applies to all instances in the subnet. This is faster than modifying Security Groups on individual instances.
PrivateLink is preferred over VPC peering for exposing services — it provides minimal exposure (only the specific service endpoint, not the entire VPC's routing table).
Transit Gateway route tables: by default, all attached VPCs can route to each other. Segment using separate TGW route tables for production/staging/dev environments.

Performance Implications

Placement groups (cluster): instances in the same cluster placement group are placed on hardware in close physical proximity. Achieves 10μs RTT between instances vs ~100μs otherwise. Required for HPC (MPI) workloads. Limitation: limited capacity in one AZ.
Enhanced Networking (ENA): all current-generation Nitro instances use ENA for multi-queue networking. Ensure ENA driver is installed (comes by default on Amazon Linux, may need manual install on other distros for optimal multi-queue behavior).
Jumbo frames: VPC supports 9001 MTU (jumbo frames) for traffic within the same VPC. Internet-facing traffic is limited to 1500 MTU. Enable jumbo frames for large-volume internal data transfers (backup, replication).
NAT Gateway throughput: max 45Gbps per NAT Gateway. For high-egress workloads, distribute across multiple NAT Gateways in different AZs. Never route cross-AZ through a single NAT Gateway (latency + AZ dependency).

Failure Modes

AZ affinity mismatch: if your ALB is in AZ-a and AZ-b, but all healthy targets are in AZ-c, the ALB will return 502s. Always spread targets across the same AZs as your ALB. Enable cross-zone load balancing for uneven distributions.
Security Group rule limit: default 60 inbound + 60 outbound rules per SG, 5 SGs per ENI. Large security surface applications approach these limits. Use prefix lists for IP sets (a single prefix list counts as one rule).
Route propagation lag: VPC route table updates propagate to all ENIs within seconds, but Direct Connect BGP propagation through the network takes 30-60 seconds on average. Failover cutover times depend on this propagation.
Anycast hot potato routing: BGP Anycast routes users to the "nearest" PoP by BGP topology, not necessarily by actual latency. This sometimes sends users to a suboptimal PoP due to ISP routing decisions. Latency-based routing (Route 53 latency records) using measured RTTs provides better results for user-facing services.

Modern Usage

Service meshes (Istio, AWS App Mesh, Linkerd) have added L7 traffic management within VPCs: mTLS between services, circuit breaking, retries, canary routing. These operate alongside (not replacing) VPC-level Security Groups.

AWS Gateway Load Balancer (GWLB) enables transparent insertion of third-party network appliances (IDS/IPS, firewalls, deep packet inspection) into traffic paths without source NAT. All traffic flows through the appliance, which can drop or pass traffic, returning it to the original path.

Future Directions

IPv6-only VPCs are becoming viable as IPv6 adoption grows in transit networks. Reduces NAT Gateway dependency and egress costs.
AWS Network Access Analyzer: automated detection of unintended network access paths using formal verification of route tables and Security Group configurations.
Encrypted inter-VPC traffic using AWS KMS: Transit Gateway supports attachment-level traffic encryption, eliminating need to trust the AWS network fabric for sensitive internal traffic.

Exercises

Create a VPC with public and private subnets across two AZs. Deploy a web server in the private subnet, an ALB in the public subnet, and verify Security Group rules prevent direct SSH from the internet while allowing ALB health checks.
Enable VPC Flow Logs and simulate a Security Group rejection. Parse the flow log to identify the rejected connection's 5-tuple.
Configure VPC peering between two VPCs with non-overlapping CIDRs. Test connectivity, then verify that transitive routing does not work (i.e., a third VPC peered with the second cannot reach the first).
Benchmark the latency difference between instances in a cluster placement group vs. instances spread across an AZ using netperf or iperf3.
Design a multi-region architecture using Transit Gateway and Direct Connect that achieves active-active failover with RTO under 30 seconds if one region's Direct Connect circuit fails.

References

AWS VPC User Guide: https://docs.aws.amazon.com/vpc/latest/userguide/
James Hamilton: "AWS Networking Overview" (re:Invent 2014, CMP401)
Amin Vahdat et al.: "Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network" (SIGCOMM 2015)
AWS re:Invent 2021: "Networking best practices for your VPC" (NET204)
Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization (NSDI 2018)
Cloudflare: "Cloudflare's Approach to BGP Anycast" (blog.cloudflare.com)
RFC 7348: Virtual eXtensible Local Area Network (VXLAN)