Container Networking

Technical Overview

Container networking is the set of mechanisms that give containers network connectivity — both between containers on the same host and across hosts — while maintaining appropriate isolation boundaries. It sits at the intersection of Linux kernel networking (network namespaces, virtual devices, iptables/nftables, eBPF), IP networking concepts (routing, NAT, overlay protocols), and container ecosystem standards (CNI).

The core challenge in container networking is bridging network namespaces: a new container starts with an isolated network namespace containing only a loopback interface. The runtime must connect this isolated namespace to the outside world. The mechanism used — from simple veth pairs with iptables NAT, to overlay VXLAN tunnels, to BGP routing, to eBPF-native forwarding — determines the performance, operational complexity, and policy capabilities of the resulting network.

Prerequisites

Linux network namespaces (section 01)
IP networking fundamentals (routing, NAT, ARP)
Linux network devices: bridge, veth, tun/tap, VXLAN
iptables/nftables basics
eBPF basics (for Cilium section)

Historical Context

Docker's original networking model (2013) used a simple Linux bridge (docker0) with iptables MASQUERADE for NAT. This worked well for single-host development but scaled poorly for production multi-host clusters.

As Kubernetes emerged (2014-2015), the network model became more complex: pods needed IP addresses routable across the cluster, pod-to-pod communication without NAT was required, and network policy (firewall rules at the pod level) was needed. This drove the creation of the CNI specification by CoreOS and others in 2015, later adopted by Kubernetes.

The evolution from iptables-based solutions (Flannel VXLAN) to BGP-native routing (Calico) to eBPF-native implementations (Cilium) mirrors the broader industry trend toward programmable data planes.

Docker Bridge Networking

The default Docker network mode creates a Linux bridge device (docker0) on the host and connects each container to it via a virtual ethernet pair.

Setup Walkthrough

When Docker starts:

ip link add docker0 type bridge
ip addr add 172.17.0.1/16 dev docker0
ip link set docker0 up

When a container starts:

# Create a veth pair
ip link add veth0 type veth peer name eth0

# Move the container end into the container's network namespace
ip link set eth0 netns <container_ns>

# Connect host end to docker0 bridge
ip link set veth0 master docker0
ip link set veth0 up

# Inside container namespace:
ip addr add 172.17.0.2/16 dev eth0
ip link set eth0 up
ip route add default via 172.17.0.1

NAT for Outbound Traffic

iptables rules (inserted by Docker):

# MASQUERADE (SNAT) packets leaving the host on non-docker interfaces
iptables -t nat -A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE

This rewrites the source IP from the container's 172.17.x.x to the host's external IP before the packet leaves the host.

Topology Diagram

Host
┌──────────────────────────────────────────────────────────────┐
│                                                              │
│  ┌─────────────────┐          ┌─────────────────┐           │
│  │  Container A    │          │  Container B    │           │
│  │  eth0           │          │  eth0           │           │
│  │  172.17.0.2/16  │          │  172.17.0.3/16  │           │
│  └────────┬────────┘          └────────┬────────┘           │
│           │ (veth pair)                │ (veth pair)         │
│      veth0a                       veth0b                    │
│           │                            │                    │
│     ┌─────▼────────────────────────────▼─────┐             │
│     │           docker0 bridge               │             │
│     │           172.17.0.1/16               │             │
│     └──────────────────┬────────────────────┘             │
│                        │ routing + iptables MASQUERADE     │
│                      eth0 (host interface)                  │
│                      10.0.0.1/24                            │
└──────────────────────────────────────────────────────────────┘
                         │
                    External network

Host Network Mode

With docker run --network=host, the container shares the host's network namespace entirely:

No veth pair, no bridge, no NAT
Container sees all host network interfaces directly
ss -tlnp inside the container shows all host socket listeners
Ports bound inside the container are bound on the host's IP

This mode is used for maximum performance (no NAT overhead, no veth copy) and for containers that need to operate on the host's network stack (network monitoring tools, CNI plugins themselves).

Security implication: there is no network isolation. A compromised container can sniff all host network traffic.

Overlay Networks for Multi-Host

Single-host bridge networking does not work across hosts because 172.17.0.0/16 addresses are not routable on the external network. Overlay networks tunnel container traffic through the host's external network.

VXLAN Encapsulation

VXLAN (Virtual Extensible LAN, RFC 7348) encapsulates Layer 2 frames in UDP packets:

Outer IP header  (host-to-host routing)
Outer UDP header (port 4789)
VXLAN header     (24-bit VNI — virtual network identifier)
Inner Ethernet frame (container-to-container)

Each host has a VXLAN Tunnel Endpoint (VTEP) — a virtual network device that encapsulates outbound container traffic and decapsulates incoming VXLAN packets.

Host A                                        Host B
┌──────────────────────┐    VXLAN tunnel     ┌──────────────────────┐
│ Container 10.1.0.2   │ ──────────────────→ │ Container 10.1.0.3   │
│     │                │  UDP port 4789      │     │                │
│   bridge             │  encapsulated in    │   bridge             │
│     │                │  host IPs           │     │                │
│   VTEP flannel.1     │                     │   VTEP flannel.1     │
│   192.168.1.10       │                     │   192.168.1.11       │
└──────────────────────┘                     └──────────────────────┘

Docker Swarm Overlay

Docker Swarm uses VXLAN overlays with a distributed state store (Raft-based) to maintain ARP tables and VTEP MAC-to-host mappings across the cluster.

CNI Specification

The Container Network Interface is a specification and a set of reference libraries/plugins for configuring network interfaces in Linux containers. The Kubernetes kubelet (via containerd/CRI-O) calls CNI plugins to set up container networking.

CNI Contract

The runtime sets environment variables and passes a JSON config via stdin to the plugin binary. The plugin configures the network and returns a JSON result via stdout.

Environment variables:
  CNI_COMMAND=ADD|DEL|CHECK|VERSION
  CNI_CONTAINERID=<container id>
  CNI_NETNS=/run/netns/<id>
  CNI_IFNAME=eth0
  CNI_PATH=/opt/cni/bin

Stdin: JSON config
{
  "cniVersion": "1.0.0",
  "name": "my-network",
  "type": "bridge",
  "bridge": "cni0",
  "ipam": {
    "type": "host-local",
    "subnet": "10.22.0.0/16"
  }
}

Stdout (on ADD): JSON result with assigned IP, routes
{
  "ips": [{"address": "10.22.0.5/16", "gateway": "10.22.0.1"}],
  "routes": [{"dst": "0.0.0.0/0"}]
}

CNI Call Flow Diagram

kubelet (container created, network namespace ready)
        │
        │  exec: /opt/cni/bin/calico with CNI_COMMAND=ADD
        │  stdin: JSON network config
        ▼
   CNI plugin binary
        │
        │  1. Parse config
        │  2. Call IPAM plugin to allocate IP
        │  3. Create/configure network device in container netns
        │  4. Add routes
        │  5. Program any host-side policy (iptables/eBPF)
        │
        │  stdout: JSON result (IP, routes, DNS)
        ▼
   kubelet receives IP, updates pod status

CNI is called synchronously by the container runtime. If the CNI plugin hangs, container creation hangs. CNI errors bubble up as pod scheduling failures in Kubernetes.

IPAM Plugins

host-local: Allocates IPs from a subnet stored in files on the local host at /var/lib/cni/networks/<network-name>/. No coordination across hosts — each host manages its own subnet (requires per-host CIDR allocation at a higher level).

Whereabouts: Cluster-wide IPAM that uses etcd or Kubernetes CRDs to track IP allocation across all nodes. Ensures no IP is double-allocated across the cluster.

CNI Plugin Comparison

Flannel

Approach: VXLAN overlay (default) or host-gw (Layer 2 only)
Installation: DaemonSet; configures VXLAN VTEP and FDB (forwarding database) entries
Routing: All pod-to-pod traffic goes through VXLAN tunnel
Network Policy: None — Flannel handles connectivity only; use Calico or Cilium for policy
Performance: Lower than BGP/native; VXLAN encapsulation adds ~100-300ns overhead
Use case: Simple clusters where operational simplicity matters more than performance

Calico

Approach: BGP routing — publishes pod CIDR routes via BGP to the host network
No overlay required: If hosts are on the same L2 network, or if BGP peering reaches the upstream router, pods can communicate without encapsulation
Dataplane options: iptables (legacy), eBPF (since Calico 3.13)
Network Policy: Full Kubernetes NetworkPolicy + Calico GlobalNetworkPolicy (cluster-wide, namespace-spanning)
Performance: Native routing approaches line rate; eBPF dataplane removes iptables traversal
Use case: Production clusters where performance and policy matter; especially on-premise or public cloud with BGP support

Cilium

Approach: eBPF-native — replaces iptables, kube-proxy, and traditional CNI plumbing with eBPF programs
Identity-based security: Pods get a numeric identity based on labels; policy rules reference identities, not IPs (IPs are ephemeral in Kubernetes)
L7 visibility: Can enforce HTTP, gRPC, Kafka policy without a sidecar proxy
kube-proxy replacement: Cilium's eBPF map-based load balancing replaces kube-proxy's iptables rules, with better performance and per-connection load balancing
Hubble observability: Built-in flow visibility (like a network traffic recorder) at the eBPF layer
Use case: Large-scale Kubernetes clusters, security-sensitive environments, microservices requiring L7 policy

DNS in Containers

Docker Embedded DNS

Docker runs an embedded DNS server (listening on 127.0.0.11) within each container's network namespace. Container DNS queries for service names (on Docker networks) are answered by this server. External queries are forwarded to the host's configured DNS.

Kubernetes CoreDNS

Kubernetes runs CoreDNS as a cluster-level DNS server: - ClusterIP: 10.96.0.10 (default) - Configured in /etc/resolv.conf inside every pod: nameserver 10.96.0.10 - Service A record: <service>.<namespace>.svc.cluster.local → ClusterIP - Pod A record: <pod-ip-dashes>.<namespace>.pod.cluster.local - External names: forwarded to upstream resolvers

CoreDNS plugins handle custom zones, rewrites, caching, and forwarding. In large clusters, DNS is a scaling bottleneck — NodeLocal DNSCache (a DaemonSet caching DNS agent) addresses this by caching responses on each node.

Container Network Debugging

Tools Inside the Container

# Check interface configuration
ip addr
ip route

# DNS resolution
nslookup my-service
dig my-service.default.svc.cluster.local

# Check open connections
ss -tlnp

# Test connectivity
curl -v http://other-service:8080/health

Tools from the Host

# Find container PID
docker inspect --format '{{.State.Pid}}' mycontainer
# or for Kubernetes:
crictl inspect <container-id> | jq '.info.pid'

# Enter container network namespace for debugging
nsenter --target <PID> --net -- ip addr
nsenter --target <PID> --net -- ss -tlnp

# tcpdump on the container's veth (from host side)
# Find the veth by inspecting the container's interface index
nsenter --target <PID> --net -- cat /sys/class/net/eth0/ifindex
# Then find the peer on the host
ip link | grep -A1 "^<ifindex>:"
# Now tcpdump on it
tcpdump -i veth<X> -nn

# tcpdump inside the container namespace
nsenter --target <PID> --net -- tcpdump -i eth0 -nn

Network Diagram (Kubernetes Multi-Node with Calico BGP)

Node 1 (192.168.1.10)              Node 2 (192.168.1.11)
┌─────────────────────────┐        ┌─────────────────────────┐
│ Pod A: 10.0.1.2         │        │ Pod C: 10.0.2.2         │
│   eth0 ← veth pair      │        │   eth0 ← veth pair      │
│         │               │        │         │               │
│    cali<x> (host veth)  │        │    cali<y> (host veth)  │
│         │               │        │         │               │
│   ┌─────▼────────────┐  │        │   ┌─────▼────────────┐  │
│   │  kernel routing  │  │        │   │  kernel routing  │  │
│   │  10.0.1.0/24     │  │        │   │  10.0.2.0/24     │  │
│   │  via local       │  │        │   │  via local       │  │
│   └─────┬────────────┘  │        │   └─────┬────────────┘  │
│         │ BGP route     │        │         │               │
│      eth0 192.168.1.10  │        │      eth0 192.168.1.11  │
└─────────┼───────────────┘        └──────────┼──────────────┘
          │  BGP: "10.0.1.0/24 via 192.168.1.10"              │
          └──────────────────────────────────────────────────┘
          Physical L2 network / upstream router peering BGP

With Calico in BGP mode, packets from Pod A to Pod C travel without encapsulation: Pod A sends to its default route, the kernel on Node 1 routes based on BGP-learned routes that 10.0.2.0/24 is reachable via 192.168.1.11, and the packet goes directly over the physical network.

Production Examples

Debugging a CrashLoopBackOff caused by DNS:

# Get pod's sandbox network namespace
kubectl get pod mypod -o json | jq '.status.hostIP'
kubectl debug -it mypod --image=busybox -- nslookup my-service

# On node, check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

# Check /etc/resolv.conf inside pod
kubectl exec mypod -- cat /etc/resolv.conf

Calico network policy debugging:

# Check if a policy is blocking traffic
calicoctl get networkpolicy -n myns -o yaml

# Calico flow logs (if enabled)
kubectl logs -n calico-system calico-node-<node> | grep "denied"

Cilium flow visibility:

# Install Hubble CLI
# Stream all flows in namespace
hubble observe --namespace myns --follow

# Check if policy is dropping packets
cilium monitor --type drop

Debugging Notes

Pod can't reach service: Check kube-proxy (or Cilium) rules. iptables -t nat -L KUBE-SERVICES | grep <service-ip> or cilium service list.
DNS timeout but eventual success: NDots setting — Kubernetes default ndots:5 means unqualified names get many suffixes appended before falling back to the bare name. Fix: dnsConfig.options.ndots: 2 in pod spec.
VXLAN MTU issues: VXLAN adds 50 bytes of overhead. Container MTU must be at most host_MTU - 50. Misconfigured MTU causes silent packet drops for large packets. Check with ping -M do -s 1450 <destination>.
iptables rules not applying: Verify kube-proxy is running and healthy. Check iptables -t nat -L | wc -l — in large clusters with thousands of services, iptables rule count can be hundreds of thousands.

Security Implications

Container network namespace = isolation boundary: Containers cannot see each other's traffic by default (unlike VMs with promiscuous mode bridges). But on the same bridge, ARP spoofing is possible — mitigated by bridge ARP filtering.
NetworkPolicy is deny-by-default pattern: Kubernetes NetworkPolicy is additive (allow rules only). Without any policy, all pods can communicate. Apply a default-deny policy in every namespace and explicitly allow required flows.
iptables vs eBPF: iptables rules have ordering issues and can be bypassed by raw socket operations. eBPF-based policy (Cilium) is enforced at the socket and packet level, harder to bypass.
CNI plugin privilege: CNI plugins run as root with access to host network namespaces. Compromise of the CNI plugin or its configuration can affect all container networking on the node.

Performance Implications

VXLAN overhead: 50-byte header + UDP processing adds ~5-15% throughput reduction and 100-300ns latency on typical hardware. With hardware VXLAN offload (VTEP in NIC), overhead is near zero.
iptables at scale: Each packet destined for a Kubernetes service traverses all iptables rules linearly (or via iptables chain jumping). At 10,000 services, this is measurable. IPVS mode (kube-proxy --proxy-mode=ipvs) uses kernel hash tables instead, O(1) lookup.
Cilium eBPF: Replaces iptables with BPF maps, reducing conntrack overhead, per-packet iptables traversal, and enabling XDP-level processing for extreme performance.
DNS latency: In high-throughput microservices, DNS resolution latency compounds. NodeLocal DNSCache reduces p99 DNS latency from ~5ms to ~0.1ms by caching on-node.

Failure Modes

Failure	Symptom	Cause
veth pair missing	Container has no network	Runtime CNI call failed; check CNI plugin logs
VXLAN FDB stale	Cross-node pod comm drops	VTEP entry not updated after node restart; often self-heals
iptables corruption	Services unreachable	kube-proxy crash left partial rules; restart kube-proxy
MTU mismatch	Large packet drops, TCP hangs	VXLAN MTU not configured; check interface MTU end-to-end
IP address exhaustion	Pods stuck in ContainerCreating	IPAM subnet full; check host-local lease files or Whereabouts CRD
CoreDNS crash	All DNS fails, pods can't connect	CoreDNS pod OOM killed; check memory limits and query rate

Modern Usage

Service mesh integration: Istio, Linkerd sidecars intercept container traffic via iptables redirect rules (or Cilium native redirect) to the sidecar proxy. Understanding CNI is required to debug service mesh traffic interception.
eBPF-native mesh: Cilium Service Mesh replaces sidecar proxies with eBPF programs — L7 policy without process-level proxies.
IPv6 and dual-stack: Kubernetes supports dual-stack (IPv4 + IPv6) pods; CNI plugins increasingly support this.
NetworkPolicy v2 proposals: The current NetworkPolicy API is limited. SIG Network is working on richer NetworkPolicy APIs (AdminNetworkPolicy, policy for egress to external CIDRs).

Future Directions

eBPF-universal networking: As eBPF matures, more of the networking stack moves to BPF programs — from CNI plugins to service mesh data planes to observability
QUIC/HTTP3 awareness: Container network infrastructure needs to handle UDP-based protocols differently from TCP — stateful load balancing for QUIC streams
Multi-cluster networking: Service discovery and routing across Kubernetes clusters (Submariner, Cilium ClusterMesh) becoming standard
WireGuard encryption: Calico and Cilium both support WireGuard for transparent pod-to-pod encryption, replacing legacy IPsec overlays

Exercises

Create two network namespaces manually (ip netns add) and connect them with a veth pair. Configure IPs and verify ping works. Then delete one namespace and verify the veth disappears.
Inspect a running Docker container's iptables rules on the host. Trace the path of a packet from the container to the external internet through NAT.
Install Flannel on a 2-node Kubernetes cluster. Use tcpdump on the flannel.1 interface on one node while pinging a pod on the other node. Capture and decode the VXLAN-encapsulated packet.
Write a minimal CNI plugin in bash that assigns a fixed IP address and creates a veth pair. Test it with CNI_COMMAND=ADD CNI_NETNS=/run/netns/test CNI_IFNAME=eth0 ./my-cni < config.json.
In a cluster with Cilium, create a default-deny NetworkPolicy in a namespace. Verify pods can no longer communicate. Add a specific allow policy and verify communication is restored. Use hubble observe to see the policy decisions.
Benchmark the latency difference between Flannel VXLAN and Calico BGP mode on the same hardware using qperf or iperf3. Measure both throughput and latency.

References

CNI specification: github.com/containernetworking/cni
CNI plugins: github.com/containernetworking/plugins
Flannel: github.com/flannel-io/flannel
Calico documentation: docs.tigera.io
Cilium documentation: docs.cilium.io
VXLAN RFC 7348
Kubernetes networking model: kubernetes.io/docs/concepts/cluster-administration/networking/
Linux ip(8) man page, bridge(8) man page
Brendan Gregg, Linux networking performance tools