Container Networking
Technical Overview
Container networking is the set of mechanisms that give containers network connectivity — both between containers on the same host and across hosts — while maintaining appropriate isolation boundaries. It sits at the intersection of Linux kernel networking (network namespaces, virtual devices, iptables/nftables, eBPF), IP networking concepts (routing, NAT, overlay protocols), and container ecosystem standards (CNI).
The core challenge in container networking is bridging network namespaces: a new container starts with an isolated network namespace containing only a loopback interface. The runtime must connect this isolated namespace to the outside world. The mechanism used — from simple veth pairs with iptables NAT, to overlay VXLAN tunnels, to BGP routing, to eBPF-native forwarding — determines the performance, operational complexity, and policy capabilities of the resulting network.
Prerequisites
- Linux network namespaces (section 01)
- IP networking fundamentals (routing, NAT, ARP)
- Linux network devices: bridge, veth, tun/tap, VXLAN
- iptables/nftables basics
- eBPF basics (for Cilium section)
Historical Context
Docker's original networking model (2013) used a simple Linux bridge (docker0) with iptables MASQUERADE for NAT. This worked well for single-host development but scaled poorly for production multi-host clusters.
As Kubernetes emerged (2014-2015), the network model became more complex: pods needed IP addresses routable across the cluster, pod-to-pod communication without NAT was required, and network policy (firewall rules at the pod level) was needed. This drove the creation of the CNI specification by CoreOS and others in 2015, later adopted by Kubernetes.
The evolution from iptables-based solutions (Flannel VXLAN) to BGP-native routing (Calico) to eBPF-native implementations (Cilium) mirrors the broader industry trend toward programmable data planes.
Docker Bridge Networking
The default Docker network mode creates a Linux bridge device (docker0) on the host and connects each container to it via a virtual ethernet pair.
Setup Walkthrough
When Docker starts:
ip link add docker0 type bridge
ip addr add 172.17.0.1/16 dev docker0
ip link set docker0 up
When a container starts:
# Create a veth pair
ip link add veth0 type veth peer name eth0
# Move the container end into the container's network namespace
ip link set eth0 netns <container_ns>
# Connect host end to docker0 bridge
ip link set veth0 master docker0
ip link set veth0 up
# Inside container namespace:
ip addr add 172.17.0.2/16 dev eth0
ip link set eth0 up
ip route add default via 172.17.0.1
NAT for Outbound Traffic
iptables rules (inserted by Docker):
# MASQUERADE (SNAT) packets leaving the host on non-docker interfaces
iptables -t nat -A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
This rewrites the source IP from the container's 172.17.x.x to the host's external IP before the packet leaves the host.
Topology Diagram
Host
┌──────────────────────────────────────────────────────────────┐
│ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Container A │ │ Container B │ │
│ │ eth0 │ │ eth0 │ │
│ │ 172.17.0.2/16 │ │ 172.17.0.3/16 │ │
│ └────────┬────────┘ └────────┬────────┘ │
│ │ (veth pair) │ (veth pair) │
│ veth0a veth0b │
│ │ │ │
│ ┌─────▼────────────────────────────▼─────┐ │
│ │ docker0 bridge │ │
│ │ 172.17.0.1/16 │ │
│ └──────────────────┬────────────────────┘ │
│ │ routing + iptables MASQUERADE │
│ eth0 (host interface) │
│ 10.0.0.1/24 │
└──────────────────────────────────────────────────────────────┘
│
External network
Host Network Mode
With docker run --network=host, the container shares the host's network namespace entirely:
- No veth pair, no bridge, no NAT
- Container sees all host network interfaces directly
ss -tlnpinside the container shows all host socket listeners- Ports bound inside the container are bound on the host's IP
This mode is used for maximum performance (no NAT overhead, no veth copy) and for containers that need to operate on the host's network stack (network monitoring tools, CNI plugins themselves).
Security implication: there is no network isolation. A compromised container can sniff all host network traffic.
Overlay Networks for Multi-Host
Single-host bridge networking does not work across hosts because 172.17.0.0/16 addresses are not routable on the external network. Overlay networks tunnel container traffic through the host's external network.
VXLAN Encapsulation
VXLAN (Virtual Extensible LAN, RFC 7348) encapsulates Layer 2 frames in UDP packets:
Outer IP header (host-to-host routing)
Outer UDP header (port 4789)
VXLAN header (24-bit VNI — virtual network identifier)
Inner Ethernet frame (container-to-container)
Each host has a VXLAN Tunnel Endpoint (VTEP) — a virtual network device that encapsulates outbound container traffic and decapsulates incoming VXLAN packets.
Host A Host B
┌──────────────────────┐ VXLAN tunnel ┌──────────────────────┐
│ Container 10.1.0.2 │ ──────────────────→ │ Container 10.1.0.3 │
│ │ │ UDP port 4789 │ │ │
│ bridge │ encapsulated in │ bridge │
│ │ │ host IPs │ │ │
│ VTEP flannel.1 │ │ VTEP flannel.1 │
│ 192.168.1.10 │ │ 192.168.1.11 │
└──────────────────────┘ └──────────────────────┘
Docker Swarm Overlay
Docker Swarm uses VXLAN overlays with a distributed state store (Raft-based) to maintain ARP tables and VTEP MAC-to-host mappings across the cluster.
CNI Specification
The Container Network Interface is a specification and a set of reference libraries/plugins for configuring network interfaces in Linux containers. The Kubernetes kubelet (via containerd/CRI-O) calls CNI plugins to set up container networking.
CNI Contract
The runtime sets environment variables and passes a JSON config via stdin to the plugin binary. The plugin configures the network and returns a JSON result via stdout.
Environment variables:
CNI_COMMAND=ADD|DEL|CHECK|VERSION
CNI_CONTAINERID=<container id>
CNI_NETNS=/run/netns/<id>
CNI_IFNAME=eth0
CNI_PATH=/opt/cni/bin
Stdin: JSON config
{
"cniVersion": "1.0.0",
"name": "my-network",
"type": "bridge",
"bridge": "cni0",
"ipam": {
"type": "host-local",
"subnet": "10.22.0.0/16"
}
}
Stdout (on ADD): JSON result with assigned IP, routes
{
"ips": [{"address": "10.22.0.5/16", "gateway": "10.22.0.1"}],
"routes": [{"dst": "0.0.0.0/0"}]
}
CNI Call Flow Diagram
kubelet (container created, network namespace ready)
│
│ exec: /opt/cni/bin/calico with CNI_COMMAND=ADD
│ stdin: JSON network config
▼
CNI plugin binary
│
│ 1. Parse config
│ 2. Call IPAM plugin to allocate IP
│ 3. Create/configure network device in container netns
│ 4. Add routes
│ 5. Program any host-side policy (iptables/eBPF)
│
│ stdout: JSON result (IP, routes, DNS)
▼
kubelet receives IP, updates pod status
CNI is called synchronously by the container runtime. If the CNI plugin hangs, container creation hangs. CNI errors bubble up as pod scheduling failures in Kubernetes.
IPAM Plugins
host-local: Allocates IPs from a subnet stored in files on the local host at /var/lib/cni/networks/<network-name>/. No coordination across hosts — each host manages its own subnet (requires per-host CIDR allocation at a higher level).
Whereabouts: Cluster-wide IPAM that uses etcd or Kubernetes CRDs to track IP allocation across all nodes. Ensures no IP is double-allocated across the cluster.
CNI Plugin Comparison
Flannel
- Approach: VXLAN overlay (default) or host-gw (Layer 2 only)
- Installation: DaemonSet; configures VXLAN VTEP and FDB (forwarding database) entries
- Routing: All pod-to-pod traffic goes through VXLAN tunnel
- Network Policy: None — Flannel handles connectivity only; use Calico or Cilium for policy
- Performance: Lower than BGP/native; VXLAN encapsulation adds ~100-300ns overhead
- Use case: Simple clusters where operational simplicity matters more than performance
Calico
- Approach: BGP routing — publishes pod CIDR routes via BGP to the host network
- No overlay required: If hosts are on the same L2 network, or if BGP peering reaches the upstream router, pods can communicate without encapsulation
- Dataplane options: iptables (legacy), eBPF (since Calico 3.13)
- Network Policy: Full Kubernetes NetworkPolicy + Calico GlobalNetworkPolicy (cluster-wide, namespace-spanning)
- Performance: Native routing approaches line rate; eBPF dataplane removes iptables traversal
- Use case: Production clusters where performance and policy matter; especially on-premise or public cloud with BGP support
Cilium
- Approach: eBPF-native — replaces iptables, kube-proxy, and traditional CNI plumbing with eBPF programs
- Identity-based security: Pods get a numeric identity based on labels; policy rules reference identities, not IPs (IPs are ephemeral in Kubernetes)
- L7 visibility: Can enforce HTTP, gRPC, Kafka policy without a sidecar proxy
- kube-proxy replacement: Cilium's eBPF map-based load balancing replaces kube-proxy's iptables rules, with better performance and per-connection load balancing
- Hubble observability: Built-in flow visibility (like a network traffic recorder) at the eBPF layer
- Use case: Large-scale Kubernetes clusters, security-sensitive environments, microservices requiring L7 policy
DNS in Containers
Docker Embedded DNS
Docker runs an embedded DNS server (listening on 127.0.0.11) within each container's network namespace. Container DNS queries for service names (on Docker networks) are answered by this server. External queries are forwarded to the host's configured DNS.
Kubernetes CoreDNS
Kubernetes runs CoreDNS as a cluster-level DNS server:
- ClusterIP: 10.96.0.10 (default)
- Configured in /etc/resolv.conf inside every pod: nameserver 10.96.0.10
- Service A record: <service>.<namespace>.svc.cluster.local → ClusterIP
- Pod A record: <pod-ip-dashes>.<namespace>.pod.cluster.local
- External names: forwarded to upstream resolvers
CoreDNS plugins handle custom zones, rewrites, caching, and forwarding. In large clusters, DNS is a scaling bottleneck — NodeLocal DNSCache (a DaemonSet caching DNS agent) addresses this by caching responses on each node.
Container Network Debugging
Tools Inside the Container
# Check interface configuration
ip addr
ip route
# DNS resolution
nslookup my-service
dig my-service.default.svc.cluster.local
# Check open connections
ss -tlnp
# Test connectivity
curl -v http://other-service:8080/health
Tools from the Host
# Find container PID
docker inspect --format '{{.State.Pid}}' mycontainer
# or for Kubernetes:
crictl inspect <container-id> | jq '.info.pid'
# Enter container network namespace for debugging
nsenter --target <PID> --net -- ip addr
nsenter --target <PID> --net -- ss -tlnp
# tcpdump on the container's veth (from host side)
# Find the veth by inspecting the container's interface index
nsenter --target <PID> --net -- cat /sys/class/net/eth0/ifindex
# Then find the peer on the host
ip link | grep -A1 "^<ifindex>:"
# Now tcpdump on it
tcpdump -i veth<X> -nn
# tcpdump inside the container namespace
nsenter --target <PID> --net -- tcpdump -i eth0 -nn
Network Diagram (Kubernetes Multi-Node with Calico BGP)
Node 1 (192.168.1.10) Node 2 (192.168.1.11)
┌─────────────────────────┐ ┌─────────────────────────┐
│ Pod A: 10.0.1.2 │ │ Pod C: 10.0.2.2 │
│ eth0 ← veth pair │ │ eth0 ← veth pair │
│ │ │ │ │ │
│ cali<x> (host veth) │ │ cali<y> (host veth) │
│ │ │ │ │ │
│ ┌─────▼────────────┐ │ │ ┌─────▼────────────┐ │
│ │ kernel routing │ │ │ │ kernel routing │ │
│ │ 10.0.1.0/24 │ │ │ │ 10.0.2.0/24 │ │
│ │ via local │ │ │ │ via local │ │
│ └─────┬────────────┘ │ │ └─────┬────────────┘ │
│ │ BGP route │ │ │ │
│ eth0 192.168.1.10 │ │ eth0 192.168.1.11 │
└─────────┼───────────────┘ └──────────┼──────────────┘
│ BGP: "10.0.1.0/24 via 192.168.1.10" │
└──────────────────────────────────────────────────┘
Physical L2 network / upstream router peering BGP
With Calico in BGP mode, packets from Pod A to Pod C travel without encapsulation: Pod A sends to its default route, the kernel on Node 1 routes based on BGP-learned routes that 10.0.2.0/24 is reachable via 192.168.1.11, and the packet goes directly over the physical network.
Production Examples
Debugging a CrashLoopBackOff caused by DNS:
# Get pod's sandbox network namespace
kubectl get pod mypod -o json | jq '.status.hostIP'
kubectl debug -it mypod --image=busybox -- nslookup my-service
# On node, check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
# Check /etc/resolv.conf inside pod
kubectl exec mypod -- cat /etc/resolv.conf
Calico network policy debugging:
# Check if a policy is blocking traffic
calicoctl get networkpolicy -n myns -o yaml
# Calico flow logs (if enabled)
kubectl logs -n calico-system calico-node-<node> | grep "denied"
Cilium flow visibility:
# Install Hubble CLI
# Stream all flows in namespace
hubble observe --namespace myns --follow
# Check if policy is dropping packets
cilium monitor --type drop
Debugging Notes
- Pod can't reach service: Check kube-proxy (or Cilium) rules.
iptables -t nat -L KUBE-SERVICES | grep <service-ip>orcilium service list. - DNS timeout but eventual success: NDots setting — Kubernetes default
ndots:5means unqualified names get many suffixes appended before falling back to the bare name. Fix:dnsConfig.options.ndots: 2in pod spec. - VXLAN MTU issues: VXLAN adds 50 bytes of overhead. Container MTU must be at most
host_MTU - 50. Misconfigured MTU causes silent packet drops for large packets. Check withping -M do -s 1450 <destination>. - iptables rules not applying: Verify
kube-proxyis running and healthy. Checkiptables -t nat -L | wc -l— in large clusters with thousands of services, iptables rule count can be hundreds of thousands.
Security Implications
- Container network namespace = isolation boundary: Containers cannot see each other's traffic by default (unlike VMs with promiscuous mode bridges). But on the same bridge, ARP spoofing is possible — mitigated by bridge ARP filtering.
- NetworkPolicy is deny-by-default pattern: Kubernetes NetworkPolicy is additive (allow rules only). Without any policy, all pods can communicate. Apply a default-deny policy in every namespace and explicitly allow required flows.
- iptables vs eBPF: iptables rules have ordering issues and can be bypassed by raw socket operations. eBPF-based policy (Cilium) is enforced at the socket and packet level, harder to bypass.
- CNI plugin privilege: CNI plugins run as root with access to host network namespaces. Compromise of the CNI plugin or its configuration can affect all container networking on the node.
Performance Implications
- VXLAN overhead: 50-byte header + UDP processing adds ~5-15% throughput reduction and 100-300ns latency on typical hardware. With hardware VXLAN offload (VTEP in NIC), overhead is near zero.
- iptables at scale: Each packet destined for a Kubernetes service traverses all iptables rules linearly (or via iptables chain jumping). At 10,000 services, this is measurable. IPVS mode (kube-proxy --proxy-mode=ipvs) uses kernel hash tables instead, O(1) lookup.
- Cilium eBPF: Replaces iptables with BPF maps, reducing conntrack overhead, per-packet iptables traversal, and enabling XDP-level processing for extreme performance.
- DNS latency: In high-throughput microservices, DNS resolution latency compounds. NodeLocal DNSCache reduces p99 DNS latency from ~5ms to ~0.1ms by caching on-node.
Failure Modes
| Failure | Symptom | Cause |
|---|---|---|
| veth pair missing | Container has no network | Runtime CNI call failed; check CNI plugin logs |
| VXLAN FDB stale | Cross-node pod comm drops | VTEP entry not updated after node restart; often self-heals |
| iptables corruption | Services unreachable | kube-proxy crash left partial rules; restart kube-proxy |
| MTU mismatch | Large packet drops, TCP hangs | VXLAN MTU not configured; check interface MTU end-to-end |
| IP address exhaustion | Pods stuck in ContainerCreating | IPAM subnet full; check host-local lease files or Whereabouts CRD |
| CoreDNS crash | All DNS fails, pods can't connect | CoreDNS pod OOM killed; check memory limits and query rate |
Modern Usage
- Service mesh integration: Istio, Linkerd sidecars intercept container traffic via iptables redirect rules (or Cilium native redirect) to the sidecar proxy. Understanding CNI is required to debug service mesh traffic interception.
- eBPF-native mesh: Cilium Service Mesh replaces sidecar proxies with eBPF programs — L7 policy without process-level proxies.
- IPv6 and dual-stack: Kubernetes supports dual-stack (IPv4 + IPv6) pods; CNI plugins increasingly support this.
- NetworkPolicy v2 proposals: The current NetworkPolicy API is limited. SIG Network is working on richer NetworkPolicy APIs (AdminNetworkPolicy, policy for egress to external CIDRs).
Future Directions
- eBPF-universal networking: As eBPF matures, more of the networking stack moves to BPF programs — from CNI plugins to service mesh data planes to observability
- QUIC/HTTP3 awareness: Container network infrastructure needs to handle UDP-based protocols differently from TCP — stateful load balancing for QUIC streams
- Multi-cluster networking: Service discovery and routing across Kubernetes clusters (Submariner, Cilium ClusterMesh) becoming standard
- WireGuard encryption: Calico and Cilium both support WireGuard for transparent pod-to-pod encryption, replacing legacy IPsec overlays
Exercises
- Create two network namespaces manually (
ip netns add) and connect them with a veth pair. Configure IPs and verify ping works. Then delete one namespace and verify the veth disappears. - Inspect a running Docker container's iptables rules on the host. Trace the path of a packet from the container to the external internet through NAT.
- Install Flannel on a 2-node Kubernetes cluster. Use
tcpdumpon theflannel.1interface on one node while pinging a pod on the other node. Capture and decode the VXLAN-encapsulated packet. - Write a minimal CNI plugin in bash that assigns a fixed IP address and creates a veth pair. Test it with
CNI_COMMAND=ADD CNI_NETNS=/run/netns/test CNI_IFNAME=eth0 ./my-cni < config.json. - In a cluster with Cilium, create a default-deny NetworkPolicy in a namespace. Verify pods can no longer communicate. Add a specific allow policy and verify communication is restored. Use
hubble observeto see the policy decisions. - Benchmark the latency difference between Flannel VXLAN and Calico BGP mode on the same hardware using
qperforiperf3. Measure both throughput and latency.
References
- CNI specification: github.com/containernetworking/cni
- CNI plugins: github.com/containernetworking/plugins
- Flannel: github.com/flannel-io/flannel
- Calico documentation: docs.tigera.io
- Cilium documentation: docs.cilium.io
- VXLAN RFC 7348
- Kubernetes networking model: kubernetes.io/docs/concepts/cluster-administration/networking/
- Linux
ip(8)man page,bridge(8)man page - Brendan Gregg, Linux networking performance tools