Skip to content

Pod Networking Deep Dive: CNI, kube-proxy, and eBPF Service Routing

Overview

Pod networking in Kubernetes is a multi-layer stack that translates high-level service abstractions into actual packet forwarding rules. Understanding this stack — from how a pod gets an IP address through how a Service's virtual ClusterIP is translated to a real pod IP via iptables DNAT chains — is essential for diagnosing connectivity issues, reasoning about network performance, and understanding why kube-proxy's iptables implementation struggles at scale and why eBPF-based alternatives like Cilium have become the production choice for large clusters.

This file traces the complete path from "kubelet starts a pod" to "that pod can receive traffic from a Service," examining every transformation the packet undergoes along the way.

Prerequisites

  • Linux networking: network namespaces, veth pairs, iptables, routing tables
  • TCP/IP fundamentals: IP routing, NAT, connection tracking
  • Kubernetes architecture: kubelet, CRI, kube-proxy, API server
  • Basic understanding of eBPF (Berkeley Packet Filter) concepts
  • Familiarity with kernel netfilter framework

How a Pod Gets an IP Address

When a pod is scheduled to a node, the following sequence occurs:

Scheduler writes pod.spec.nodeName = "node-1"
        │
        ▼
kubelet on node-1 detects pod assigned to it
(via API server watch on its own nodeName)
        │
        ▼
kubelet calls CRI: RunPodSandbox(pod_spec)
  (CRI = Container Runtime Interface: containerd or CRI-O)
        │
        ▼
CRI creates the "pause" container (infra container)
  - The pause container does nothing (just sleeps)
  - Its network namespace is the pod's shared namespace
  - All containers in the pod share this one network namespace
        │
        ▼
CRI calls CNI plugin: ADD(containerID, netns, ifname="eth0", config)
  (CNI = Container Network Interface)
        │
        ▼
CNI plugin executes:

  1. Allocate IP from IPAM (IP Address Management):
     - For AWS VPC CNI: request an IP from EC2 API (secondary IP on ENI)
     - For Calico/Cilium/Flannel: allocate from per-node IP pool
     - For AWS VPC CNI: each node has a pre-allocated pool of IPs

  2. Create veth pair:
     ip link add eth0 type veth peer name veth-abc123
     (eth0: inside pod namespace, veth-abc123: inside host namespace)

  3. Move eth0 into pod's network namespace:
     ip link set eth0 netns <pod_netns>

  4. Assign allocated IP to eth0 (inside pod):
     nsenter --net=<pod_netns> ip addr add 10.0.1.5/24 dev eth0
     nsenter --net=<pod_netns> ip link set eth0 up

  5. Set up routing in pod namespace:
     nsenter --net=<pod_netns> ip route add default via 169.254.1.1
     (gateway is a link-local address on host side of veth)

  6. Configure host side (veth-abc123):
     ip link set veth-abc123 up
     echo 1 > /proc/sys/net/ipv4/conf/veth-abc123/proxy_arp
     ip route add 10.0.1.5/32 dev veth-abc123
     (route: traffic to 10.0.1.5 goes via this specific veth)
        │
        ▼
CRI reports pod IP to kubelet
        │
        ▼
kubelet reports pod IP to API server (pod.status.podIP)
        │
        ▼
EndpointSlice controller adds pod IP to Service endpoints

The result: the pod has an IP address, a default route, and the host has a route to reach the pod via its dedicated veth pair.

Network Namespace and veth Architecture

HOST NETWORK NAMESPACE:
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  eth0 (node's physical NIC)                                    │
│    └── bridge (cbr0) or direct routing                         │
│         ├── veth-abc123 ──────────────┐                        │
│         ├── veth-def456 ──────────────│────────┐               │
│         └── veth-ghi789 ──────────────│────────│────┐          │
│                                       │        │    │          │
│  routing table:                       │        │    │          │
│    10.0.1.5/32 dev veth-abc123        │        │    │          │
│    10.0.1.6/32 dev veth-def456        │        │    │          │
│    10.0.1.7/32 dev veth-ghi789        │        │    │          │
└───────────────────────────────────────│────────│────│──────────┘
                                        │        │    │
   POD A NETNS              ┌───────────┘        │    │
   ┌──────────────┐         │        POD B NETNS │    │
   │ eth0:        │  veth   │        ┌───────────┘    │
   │ 10.0.1.5/24  ├─────────┘        │ eth0:          │
   │              │                  │ 10.0.1.6/24    │
   └──────────────┘                  └────────────────┘

Each pod has its own isolated network namespace with one (or more, for multi-homed pods) virtual ethernet interface. All containers within the same pod share this single network namespace — they can communicate via localhost and see the same IP address.

The Flat Network Requirement

Kubernetes's networking model has one fundamental requirement: all pods must be able to communicate with each other without NAT. Pod A (IP 10.0.1.5 on node-1) can send a packet to Pod B (IP 10.0.2.8 on node-2), and Pod B receives it with source IP 10.0.1.5 — not NATted to node-1's IP.

This requirement must be satisfied by the CNI plugin. Different CNI plugins satisfy it differently:

  • Flannel (VXLAN mode): encapsulates pod-to-pod packets in VXLAN, delivering them to the destination node, which decapsulates and forwards to the destination pod's veth.
  • Calico (BGP mode): each node advertises its pod CIDR via BGP. Routers in the physical network learn that "to reach 10.0.1.0/24, forward to node-1." Direct routing, no encapsulation.
  • AWS VPC CNI: assigns real VPC IPs to pods. Pod IPs are secondary IPs on node ENIs. No overlay needed — VPC routing table handles inter-node pod routing natively.
  • Cilium (eBPF): uses eBPF programs to implement VXLAN encapsulation or native routing, depending on configuration.

kube-proxy: Service Networking Implementation

A Kubernetes Service provides a stable virtual IP (ClusterIP) that load balances traffic to a dynamic set of pod IPs. ClusterIP is not assigned to any network interface on any machine — it exists only as a firewall rule. kube-proxy implements this by installing iptables rules that DNAT (Destination NAT) packets destined for the ClusterIP to one of the actual pod IPs.

iptables Mode Implementation

When a Service is created:

Service: my-service
  ClusterIP: 10.96.1.100
  Port: 80 → Pod port 8080
  Selector: app=my-app
  Endpoints: [10.0.1.5:8080, 10.0.1.6:8080, 10.0.1.7:8080]

kube-proxy installs the following iptables rules:

PREROUTING chain (nat table):
  -A PREROUTING -j KUBE-SERVICES

KUBE-SERVICES chain:
  -A KUBE-SERVICES -d 10.96.1.100/32 -p tcp --dport 80 -j KUBE-SVC-XYZ123

KUBE-SVC-XYZ123 chain (load balancing via probability):
  -A KUBE-SVC-XYZ123 -m statistic --mode random --probability 0.33333 \
     -j KUBE-SEP-POD1    ← 33.3% chance: forward to pod 1
  -A KUBE-SVC-XYZ123 -m statistic --mode random --probability 0.50000 \
     -j KUBE-SEP-POD2    ← 50% of remaining (= 33.3% overall): pod 2
  -A KUBE-SVC-XYZ123 -j KUBE-SEP-POD3    ← remaining 33.3%: pod 3

KUBE-SEP-POD1 chain (endpoint: DNAT to pod 1):
  -A KUBE-SEP-POD1 -s 10.0.1.5/32 -j KUBE-MARK-MASQ  ← hairpin SNAT
  -A KUBE-SEP-POD1 -p tcp -j DNAT --to-destination 10.0.1.5:8080

KUBE-SEP-POD2 chain:
  -A KUBE-SEP-POD2 -s 10.0.1.6/32 -j KUBE-MARK-MASQ
  -A KUBE-SEP-POD2 -p tcp -j DNAT --to-destination 10.0.1.6:8080

KUBE-SEP-POD3 chain:
  -A KUBE-SEP-POD3 -s 10.0.1.7/32 -j KUBE-MARK-MASQ
  -A KUBE-SEP-POD3 -p tcp -j DNAT --to-destination 10.0.1.7:8080

OUTPUT chain (nat table) also routes:
  -A OUTPUT -j KUBE-SERVICES
  (catches pod-to-service traffic from pods on the same node)

The packet transformation for a client pod accessing my-service:80:

Client pod (10.0.2.3) → Service IP:Port (10.96.1.100:80)
  ↓ PREROUTING → KUBE-SERVICES → KUBE-SVC-XYZ123 → KUBE-SEP-POD2
  ↓ DNAT applied: destination changes from 10.96.1.100:80 → 10.0.1.6:8080
Client pod (10.0.2.3) → Pod IP:Port (10.0.1.6:8080)   [modified]
  ↓ Routed to node hosting 10.0.1.6
Pod B receives from 10.0.2.3:ephemeral_port
  ↓ Response: src=10.0.1.6:8080, dst=10.0.2.3:port
  ↓ Connection tracking reverse-NATes: src becomes 10.96.1.100:80
Client pod receives from 10.96.1.100:80 (service IP preserved)

Conntrack (connection tracking) handles the reverse NAT automatically: once the DNAT decision is made for a connection, all subsequent packets of that connection (including responses) are translated consistently.

iptables Scaling Problem

The iptables implementation has a severe scaling problem:

Rules per service:
  1 rule in KUBE-SERVICES (matching ClusterIP)
  1 rule per endpoint in KUBE-SVC-xxx (probability-based)
  2 rules per endpoint in KUBE-SEP-xxx (masquerade + DNAT)

Total rules: approximately 20× services + 10× endpoints

Example cluster:
  10,000 services × 5 endpoints each = 50,000 endpoints
  iptables rules: 10,000 + 150,000 + 200,000 ≈ 360,000 rules

iptables update = full replacement of all chains:
  Every time ANY endpoint changes (pod starts/dies):
    iptables-restore replaces all 360,000 rules atomically
    This takes 5-30 seconds on a busy node
    During replacement: iptables holds a kernel lock (xt_lock)
    All network traffic stalls during lock hold

This is not a theoretical concern. Clusters with 10K+ services regularly experience 10-30 second network latency spikes on nodes coinciding with iptables updates. The root cause is that iptables rule evaluation is O(N) (linear scan through all rules) and iptables updates are O(N) table rewrites.

IPVS Mode

IPVS (IP Virtual Server) is an in-kernel Layer 4 load balancer using hash tables. kube-proxy's IPVS mode uses a dummy interface (kube-ipvs0) to hold all ClusterIPs, then uses IPVS to load-balance connections to endpoints:

ip addr show kube-ipvs0:
  10.96.1.100/32  ← Service 1 ClusterIP
  10.96.1.101/32  ← Service 2 ClusterIP
  ... (all ClusterIPs on this interface)

ipvsadm -Ln:
  TCP 10.96.1.100:80 rr
    -> 10.0.1.5:8080  Round Robin    1
    -> 10.0.1.6:8080  Round Robin    1
    -> 10.0.1.7:8080  Round Robin    1

IPVS hash table lookup is O(1), not O(N). Adding or removing an endpoint is O(1) — only the specific IPVS entry is modified, not the entire ruleset. Performance improvements at scale:

Service count  | iptables update time | IPVS update time
---------------+---------------------+-----------------
    100        |       10ms           |    1ms
   1,000       |      100ms           |    1ms
  10,000       |     1000ms           |    1ms
  50,000       |    5000ms+           |    1ms

IPVS supports richer load-balancing algorithms: round-robin (default), least connections, source-hash, destination-hash, shortest-expected-delay. iptables supports only random selection.

Limitation: IPVS still requires some iptables rules for masquerading and edge cases. But the rule count is O(services) not O(endpoints).

Cilium eBPF: Replacing kube-proxy Entirely

Cilium replaces kube-proxy completely using eBPF programs attached to kernel hooks. This eliminates iptables/IPVS entirely for service routing.

eBPF Service Map

Cilium stores service-to-endpoint mappings in eBPF maps (hash maps living in kernel memory):

BPF MAP: cilium_lb4_services_v2
  Key: {ClusterIP=10.96.1.100, Port=80, Proto=TCP}
  Value: {backend_id=42, flags=SVC_FLAG_ROUTABLE}

BPF MAP: cilium_lb4_backends_v2
  Key: {backend_id=42}
  Value: {addresses=[{IP=10.0.1.5,Port=8080},{IP=10.0.1.6,Port=8080},{IP=10.0.1.7,Port=8080}]}

eBPF map lookups are O(1) hash table operations. Adding/removing an endpoint requires updating only the relevant map entry — no lock on the entire ruleset, no full rewrite.

XDP and tc-BPF Attachment Points

Packet path with Cilium:

[Network card] → XDP hook (before kernel networking stack)
  ↓ For service traffic: can DNAT at XDP level (earliest possible)
  ↓ XDP returns: XDP_DROP, XDP_PASS, XDP_TX, XDP_REDIRECT

[Kernel network stack: routing] → tc-BPF hook (traffic control)
  ↓ Cilium attaches bpf_redir program to veth interface TC hook
  ↓ Service lookup in eBPF map → DNAT → route to backend pod
  ↓ For local deliveries: sockmap shortcut (below)

[Socket BPF / sockmap]:
  ↓ For pod-to-service on same node: sockmap bypasses network stack entirely
  ↓ Direct socket-to-socket data copy in kernel
  ↓ Zero-copy, no IP processing, no iptables

The sockmap optimization is particularly powerful: when a pod on node-1 connects to a service whose backend is also on node-1, Cilium detects this at the socket level and connects the sockets directly. The packets never traverse the network stack — they're delivered as a kernel memory copy. This eliminates the TCP overhead, IP routing, and DNAT processing for local connections.

Comparison: iptables vs IPVS vs Cilium eBPF

Metric                  | iptables        | IPVS           | Cilium eBPF
------------------------+-----------------+----------------+------------------
Lookup time             | O(N) rules      | O(1) hash      | O(1) hash
Update time             | O(N) full rewrite| O(1) per entry | O(1) per entry
Local socket bypass     | No              | No             | Yes (sockmap)
XDP early drop          | No              | No             | Yes
Network policy          | Basic iptables  | None built-in  | Full L3/L4/L7
Encryption              | External        | External       | WireGuard/IPSec
Observability           | Limited         | ipvsadm        | Hubble (eBPF-native)
Kernel version required | 2.6+            | 2.6+ (IPVS)    | 4.9+ (full: 5.10+)
Max services            | ~5K practical   | ~100K          | ~100K+
Connection tracking     | Conntrack       | Conntrack       | eBPF CT (optional)

Network Policies

Kubernetes NetworkPolicy objects define L3/L4 traffic rules for pods. Implementation depends entirely on the CNI plugin — without a NetworkPolicy-capable CNI, NetworkPolicy objects have no effect.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-api-from-frontend
spec:
  podSelector:
    matchLabels: {app: api}
  policyTypes: [Ingress, Egress]
  ingress:
  - from:
    - podSelector:
        matchLabels: {app: frontend}
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - podSelector:
        matchLabels: {app: database}
    ports:
    - protocol: TCP
      port: 5432

This policy allows: frontend pods → api pods on port 8080 (ingress), and api pods → database pods on port 5432 (egress). All other traffic to/from api pods is dropped.

Calico implements NetworkPolicy via iptables or eBPF. Cilium implements it via eBPF maps, and additionally supports L7 policy (HTTP path-level rules, DNS-aware policy) that iptables cannot express.

NodePort and LoadBalancer Services

NodePort

A NodePort service opens the same port on every node's external IP. Traffic to <any-node-IP>:<nodePort> is forwarded to the service's backends.

External client → 192.168.1.10:30080 (any node)
  ↓ iptables PREROUTING:
  KUBE-NODEPORTS chain → match NodePort 30080
  → KUBE-SVC-xxx (same chain as ClusterIP service)
  → DNAT to one of the pod IPs

NodePort range: 30000-32767 by default. NodePort services are primarily used with external load balancers that distribute traffic across all node IPs.

LoadBalancer

LoadBalancer services integrate with cloud provider load balancers. The cloud-controller-manager creates an external load balancer (AWS NLB/ELB, GCP Network Load Balancer, Azure Load Balancer) and configures it to forward traffic to the NodePorts of all cluster nodes. The external LB IP is reported in service.status.loadBalancer.ingress[0].ip.

Debugging Notes

  • Pod cannot reach Service (ClusterIP): check if kube-proxy is running on the node. Check iptables rules: iptables -t nat -L KUBE-SERVICES -n | grep <ClusterIP>. Check if endpoints are populated: kubectl get endpoints <service-name>. Verify pod readiness probes are passing (pods not Ready are excluded from endpoints).
  • Intermittent connection failures from high iptables update rate: check iptables -t nat -L | wc -l on a node. If >100K rules, consider migrating to IPVS or Cilium. Check for xt_lock contention: cat /proc/net/xt_hashlimit or use perf to observe kernel lock contention.
  • Cross-node pod communication failure: use ping and traceroute from inside the pod's network namespace (nsenter -n --target <containerPID>). Check if the node has the correct route to the destination pod's node CIDR. For overlay networks, check if the overlay tunnel interface is up.
  • Service IP not reachable from outside cluster: ClusterIPs are only routable within the cluster (they only exist in iptables/IPVS rules on cluster nodes). External access requires NodePort, LoadBalancer, or Ingress.
  • Cilium debugging: cilium endpoint list shows per-pod policy state. cilium service list shows all services in eBPF map. hubble observe --follow shows real-time network flows with policy verdicts (allow/drop).

Security Implications

  • Without NetworkPolicy, all pods can communicate with all other pods on any port. This is a flat network security risk. Implement a default-deny policy in every namespace and explicitly allow required communication paths.
  • iptables rules are visible to any process running as root on the node. An attacker with host-level access can modify iptables rules to redirect traffic. eBPF programs are harder to manipulate but still require preventing privilege escalation.
  • The EndpointSlice controller publishes pod IPs to all nodes via the API server. Any client with RBAC access to read EndpointSlices can discover the IP addresses of all pods running a given service. Restrict EndpointSlice read access for sensitive services.
  • Service topology (limiting traffic to within-zone backends) reduces blast radius: a compromised pod in AZ-a cannot reach pods in AZ-b if topology-aware routing is enforced.
  • Cilium's L7 network policy (based on eBPF + Envoy sidecar) can enforce HTTP path-level access controls between services without modifying application code — more granular than L4 iptables rules.

Performance Implications

  • kube-proxy iptables mode: p99 connection establishment latency increases linearly with number of service rules. At 10K services, expect 1-5ms additional latency per new connection. Existing connections are unaffected.
  • IPVS mode reduces connection establishment overhead to O(1) but has slightly higher per-packet CPU overhead vs iptables for low service counts. For >1000 services, IPVS wins on every metric.
  • Cilium sockmap: pod-to-pod connections through a service on the same node bypass the network stack entirely. Benchmark shows 2-5x throughput improvement and 50% latency reduction vs iptables for same-node service communication.
  • AWS VPC CNI (no overlay): pod-to-pod traffic uses VPC routing directly. Eliminates VXLAN encapsulation overhead. But each node can host a limited number of pods determined by the number of secondary IPs the instance type supports (e.g., c5.large: 10 pods max).

Failure Modes

  • kube-proxy crash: if kube-proxy crashes on a node, iptables rules are not updated. New services are not reachable, terminated pods remain in iptables rules. Fix: restart kube-proxy. Existing iptables rules remain functional for existing services until kube-proxy restarts and refreshes.
  • EndpointSlice controller lag: if many pods are restarting simultaneously (rolling deploy of large deployment), endpoint updates may lag behind pod readiness by seconds. Load balancers will send requests to terminating pods that have not been removed from endpoints yet. Solution: configure preStop hook with a sleep to give kube-proxy time to update rules before the pod stops accepting connections.
  • CNI plugin failure: if the CNI plugin crashes or cannot allocate IPs (pool exhausted), new pods fail to start with NetworkPluginNotReady error. Existing pods are unaffected. Check CNI plugin logs and IP pool utilization.
  • Conntrack table exhaustion: at very high connection rates (hundreds of thousands of short-lived connections), the kernel conntrack table can fill. Check with conntrack -C or cat /proc/sys/net/netfilter/nf_conntrack_count. Increase nf_conntrack_max or reduce conntrack timeout values.

Modern Usage

Cilium with kube-proxy replacement is the recommended CNI for new large-scale deployments. AWS EKS supports Cilium as an alternative CNI. GKE Dataplane V2 is based on Cilium. Azure uses Cilium for their CNI overlay mode (announced 2023).

Gateway API (Kubernetes 1.28 GA) is replacing Ingress as the standard for L7 HTTP routing. It provides richer semantics (traffic splitting for canary deployments, header-based routing) implemented by API-compatible controllers (Cilium Gateway API, Envoy Gateway, Istio Gateway, NGINX Gateway Fabric).

Future Directions

  • eBPF for everything: continued shift from iptables/IPVS to eBPF programs for all network operations. eBPF's programmability, observability, and performance advantages are driving this transition.
  • QUIC as service transport: as application protocols migrate to HTTP/3 (QUIC), service mesh and ingress controllers must handle QUIC's connection migration features, which break assumptions of traditional 5-tuple connection tracking.
  • Multi-network pods (network.k8s.io NetworkAttachmentDefinition): pods with multiple network interfaces for different traffic types (management vs data vs storage network). Being standardized in Kubernetes SIG-Network.
  • eBPF-based service mesh (sidecarless): Cilium Service Mesh and Istio Ambient Mode eliminate per-pod Envoy sidecars, using eBPF programs and node-level L7 proxies instead. Reduces resource overhead from sidecars by 30-50%.

Exercises

  1. Start a pod and examine its network namespace manually: find the pod's container PID (docker inspect or crictl inspect), use nsenter --net --target <PID> ip addr and ip route to see the pod's network configuration. Find the corresponding veth on the host and trace the routing.
  2. In a cluster running iptables-mode kube-proxy, create a Service with 3 endpoints. Use iptables -t nat -L -n --line-numbers | grep KUBE to find all generated chains. Trace a packet manually through the iptables rules using iptables -t nat -L KUBE-SVC-xxx -n.
  3. Benchmark connection establishment latency with 100 services vs 10,000 services using a tool like wrk or hey targeting a ClusterIP. Quantify the latency difference.
  4. Deploy Cilium with kube-proxy replacement mode. Use cilium monitor to observe packet flows in real-time. Then deploy a NetworkPolicy and verify traffic is blocked/allowed as expected using hubble observe.
  5. Simulate EndpointSlice update lag: deploy a large Deployment (100 replicas), trigger a rolling update, and use tcpdump on a client pod to capture TCP RST packets that result from requests hitting terminating pods before they are removed from endpoints.

References

  • Kubernetes Networking Model: https://kubernetes.io/docs/concepts/cluster-administration/networking/
  • Container Network Interface (CNI) Specification: https://github.com/containernetworking/cni/blob/main/SPEC.md
  • Cilium documentation: https://docs.cilium.io/en/stable/
  • "Life of a Packet" — Michael Rubin (KubeCon 2017, highest-rated Kubernetes networking talk)
  • kube-proxy iptables deep dive: https://arthurchiao.art/blog/cracking-k8s-node-proxy/
  • Brendan Gregg: "BPF Performance Tools" (Addison-Wesley, 2019) — eBPF fundamentals
  • AWS VPC CNI documentation: https://github.com/aws/amazon-vpc-cni-k8s
  • "Kubernetes Networking: The Complete Guide" — Tigera (Calico blog series)
  • Cilium: BPF and XDP Reference Guide: https://docs.cilium.io/en/stable/bpf/
  • IPVS-Based In-Cluster Load Balancing Deep Dive: https://kubernetes.io/blog/2018/07/09/ipvs-based-in-cluster-load-balancing-deep-dive/