06 — Virtual Networking
Technical Overview
Virtual networking provides the fabric that connects containers, virtual machines, and cloud resources. Linux ships a comprehensive toolkit of virtual network devices — each with distinct properties and performance characteristics — that are composed into increasingly sophisticated topologies. Understanding this toolkit is prerequisite for debugging container networking, Kubernetes CNI plugins, OpenStack networking, and hypervisor connectivity. Every major cloud provider's SDN (Software-Defined Networking) layer is built on these primitives.
Prerequisites
- Linux network namespace concepts (
ip netns) - Basic L2/L3 networking (Ethernet, IP forwarding, ARP)
- Linux network stack internals (see
01-linux-network-stack.md) - Familiarity with
ip link,bridge,ip route,ovs-vsctl
Core Content
veth Pair
A veth (virtual Ethernet) pair is a bi-directional pipe with two endpoints, each appearing as a network interface. Packets entering one end immediately appear at the other end. It is the fundamental building block for container networking:
# Create a veth pair
ip link add veth0 type veth peer name veth1
# Move one end into a network namespace (container)
ip link set veth1 netns <namespace>
# Configure both ends
ip addr add 10.0.0.1/24 dev veth0
ip link set veth0 up
ip netns exec <namespace> ip addr add 10.0.0.2/24 dev veth1
ip netns exec <namespace> ip link set veth1 up
veth has near-zero overhead for intra-host communication: packets transferred via veth take the netif_rx() path — a softirq-driven path that is faster than going through a NIC. However, each packet traversal still involves sk_buff allocation and the full stack processing of the receiving namespace.
Performance: veth throughput on loopback can reach 20–40 Gbps on modern hardware. Latency is ~5–10 µs for a round-trip through a veth pair.
Linux Bridge
A Linux bridge connects multiple network interfaces at Layer 2, learning MAC addresses and forwarding frames to the correct port. It behaves like a software Ethernet switch:
# Create a bridge
ip link add name br0 type bridge
ip link set br0 up
# Add interfaces to the bridge
ip link set eth0 master br0
ip link set veth0 master br0
# Assign IP to the bridge (host connectivity)
ip addr add 192.168.1.1/24 dev br0
The bridge maintains a forwarding database (FDB) — a MAC address table. Packets arriving on any port are forwarded to the port whose MAC matches the destination, or flooded to all ports if unknown.
bridge fdb show dev br0 # show MAC table
bridge fdb add <MAC> dev <port> master # static entry
Container networking with veth + bridge:
Host network namespace Container namespace
(netns A)
+---br0 (192.168.1.1)------------+ +---veth1 (192.168.1.2)---+
| | | | | |
| eth0 veth0 <-----------+--------> veth1 |
| (uplink) (host end) | | (container end) |
+-------------------------------+ +--------------------------+
Container networking diagram
============================
Internet
|
eth0 (192.0.2.1)
|
br0 (192.168.1.1) ---- [Linux bridge FDB]
| |
veth0A veth0B <-- host ends of veth pairs
| |
[veth1A] [veth1B] <-- container ends (in separate netns)
| |
container A container B
(192.168.1.2) (192.168.1.3)
Docker's default bridge (docker0) uses exactly this topology. Kubernetes uses it per-node with CNI plugins adding overlay protocols.
TUN/TAP Devices
TUN (network TUNnel) and TAP (Terminal Access Point) devices present virtual NICs to userspace processes:
- TUN: operates at Layer 3 (IP). Userspace reads IP packets from
/dev/net/tunand writes IP packets back. Used by VPN software (OpenVPN, WireGuard userspace mode). - TAP: operates at Layer 2 (Ethernet frames). Used by hypervisors (QEMU/KVM) to give VMs a virtual NIC.
# Create a TUN device
ip tuntap add dev tun0 mode tun user $(whoami)
ip link set tun0 up
ip addr add 10.8.0.1/24 dev tun0
# Create a TAP device (for VM)
ip tuntap add dev tap0 mode tap user qemu
ip link set tap0 master br0
ip link set tap0 up
A VPN application using TUN:
1. Kernel routes outgoing packets to tun0 (based on IP routing table)
2. VPN process reads the IP packet from the TUN file descriptor
3. VPN process encrypts it and sends via a UDP socket to the remote VPN server
4. Remote server decrypts, writes to its TUN device
5. Kernel delivers to destination
WireGuard implements this entirely in kernel space for performance; OpenVPN uses userspace TUN.
VLAN (802.1Q)
VLANs partition a physical network into isolated L2 broadcast domains using 4-byte 802.1Q tags inserted into Ethernet frames. Linux creates VLAN subinterfaces:
# Create VLAN 100 subinterface on eth0
ip link add link eth0 name eth0.100 type vlan id 100
ip link set eth0.100 up
ip addr add 10.100.0.1/24 dev eth0.100
# VLAN on a bridge port (trunk)
bridge vlan add vid 100 dev eth0 pvid untagged # access port
bridge vlan add vid 100-200 dev eth1 # trunk port
VLAN tags are 12 bits (VID), supporting 4094 logical networks per physical link (VID 0 and 4095 reserved). This limit drove the development of VXLAN.
VXLAN: Virtual Extensible LAN
VXLAN encapsulates Ethernet frames in UDP/IP, extending L2 networks across L3 boundaries. Key parameters:
- VNI (VXLAN Network Identifier): 24-bit, supporting ~16 million virtual networks (vs 4094 for VLAN)
- UDP port: 4789 (IANA assigned)
- Outer header: IP + UDP (enables L3 routing, ECMP across fabric)
VXLAN encapsulation:
+----------+----------+----------+----------+----------+
| Outer IP | Outer UDP| VXLAN hdr| Inner ETH| Inner IP |
| (8B) | (8B) | (8B VNI) | (14B) | ... |
+----------+----------+----------+----------+----------+
^ ^
outer network addressing original packet
# Create VXLAN interface
ip link add vxlan100 type vxlan \
id 100 \
dstport 4789 \
dev eth0 \
local 10.0.0.1 \
remote 10.0.0.2 # unicast mode (point-to-point)
# Or multicast mode for BUM (Broadcast/Unknown/Multicast) flooding
ip link add vxlan100 type vxlan \
id 100 \
dstport 4789 \
dev eth0 \
group 239.1.1.1 # multicast group
ip link set vxlan100 up
ip addr add 192.168.100.1/24 dev vxlan100
VXLAN FDB: MAC-to-VTEP (VXLAN Tunnel Endpoint) mappings. When a destination MAC is unknown, the VTEP floods to multicast group or (in controller-driven SDN) consults a central directory:
bridge fdb show dev vxlan100 # show VTEP mappings
bridge fdb add <MAC> dev vxlan100 dst <remote_vtep_ip> # add static entry
Cloud overlay: AWS VPC, Azure VNET, GCP VPC all use VXLAN or similar encapsulation (Geneve in AWS's newer implementation) to provide tenant network isolation.
Geneve
Geneve (Generic Network Virtualization Encapsulation, RFC 8926) is VXLAN's flexible successor. The key difference: Geneve has a variable-length options field that carries metadata (OVN uses this for security labels, load balancer state):
ip link add geneve0 type geneve id 42 remote 10.0.0.2
OVN (Open Virtual Network, used by Kubernetes OVN-Kubernetes CNI) uses Geneve to carry logical datapath and port binding information in the options header.
MACVLAN and IPVLAN
MACVLAN: creates virtual interfaces with distinct MAC addresses on top of a physical interface. Each MACVLAN interface has its own MAC, appears as a separate NIC to the network:
ip link add macvlan0 link eth0 type macvlan mode bridge
Modes: bridge (communicate between instances), vepa (must go via external switch), private (no direct communication), passthru (single MAC, VM-like). Used in Docker macvlan networks for containers needing direct L2 access.
IPVLAN: similar to MACVLAN but shares the parent MAC. Multiple IP addresses on one MAC. Useful when the upstream switch limits MAC address count per port (some wireless APs):
ip link add ipvlan0 link eth0 type ipvlan mode l2
Open vSwitch (OVS)
OVS is a production-quality L2/L3 software switch with OpenFlow support. It has two components: - ovsdb-server: configuration database - ovs-vswitchd: userspace dataplane with kernel datapath module
# Create OVS bridge
ovs-vsctl add-br ovs-br0
# Add physical NIC and VM tap interface
ovs-vsctl add-port ovs-br0 eth0
ovs-vsctl add-port ovs-br0 tap-vm1
# Add VXLAN tunnel port
ovs-vsctl add-port ovs-br0 vxlan0 -- set Interface vxlan0 \
type=vxlan options:remote_ip=10.0.0.2 options:key=100
# OpenFlow rule: forward traffic from VM to VXLAN port
ovs-ofctl add-flow ovs-br0 "in_port=tap-vm1,actions=output:vxlan0"
# Show datapath flows (kernel cache)
ovs-dpctl dump-flows
# Show OpenFlow flows
ovs-ofctl dump-flows ovs-br0
OVS-DPDK replaces the kernel datapath module with a DPDK PMD, achieving 40–100 Gbps for VM-to-VM traffic.
SR-IOV: Single Root I/O Virtualization
SR-IOV allows a physical NIC to present multiple Virtual Functions (VFs) — each appearing as a separate PCIe device that can be passed directly to a VM or container:
# Enable 4 VFs on physical function
echo 4 > /sys/class/net/eth0/device/sriov_numvfs
# Verify VFs created
ip link show eth0 # shows VF list
# Configure VF MAC/VLAN
ip link set eth0 vf 0 mac 00:11:22:33:44:55 vlan 100
# Pass VF to container using device plugin
# Or to VM: qemu -device vfio-pci,host=0000:01:00.1
SR-IOV VFs bypass the host kernel network stack entirely for data plane traffic. VM-to-NIC latency drops to ~5 µs (vs ~50 µs through OVS). The tradeoff: live migration requires VF detach/attach, losing connection state.
Historical Context
Linux network namespaces were introduced in Linux 2.4.19 (2002) as a proof-of-concept, made production-ready in 2.6.24 (2008). veth pairs came with namespace support. This infrastructure sat mostly dormant until Docker (2013) popularized containerization and made "one container, one network namespace, one veth pair" the standard deployment model.
VXLAN was proposed by VMware, Cisco, and others in 2011 (RFC 7348, 2014) to overcome the 4094-VLAN limit in cloud environments. Within 3 years, every major cloud had deployed VXLAN-based overlays at million-node scale.
OVS originated at Nicira (2007), which was acquired by VMware for $1.26B in 2012. The Nicira engineers went on to found companies like Pensando (SmartNICs) and continued developing OVN as the successor overlay fabric.
Production Examples
Kubernetes pod networking (kubenet/flannel):
# Each node: a bridge cbr0 connecting pod veth pairs
# Cross-node: VXLAN overlay with flannel daemon managing FDB
ip link show type veth # pod veth pairs
ip link show type vxlan # flannel VXLAN interface
bridge fdb show | grep vxlan # remote VTEP MAC entries
Diagnosis: packet drops in veth path
# Check veth TX errors (receiver namespace full)
ip -s link show veth0
# Rx errors on the container side means the container recv queue is full
# ethtool for veth (limited but shows statistics)
ethtool -S veth0
Debugging Notes
# Trace ARP across bridge/veth
ip netns exec <ns> arping -I veth1 10.0.0.1
# Check bridge FDB (MAC learning table)
bridge fdb show dev br0
# Test overlay reachability
ping -c1 -W1 <remote_inner_ip> # VXLAN inner
# Then check outer UDP connectivity
tcpdump -i eth0 'udp port 4789'
# Namespace-aware packet capture
ip netns exec <ns> tcpdump -i veth1 -w /tmp/cap.pcap
# OVS troubleshooting
ovs-appctl ofproto/trace ovs-br0 in_port=1,dl_dst=ff:ff:ff:ff:ff:ff
ovs-dpctl show
Security Implications
- Bridge bypass: a container with
CAP_NET_ADMINcan create a new network interface, add it to the host bridge, and gain access to all bridge traffic. Kubernetes prevents this viaallowPrivilegeEscalation: falseand seccomp profiles. - ARP spoofing on bridge: without
arptablesor OVS ARP inspection, a malicious container can ARP-poison the bridge FDB and intercept traffic. Mitigation:ebtables -A FORWARD --protocol ARP --arp-op Reply -j DROPfor all but validated sources. - VXLAN without encryption: VXLAN carries tenant traffic in plaintext UDP — any host on the underlay can capture and inject traffic. Mitigation: IPsec or WireGuard on the underlay, or use overlay encryption (OVN + IPsec).
- SR-IOV VF isolation: VFs are isolated by IOMMU, but firmware bugs (Spectre-v1 variants in NIC firmware) can allow cross-VF information leakage. Keep NIC firmware updated.
Performance Implications
| Virtual device | Throughput | Latency | CPU cost |
|---|---|---|---|
| veth | 20–40 Gbps | 5–10 µs | Low |
| Linux bridge | 10–20 Gbps | 10–20 µs | Medium |
| OVS (kernel) | 5–10 Gbps | 20–50 µs | Medium-High |
| OVS-DPDK | 40–100 Gbps | 5–15 µs | High (dedicated core) |
| VXLAN | 5–10 Gbps | +5–10 µs (encap) | Medium |
| VXLAN + GSO/GRO offload | ~line rate | — | Low with NIC offload |
| SR-IOV VF | Near physical NIC | ~5 µs | Very Low |
Enable VXLAN hardware offload (where supported):
ethtool -k eth0 | grep tx-udp-tnl # check offload capability
ethtool -K eth0 tx-udp-tnl-csum-segmentation on
Failure Modes and Real Incidents
Incident: Kubernetes pod DNS failure due to bridge loop
A misconfigured CNI plugin added the same VXLAN interface to two bridges, creating a forwarding loop. STP (Spanning Tree Protocol) was disabled (default in Linux bridge). BUM traffic amplified into a storm, saturating CPU with bridge flooding. Diagnosis: bridge fdb show showed duplicates; ip -s link show br0 showed exponential RX rate. Fix: enable STP (ip link set br0 type bridge stp_state 1) or correct CNI configuration.
Failure Mode: VXLAN MTU mismatch Inner Ethernet frames + VXLAN header (50 bytes) exceed physical MTU (1500 bytes). Fragments are dropped by the underlay. Symptom: large file transfers fail, ping works. Fix: reduce container/VM MTU to 1450, or enable jumbo frames (MTU 9000) on underlay.
Incident: SR-IOV VF queue exhaustion (telecom) A 5G gNB application consumed all 1024 queue descriptors on its VF. Other VFs on the same PF became starved. NIC firmware bug: descriptor exhaustion was not properly isolated. Required NIC firmware update and VF queue count reduction.
Modern Usage
- CNI plugins: Cilium (eBPF), Calico (BGP + eBPF), Flannel (VXLAN), Weave (overlay), AWS VPC CNI (ENI-based, native VPC routing) — all compose the virtual networking primitives above
- eBPF replaces bridges: Cilium avoids bridge + veth by using TC eBPF programs directly on pod veth interfaces, cutting a full stack traversal per packet
- Geneve over VXLAN: OVN, Azure Accelerated Networking, and newer AWS deployments have moved to Geneve for extensible metadata
- SmartNIC offload: Nvidia BlueField-2/3, AMD Pensando run OVS-offload in NIC firmware, reducing host CPU load for virtual switching to near zero
Future Directions
- eBPF-native CNI: as eBPF socket-level hooks mature, the overhead of veth+bridge for pod-to-pod communication may be eliminated entirely — packets redirected at socket level without entering the IP stack
- BESS (Berkeley Extensible Software Switch): academic work on modular software switching that influenced production systems
- P4-programmable fabrics: hardware switches programmable in P4 replace static VXLAN/EVPN with application-aware forwarding
Exercises
-
Create two network namespaces connected via a veth pair through a Linux bridge. Add a third namespace. Demonstrate L2 forwarding between all three by capturing ARP and ICMP traffic with
tcpdumpon the bridge interface. Explain when flooding vs directed forwarding occurs. -
Set up a VXLAN tunnel between two VMs (or network namespaces on different hosts). Introduce an MTU mismatch by setting the outer interface MTU to 1500 and the inner interface MTU to 1500 (no adjustment). Observe the failure mode using
ping -M do -s 1400. Fix by correcting the MTU. -
Implement a MACVLAN bridge-mode setup with three virtual NICs on the same physical interface. Verify that instances can communicate with each other and the external network. Then switch to MACVLAN
privatemode and explain the behavioral difference. -
Deploy OVS with two ports connected to network namespaces. Write an OpenFlow rule that mirrors traffic from port 1 to a monitoring port (port 3) while forwarding normally. Capture the mirrored traffic with
tcpdump. -
On a host with an SR-IOV capable NIC, create 4 VFs. Assign each to a separate network namespace with
ip link set eth0v0 netns ns1. Runiperf3between VF namespaces and measure throughput vs a veth+bridge baseline. Explain the performance difference.
References
drivers/net/veth.c— veth driver implementationnet/bridge/br_forward.c— bridge forwarding logicdrivers/net/tun.c— TUN/TAP implementationnet/ipv4/udp_tunnel.c— VXLAN/Geneve UDP tunnel infrastructure- RFC 7348 — Virtual eXtensible Local Area Network (VXLAN)
- RFC 8926 — Geneve: Generic Network Virtualization Encapsulation
- Open vSwitch documentation. OVS Design Decisions. openvswitch.org
- Kubernetes documentation. Cluster Networking. kubernetes.io
ip-link(8)— virtual device types referenceman 8 bridge— Linux bridge management