Skip to content

06 — Virtual Networking

Technical Overview

Virtual networking provides the fabric that connects containers, virtual machines, and cloud resources. Linux ships a comprehensive toolkit of virtual network devices — each with distinct properties and performance characteristics — that are composed into increasingly sophisticated topologies. Understanding this toolkit is prerequisite for debugging container networking, Kubernetes CNI plugins, OpenStack networking, and hypervisor connectivity. Every major cloud provider's SDN (Software-Defined Networking) layer is built on these primitives.


Prerequisites

  • Linux network namespace concepts (ip netns)
  • Basic L2/L3 networking (Ethernet, IP forwarding, ARP)
  • Linux network stack internals (see 01-linux-network-stack.md)
  • Familiarity with ip link, bridge, ip route, ovs-vsctl

Core Content

veth Pair

A veth (virtual Ethernet) pair is a bi-directional pipe with two endpoints, each appearing as a network interface. Packets entering one end immediately appear at the other end. It is the fundamental building block for container networking:

# Create a veth pair
ip link add veth0 type veth peer name veth1

# Move one end into a network namespace (container)
ip link set veth1 netns <namespace>

# Configure both ends
ip addr add 10.0.0.1/24 dev veth0
ip link set veth0 up
ip netns exec <namespace> ip addr add 10.0.0.2/24 dev veth1
ip netns exec <namespace> ip link set veth1 up

veth has near-zero overhead for intra-host communication: packets transferred via veth take the netif_rx() path — a softirq-driven path that is faster than going through a NIC. However, each packet traversal still involves sk_buff allocation and the full stack processing of the receiving namespace.

Performance: veth throughput on loopback can reach 20–40 Gbps on modern hardware. Latency is ~5–10 µs for a round-trip through a veth pair.


Linux Bridge

A Linux bridge connects multiple network interfaces at Layer 2, learning MAC addresses and forwarding frames to the correct port. It behaves like a software Ethernet switch:

# Create a bridge
ip link add name br0 type bridge
ip link set br0 up

# Add interfaces to the bridge
ip link set eth0 master br0
ip link set veth0 master br0

# Assign IP to the bridge (host connectivity)
ip addr add 192.168.1.1/24 dev br0

The bridge maintains a forwarding database (FDB) — a MAC address table. Packets arriving on any port are forwarded to the port whose MAC matches the destination, or flooded to all ports if unknown.

bridge fdb show dev br0     # show MAC table
bridge fdb add <MAC> dev <port> master  # static entry

Container networking with veth + bridge:

Host network namespace           Container namespace
                                      (netns A)
+---br0 (192.168.1.1)------------+      +---veth1 (192.168.1.2)---+
|     |          |               |      |                          |
|  eth0       veth0 <-----------+--------> veth1                  |
|  (uplink)   (host end)        |      |  (container end)         |
+-------------------------------+      +--------------------------+

                        Container networking diagram
                        ============================

  Internet
     |
   eth0 (192.0.2.1)
     |
   br0 (192.168.1.1) ---- [Linux bridge FDB]
     |         |
   veth0A    veth0B       <-- host ends of veth pairs
     |         |
  [veth1A] [veth1B]       <-- container ends (in separate netns)
     |         |
  container A  container B
 (192.168.1.2) (192.168.1.3)

Docker's default bridge (docker0) uses exactly this topology. Kubernetes uses it per-node with CNI plugins adding overlay protocols.


TUN/TAP Devices

TUN (network TUNnel) and TAP (Terminal Access Point) devices present virtual NICs to userspace processes:

  • TUN: operates at Layer 3 (IP). Userspace reads IP packets from /dev/net/tun and writes IP packets back. Used by VPN software (OpenVPN, WireGuard userspace mode).
  • TAP: operates at Layer 2 (Ethernet frames). Used by hypervisors (QEMU/KVM) to give VMs a virtual NIC.
# Create a TUN device
ip tuntap add dev tun0 mode tun user $(whoami)
ip link set tun0 up
ip addr add 10.8.0.1/24 dev tun0

# Create a TAP device (for VM)
ip tuntap add dev tap0 mode tap user qemu
ip link set tap0 master br0
ip link set tap0 up

A VPN application using TUN: 1. Kernel routes outgoing packets to tun0 (based on IP routing table) 2. VPN process reads the IP packet from the TUN file descriptor 3. VPN process encrypts it and sends via a UDP socket to the remote VPN server 4. Remote server decrypts, writes to its TUN device 5. Kernel delivers to destination

WireGuard implements this entirely in kernel space for performance; OpenVPN uses userspace TUN.


VLAN (802.1Q)

VLANs partition a physical network into isolated L2 broadcast domains using 4-byte 802.1Q tags inserted into Ethernet frames. Linux creates VLAN subinterfaces:

# Create VLAN 100 subinterface on eth0
ip link add link eth0 name eth0.100 type vlan id 100
ip link set eth0.100 up
ip addr add 10.100.0.1/24 dev eth0.100

# VLAN on a bridge port (trunk)
bridge vlan add vid 100 dev eth0 pvid untagged  # access port
bridge vlan add vid 100-200 dev eth1             # trunk port

VLAN tags are 12 bits (VID), supporting 4094 logical networks per physical link (VID 0 and 4095 reserved). This limit drove the development of VXLAN.


VXLAN: Virtual Extensible LAN

VXLAN encapsulates Ethernet frames in UDP/IP, extending L2 networks across L3 boundaries. Key parameters:

  • VNI (VXLAN Network Identifier): 24-bit, supporting ~16 million virtual networks (vs 4094 for VLAN)
  • UDP port: 4789 (IANA assigned)
  • Outer header: IP + UDP (enables L3 routing, ECMP across fabric)
VXLAN encapsulation:
+----------+----------+----------+----------+----------+
| Outer IP | Outer UDP| VXLAN hdr| Inner ETH| Inner IP |
| (8B)     | (8B)     | (8B VNI) | (14B)    | ...      |
+----------+----------+----------+----------+----------+
 ^                                ^
 outer network addressing         original packet
# Create VXLAN interface
ip link add vxlan100 type vxlan \
    id 100 \
    dstport 4789 \
    dev eth0 \
    local 10.0.0.1 \
    remote 10.0.0.2    # unicast mode (point-to-point)

# Or multicast mode for BUM (Broadcast/Unknown/Multicast) flooding
ip link add vxlan100 type vxlan \
    id 100 \
    dstport 4789 \
    dev eth0 \
    group 239.1.1.1    # multicast group

ip link set vxlan100 up
ip addr add 192.168.100.1/24 dev vxlan100

VXLAN FDB: MAC-to-VTEP (VXLAN Tunnel Endpoint) mappings. When a destination MAC is unknown, the VTEP floods to multicast group or (in controller-driven SDN) consults a central directory:

bridge fdb show dev vxlan100   # show VTEP mappings
bridge fdb add <MAC> dev vxlan100 dst <remote_vtep_ip>  # add static entry

Cloud overlay: AWS VPC, Azure VNET, GCP VPC all use VXLAN or similar encapsulation (Geneve in AWS's newer implementation) to provide tenant network isolation.


Geneve

Geneve (Generic Network Virtualization Encapsulation, RFC 8926) is VXLAN's flexible successor. The key difference: Geneve has a variable-length options field that carries metadata (OVN uses this for security labels, load balancer state):

ip link add geneve0 type geneve id 42 remote 10.0.0.2

OVN (Open Virtual Network, used by Kubernetes OVN-Kubernetes CNI) uses Geneve to carry logical datapath and port binding information in the options header.


MACVLAN and IPVLAN

MACVLAN: creates virtual interfaces with distinct MAC addresses on top of a physical interface. Each MACVLAN interface has its own MAC, appears as a separate NIC to the network:

ip link add macvlan0 link eth0 type macvlan mode bridge

Modes: bridge (communicate between instances), vepa (must go via external switch), private (no direct communication), passthru (single MAC, VM-like). Used in Docker macvlan networks for containers needing direct L2 access.

IPVLAN: similar to MACVLAN but shares the parent MAC. Multiple IP addresses on one MAC. Useful when the upstream switch limits MAC address count per port (some wireless APs):

ip link add ipvlan0 link eth0 type ipvlan mode l2

Open vSwitch (OVS)

OVS is a production-quality L2/L3 software switch with OpenFlow support. It has two components: - ovsdb-server: configuration database - ovs-vswitchd: userspace dataplane with kernel datapath module

# Create OVS bridge
ovs-vsctl add-br ovs-br0

# Add physical NIC and VM tap interface
ovs-vsctl add-port ovs-br0 eth0
ovs-vsctl add-port ovs-br0 tap-vm1

# Add VXLAN tunnel port
ovs-vsctl add-port ovs-br0 vxlan0 -- set Interface vxlan0 \
    type=vxlan options:remote_ip=10.0.0.2 options:key=100

# OpenFlow rule: forward traffic from VM to VXLAN port
ovs-ofctl add-flow ovs-br0 "in_port=tap-vm1,actions=output:vxlan0"

# Show datapath flows (kernel cache)
ovs-dpctl dump-flows

# Show OpenFlow flows
ovs-ofctl dump-flows ovs-br0

OVS-DPDK replaces the kernel datapath module with a DPDK PMD, achieving 40–100 Gbps for VM-to-VM traffic.


SR-IOV: Single Root I/O Virtualization

SR-IOV allows a physical NIC to present multiple Virtual Functions (VFs) — each appearing as a separate PCIe device that can be passed directly to a VM or container:

# Enable 4 VFs on physical function
echo 4 > /sys/class/net/eth0/device/sriov_numvfs

# Verify VFs created
ip link show eth0   # shows VF list

# Configure VF MAC/VLAN
ip link set eth0 vf 0 mac 00:11:22:33:44:55 vlan 100

# Pass VF to container using device plugin
# Or to VM: qemu -device vfio-pci,host=0000:01:00.1

SR-IOV VFs bypass the host kernel network stack entirely for data plane traffic. VM-to-NIC latency drops to ~5 µs (vs ~50 µs through OVS). The tradeoff: live migration requires VF detach/attach, losing connection state.


Historical Context

Linux network namespaces were introduced in Linux 2.4.19 (2002) as a proof-of-concept, made production-ready in 2.6.24 (2008). veth pairs came with namespace support. This infrastructure sat mostly dormant until Docker (2013) popularized containerization and made "one container, one network namespace, one veth pair" the standard deployment model.

VXLAN was proposed by VMware, Cisco, and others in 2011 (RFC 7348, 2014) to overcome the 4094-VLAN limit in cloud environments. Within 3 years, every major cloud had deployed VXLAN-based overlays at million-node scale.

OVS originated at Nicira (2007), which was acquired by VMware for $1.26B in 2012. The Nicira engineers went on to found companies like Pensando (SmartNICs) and continued developing OVN as the successor overlay fabric.


Production Examples

Kubernetes pod networking (kubenet/flannel):

# Each node: a bridge cbr0 connecting pod veth pairs
# Cross-node: VXLAN overlay with flannel daemon managing FDB
ip link show type veth        # pod veth pairs
ip link show type vxlan       # flannel VXLAN interface
bridge fdb show | grep vxlan  # remote VTEP MAC entries

Diagnosis: packet drops in veth path

# Check veth TX errors (receiver namespace full)
ip -s link show veth0
# Rx errors on the container side means the container recv queue is full

# ethtool for veth (limited but shows statistics)
ethtool -S veth0

Debugging Notes

# Trace ARP across bridge/veth
ip netns exec <ns> arping -I veth1 10.0.0.1

# Check bridge FDB (MAC learning table)
bridge fdb show dev br0

# Test overlay reachability
ping -c1 -W1 <remote_inner_ip>  # VXLAN inner
# Then check outer UDP connectivity
tcpdump -i eth0 'udp port 4789'

# Namespace-aware packet capture
ip netns exec <ns> tcpdump -i veth1 -w /tmp/cap.pcap

# OVS troubleshooting
ovs-appctl ofproto/trace ovs-br0 in_port=1,dl_dst=ff:ff:ff:ff:ff:ff
ovs-dpctl show

Security Implications

  • Bridge bypass: a container with CAP_NET_ADMIN can create a new network interface, add it to the host bridge, and gain access to all bridge traffic. Kubernetes prevents this via allowPrivilegeEscalation: false and seccomp profiles.
  • ARP spoofing on bridge: without arptables or OVS ARP inspection, a malicious container can ARP-poison the bridge FDB and intercept traffic. Mitigation: ebtables -A FORWARD --protocol ARP --arp-op Reply -j DROP for all but validated sources.
  • VXLAN without encryption: VXLAN carries tenant traffic in plaintext UDP — any host on the underlay can capture and inject traffic. Mitigation: IPsec or WireGuard on the underlay, or use overlay encryption (OVN + IPsec).
  • SR-IOV VF isolation: VFs are isolated by IOMMU, but firmware bugs (Spectre-v1 variants in NIC firmware) can allow cross-VF information leakage. Keep NIC firmware updated.

Performance Implications

Virtual device Throughput Latency CPU cost
veth 20–40 Gbps 5–10 µs Low
Linux bridge 10–20 Gbps 10–20 µs Medium
OVS (kernel) 5–10 Gbps 20–50 µs Medium-High
OVS-DPDK 40–100 Gbps 5–15 µs High (dedicated core)
VXLAN 5–10 Gbps +5–10 µs (encap) Medium
VXLAN + GSO/GRO offload ~line rate Low with NIC offload
SR-IOV VF Near physical NIC ~5 µs Very Low

Enable VXLAN hardware offload (where supported):

ethtool -k eth0 | grep tx-udp-tnl  # check offload capability
ethtool -K eth0 tx-udp-tnl-csum-segmentation on

Failure Modes and Real Incidents

Incident: Kubernetes pod DNS failure due to bridge loop A misconfigured CNI plugin added the same VXLAN interface to two bridges, creating a forwarding loop. STP (Spanning Tree Protocol) was disabled (default in Linux bridge). BUM traffic amplified into a storm, saturating CPU with bridge flooding. Diagnosis: bridge fdb show showed duplicates; ip -s link show br0 showed exponential RX rate. Fix: enable STP (ip link set br0 type bridge stp_state 1) or correct CNI configuration.

Failure Mode: VXLAN MTU mismatch Inner Ethernet frames + VXLAN header (50 bytes) exceed physical MTU (1500 bytes). Fragments are dropped by the underlay. Symptom: large file transfers fail, ping works. Fix: reduce container/VM MTU to 1450, or enable jumbo frames (MTU 9000) on underlay.

Incident: SR-IOV VF queue exhaustion (telecom) A 5G gNB application consumed all 1024 queue descriptors on its VF. Other VFs on the same PF became starved. NIC firmware bug: descriptor exhaustion was not properly isolated. Required NIC firmware update and VF queue count reduction.


Modern Usage

  • CNI plugins: Cilium (eBPF), Calico (BGP + eBPF), Flannel (VXLAN), Weave (overlay), AWS VPC CNI (ENI-based, native VPC routing) — all compose the virtual networking primitives above
  • eBPF replaces bridges: Cilium avoids bridge + veth by using TC eBPF programs directly on pod veth interfaces, cutting a full stack traversal per packet
  • Geneve over VXLAN: OVN, Azure Accelerated Networking, and newer AWS deployments have moved to Geneve for extensible metadata
  • SmartNIC offload: Nvidia BlueField-2/3, AMD Pensando run OVS-offload in NIC firmware, reducing host CPU load for virtual switching to near zero

Future Directions

  • eBPF-native CNI: as eBPF socket-level hooks mature, the overhead of veth+bridge for pod-to-pod communication may be eliminated entirely — packets redirected at socket level without entering the IP stack
  • BESS (Berkeley Extensible Software Switch): academic work on modular software switching that influenced production systems
  • P4-programmable fabrics: hardware switches programmable in P4 replace static VXLAN/EVPN with application-aware forwarding

Exercises

  1. Create two network namespaces connected via a veth pair through a Linux bridge. Add a third namespace. Demonstrate L2 forwarding between all three by capturing ARP and ICMP traffic with tcpdump on the bridge interface. Explain when flooding vs directed forwarding occurs.

  2. Set up a VXLAN tunnel between two VMs (or network namespaces on different hosts). Introduce an MTU mismatch by setting the outer interface MTU to 1500 and the inner interface MTU to 1500 (no adjustment). Observe the failure mode using ping -M do -s 1400. Fix by correcting the MTU.

  3. Implement a MACVLAN bridge-mode setup with three virtual NICs on the same physical interface. Verify that instances can communicate with each other and the external network. Then switch to MACVLAN private mode and explain the behavioral difference.

  4. Deploy OVS with two ports connected to network namespaces. Write an OpenFlow rule that mirrors traffic from port 1 to a monitoring port (port 3) while forwarding normally. Capture the mirrored traffic with tcpdump.

  5. On a host with an SR-IOV capable NIC, create 4 VFs. Assign each to a separate network namespace with ip link set eth0v0 netns ns1. Run iperf3 between VF namespaces and measure throughput vs a veth+bridge baseline. Explain the performance difference.


References

  • drivers/net/veth.c — veth driver implementation
  • net/bridge/br_forward.c — bridge forwarding logic
  • drivers/net/tun.c — TUN/TAP implementation
  • net/ipv4/udp_tunnel.c — VXLAN/Geneve UDP tunnel infrastructure
  • RFC 7348 — Virtual eXtensible Local Area Network (VXLAN)
  • RFC 8926 — Geneve: Generic Network Virtualization Encapsulation
  • Open vSwitch documentation. OVS Design Decisions. openvswitch.org
  • Kubernetes documentation. Cluster Networking. kubernetes.io
  • ip-link(8) — virtual device types reference
  • man 8 bridge — Linux bridge management