05 — IP Routing Internals
Technical Overview
IP routing is the process of determining the output interface and next-hop address for each packet based on the destination IP address. In Linux, routing is split between the control plane (routing daemons, user-space tools that populate routing tables) and the data plane (the kernel's fast-path lookup and forwarding code). Understanding routing internals explains why BGP route flaps can cause packet loss, how ECMP load balancing works, and how to debug asymmetric routing and unreachable destinations.
Prerequisites
- IPv4 addressing and CIDR notation
- IP protocol basics (TTL, fragmentation, forwarding)
- Linux networking fundamentals (see
01-linux-network-stack.md) - Basic
ip route,ip rule,traceroute,arpingfamiliarity
Core Content
Control Plane vs Data Plane
Control Plane (userspace) Data Plane (kernel fast path)
========================= ==============================
Routing daemons: FIB (Forwarding Information Base):
- FRR/Quagga (BGP, OSPF) struct fib_table
- BIRD struct fib_info
- static routes via 'ip route' LC-trie (level-compressed trie)
RIB (Routing Information Base): Forwarding path:
- All candidate routes ip_rcv()
- Multiple protocols' routes → ip_route_input()
- Best-path selection → fib_lookup()
- Administrative distance → ip_forward()
→ ip_output()
Netlink:
- RTM_NEWROUTE / RTM_DELROUTE ARP/Neighbor cache:
- Routing socket (RTNETLINK) struct neighbour
(control → kernel FIB) neigh_lookup()
L2 header rewrite
The kernel's FIB is the authoritative routing table. Userspace daemons program it via RTM_NEWROUTE netlink messages. The kernel never reads the RIB.
FIB and fib_table
Linux maintains multiple routing tables (identified by number 0–255). The main table is table 254 (RT_TABLE_MAIN), the local table is 255 (RT_TABLE_LOCAL):
# List routing tables
ip route show table main
ip route show table local # local and broadcast addresses
ip route show table all # all tables
# Add route to specific table
ip route add 10.0.0.0/8 via 192.168.1.1 table 200
Each table is a struct fib_table containing an LPM trie.
Key kernel structures (include/net/ip_fib.h):
struct fib_table {
struct hlist_node tb_hlist;
u32 tb_id; /* table number */
int tb_num_default;
struct rcu_head rcu;
unsigned long tb_data[0]; /* trie data */
};
struct fib_info {
struct hlist_node fib_hash;
struct hlist_node fib_lhash;
struct net *fib_net;
int fib_treeref;
refcount_t fib_clntref;
unsigned int fib_flags;
unsigned char fib_dead;
unsigned char fib_protocol; /* routing protocol (RTPROT_*) */
unsigned char fib_scope;
unsigned char fib_type;
__be32 fib_prefsrc;
u32 fib_tb_id;
u32 fib_priority; /* metric */
struct dst_metrics *fib_metrics;
int fib_nhs;
bool fib_nh_is_v6;
struct fib_nh fib_nh[0]; /* next hops (ECMP) */
};
LPM: Longest Prefix Match
Routing lookup finds the most specific (longest prefix) matching route for a destination IP. Given a routing table:
10.0.0.0/8 via 192.168.1.1
10.1.0.0/16 via 192.168.1.2
10.1.1.0/24 via 192.168.1.3
0.0.0.0/0 via 192.168.1.254 (default)
For destination 10.1.1.5:
- 0.0.0.0/0 matches (all IPs, prefix length 0)
- 10.0.0.0/8 matches (prefix length 8)
- 10.1.0.0/16 matches (prefix length 16)
- 10.1.1.0/24 matches (prefix length 24) ← longest prefix wins
Result: forward via 192.168.1.3.
LC-Trie: The Linux FIB Lookup Algorithm
Linux uses an LC-trie (Level-Compressed trie, also called a Patricia trie with path compression) for FIB lookups. Introduced to replace the routing cache in Linux 3.6 (2012).
Why LC-Trie: a simple radix trie over 32-bit IPv4 addresses has depth 32 (one bit per level = 32 comparisons). The LC-trie compresses consecutive single-child nodes, dramatically reducing lookup depth for typical routing tables:
LC-Trie structure for sample IPv4 routing table
(nodes contain "skip" count and "branch" bits):
Root
/ \
[0...*] [1...*]
/ \
[10.0.0.*] [10.1.*]
/ \
[10.1.0.*] [10.1.1.*]
|
/24 entry
For a full BGP table (800K+ routes), LC-trie typically requires 10–15 memory accesses per lookup, compared to 32 for a simple binary trie.
Source: net/ipv4/fib_trie.c
Routing Cache Removal (Linux 3.6)
Before Linux 3.6 (2012), the kernel maintained a routing cache — a hash table of recently seen (src IP, dst IP) pairs with precomputed next-hops. This eliminated LPM lookup cost for repeated flows.
The routing cache was removed because it was a DoS vector: an attacker sending packets with random source IPs filled the cache, causing cache thrashing and O(n) cache lookups. With modern CPUs and the efficient LC-trie, the raw LPM lookup became faster than cache management for most workloads.
Impact: the first packet to a new destination is slightly slower (LPM lookup), but cache pollution attacks no longer work.
IPv4 Routing Path
ip_rcv() [net/ipv4/ip_input.c]
|
+─→ NF_INET_PRE_ROUTING
|
ip_rcv_finish()
|
ip_route_input_noref() [net/ipv4/route.c]
|
+─→ fib_lookup() ─→ LC-trie lookup
| |
| +── local delivery? ──→ ip_local_deliver()
| | |
| | NF_INET_LOCAL_IN
| | |
| | TCP/UDP/ICMP handlers
| |
| +── forward? ───────→ ip_forward()
| |
| NF_INET_FORWARD
| |
| ip_forward_finish()
| |
| ip_output() ──→ NIC
|
+─→ (route not found: ICMP host unreachable to sender)
ip_output() [net/ipv4/ip_output.c]
|
+─→ NF_INET_POST_ROUTING
|
ip_finish_output()
|
+─→ ip_fragment() (if packet > MTU and DF bit not set)
|
neigh_output() [neighbor table lookup, ARP if needed]
|
dev_queue_xmit() [qdisc + driver]
Key path for performance: fib_lookup() → neigh_lookup() → dev_queue_xmit(). The neighbor lookup (ARP cache check) is on the critical path for every forwarded packet.
ARP and the Neighbor Table
ARP resolves an IPv4 address to an Ethernet MAC address for L2 forwarding. The Linux neighbor table (struct neigh_table) caches these mappings:
# View ARP cache
ip neigh show
# States: REACHABLE, STALE, DELAY, PROBE, FAILED, INCOMPLETE, PERMANENT
# Add static ARP entry
ip neigh add 10.0.0.2 lladdr 00:11:22:33:44:55 dev eth0 nud permanent
# Flush ARP cache
ip neigh flush dev eth0
ARP entry states:
- REACHABLE: confirmed recently (within base_reachable_time_ms, default 30s)
- STALE: not confirmed recently, but known good
- DELAY: waiting to confirm via traffic
- PROBE: sending ARP requests to confirm
- FAILED: ARP resolution failed (host down or not responding)
- INCOMPLETE: ARP request sent, awaiting reply
Neighbor table sizing:
sysctl net.ipv4.neigh.default.gc_thresh1 # 128 (start GC)
sysctl net.ipv4.neigh.default.gc_thresh2 # 512 (hard work GC)
sysctl net.ipv4.neigh.default.gc_thresh3 # 1024 (immediate GC, may drop)
# For large /16 or /8 subnets with many hosts
sysctl -w net.ipv4.neigh.default.gc_thresh3=65536
IPv6 NDP (Neighbor Discovery Protocol) replaces ARP for IPv6. Uses ICMPv6 Neighbor Solicitation/Advertisement multicast instead of broadcast ARP.
ECMP: Equal-Cost Multipath
ECMP (Equal-Cost Multipath) routes traffic across multiple next-hops with the same metric. Used for L3 load balancing across bonded uplinks or spine switches in data centers:
# ECMP route with two next-hops
ip route add 0.0.0.0/0 nexthop via 192.168.1.1 dev eth0 weight 1 \
nexthop via 192.168.2.1 dev eth1 weight 1
# Verify ECMP
ip route show
# default
# nexthop via 192.168.1.1 dev eth0 weight 1
# nexthop via 192.168.2.1 dev eth1 weight 1
Hash-based ECMP distributes flows across paths using a hash of the packet's 5-tuple (or 3-tuple for ICMP). The same flow always takes the same path (per-flow load balancing), avoiding packet reordering within a TCP flow.
# Hash policy: 0 = L3 (src/dst IP), 1 = L3+L4 (src/dst IP+port)
sysctl -w net.ipv4.fib_multipath_hash_policy=1
# Resilient ECMP (Linux 5.4+): doesn't rehash all flows on member add/remove
ip nexthop add id 1 via 192.168.1.1 dev eth0
ip nexthop add id 2 via 192.168.2.1 dev eth1
ip nexthop add id 10 group 1/2 type resilient buckets 64
ip route add default nhid 10
ECMP and TCP: standard ECMP does per-flow hashing — all packets in a TCP flow take the same path. Resilient ECMP additionally ensures existing flows aren't redistributed when a nexthop is added or removed (only new flows are redistributed).
Policy Routing (Multiple Tables)
Policy routing allows different routing decisions based on source address, interface, TOS, fwmark, or other attributes — beyond just destination IP:
# ip rule: evaluated in priority order (lower = first)
ip rule show
# 0: from all lookup local
# 32766: from all lookup main
# 32767: from all lookup default
# Route traffic from 10.0.0.0/24 via table 100 (separate uplink)
ip rule add from 10.0.0.0/24 table 100
ip route add default via 172.16.0.1 table 100
# Route traffic marked with fwmark 0x1 via table 200 (VPN)
ip rule add fwmark 0x1 table 200
ip route add default via 10.8.0.1 dev tun0 table 200
# Packet marking: iptables -t mangle -A OUTPUT -p tcp --dport 443 -j MARK --set-mark 0x1
Policy routing is used by: - Multi-homed hosts with multiple ISPs (route return traffic via the same ISP it came in on) - VPN routing (specific traffic through tunnel, rest through default) - Kubernetes pod networking (each pod namespace has policy routes ensuring traffic exits via the correct veth)
BGP as Inter-Domain Routing
BGP (Border Gateway Protocol) is the routing protocol of the internet — it propagates reachability information between autonomous systems (ASes). In Linux, BGP is implemented by FRR (Free Range Routing) or BIRD, running in userspace and programming the FIB via netlink.
# FRR BGP status
vtysh -c 'show bgp summary'
vtysh -c 'show ip route bgp'
# Check BGP routes in Linux FIB
ip route show proto bgp
ip route show | grep -c bgp # count BGP routes
# ECMP via BGP (iBGP multipath)
vtysh -c 'router bgp 65001'
vtysh -c ' address-family ipv4 unicast'
vtysh -c ' maximum-paths 8'
Historical Context
The Linux IPv4 routing subsystem was largely written by Alexey Kuznetsov (LKML identity: kuznet@ms2.inr.ac.ru) in the late 1990s. The routing cache was a pragmatic optimization that worked well until the scale of the internet made it a liability.
The routing cache removal in 2012 was controversial — early benchmarks showed performance regression for single-stream flows. Subsequent optimization of the LC-trie and neighbor table code recovered the performance, and the security benefit (eliminating the cache fill attack) was decisive.
ECMP in Linux was basic (only 2 paths initially) and gained practical utility when the multipath_hash_policy sysctl enabled 4-tuple hashing in Linux 4.12 (2017) and resilient ECMP was added in 5.4 (2019).
Debugging Notes
# Trace routing decision for a destination
ip route get 8.8.8.8
# Shows: which table, which next-hop, which interface, src IP
# Policy routing trace
ip route get 8.8.8.8 from 10.0.0.1
ip route get 8.8.8.8 oif eth0
# ARP resolution failures
ip neigh show | grep FAILED
# Monitor routing table changes (netlink events)
ip monitor route
# FIB statistics
cat /proc/net/fib_triestat
# Total: entries, leaves, internal nodes, reuses
# ECMP next-hop resolution
ip route show exact 0.0.0.0/0 # show all ECMP members
# Kernel routing event tracing
bpftrace -e 'kprobe:fib_lookup { @[comm] = count(); }'
# ARP storm detection
bpftrace -e 'kprobe:arp_rcv { @[comm] = count(); }'
Security Implications
- Source address validation (rp_filter): reverse path filtering ensures packets arrive on the expected interface. Prevents IP spoofing in forwarded traffic:
sysctl -w net.ipv4.conf.all.rp_filter=1 - ICMP redirect: a router can send ICMP redirects to update a host's routing table. This allows MITM attacks:
sysctl -w net.ipv4.conf.all.accept_redirects=0 - ARP spoofing: any host on the same L2 can poison the ARP cache with gratuitous ARPs. Mitigation:
arpwatch, static ARP entries for critical hosts, 802.1X port authentication. - BGP hijacking: a malicious AS can advertise more specific prefixes, attracting traffic. BGP origin validation (RPKI) signs IP prefix ownership; routers reject invalid announcements. Deployed by major ISPs.
- ECMP and NAT: ECMP over NAT can cause asymmetric routing — outbound via path A, return via path B through a different NAT box that doesn't have the connection state. Always ensure ECMP hash is consistent across NAT zones.
Performance Implications
| Operation | Typical latency | Notes |
|---|---|---|
| FIB lookup (LC-trie) | 50–100 ns | ~10 cache line accesses |
| ARP cache hit | 20–50 ns | neigh_lookup() hash |
| ARP cache miss → probe | 1–10 ms | Waits for ARP reply |
| Policy routing (1 rule) | +10–20 ns | Per rule evaluated |
| ECMP hash computation | +20 ns | 5-tuple hash (jhash) |
Performance regression profile: a host with 100,000 routes takes longer per lookup than one with 1,000 routes due to deeper trie traversal and worse cache behavior. Internet full-table BGP routers (~900K IPv4 routes) require the FIB to fit in L3 cache.
Failure Modes and Real Incidents
Incident: ARP table exhaustion in /16 subnet (2016, cloud provider)
A host in a /16 subnet (65,536 hosts) sent pings to all IPs to scan for active hosts. Each ping caused an ARP request (INCOMPLETE state), quickly filling gc_thresh3 (default 1024). New ARP entries were dropped; host could only communicate with the 1024 currently-cached neighbors. Fix: gc_thresh3=65536, and rate-limit ARP scanning.
Incident: ECMP asymmetric routing under maintenance (multiple providers) When removing a spine switch from ECMP rotation during maintenance, existing flows were rehashed to remaining paths. For TCP connections, packets arrived out of order, triggering retransmits. With non-resilient ECMP: ~50% of active flows experience reordering during any topology change. Fix: resilient ECMP with graceful drain (withdraw BGP prefix with 5-minute prepend before removing).
Failure Mode: rp_filter breaks asymmetric routing
A multi-homed host receives traffic on eth0 but routes return traffic via eth1 (asymmetric routing). rp_filter=1 drops the incoming packets (they don't match the route for the source IP). Symptom: one-way communication. Fix: rp_filter=2 (loose mode: accept if any route exists) or rp_filter=0 for trusted interfaces.
Modern Usage
- BPF routing: eBPF programs can be attached to routing table lookup hooks (
bpf_fib_lookup) to implement custom routing policies without modifying the FIB - VRF (Virtual Routing and Forwarding): Linux 4.3+ supports network-namespace-like VRFs implemented as master devices:
ip link add vrf0 type vrf table 100. Each VRF has its own FIB. Used for L3 VPN and network segmentation. - MPLS in Linux:
ip -M routeandip -M nexthop— Linux supports MPLS forwarding, used with FRR for MPLS/LDP/RSVP networks
Future Directions
- XDP routing offload: eBPF programs that perform FIB lookup in XDP can forward packets at wire speed without sk_buff allocation — essentially turning a Linux server into a line-rate router
- IPv6 segment routing (SRv6): extends IPv6 routing with a segment list in the header, enabling source routing and traffic engineering without MPLS state in the core
- Hardware FIB offload: switchdev API allows Linux to program FIB entries directly into SmartNIC or switch ASIC firmware, offloading routing from the host CPU
Exercises
-
Set up a Linux router with three network namespaces: client, router, server. Configure
ip_forward=1and appropriate routes. Capture packet forwarding withbpftraceonip_forward. Measure forwarding latency withhping3 -S. -
Create a policy routing setup where traffic from source 10.0.0.0/24 exits via eth0 and traffic from 172.16.0.0/24 exits via eth1. Verify with
ip route get <dst> from <src>. Introduce rp_filter and observe which source/interface combinations cause drops. -
Populate the routing table with 10,000 synthetic routes (
ip route add 10.X.Y.0/24 via ...in a loop). Measurefib_lookuplatency before and after usingbpftrace. Observe the FIB trie statistics in/proc/net/fib_triestat. -
Configure ECMP with 3 next-hops. Use
tc qdisc add dev eth0 root netem loss 100%to kill one path. Verify that ECMP automatically routes around it. Then observe what happens with resilient ECMP vs standard ECMP for existing flows. -
Install FRR on a Linux VM. Configure BGP between two VMs using private ASN 65001 and 65002. Advertise a prefix from one VM and verify it appears in the other's
ip routetable as a BGP route. Then withdraw the prefix and observe the convergence time.
References
net/ipv4/ip_input.c—ip_rcv(),ip_local_deliver()net/ipv4/ip_forward.c—ip_forward()net/ipv4/fib_trie.c— LC-trie implementationnet/ipv4/route.c—ip_route_input(), ECMPnet/core/neighbour.c— ARP/NDP neighbor tableinclude/net/ip_fib.h—fib_table,fib_info,fib_nh- Almesberger, W. Linux IP Networking: A Guide to the Implementation and Modification of the Linux Protocol Stack. 1999.
- RFC 4632 — Classless Inter-Domain Routing (CIDR)
- RFC 1812 — Requirements for IP Version 4 Routers
ip-route(8),ip-rule(8),ip-neighbour(8)man pages- FRR documentation. frrouting.org
Documentation/networking/vrf.rst— Linux VRF documentation