05 — IP Routing Internals

Technical Overview

IP routing is the process of determining the output interface and next-hop address for each packet based on the destination IP address. In Linux, routing is split between the control plane (routing daemons, user-space tools that populate routing tables) and the data plane (the kernel's fast-path lookup and forwarding code). Understanding routing internals explains why BGP route flaps can cause packet loss, how ECMP load balancing works, and how to debug asymmetric routing and unreachable destinations.

Prerequisites

IPv4 addressing and CIDR notation
IP protocol basics (TTL, fragmentation, forwarding)
Linux networking fundamentals (see 01-linux-network-stack.md)
Basic ip route, ip rule, traceroute, arping familiarity

Core Content

Control Plane vs Data Plane

Control Plane (userspace)          Data Plane (kernel fast path)
=========================          ==============================

Routing daemons:                   FIB (Forwarding Information Base):
  - FRR/Quagga (BGP, OSPF)           struct fib_table
  - BIRD                             struct fib_info
  - static routes via 'ip route'     LC-trie (level-compressed trie)

RIB (Routing Information Base):    Forwarding path:
  - All candidate routes             ip_rcv()
  - Multiple protocols' routes         → ip_route_input()
  - Best-path selection                → fib_lookup()
  - Administrative distance             → ip_forward()
                                        → ip_output()
Netlink:
  - RTM_NEWROUTE / RTM_DELROUTE     ARP/Neighbor cache:
  - Routing socket (RTNETLINK)        struct neighbour
    (control → kernel FIB)            neigh_lookup()
                                      L2 header rewrite

The kernel's FIB is the authoritative routing table. Userspace daemons program it via RTM_NEWROUTE netlink messages. The kernel never reads the RIB.

FIB and fib_table

Linux maintains multiple routing tables (identified by number 0–255). The main table is table 254 (RT_TABLE_MAIN), the local table is 255 (RT_TABLE_LOCAL):

# List routing tables
ip route show table main
ip route show table local   # local and broadcast addresses
ip route show table all    # all tables

# Add route to specific table
ip route add 10.0.0.0/8 via 192.168.1.1 table 200

Each table is a struct fib_table containing an LPM trie.

Key kernel structures (include/net/ip_fib.h):

struct fib_table {
    struct hlist_node  tb_hlist;
    u32                tb_id;       /* table number */
    int                tb_num_default;
    struct rcu_head    rcu;
    unsigned long      tb_data[0];  /* trie data */
};

struct fib_info {
    struct hlist_node  fib_hash;
    struct hlist_node  fib_lhash;
    struct net        *fib_net;
    int                fib_treeref;
    refcount_t         fib_clntref;
    unsigned int       fib_flags;
    unsigned char      fib_dead;
    unsigned char      fib_protocol;  /* routing protocol (RTPROT_*) */
    unsigned char      fib_scope;
    unsigned char      fib_type;
    __be32             fib_prefsrc;
    u32                fib_tb_id;
    u32                fib_priority;  /* metric */
    struct dst_metrics *fib_metrics;
    int                fib_nhs;
    bool               fib_nh_is_v6;
    struct fib_nh      fib_nh[0];    /* next hops (ECMP) */
};

LPM: Longest Prefix Match

Routing lookup finds the most specific (longest prefix) matching route for a destination IP. Given a routing table:

10.0.0.0/8     via 192.168.1.1
10.1.0.0/16    via 192.168.1.2
10.1.1.0/24    via 192.168.1.3
0.0.0.0/0      via 192.168.1.254  (default)

For destination 10.1.1.5: - 0.0.0.0/0 matches (all IPs, prefix length 0) - 10.0.0.0/8 matches (prefix length 8) - 10.1.0.0/16 matches (prefix length 16) - 10.1.1.0/24 matches (prefix length 24) ← longest prefix wins

Result: forward via 192.168.1.3.

LC-Trie: The Linux FIB Lookup Algorithm

Linux uses an LC-trie (Level-Compressed trie, also called a Patricia trie with path compression) for FIB lookups. Introduced to replace the routing cache in Linux 3.6 (2012).

Why LC-Trie: a simple radix trie over 32-bit IPv4 addresses has depth 32 (one bit per level = 32 comparisons). The LC-trie compresses consecutive single-child nodes, dramatically reducing lookup depth for typical routing tables:

LC-Trie structure for sample IPv4 routing table
(nodes contain "skip" count and "branch" bits):

                 Root
                 /  \
            [0...*]  [1...*]
           /       \
      [10.0.0.*] [10.1.*]
                  /      \
           [10.1.0.*]  [10.1.1.*]
                           |
                       /24 entry

For a full BGP table (800K+ routes), LC-trie typically requires 10–15 memory accesses per lookup, compared to 32 for a simple binary trie.

Source: net/ipv4/fib_trie.c

Routing Cache Removal (Linux 3.6)

Before Linux 3.6 (2012), the kernel maintained a routing cache — a hash table of recently seen (src IP, dst IP) pairs with precomputed next-hops. This eliminated LPM lookup cost for repeated flows.

The routing cache was removed because it was a DoS vector: an attacker sending packets with random source IPs filled the cache, causing cache thrashing and O(n) cache lookups. With modern CPUs and the efficient LC-trie, the raw LPM lookup became faster than cache management for most workloads.

Impact: the first packet to a new destination is slightly slower (LPM lookup), but cache pollution attacks no longer work.

IPv4 Routing Path

ip_rcv()                 [net/ipv4/ip_input.c]
    |
    +─→ NF_INET_PRE_ROUTING
    |
ip_rcv_finish()
    |
ip_route_input_noref()   [net/ipv4/route.c]
    |
    +─→ fib_lookup() ─→ LC-trie lookup
    |       |
    |       +── local delivery? ──→ ip_local_deliver()
    |       |                            |
    |       |                       NF_INET_LOCAL_IN
    |       |                            |
    |       |                       TCP/UDP/ICMP handlers
    |       |
    |       +── forward? ───────→ ip_forward()
    |                                   |
    |                            NF_INET_FORWARD
    |                                   |
    |                            ip_forward_finish()
    |                                   |
    |                            ip_output() ──→ NIC
    |
    +─→ (route not found: ICMP host unreachable to sender)

ip_output()              [net/ipv4/ip_output.c]
    |
    +─→ NF_INET_POST_ROUTING
    |
ip_finish_output()
    |
    +─→ ip_fragment() (if packet > MTU and DF bit not set)
    |
neigh_output()           [neighbor table lookup, ARP if needed]
    |
dev_queue_xmit()         [qdisc + driver]

Key path for performance: fib_lookup() → neigh_lookup() → dev_queue_xmit(). The neighbor lookup (ARP cache check) is on the critical path for every forwarded packet.

ARP and the Neighbor Table

ARP resolves an IPv4 address to an Ethernet MAC address for L2 forwarding. The Linux neighbor table (struct neigh_table) caches these mappings:

# View ARP cache
ip neigh show
# States: REACHABLE, STALE, DELAY, PROBE, FAILED, INCOMPLETE, PERMANENT

# Add static ARP entry
ip neigh add 10.0.0.2 lladdr 00:11:22:33:44:55 dev eth0 nud permanent

# Flush ARP cache
ip neigh flush dev eth0

ARP entry states: - REACHABLE: confirmed recently (within base_reachable_time_ms, default 30s) - STALE: not confirmed recently, but known good - DELAY: waiting to confirm via traffic - PROBE: sending ARP requests to confirm - FAILED: ARP resolution failed (host down or not responding) - INCOMPLETE: ARP request sent, awaiting reply

Neighbor table sizing:

sysctl net.ipv4.neigh.default.gc_thresh1  # 128 (start GC)
sysctl net.ipv4.neigh.default.gc_thresh2  # 512 (hard work GC)
sysctl net.ipv4.neigh.default.gc_thresh3  # 1024 (immediate GC, may drop)

# For large /16 or /8 subnets with many hosts
sysctl -w net.ipv4.neigh.default.gc_thresh3=65536

IPv6 NDP (Neighbor Discovery Protocol) replaces ARP for IPv6. Uses ICMPv6 Neighbor Solicitation/Advertisement multicast instead of broadcast ARP.

ECMP: Equal-Cost Multipath

ECMP (Equal-Cost Multipath) routes traffic across multiple next-hops with the same metric. Used for L3 load balancing across bonded uplinks or spine switches in data centers:

# ECMP route with two next-hops
ip route add 0.0.0.0/0 nexthop via 192.168.1.1 dev eth0 weight 1 \
                        nexthop via 192.168.2.1 dev eth1 weight 1

# Verify ECMP
ip route show
# default
#     nexthop via 192.168.1.1 dev eth0 weight 1
#     nexthop via 192.168.2.1 dev eth1 weight 1

Hash-based ECMP distributes flows across paths using a hash of the packet's 5-tuple (or 3-tuple for ICMP). The same flow always takes the same path (per-flow load balancing), avoiding packet reordering within a TCP flow.

# Hash policy: 0 = L3 (src/dst IP), 1 = L3+L4 (src/dst IP+port)
sysctl -w net.ipv4.fib_multipath_hash_policy=1

# Resilient ECMP (Linux 5.4+): doesn't rehash all flows on member add/remove
ip nexthop add id 1 via 192.168.1.1 dev eth0
ip nexthop add id 2 via 192.168.2.1 dev eth1
ip nexthop add id 10 group 1/2 type resilient buckets 64
ip route add default nhid 10

ECMP and TCP: standard ECMP does per-flow hashing — all packets in a TCP flow take the same path. Resilient ECMP additionally ensures existing flows aren't redistributed when a nexthop is added or removed (only new flows are redistributed).

Policy Routing (Multiple Tables)

Policy routing allows different routing decisions based on source address, interface, TOS, fwmark, or other attributes — beyond just destination IP:

# ip rule: evaluated in priority order (lower = first)
ip rule show
# 0:      from all lookup local
# 32766:  from all lookup main
# 32767:  from all lookup default

# Route traffic from 10.0.0.0/24 via table 100 (separate uplink)
ip rule add from 10.0.0.0/24 table 100
ip route add default via 172.16.0.1 table 100

# Route traffic marked with fwmark 0x1 via table 200 (VPN)
ip rule add fwmark 0x1 table 200
ip route add default via 10.8.0.1 dev tun0 table 200
# Packet marking: iptables -t mangle -A OUTPUT -p tcp --dport 443 -j MARK --set-mark 0x1

Policy routing is used by: - Multi-homed hosts with multiple ISPs (route return traffic via the same ISP it came in on) - VPN routing (specific traffic through tunnel, rest through default) - Kubernetes pod networking (each pod namespace has policy routes ensuring traffic exits via the correct veth)

BGP as Inter-Domain Routing

BGP (Border Gateway Protocol) is the routing protocol of the internet — it propagates reachability information between autonomous systems (ASes). In Linux, BGP is implemented by FRR (Free Range Routing) or BIRD, running in userspace and programming the FIB via netlink.

# FRR BGP status
vtysh -c 'show bgp summary'
vtysh -c 'show ip route bgp'

# Check BGP routes in Linux FIB
ip route show proto bgp
ip route show | grep -c bgp  # count BGP routes

# ECMP via BGP (iBGP multipath)
vtysh -c 'router bgp 65001'
vtysh -c ' address-family ipv4 unicast'
vtysh -c '  maximum-paths 8'

Historical Context

The Linux IPv4 routing subsystem was largely written by Alexey Kuznetsov (LKML identity: kuznet@ms2.inr.ac.ru) in the late 1990s. The routing cache was a pragmatic optimization that worked well until the scale of the internet made it a liability.

The routing cache removal in 2012 was controversial — early benchmarks showed performance regression for single-stream flows. Subsequent optimization of the LC-trie and neighbor table code recovered the performance, and the security benefit (eliminating the cache fill attack) was decisive.

ECMP in Linux was basic (only 2 paths initially) and gained practical utility when the multipath_hash_policy sysctl enabled 4-tuple hashing in Linux 4.12 (2017) and resilient ECMP was added in 5.4 (2019).

Debugging Notes

# Trace routing decision for a destination
ip route get 8.8.8.8
# Shows: which table, which next-hop, which interface, src IP

# Policy routing trace
ip route get 8.8.8.8 from 10.0.0.1
ip route get 8.8.8.8 oif eth0

# ARP resolution failures
ip neigh show | grep FAILED

# Monitor routing table changes (netlink events)
ip monitor route

# FIB statistics
cat /proc/net/fib_triestat
# Total: entries, leaves, internal nodes, reuses

# ECMP next-hop resolution
ip route show exact 0.0.0.0/0   # show all ECMP members

# Kernel routing event tracing
bpftrace -e 'kprobe:fib_lookup { @[comm] = count(); }'

# ARP storm detection
bpftrace -e 'kprobe:arp_rcv { @[comm] = count(); }'

Security Implications

Source address validation (rp_filter): reverse path filtering ensures packets arrive on the expected interface. Prevents IP spoofing in forwarded traffic: sysctl -w net.ipv4.conf.all.rp_filter=1
ICMP redirect: a router can send ICMP redirects to update a host's routing table. This allows MITM attacks: sysctl -w net.ipv4.conf.all.accept_redirects=0
ARP spoofing: any host on the same L2 can poison the ARP cache with gratuitous ARPs. Mitigation: arpwatch, static ARP entries for critical hosts, 802.1X port authentication.
BGP hijacking: a malicious AS can advertise more specific prefixes, attracting traffic. BGP origin validation (RPKI) signs IP prefix ownership; routers reject invalid announcements. Deployed by major ISPs.
ECMP and NAT: ECMP over NAT can cause asymmetric routing — outbound via path A, return via path B through a different NAT box that doesn't have the connection state. Always ensure ECMP hash is consistent across NAT zones.

Performance Implications

Operation	Typical latency	Notes
FIB lookup (LC-trie)	50–100 ns	~10 cache line accesses
ARP cache hit	20–50 ns	neigh_lookup() hash
ARP cache miss → probe	1–10 ms	Waits for ARP reply
Policy routing (1 rule)	+10–20 ns	Per rule evaluated
ECMP hash computation	+20 ns	5-tuple hash (jhash)

Performance regression profile: a host with 100,000 routes takes longer per lookup than one with 1,000 routes due to deeper trie traversal and worse cache behavior. Internet full-table BGP routers (~900K IPv4 routes) require the FIB to fit in L3 cache.

Failure Modes and Real Incidents

Incident: ARP table exhaustion in /16 subnet (2016, cloud provider) A host in a /16 subnet (65,536 hosts) sent pings to all IPs to scan for active hosts. Each ping caused an ARP request (INCOMPLETE state), quickly filling gc_thresh3 (default 1024). New ARP entries were dropped; host could only communicate with the 1024 currently-cached neighbors. Fix: gc_thresh3=65536, and rate-limit ARP scanning.

Incident: ECMP asymmetric routing under maintenance (multiple providers) When removing a spine switch from ECMP rotation during maintenance, existing flows were rehashed to remaining paths. For TCP connections, packets arrived out of order, triggering retransmits. With non-resilient ECMP: ~50% of active flows experience reordering during any topology change. Fix: resilient ECMP with graceful drain (withdraw BGP prefix with 5-minute prepend before removing).

Failure Mode: rp_filter breaks asymmetric routing A multi-homed host receives traffic on eth0 but routes return traffic via eth1 (asymmetric routing). rp_filter=1 drops the incoming packets (they don't match the route for the source IP). Symptom: one-way communication. Fix: rp_filter=2 (loose mode: accept if any route exists) or rp_filter=0 for trusted interfaces.

Modern Usage

BPF routing: eBPF programs can be attached to routing table lookup hooks (bpf_fib_lookup) to implement custom routing policies without modifying the FIB
VRF (Virtual Routing and Forwarding): Linux 4.3+ supports network-namespace-like VRFs implemented as master devices: ip link add vrf0 type vrf table 100. Each VRF has its own FIB. Used for L3 VPN and network segmentation.
MPLS in Linux: ip -M route and ip -M nexthop — Linux supports MPLS forwarding, used with FRR for MPLS/LDP/RSVP networks

Future Directions

XDP routing offload: eBPF programs that perform FIB lookup in XDP can forward packets at wire speed without sk_buff allocation — essentially turning a Linux server into a line-rate router
IPv6 segment routing (SRv6): extends IPv6 routing with a segment list in the header, enabling source routing and traffic engineering without MPLS state in the core
Hardware FIB offload: switchdev API allows Linux to program FIB entries directly into SmartNIC or switch ASIC firmware, offloading routing from the host CPU

Exercises

Set up a Linux router with three network namespaces: client, router, server. Configure ip_forward=1 and appropriate routes. Capture packet forwarding with bpftrace on ip_forward. Measure forwarding latency with hping3 -S.
Create a policy routing setup where traffic from source 10.0.0.0/24 exits via eth0 and traffic from 172.16.0.0/24 exits via eth1. Verify with ip route get <dst> from <src>. Introduce rp_filter and observe which source/interface combinations cause drops.
Populate the routing table with 10,000 synthetic routes (ip route add 10.X.Y.0/24 via ... in a loop). Measure fib_lookup latency before and after using bpftrace. Observe the FIB trie statistics in /proc/net/fib_triestat.
Configure ECMP with 3 next-hops. Use tc qdisc add dev eth0 root netem loss 100% to kill one path. Verify that ECMP automatically routes around it. Then observe what happens with resilient ECMP vs standard ECMP for existing flows.
Install FRR on a Linux VM. Configure BGP between two VMs using private ASN 65001 and 65002. Advertise a prefix from one VM and verify it appears in the other's ip route table as a BGP route. Then withdraw the prefix and observe the convergence time.

References

net/ipv4/ip_input.c — ip_rcv(), ip_local_deliver()
net/ipv4/ip_forward.c — ip_forward()
net/ipv4/fib_trie.c — LC-trie implementation
net/ipv4/route.c — ip_route_input(), ECMP
net/core/neighbour.c — ARP/NDP neighbor table
include/net/ip_fib.h — fib_table, fib_info, fib_nh
Almesberger, W. Linux IP Networking: A Guide to the Implementation and Modification of the Linux Protocol Stack. 1999.
RFC 4632 — Classless Inter-Domain Routing (CIDR)
RFC 1812 — Requirements for IP Version 4 Routers
ip-route(8), ip-rule(8), ip-neighbour(8) man pages
FRR documentation. frrouting.org
Documentation/networking/vrf.rst — Linux VRF documentation