03 — Netfilter and iptables
Technical Overview
Netfilter is the framework inside the Linux kernel that provides hooks at well-defined points in the packet-processing path. User-space tools — iptables, nftables, conntrack, ipset — attach filtering, NAT, and mangling logic to these hooks without modifying kernel source. Netfilter underpins firewalls, NAT gateways, VPN masquerading, Kubernetes kube-proxy, and cloud provider security groups. Understanding its internals explains both its power and its notorious performance cliff at scale.
Prerequisites
- Linux network stack fundamentals (see
01-linux-network-stack.md) - Basic IP networking (routing, NAT, firewalls)
- Familiarity with
iptables,nft,conntrack,ipsetCLI tools - C struct reading ability for kernel data structures
Core Content
Netfilter Hook Points
Netfilter registers hooks at five points in the IPv4 (and IPv6) packet path. Every packet traversing the kernel passes through some or all of these hooks in order:
Incoming packet (from NIC)
|
v
[PREROUTING hook] <-- NF_INET_PRE_ROUTING
|
v
Routing decision
/ \
/ \
v v
[INPUT] [FORWARD] <-- NF_INET_LOCAL_IN / NF_INET_FORWARD
| |
v v
Local process [POSTROUTING] <-- NF_INET_POST_ROUTING
|
v
Outgoing packet (to NIC)
Locally generated packet:
Local process
|
v
[OUTPUT hook] <-- NF_INET_LOCAL_OUT
|
v
[POSTROUTING hook] <-- NF_INET_POST_ROUTING
|
v
NIC / wire
Each hook has a priority (integer) that determines the order when multiple subsystems (iptables, nftables, conntrack, SELinux) register callbacks on the same hook. Lower priority = called first. Connection tracking registers at NF_IP_PRI_CONNTRACK (-200), iptables NAT at NF_IP_PRI_NAT_DST (-100), iptables filter at NF_IP_PRI_FILTER (0).
Hook return values: NF_ACCEPT (continue), NF_DROP (discard packet), NF_STOLEN (handler owns packet), NF_QUEUE (send to userspace via NFQUEUE), NF_REPEAT (re-invoke hook).
iptables: Tables, Chains, Rules
iptables is a Netfilter frontend. It organizes rules into tables containing chains of rules.
Tables and their hooks:
| Table | Purpose | Chains available |
|---|---|---|
raw |
Bypass conntrack (NOTRACK) | PREROUTING, OUTPUT |
mangle |
Modify packet fields (TOS, TTL) | All 5 chains |
nat |
Address translation | PREROUTING, INPUT, OUTPUT, POSTROUTING |
filter |
Accept/drop/reject | INPUT, FORWARD, OUTPUT |
security |
SELinux labels | INPUT, FORWARD, OUTPUT |
Chain traversal within a table:
Rule 1: match criteria → target (ACCEPT/DROP/JUMP/RETURN)
Rule 2: match criteria → target
Rule 3: match criteria → target
...
Default policy: ACCEPT or DROP
Rules are evaluated top to bottom. First match wins. If no rule matches, the chain's default policy applies. JUMP transfers evaluation to a user-defined chain. RETURN goes back to the calling chain.
Packet traversal order across tables at PREROUTING:
1. raw PREROUTING (highest priority)
2. Connection tracking (if not NOTRACK)
3. mangle PREROUTING
4. nat PREROUTING (DNAT)
5. Routing decision
Connection Tracking
Connection tracking (nf_conntrack) is a stateful packet inspection system that correlates packets belonging to the same connection. It lives in net/netfilter/nf_conntrack_core.c.
Core structure: struct nf_conn (defined in include/net/netfilter/nf_conntrack.h):
struct nf_conn {
struct nf_conntrack_tuple_hash tuplehash[IP_CT_DIR_MAX]; /* original + reply */
unsigned long status; /* IPS_SEEN_REPLY, IPS_ASSURED, ... */
u32 timeout;
struct nf_ct_ext *ext; /* extensions: helper, NAT, acct, etc. */
spinlock_t lock;
/* ... */
};
A connection is identified by a 5-tuple: (src IP, dst IP, src port, dst port, protocol). Two tuples are stored per connection: the original direction and the reply direction (with NAT addresses if applicable).
Connection states visible in iptables (-m state --state):
| State | Meaning |
|---|---|
NEW |
First packet of a new connection (no reply seen) |
ESTABLISHED |
Packets in both directions have been seen |
RELATED |
Associated with an existing connection (FTP data, ICMP error) |
INVALID |
Doesn't match any known connection; usually dropped |
UNTRACKED |
Explicitly bypassed conntrack via raw table NOTRACK |
Monitor live conntrack table:
conntrack -L # list all tracked connections
conntrack -L -p tcp --state ESTABLISHED | wc -l
conntrack -E # watch events in real-time
cat /proc/net/nf_conntrack # raw kernel table
cat /proc/sys/net/netfilter/nf_conntrack_count # current count
cat /proc/sys/net/netfilter/nf_conntrack_max # maximum
NAT Implementation
DNAT (Destination NAT, PREROUTING chain) rewrites the destination address — used for port forwarding and load balancing:
# Forward external port 8080 to internal host 192.168.1.10:80
iptables -t nat -A PREROUTING -p tcp --dport 8080 -j DNAT --to-destination 192.168.1.10:80
SNAT (Source NAT, POSTROUTING chain) rewrites the source address — used for internet sharing:
# Masquerade outgoing traffic on eth0 (dynamic source IP)
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
# Fixed source IP SNAT
iptables -t nat -A POSTROUTING -o eth0 -j SNAT --to-source 1.2.3.4
MASQUERADE dynamically uses the interface's current IP (for DHCP environments). SNAT is faster (no interface IP lookup per packet).
Connection tracking makes NAT stateful: once the first packet is translated and recorded in nf_conn, subsequent packets in the same flow are translated by the conntrack extension (nf_nat_*), not by re-evaluating the iptables rules.
iptables Performance Limitations
iptables rules are evaluated as a linked list (in the kernel: a flat bytecode array). Rule lookup is O(n) per packet. With 10,000 rules (common in cloud environments with per-pod security groups):
- Each packet evaluates up to 10,000 rules before matching or reaching default policy
- At 1 Mpps with 10K rules: 10 billion comparisons/second — CPU-bound
- Kubernetes kube-proxy in iptables mode creates one rule per Service endpoint — clusters with 10K services hit this wall
Measure iptables rule count and overhead:
iptables -L --line-numbers | wc -l
perf stat -e cache-misses,instructions -- iperf3 -c <host> -t 5
nftables: The Replacement
nftables (nft) was merged in Linux 3.13 (2014) to replace iptables/ip6tables/arptables/ebtables with a single consistent interface.
Key improvements:
- Set-based matching: a packet's source IP matched against a 10,000-IP set costs O(log n) or O(1) via hash sets, not O(n)
- Single kernel module (nf_tables) instead of separate modules per table
- Atomic rule updates: entire ruleset updated transactionally
- Native IPv4/IPv6 dual-stack rules
- Maps: a match can produce a value (e.g., route to specific VRF based on DSCP)
# nftables equivalent of iptables DROP
nft add table inet filter
nft add chain inet filter input { type filter hook input priority 0 \; policy drop \; }
nft add rule inet filter input ct state established,related accept
nft add rule inet filter input iif lo accept
# Set-based: block a list of IPs atomically
nft add set inet filter blocklist { type ipv4_addr \; }
nft add element inet filter blocklist { 1.2.3.4, 5.6.7.8 }
nft add rule inet filter input ip saddr @blocklist drop
# Show ruleset
nft list ruleset
ipset is the iptables-era workaround for the same problem — maintains kernel hash/bitmap sets that iptables rules reference:
ipset create myset hash:ip maxelem 1000000
ipset add myset 1.2.3.4
iptables -A INPUT -m set --match-set myset src -j DROP
Historical Context
Netfilter was designed by Paul Russell and Marc Boucher and merged in Linux 2.4.0 (January 2001), replacing the older ipchains (Linux 2.2) and ipfwadm (Linux 2.0). The hook-based architecture was deliberately extensible: the same hook points support stateless filtering, stateful tracking, NAT, and application-layer gateways.
Connection tracking was designed primarily for NAT — to map reply packets back through the translation. Its use as a stateful firewall mechanism (NEW/ESTABLISHED/RELATED) came later and turned out to be the killer feature for most deployments.
iptables' O(n) limitation was acceptable for the hardware of 2001. By 2015, cloud providers with hundreds of containers per host each needing security rules were running 50,000+ iptables rules per node. This drove Cilium (eBPF-based) and the eventual maturation of nftables as the Kubernetes dataplane.
Production Examples
Kubernetes kube-proxy (iptables mode) — connection tracking exhaustion:
# Check conntrack table fill level
echo "$(cat /proc/sys/net/netfilter/nf_conntrack_count) / $(cat /proc/sys/net/netfilter/nf_conntrack_max)"
# Common fix: increase max
sysctl -w net.netfilter.nf_conntrack_max=1048576
# Conntrack drops (table full → new connections refused)
netstat -s | grep 'connection resets'
conntrack -S | grep drop
Stateless ACL with nftables + sets (cloud firewall):
# 100K blocked IPs, O(1) lookup via hash
nft add set inet filter country_block { type ipv4_addr \; flags interval \; }
# ... populate set from GeoIP database ...
nft add rule inet filter input ip saddr @country_block counter drop
Debugging Notes
# Trace iptables rule traversal (verbose, use sparingly)
iptables -t raw -A PREROUTING -p tcp --dport 80 -j TRACE
iptables -t raw -A OUTPUT -p tcp --sport 80 -j TRACE
# View trace output
tail -f /var/log/kern.log | grep 'TRACE:'
# conntrack event monitoring
conntrack -E --event-mask NEW
# nftables rule hit counters
nft add rule inet filter input counter accept # add counter to specific rule
nft list ruleset # counters shown inline
# kernel conntrack errors
cat /proc/net/stat/nf_conntrack
# Column 8 = drops (table full), Column 9 = early drops
# Check NAT translation table
conntrack -L -n # show NAT info
Security Implications
- conntrack table exhaustion (CVE-adjacent): SYN flood exhausts
nf_conntrack_max. New legitimate connections getINVALIDstate → dropped. Mitigation:net.netfilter.nf_conntrack_maxtuning + SYN cookies (net.ipv4.tcp_syncookies=1) +rawNOTRACK for UDP services. - INVALID state dropping:
iptables -A INPUT -m conntrack --ctstate INVALID -j DROP— defense against out-of-window segments and TCP sequence number attacks. - iptables rule order bugs: a permissive rule before a restrictive rule makes the restriction unreachable. Automated testing of rule semantics is non-trivial.
- DNAT hairpinning: internal hosts cannot reach a DNAT target via the external IP without additional MASQUERADE on the internal interface — a common misconfiguration that silently fails for local clients.
ip_forwardenablement: DNAT and FORWARD chain rules only take effect ifnet.ipv4.ip_forward=1. Forgetting this is the #1 NAT misconfiguration.
Performance Implications
| Configuration | Throughput impact |
|---|---|
| 100 iptables rules | Negligible |
| 10,000 iptables rules | ~30% throughput reduction at 1Mpps |
| 100,000 iptables rules | Effectively CPU-bound |
| nftables + hash set (100K entries) | O(1), negligible |
| NOTRACK (raw table bypass) | Eliminates conntrack overhead (~10% gain for stateless workloads) |
| conntrack table miss → lookup | ~100–200 ns per packet |
For high-rate stateless services (DNS, NTP, DDoS scrubbers), use:
iptables -t raw -A PREROUTING -p udp --dport 53 -j NOTRACK
iptables -t raw -A OUTPUT -p udp --sport 53 -j NOTRACK
Failure Modes and Real Incidents
Incident: Kubernetes conntrack exhaustion (2019, multiple clusters)
kube-proxy creates DNAT rules for every Service endpoint. A cluster with 500 services × 10 pods = 5000 DNAT rules. Under high traffic, nf_conntrack_max (default 262,144) fills up. New pod connections fail silently (packets dropped as INVALID). Symptoms: intermittent 5xx errors, no obvious application log entries. Diagnosis: dmesg | grep 'nf_conntrack: table full'. Fix: increase max + switch to Cilium/eBPF.
Incident: NAT port exhaustion A NAT gateway masquerading 1000 clients all connecting to the same external IP:port hits the ephemeral port limit (65,535 source ports) for SNAT. New connections from the gateway fail with ENOMEM in conntrack or RST from destination. Fix: add more external IPs to rotate SNAT across.
Failure Mode: iptables rule leak
A deploy script adds iptables rules but the cleanup on deploy failure doesn't run. Rules accumulate across deploys. Symptoms emerge hours later as rule count crosses the performance cliff. Always test: iptables-save | wc -l in monitoring.
Modern Usage
- nftables is the default in RHEL 8+, Debian 10+, Ubuntu 20.04+ —
iptablescommands are redirected toiptables-nft(nftables backend) by default - Cilium replaces kube-proxy entirely with eBPF programs, bypassing Netfilter for most Kubernetes traffic — O(1) service lookup via BPF maps instead of O(n) iptables chains
- Cloud provider security groups (AWS, GCP, Azure) implement ingress/egress filtering in hypervisor software, effectively offloading Netfilter from the guest VM
- Firewalld uses nftables as its backend in modern distributions, providing a higher-level zone/service abstraction
Future Directions
- eBPF completely replacing Netfilter for Kubernetes: projects like Cilium and Calico-eBPF route all service traffic through XDP/TC eBPF programs, eliminating conntrack overhead for east-west traffic
- Flowtable offload: nftables flowtable (
nft add flowtable) hardware-offloads established connections to NIC firmware, completely bypassing the kernel stack per flow - Hardware NAT: modern SmartNICs (Mellanox BlueField, Netronome Agilio) can execute nftables/Netfilter rules in NIC firmware at line rate
Exercises
-
On a Linux VM, create an iptables ruleset with 5,000 rules that all DROP (except the last which ACCEPTs your SSH source IP). Measure the throughput impact of increasing rule counts from 100 to 1,000 to 5,000 using
iperf3. Plot the throughput curve. -
Configure DNAT to forward port 8080 on your external interface to port 80 on
127.0.0.1. Verify withcurl localhost:8080andconntrack -Lshowing the translation. Explain whynet.ipv4.ip_forwardis not needed for local-to-local DNAT. -
Reproduce conntrack table exhaustion in a VM: set
nf_conntrack_max=100, then usehping3 --syn -p 80 <host>to generate SYN packets. Observedmesgandconntrack -Soutput. ImplementNOTRACKin the raw table and verify conntrack count no longer grows. -
Rewrite a set of 10 iptables rules that match against individual IPs as a single nftables hash set rule. Benchmark lookup latency for the original iptables approach vs nftables set using
perf stat. -
Write a Python script using
iptc(python-iptables) that enumerates all iptables rules across all tables and identifies rules that are shadowed (never reached due to a prior matching rule).
References
net/netfilter/nf_conntrack_core.c— connection tracking corenet/netfilter/nf_tables_core.c— nftables evaluation engineinclude/linux/netfilter.h— hook infrastructurenet/ipv4/netfilter/iptable_filter.c,iptable_nat.c— iptables table implementations- Russell, P. Netfilter Hacking HOWTO. Netfilter.org, 2002.
- Neira Ayuso, P. The netfilter connection tracking system. ;login: USENIX Magazine, 2006.
man 8 iptables,man 8 nft,man 8 conntrackDocumentation/networking/nf_conntrack-sysctl.rst- Cilium documentation: eBPF-based Networking, Security, and Observability. cilium.io