03 — Netfilter and iptables

Technical Overview

Netfilter is the framework inside the Linux kernel that provides hooks at well-defined points in the packet-processing path. User-space tools — iptables, nftables, conntrack, ipset — attach filtering, NAT, and mangling logic to these hooks without modifying kernel source. Netfilter underpins firewalls, NAT gateways, VPN masquerading, Kubernetes kube-proxy, and cloud provider security groups. Understanding its internals explains both its power and its notorious performance cliff at scale.

Prerequisites

Linux network stack fundamentals (see 01-linux-network-stack.md)
Basic IP networking (routing, NAT, firewalls)
Familiarity with iptables, nft, conntrack, ipset CLI tools
C struct reading ability for kernel data structures

Core Content

Netfilter Hook Points

Netfilter registers hooks at five points in the IPv4 (and IPv6) packet path. Every packet traversing the kernel passes through some or all of these hooks in order:

Incoming packet (from NIC)
         |
         v
  [PREROUTING hook]        <-- NF_INET_PRE_ROUTING
         |
         v
  Routing decision
    /         \
   /           \
  v             v
[INPUT]      [FORWARD]     <-- NF_INET_LOCAL_IN / NF_INET_FORWARD
  |             |
  v             v
Local process  [POSTROUTING]  <-- NF_INET_POST_ROUTING
                    |
                    v
              Outgoing packet (to NIC)

Locally generated packet:
  Local process
       |
       v
  [OUTPUT hook]             <-- NF_INET_LOCAL_OUT
       |
       v
  [POSTROUTING hook]        <-- NF_INET_POST_ROUTING
       |
       v
  NIC / wire

Each hook has a priority (integer) that determines the order when multiple subsystems (iptables, nftables, conntrack, SELinux) register callbacks on the same hook. Lower priority = called first. Connection tracking registers at NF_IP_PRI_CONNTRACK (-200), iptables NAT at NF_IP_PRI_NAT_DST (-100), iptables filter at NF_IP_PRI_FILTER (0).

Hook return values: NF_ACCEPT (continue), NF_DROP (discard packet), NF_STOLEN (handler owns packet), NF_QUEUE (send to userspace via NFQUEUE), NF_REPEAT (re-invoke hook).

iptables: Tables, Chains, Rules

iptables is a Netfilter frontend. It organizes rules into tables containing chains of rules.

Tables and their hooks:

Table	Purpose	Chains available
`raw`	Bypass conntrack (NOTRACK)	PREROUTING, OUTPUT
`mangle`	Modify packet fields (TOS, TTL)	All 5 chains
`nat`	Address translation	PREROUTING, INPUT, OUTPUT, POSTROUTING
`filter`	Accept/drop/reject	INPUT, FORWARD, OUTPUT
`security`	SELinux labels	INPUT, FORWARD, OUTPUT

Chain traversal within a table:

Rule 1: match criteria → target (ACCEPT/DROP/JUMP/RETURN)
Rule 2: match criteria → target
Rule 3: match criteria → target
...
Default policy: ACCEPT or DROP

Rules are evaluated top to bottom. First match wins. If no rule matches, the chain's default policy applies. JUMP transfers evaluation to a user-defined chain. RETURN goes back to the calling chain.

Packet traversal order across tables at PREROUTING: 1. raw PREROUTING (highest priority) 2. Connection tracking (if not NOTRACK) 3. mangle PREROUTING 4. nat PREROUTING (DNAT) 5. Routing decision

Connection Tracking

Connection tracking (nf_conntrack) is a stateful packet inspection system that correlates packets belonging to the same connection. It lives in net/netfilter/nf_conntrack_core.c.

Core structure: struct nf_conn (defined in include/net/netfilter/nf_conntrack.h):

struct nf_conn {
    struct nf_conntrack_tuple_hash tuplehash[IP_CT_DIR_MAX]; /* original + reply */
    unsigned long       status;           /* IPS_SEEN_REPLY, IPS_ASSURED, ... */
    u32                 timeout;
    struct nf_ct_ext   *ext;             /* extensions: helper, NAT, acct, etc. */
    spinlock_t          lock;
    /* ... */
};

A connection is identified by a 5-tuple: (src IP, dst IP, src port, dst port, protocol). Two tuples are stored per connection: the original direction and the reply direction (with NAT addresses if applicable).

Connection states visible in iptables (-m state --state):

State	Meaning
`NEW`	First packet of a new connection (no reply seen)
`ESTABLISHED`	Packets in both directions have been seen
`RELATED`	Associated with an existing connection (FTP data, ICMP error)
`INVALID`	Doesn't match any known connection; usually dropped
`UNTRACKED`	Explicitly bypassed conntrack via `raw` table NOTRACK

Monitor live conntrack table:

conntrack -L                      # list all tracked connections
conntrack -L -p tcp --state ESTABLISHED | wc -l
conntrack -E                      # watch events in real-time
cat /proc/net/nf_conntrack         # raw kernel table
cat /proc/sys/net/netfilter/nf_conntrack_count   # current count
cat /proc/sys/net/netfilter/nf_conntrack_max     # maximum

NAT Implementation

DNAT (Destination NAT, PREROUTING chain) rewrites the destination address — used for port forwarding and load balancing:

# Forward external port 8080 to internal host 192.168.1.10:80
iptables -t nat -A PREROUTING -p tcp --dport 8080 -j DNAT --to-destination 192.168.1.10:80

SNAT (Source NAT, POSTROUTING chain) rewrites the source address — used for internet sharing:

# Masquerade outgoing traffic on eth0 (dynamic source IP)
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE

# Fixed source IP SNAT
iptables -t nat -A POSTROUTING -o eth0 -j SNAT --to-source 1.2.3.4

MASQUERADE dynamically uses the interface's current IP (for DHCP environments). SNAT is faster (no interface IP lookup per packet).

Connection tracking makes NAT stateful: once the first packet is translated and recorded in nf_conn, subsequent packets in the same flow are translated by the conntrack extension (nf_nat_*), not by re-evaluating the iptables rules.

iptables Performance Limitations

iptables rules are evaluated as a linked list (in the kernel: a flat bytecode array). Rule lookup is O(n) per packet. With 10,000 rules (common in cloud environments with per-pod security groups):

Each packet evaluates up to 10,000 rules before matching or reaching default policy
At 1 Mpps with 10K rules: 10 billion comparisons/second — CPU-bound
Kubernetes kube-proxy in iptables mode creates one rule per Service endpoint — clusters with 10K services hit this wall

Measure iptables rule count and overhead:

iptables -L --line-numbers | wc -l
perf stat -e cache-misses,instructions -- iperf3 -c <host> -t 5

nftables: The Replacement

nftables (nft) was merged in Linux 3.13 (2014) to replace iptables/ip6tables/arptables/ebtables with a single consistent interface.

Key improvements: - Set-based matching: a packet's source IP matched against a 10,000-IP set costs O(log n) or O(1) via hash sets, not O(n) - Single kernel module (nf_tables) instead of separate modules per table - Atomic rule updates: entire ruleset updated transactionally - Native IPv4/IPv6 dual-stack rules - Maps: a match can produce a value (e.g., route to specific VRF based on DSCP)

# nftables equivalent of iptables DROP
nft add table inet filter
nft add chain inet filter input { type filter hook input priority 0 \; policy drop \; }
nft add rule inet filter input ct state established,related accept
nft add rule inet filter input iif lo accept

# Set-based: block a list of IPs atomically
nft add set inet filter blocklist { type ipv4_addr \; }
nft add element inet filter blocklist { 1.2.3.4, 5.6.7.8 }
nft add rule inet filter input ip saddr @blocklist drop

# Show ruleset
nft list ruleset

ipset is the iptables-era workaround for the same problem — maintains kernel hash/bitmap sets that iptables rules reference:

ipset create myset hash:ip maxelem 1000000
ipset add myset 1.2.3.4
iptables -A INPUT -m set --match-set myset src -j DROP

Historical Context

Netfilter was designed by Paul Russell and Marc Boucher and merged in Linux 2.4.0 (January 2001), replacing the older ipchains (Linux 2.2) and ipfwadm (Linux 2.0). The hook-based architecture was deliberately extensible: the same hook points support stateless filtering, stateful tracking, NAT, and application-layer gateways.

Connection tracking was designed primarily for NAT — to map reply packets back through the translation. Its use as a stateful firewall mechanism (NEW/ESTABLISHED/RELATED) came later and turned out to be the killer feature for most deployments.

iptables' O(n) limitation was acceptable for the hardware of 2001. By 2015, cloud providers with hundreds of containers per host each needing security rules were running 50,000+ iptables rules per node. This drove Cilium (eBPF-based) and the eventual maturation of nftables as the Kubernetes dataplane.

Production Examples

Kubernetes kube-proxy (iptables mode) — connection tracking exhaustion:

# Check conntrack table fill level
echo "$(cat /proc/sys/net/netfilter/nf_conntrack_count) / $(cat /proc/sys/net/netfilter/nf_conntrack_max)"

# Common fix: increase max
sysctl -w net.netfilter.nf_conntrack_max=1048576

# Conntrack drops (table full → new connections refused)
netstat -s | grep 'connection resets'
conntrack -S | grep drop

Stateless ACL with nftables + sets (cloud firewall):

# 100K blocked IPs, O(1) lookup via hash
nft add set inet filter country_block { type ipv4_addr \; flags interval \; }
# ... populate set from GeoIP database ...
nft add rule inet filter input ip saddr @country_block counter drop

Debugging Notes

# Trace iptables rule traversal (verbose, use sparingly)
iptables -t raw -A PREROUTING -p tcp --dport 80 -j TRACE
iptables -t raw -A OUTPUT -p tcp --sport 80 -j TRACE
# View trace output
tail -f /var/log/kern.log | grep 'TRACE:'

# conntrack event monitoring
conntrack -E --event-mask NEW

# nftables rule hit counters
nft add rule inet filter input counter accept  # add counter to specific rule
nft list ruleset  # counters shown inline

# kernel conntrack errors
cat /proc/net/stat/nf_conntrack
# Column 8 = drops (table full), Column 9 = early drops

# Check NAT translation table
conntrack -L -n  # show NAT info

Security Implications

conntrack table exhaustion (CVE-adjacent): SYN flood exhausts nf_conntrack_max. New legitimate connections get INVALID state → dropped. Mitigation: net.netfilter.nf_conntrack_max tuning + SYN cookies (net.ipv4.tcp_syncookies=1) + raw NOTRACK for UDP services.
INVALID state dropping: iptables -A INPUT -m conntrack --ctstate INVALID -j DROP — defense against out-of-window segments and TCP sequence number attacks.
iptables rule order bugs: a permissive rule before a restrictive rule makes the restriction unreachable. Automated testing of rule semantics is non-trivial.
DNAT hairpinning: internal hosts cannot reach a DNAT target via the external IP without additional MASQUERADE on the internal interface — a common misconfiguration that silently fails for local clients.
ip_forward enablement: DNAT and FORWARD chain rules only take effect if net.ipv4.ip_forward=1. Forgetting this is the #1 NAT misconfiguration.

Performance Implications

Configuration	Throughput impact
100 iptables rules	Negligible
10,000 iptables rules	~30% throughput reduction at 1Mpps
100,000 iptables rules	Effectively CPU-bound
nftables + hash set (100K entries)	O(1), negligible
NOTRACK (raw table bypass)	Eliminates conntrack overhead (~10% gain for stateless workloads)
conntrack table miss → lookup	~100–200 ns per packet

For high-rate stateless services (DNS, NTP, DDoS scrubbers), use:

iptables -t raw -A PREROUTING -p udp --dport 53 -j NOTRACK
iptables -t raw -A OUTPUT -p udp --sport 53 -j NOTRACK

Failure Modes and Real Incidents

Incident: Kubernetes conntrack exhaustion (2019, multiple clusters) kube-proxy creates DNAT rules for every Service endpoint. A cluster with 500 services × 10 pods = 5000 DNAT rules. Under high traffic, nf_conntrack_max (default 262,144) fills up. New pod connections fail silently (packets dropped as INVALID). Symptoms: intermittent 5xx errors, no obvious application log entries. Diagnosis: dmesg | grep 'nf_conntrack: table full'. Fix: increase max + switch to Cilium/eBPF.

Incident: NAT port exhaustion A NAT gateway masquerading 1000 clients all connecting to the same external IP:port hits the ephemeral port limit (65,535 source ports) for SNAT. New connections from the gateway fail with ENOMEM in conntrack or RST from destination. Fix: add more external IPs to rotate SNAT across.

Failure Mode: iptables rule leak A deploy script adds iptables rules but the cleanup on deploy failure doesn't run. Rules accumulate across deploys. Symptoms emerge hours later as rule count crosses the performance cliff. Always test: iptables-save | wc -l in monitoring.

Modern Usage

nftables is the default in RHEL 8+, Debian 10+, Ubuntu 20.04+ — iptables commands are redirected to iptables-nft (nftables backend) by default
Cilium replaces kube-proxy entirely with eBPF programs, bypassing Netfilter for most Kubernetes traffic — O(1) service lookup via BPF maps instead of O(n) iptables chains
Cloud provider security groups (AWS, GCP, Azure) implement ingress/egress filtering in hypervisor software, effectively offloading Netfilter from the guest VM
Firewalld uses nftables as its backend in modern distributions, providing a higher-level zone/service abstraction

Future Directions

eBPF completely replacing Netfilter for Kubernetes: projects like Cilium and Calico-eBPF route all service traffic through XDP/TC eBPF programs, eliminating conntrack overhead for east-west traffic
Flowtable offload: nftables flowtable (nft add flowtable) hardware-offloads established connections to NIC firmware, completely bypassing the kernel stack per flow
Hardware NAT: modern SmartNICs (Mellanox BlueField, Netronome Agilio) can execute nftables/Netfilter rules in NIC firmware at line rate

Exercises

On a Linux VM, create an iptables ruleset with 5,000 rules that all DROP (except the last which ACCEPTs your SSH source IP). Measure the throughput impact of increasing rule counts from 100 to 1,000 to 5,000 using iperf3. Plot the throughput curve.
Configure DNAT to forward port 8080 on your external interface to port 80 on 127.0.0.1. Verify with curl localhost:8080 and conntrack -L showing the translation. Explain why net.ipv4.ip_forward is not needed for local-to-local DNAT.
Reproduce conntrack table exhaustion in a VM: set nf_conntrack_max=100, then use hping3 --syn -p 80 <host> to generate SYN packets. Observe dmesg and conntrack -S output. Implement NOTRACK in the raw table and verify conntrack count no longer grows.
Rewrite a set of 10 iptables rules that match against individual IPs as a single nftables hash set rule. Benchmark lookup latency for the original iptables approach vs nftables set using perf stat.
Write a Python script using iptc (python-iptables) that enumerates all iptables rules across all tables and identifies rules that are shadowed (never reached due to a prior matching rule).

References

net/netfilter/nf_conntrack_core.c — connection tracking core
net/netfilter/nf_tables_core.c — nftables evaluation engine
include/linux/netfilter.h — hook infrastructure
net/ipv4/netfilter/iptable_filter.c, iptable_nat.c — iptables table implementations
Russell, P. Netfilter Hacking HOWTO. Netfilter.org, 2002.
Neira Ayuso, P. The netfilter connection tracking system. ;login: USENIX Magazine, 2006.
man 8 iptables, man 8 nft, man 8 conntrack
Documentation/networking/nf_conntrack-sysctl.rst
Cilium documentation: eBPF-based Networking, Security, and Observability. cilium.io