TLB (Translation Lookaside Buffer)

Technical Overview

The Translation Lookaside Buffer (TLB) is a fully associative or set-associative cache inside the CPU that stores recent virtual-to-physical address translations. Without a TLB, every memory access would require a 4-memory-read page table walk, making the effective memory latency 5x worse. TLBs reduce this to a single cycle lookup for addresses in the cache.

A TLB miss (the translation is not in the TLB) triggers a hardware or software page table walk. On x86-64, this is hardware — the CPU's page miss handler (PMH) automatically walks the page table. On some RISC architectures (MIPS, SPARC V8), TLB misses are handled by software (the OS fills the TLB via a TLB miss trap). The refilled translation is then cached in the TLB for subsequent accesses.

TLB performance is critical: in a data-center workload, 10–30% of all cycles can be spent on TLB misses in memory-intensive applications. Huge pages, PCID, and NUMA-aware allocation are all mechanisms to reduce TLB pressure.

Prerequisites

Virtual memory concepts (01-virtual-memory.md)
Paging and page table walk mechanics (02-paging.md, 03-page-tables.md)
CPU cache hierarchy basics
Context switch and process scheduling

Core Content

TLB Structure

TLB Organization (Typical Modern x86-64 CPU)
=============================================

L1 ITLB (Instruction TLB, fully associative):
  - 64 entries for 4KB pages
  - 8 entries for 2MB pages
  - 1-cycle lookup latency

L1 DTLB (Data TLB, 4-way set-associative):
  - 64 entries for 4KB pages
  - 32 entries for 2MB pages
  - 1 entry for 1GB pages
  - 1-cycle lookup latency

L2 STLB (Shared TLB, unified, 4-way set-associative):
  - 1536 entries for 4KB pages
  - 32 entries for 2MB pages
  - 7-cycle lookup latency

Typical numbers for Intel Ice Lake / AMD Zen 3.
Exact counts vary per microarchitecture.

TLB Lookup Path:
  VA ──► L1 DTLB hit? ──► PA (1 cycle)
           │
           └─ miss ──► L2 STLB hit? ──► PA (7 cycles)
                          │
                          └─ miss ──► Hardware Page Table Walk ──► PA (100-500 cycles)
                                       │
                                       └─ fills L1 DTLB and L2 STLB

On AMD Zen 3: L1 DTLB = 72 entries (4KB), L2 DTLB = 2048 entries. On Apple M2: L1 TLB = 192 entries, L2 TLB = 3072 entries.

TLB Entry Structure

A TLB entry contains:

TLB Entry
==========
[ ASID (8-16 bits) | Virtual Page Number (VPN) | Physical Frame Number (PFN) | Flags ]

Flags: Present, Dirty, Accessed, NX, U/S, Global, Page size (4KB/2MB/1GB)

ASID allows multiple address spaces in the TLB simultaneously.
Without ASID: must flush entire TLB on every context switch.
With ASID:    only flush when ASID space exhausted (rare).

ASID: Address Space Identifier

On context switch without ASID support, the CPU must flush the entire TLB (all entries become invalid since the new process uses different virtual-to-physical mappings). This is costly — the new process starts with a cold TLB.

ASID (called PCID — Process Context Identifier — on x86-64; ASID on ARM64) solves this. Each process is assigned a small numeric ID (12 bits on x86 = 4096 values; 16 bits on ARM64 = 65536 values). Each TLB entry is tagged with an ASID. On context switch, the CPU loads the new CR3 with the new PCID and retains TLB entries from all processes.

Without PCID (x86-64):
  Context switch: write CR3 → TLB flushed (MOV to CR3 without NOFLUSH bit)
  First N accesses: TLB misses → slow

With PCID (Intel Westmere+, Linux 4.14+):
  Context switch: write CR3 with bit 63=NOFLUSH
  Old PCID entries remain in TLB → warm TLB
  Only flush when PCID reassigned

Linux PCID management: arch/x86/mm/tlb.c
  - Per-CPU array: loaded_mm_asid[] maps mm → PCID
  - When PCIDs exhausted: evict LRU, flush that PCID
  - KPTI uses separate PCIDs for user/kernel mode (2 PCIDs per process)

ARM64 ASID is similar but provides 16-bit ASIDs (65536 distinct address spaces). ARM64 has separate ASID fields in TTBR0_EL1 and can do TLB maintenance by ASID (TLBI ASIDE1).

TLB Shootdown

When a kernel modifies a PTE (e.g., during munmap, mprotect, CoW), it must ensure no CPU has a stale cached translation in its TLB. On a single CPU, writing to CR3 or executing INVLPG <addr> is sufficient. On an SMP system, other CPUs may have cached the old translation.

TLB shootdown sends an IPI (Inter-Processor Interrupt) to all CPUs that have the affected mm loaded. Each receiving CPU executes INVLPG (or flushes the full TLB) and acknowledges.

TLB Shootdown Flow (x86-64 SMP)
=================================

CPU 0 (initiator):                    CPU 1, 2, ... (targets):
  ptep_clear_flush(vma, addr, ptep)
    │
    ├── clear PTE in page table
    │
    └── flush_tlb_range(vma, start, end)
          │
          ├── determine which CPUs have mm loaded
          │   (mm_cpumask(mm))
          │
          ├── send IPI to each target CPU
          │   (native_send_call_func_ipi)
          │         │
          │         ▼
          │   IPI handler: smp_invalidate_interrupt()
          │         │
          │         ├── INVLPG for each addr in range
          │         │   OR: write CR3 (full flush)
          │         │
          │         └── acknowledge completion
          │
          └── wait for all CPUs to acknowledge
              (spin on cpumask)

Cost: 100-1000 cycles per CPU in the mm_cpumask.
On a 64-core machine doing munmap of a large region: microseconds.

Linux TLB flush functions (mm/mmu_gather.c, arch/x86/mm/tlb.c): - flush_tlb_page(vma, addr) — flush one page - flush_tlb_range(vma, start, end) — flush a range - flush_tlb_mm(mm) — flush entire address space - flush_tlb_kernel_range(start, end) — flush kernel pages (no ASID) - try_to_unmap() + mmu_gather — deferred TLB flush aggregation for large unmap operations

mmu_gather: Linux defers TLB flushes during unmap_page_range() using an mmu_gather structure (include/asm-generic/tlb.h). It accumulates pages to free and address ranges to flush, then issues a single shootdown at the end. This batching is critical for performance of munmap on large regions.

TLB and Huge Pages

A 2 MiB huge page uses a single TLB entry (in the L1 DTLB huge-page partition) and covers 512 × 4KB pages. A workload with a 4 GB working set needs: - With 4KB pages: 4GB / 4KB = 1,048,576 TLB entries (impossible — L2 TLB has ~1500) - With 2MB pages: 4GB / 2MB = 2048 TLB entries (fits in L2 STLB with huge page entries)

This is the primary motivation for huge pages in memory-intensive applications.

TLB Coverage Comparison
========================

Working set: 4 GB

  4KB pages:   1,048,576 translations needed
               L2 STLB (1536 entries) covers only 6MB → 99.4% miss rate for random access

  2MB pages:   2,048 translations needed
               L2 STLB (32 huge entries) covers 64MB; full coverage needs dedicated TLB
               But miss penalty is 3-level walk (not 4) → ~25% faster miss handling

  1GB pages:   4 translations needed
               Always hits in L1 DTLB (1 entry for 1GB) → near-zero TLB pressure

PCID (Process Context Identifier) in Detail

PCID is x86-64's implementation of ASID. Enabled by CR4.PCIDE=1. The lower 12 bits of CR3 become the PCID (4096 possible values). Key properties:

MOV to CR3 with bit 63 set: loads new page table WITHOUT TLB flush
MOV to CR3 with bit 63 clear: loads new page table WITH full TLB flush
INVLPG <addr>: flushes TLB entries for addr in ALL PCIDs (expensive)
INVPCID type, descriptor: flush specific PCID entries (Intel only, requires CR4.PCIDE)

Linux PCID implementation (arch/x86/mm/tlb.c):

struct tlb_context {
    u64 ctx_id;       /* mm->context.ctx_id: 64-bit generation number */
    u64 tlb_gen;      /* generation when TLB was last flushed for this PCID */
};

/* Per-CPU PCID state */
struct tlb_state {
    struct mm_struct *loaded_mm;
    u16 loaded_mm_asid;           /* PCID currently in CR3 */
    struct tlb_context ctxs[TLB_NR_DYN_ASIDS];  /* 6 dynamic + 2 for kernel/user KPTI */
};

TLB Performance Analysis

# Measure TLB misses with perf (event names vary by CPU vendor)
perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses ./program

# More detailed PMU events (Intel)
perf stat -e mem_load_retired.l1_miss,\
             mem_load_retired.l2_miss,\
             mem_load_retired.l3_miss,\
             dtlb_load_misses.miss_causes_a_walk,\
             dtlb_load_misses.walk_completed,\
             dtlb_load_misses.walk_duration \
          ./program

# AMD equivalent
perf stat -e ls_tlb_miss_l1_tlb,ls_l1_dtlb_miss_l2_hit,ls_l1_dtlb_miss_l2_miss ./program

# TLB miss rate
# miss_rate = dtlb_load_misses.miss_causes_a_walk / mem_inst_retired.all_loads

# Check huge page usage (TLB impact)
grep -E "AnonHugePages|HugePages" /proc/meminfo

A TLB miss that completes in L2 STLB costs ~7 cycles. A miss that requires a full page table walk costs 100–500 cycles (depends on cache hits during the walk). At 1 billion memory operations per second, even 0.1% miss rate translates to 100 million walk cycles per second — on a 3 GHz CPU, that's 33 ms/s of wasted cycles per core.

Historical Context

TLBs were first described in the Atlas Computer (Manchester, 1962) which called them "associative memory." The term "Translation Lookaside Buffer" was coined by Liptay (1968). Early TLBs were small (4–16 entries, fully associative). The VAX had a 128-entry TLB. The Intel 80386 had a 32-entry TLB with no ASID. The Pentium Pro added larger TLBs with separate instruction/data TLBs. ASID support was added to MIPS (1992) and ARM (ARMv6, 2002). PCID was added to x86-64 in Westmere (2010). Linux enabled PCID support in kernel 4.14 (2017), coincidentally just before Meltdown made KPTI+PCID a necessity.

Production Examples

Redis and TLB pressure: Redis stores all data in a flat hash table in anonymous memory. With a 50 GB dataset and 4KB pages, the working set requires 12.5 million TLB translations. Even with a 1536-entry L2 STLB, Redis experiences >99% TLB miss rate on random access patterns. Enabling THP (2MB pages) reduces required translations by 512x (to 24,400) and drops memory latency by 20–40% in benchmarks.

PostgreSQL shared buffers: PostgreSQL's 8KB pages don't align with 2MB huge pages perfectly. Using HugeTLB with 2MB pages for the shared buffer pool (configured via huge_pages=on) has been shown to reduce CPU time by 10–20% on OLTP workloads due to TLB pressure reduction.

Java GC pause inflation by TLB: During a full GC, the JVM sweeps the entire heap. With a 32 GB heap and 4KB pages, the GC sweep generates 8 million TLB misses. With 2MB THP, this drops to 16,000. Production JVMs running with -XX:+UseTransparentHugePages consistently show 15–30% shorter GC pauses.

Debugging Notes

# Quick TLB miss check
perf stat -e dTLB-load-misses,dTLB-loads ./myapp 2>&1 | grep -E "miss|load"

# Full TLB pressure profile
perf record -e dtlb_load_misses.miss_causes_a_walk -g ./myapp
perf report --sort=dso,sym

# Check if PCID is enabled
dmesg | grep -i pcid
# or
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver  # not directly, but:
grep -m1 pcid /proc/cpuinfo

# Observe TLB flush IPIs (shootdowns)
# /proc/interrupts shows TLB_VECTOR count per CPU
grep TLB /proc/interrupts
# High and growing TLB counts on multiple CPUs = heavy shootdown activity

# Identify shootdown-heavy workloads (frequent mprotect/munmap)
strace -c -e mprotect,munmap,mmap ./myapp

Security Implications

Meltdown and KPTI TLB cost: KPTI requires two separate CR3 values (user/kernel). Without PCID, every syscall flushed the entire TLB. With PCID, PCIDs 0–4095 are used for kernel mode and 4096–8191 for user mode (effectively). The performance overhead of KPTI on syscall-heavy workloads is reduced from ~30% to ~1–5% with PCID.

ASID exhaustion as DoS: If the kernel runs out of PCIDs (only 4096 on x86), it must evict a PCID and flush those TLB entries. An attacker who can cause thousands of active processes can force constant PCID evictions, increasing context switch overhead for the victim.

TLB side channels: Flush+Reload attacks (used in FLUSH+RELOAD cache side channels) can also exploit TLB state. If an attacker can observe TLB hits/misses for specific addresses, they can infer secret data access patterns (e.g., AES key schedule lookup addresses).

Spectre variant 2 and BTB: The Branch Target Buffer (BTB) is analogous to a TLB for indirect branches. Spectre exploits this to redirect speculative execution. The mitigations (retpoline, IBRS, IBPB) flush or fence the BTB, incurring costs similar to TLB flushes.

Performance Implications

Context switch cost: Without PCID: ~200–500 cycles for TLB flush + warm-up. With PCID: ~50 cycles for CR3 write (no flush).
munmap of large regions: Each unmap_page_range() call generates TLB shootdowns to all CPUs with the mm loaded. On a 64-core machine, unmapping 1 GB generates one IPI to 63 CPUs = ~63 × 500 = 31,500 cycles minimum (serial acknowledgment). Linux batches these with mmu_gather to minimize shootdown count.
Fork + exec pattern: fork() does not flush TLBs (page tables are CoW). exec() replaces the mm, causing a full TLB flush. If fork is not followed by exec (like in process pools), TLB state is preserved across fork.
NUMA and TLB: Remote NUMA page table walks are ~2–4x slower (DRAM latency is ~100 ns local, ~200–400 ns remote). Keeping page table pages on the local NUMA node reduces TLB miss penalties.

Failure Modes and Real Incidents

Missing TLB flush after PTE modification: A kernel developer forgets to call flush_tlb_page() after modifying a PTE. The stale TLB entry is used by another CPU, which reads from the wrong physical page. This is silent data corruption. CONFIG_DEBUG_VM enables extra checks.

TLB shootdown storm: A multithreaded application running on a many-core machine does thousands of mprotect() calls per second (e.g., a JVM with a write barrier doing frequent mprotect on GC cards). Each mprotect() triggers a shootdown IPI to all cores. The kernel's TLB shootdown path becomes a serialization bottleneck. Observed in production Java workloads with G1GC on 96-core machines.

Kernel 3.x INVLPG bug: CVE-2014-4171 — a missing TLB flush in the mmap path under certain conditions allowed stale PTEs to remain in the TLB after a mapping was removed, allowing a process to continue reading freed memory.

Modern Usage

PCID in KPTI: Linux 4.15+ uses PCID to make KPTI practical on Intel CPUs with PCID support. Without PCID, KPTI overhead was 10–30% for syscall-heavy workloads; with PCID it's ~1%.
Lazy TLB mode: When a CPU switches to a kernel thread (no user address space), it enters "lazy TLB" mode — it does not update CR3. Subsequent user-mode TLB flushes for the previous mm are deferred until the CPU switches back to a user process.
Guest TLB management in KVM: KVM uses EPT (Extended Page Tables) for hardware-assisted virtualization. The guest's virtual address requires a two-dimensional page table walk (guest VA → guest PA → host PA). EPT TLB entries encode both translations. KVM uses VMX INVEPT/INVVPID to flush EPT TLB entries on VM exit.

Future Directions

Hardware TLB management: Future ISAs may allow the OS to hint the hardware about TLB entry lifetimes (like cache hints), enabling more intelligent eviction policies.
Software-managed TLBs (RISC-V Sv48/Sv57): RISC-V's page-based virtual memory allows future implementations to use software-managed TLBs, giving the OS full control over TLB replacement policy.
Predictive prefetching into TLB: Similar to data prefetching, hardware prefetch engines could speculatively walk page tables for "likely next" VPNs. AMD's DTLB prefetcher does limited versions of this.
TLB for persistent memory: NVDIMM/Optane direct-access (DAX) mappings require TLB entries for byte-addressable persistent memory. TLB flush semantics on crash recovery are an open research problem.

Exercises

Write a benchmark that accesses memory in a stride pattern, varying stride from 4KB to 2MB. Plot the latency vs stride. The performance cliff at the DTLB size confirms L1 DTLB capacity.
Use perf stat to measure dtlb_load_misses.miss_causes_a_walk for a matrix multiplication with row-major vs column-major access on a 10000×10000 matrix. Explain the difference.
Force TLB shootdowns by having one thread do mprotect() in a loop while another thread accesses the affected pages. Measure IPI rate via /proc/interrupts.
Compare the context switch overhead with and without PCID by writing a benchmark that context-switches between two threads in a tight loop, and measuring cache/TLB warm-up effects.
Implement a user-space TLB simulator: given a stream of virtual page accesses and a TLB of N entries with LRU replacement, compute the miss rate for real access traces (obtained from perf record -e page-faults).
Demonstrate TLB-induced NUMA effects: allocate a page table on NUMA node 0, and perform TLB walks from a thread on NUMA node 1. Compare latency vs local page table placement.

References

arch/x86/mm/tlb.c — flush_tlb_range(), switch_mm_irqs_off(), PCID management
arch/x86/include/asm/tlbflush.h — TLB flush API
include/asm-generic/tlb.h — mmu_gather structure for deferred TLB flush
mm/mmu_gather.c — tlb_gather_mmu(), tlb_finish_mmu()
arch/arm64/mm/tlb.c — ARM64 ASID management
Intel SDM Vol. 3A, Section 4.10 — Caching Translation Information (TLBs)
AMD64 APM Vol. 2, Section 5.5 — TLB Management
Ulrich Drepper, "What Every Programmer Should Know About Memory" (2007), Section 3
Bakhvalov, "Performance Analysis and Tuning on Modern CPUs", Chapter 5
LWN: "PCID is now a thing" — https://lwn.net/Articles/735473/
LWN: "TLB shootdown scalability" — https://lwn.net/Articles/329292/
CVE-2017-5754 + KPTI: https://www.kernel.org/doc/html/latest/x86/kaiser.html