Huge Pages
Technical Overview
A huge page is a page larger than the base 4 KiB page size. On x86-64, the hardware supports 2 MiB pages (using the Page Size bit in a PMD entry) and 1 GiB pages (using the Page Size bit in a PUD entry). ARM64 supports 2 MiB and 1 GiB pages with 4 KiB granules, plus 32 MiB and 512 MiB with 64 KiB granules.
The primary motivation is TLB coverage: a single L1 DTLB entry for a 2 MiB page covers the same address range as 512 entries for 4 KiB pages. For workloads with large, randomly-accessed working sets (in-memory databases, JVM heaps, analytics engines), TLB pressure is often the dominant performance bottleneck, and huge pages can deliver 20–50% latency improvements.
Linux supports huge pages through two mechanisms:
1. HugeTLB (explicit, pre-allocated): The administrator reserves huge pages at boot/runtime; applications opt in via hugetlbfs or mmap(MAP_HUGETLB).
2. THP (Transparent Huge Pages): The kernel automatically promotes aligned 2 MiB regions to huge pages, with no application changes required.
Prerequisites
- Paging and page table structure (02-paging.md, 03-page-tables.md)
- TLB architecture and pressure analysis (04-tlb.md)
- Buddy allocator (06-buddy-allocator.md — for contiguous memory requirements)
- NUMA topology basics (11-numa-memory.md)
Core Content
Why Huge Pages Exist: TLB Coverage
TLB Coverage with 4KB vs 2MB Pages
=====================================
Scenario: 32 GB in-memory database, random read workload
With 4KB pages:
Working set pages: 32GB / 4KB = 8,388,608 translations needed
L2 STLB capacity: ~1,536 entries (4KB partition)
Coverage: 1,536 * 4KB = 6 MB (0.02% of working set)
TLB miss rate: ~99.98% on random access
Each miss: ~150 cycles (page table walk, if cached)
At 500M reads/s: 499,900,000 misses/s * 150 cycles = ~75B cycles/s wasted
= 25 seconds of CPU time wasted per second (on 3GHz CPU)
(distributed across multiple cores)
With 2MB pages:
Working set pages: 32GB / 2MB = 16,384 translations needed
L2 STLB capacity: ~32 entries (2MB huge page partition)
L1 DTLB coverage: 32 * 2MB = 64 MB in L1 DTLB
TLB miss rate: ~99.8% (still high, but miss cost is 3-level walk, not 4)
Improvement: ~20-40% latency reduction observed in practice
With 1GB pages:
Working set pages: 32GB / 1GB = 32 translations needed
All fit in L1 DTLB (1 entry for 1GB): ~0% miss rate
Improvement: Up to 30-50% latency reduction
HugeTLB: Explicit Huge Pages
HugeTLB pre-allocates huge pages from physically contiguous memory at system startup or via sysfs. Pages are managed in a separate pool (/proc/sys/vm/nr_hugepages).
# Allocate 1000 2MB huge pages at runtime (requires physically contiguous memory)
echo 1000 > /proc/sys/vm/nr_hugepages
# Or at boot: hugepages=1000 on kernel cmdline
# Verify allocation
grep HugePages /proc/meminfo
# HugePages_Total: 1000
# HugePages_Free: 1000
# HugePages_Rsvd: 0
# HugePages_Surp: 0
# Hugepagesize: 2048 kB
# Hugetlb: 2048000 kB
# For 1GB pages
echo 4 > /proc/sys/vm/nr_hugepages # may need: /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
Applications use HugeTLB pages via:
1. hugetlbfs mount:
mount -t hugetlbfs nodev /mnt/hugetlbfs -o pagesize=2M,size=4G
# Application opens file on hugetlbfs, ftruncate, mmap
2. mmap with MAP_HUGETLB:
void *p = mmap(NULL, 2*1024*1024,
PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB,
-1, 0);
// For 1GB pages: MAP_HUGETLB | MAP_HUGE_1GB
// For 2MB pages: MAP_HUGETLB | MAP_HUGE_2MB
3. SysV shared memory with SHM_HUGETLB:
int shmid = shmget(IPC_PRIVATE, size, SHM_HUGETLB|IPC_CREAT|SHM_R|SHM_W);
HugeTLB internals: mm/hugetlb.c. The pool is managed via hstate structures (one per page size). alloc_huge_page() allocates from the pool; free_huge_page() returns to the pool without returning to the buddy allocator.
Transparent Huge Pages (THP)
THP allows the kernel to use huge pages transparently, without application changes. The kernel's khugepaged daemon scans anonymous VMAs and attempts to collapse 512 aligned, contiguous 4KB pages into a single 2MB page.
THP Architecture
================
User process:
malloc(200MB) ──► anonymous mmap (4KB pages initially)
|
| First touch: 4KB page faults
| (regular demand paging)
|
▼
khugepaged daemon (kernel thread)
|
| Scans for 512-page aligned regions
| where all 512 4KB pages are:
| - Present in memory
| - Contiguous in physical memory
| (or can be made contiguous)
| - Anonymous (not file-backed)
| - Not locked/pinned
|
▼
collapse_huge_page()
|
| Allocate one 2MB contiguous frame
| Copy 512 * 4KB → one 2MB frame
| Install 2MB PMD entry
| Free 512 old frames + PT page
|
▼
Process now uses 2MB huge page
THP configuration:
# Global THP setting
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never
# "always": THP for all anonymous VMAs
# "madvise": THP only for VMAs with MADV_HUGEPAGE hint
# "never": THP disabled
# Per-process opt-in
madvise(addr, len, MADV_HUGEPAGE); # enable THP for this VMA
madvise(addr, len, MADV_NOHUGEPAGE); # disable THP for this VMA
# Collapse threshold
cat /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
cat /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
cat /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
THP code paths: mm/huge_memory.c:
- do_huge_pmd_anonymous_page() — allocate 2MB on first fault if aligned
- khugepaged_do_scan() — collapse thread
- split_huge_pmd() — split a 2MB page back to 512 × 4KB (on mprotect, munmap, etc.)
THP and Fragmentation
THP requires a physically contiguous 2 MiB aligned block (order-9 buddy allocation). On a long-running system, physical memory becomes fragmented: there are plenty of free 4 KiB pages but no contiguous 2 MiB blocks. This causes:
- THP allocation failure:
khugepagedcannot collapse because no contiguous 2 MiB block is available. - Memory compaction: The kernel may trigger
compact_zone()to move pages and create contiguous blocks. This is expensive (latency spikes). - Allocation latency: A
mmapwithMAP_HUGETLBfails withENOMEMeven though plenty of memory is free (just fragmented).
# Check fragmentation
cat /proc/buddyinfo
# Node 0, zone Normal: 100 200 50 30 10 5 2 1 0 0 0
# Columns: order 0 (4KB) through order 10 (4MB)
# Zero in column 9 (order-9 = 2MB) = no huge pages available
# Force compaction
echo 1 > /proc/sys/vm/compact_memory
# Monitor compaction events
grep compact /proc/vmstat
NUMA and Huge Pages
HugeTLB pages are allocated from specific NUMA nodes:
# Per-node huge page allocation
echo 512 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
echo 512 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
# numactl with huge pages
numactl --membind=0 ./database_server
THP respects NUMA first-touch policy: the first CPU to fault in a page determines which NUMA node the huge page is allocated from. On a 4-socket machine, applications should be NUMA-pinned before touching memory to avoid remote huge page allocations.
Huge Pages for Databases
PostgreSQL:
# postgresql.conf
huge_pages = on # use HugeTLB for shared buffers
huge_page_size = 0 # 0 = use default (2MB)
shared_buffers = 256GB # Must have enough HugeTLB pages pre-allocated
# Verification
SELECT name, setting FROM pg_settings WHERE name LIKE 'huge%';
PostgreSQL's CreateSharedMemoryAndSemaphores() calls shmget() with SHM_HUGETLB when huge_pages=on. Performance impact: 10–25% throughput improvement on OLAP workloads with large shared_buffers.
Oracle Database: Oracle uses HugeTLB (called "Large Pages" in Oracle terminology) for its SGA. Configuration:
# Oracle sga_target = 500G requires HugeTLB pre-allocation
echo 256000 > /proc/sys/vm/nr_hugepages # 512GB at 2MB each
# Oracle detects and uses them automatically
MySQL/InnoDB:
[mysqld]
large-pages # enable HugeTLB for InnoDB buffer pool
innodb_buffer_pool_size=128G
Huge Pages for JVM
# JVM huge page options
-XX:+UseLargePages # use HugeTLB if available
-XX:+UseTransparentHugePages # use THP (kernel handles it)
-XX:LargePageSizeInBytes=2m # specify page size
-XX:+AlwaysPreTouch # touch all heap pages at startup (forces THP collapse)
# JVM heap as HugeTLB:
# With -XX:+UseLargePages, JVM calls mmap(MAP_HUGETLB) for the heap
# Requires pre-allocated HugeTLB pages >= Xmx size
Production data: Netflix engineering reported 15–20% GC pause reduction and 10% throughput improvement for JVM services running with THP on madvise mode and -XX:+AlwaysPreTouch.
/proc/meminfo Huge Page Fields
HugePages_Total: 2048 # pre-allocated HugeTLB pool
HugePages_Free: 1024 # available in pool
HugePages_Rsvd: 256 # reserved by mmap() but not yet faulted
HugePages_Surp: 0 # surplus (above nr_hugepages limit, from overcommit)
Hugepagesize: 2048 kB # default huge page size
Hugetlb: 4194304 kB # total memory in HugeTLB pool (sum all sizes)
AnonHugePages: 3276800 kB # memory in THP anonymous pages
ShmemHugePages: 0 kB # THP in shared memory
FileHugePages: 0 kB # THP in file cache (only with CONFIG_READ_ONLY_THP_FOR_FS)
Historical Context
Huge page support was added to the Linux kernel in 2.5.36 (2002) via the HugeTLB subsystem, developed initially for large database workloads on IA-64. x86 support followed quickly. The 1 GiB huge page support was added in Linux 2.6.26 (2008). Transparent Huge Pages were introduced in Linux 2.6.38 (2011) by Andrea Arcangeli, motivated by the fact that the manual HugeTLB interface was too complex for most applications to use. THP's automatic promotion and khugepaged daemon made the optimization accessible to all workloads.
Production Examples
30% performance improvement for in-memory databases: Aerospike (an in-memory NoSQL database) documented a 30% read throughput improvement with HugeTLB pages on their SSD-backed tier (where the working set is fully in RAM and TLB pressure dominates).
Redis latency improvement: Production Redis deployments at Twitter and Pinterest reported 20–40% reduction in 99th-percentile GET latency after enabling THP, primarily due to reduced TLB miss overhead on random key lookups.
THP latency spike issue: MongoDB and Redis both recommend transparent_hugepage=never or madvise in their deployment guides. The reason: THP's background compaction and collapse operations cause periodic latency spikes (pauses of 1–100ms) that violate SLA requirements for latency-sensitive services. MongoDB experienced this in production, where random spikes of 50–200ms correlated exactly with compact_stall events in /proc/vmstat.
Debugging Notes
# Check THP usage
grep -E "AnonHugePages|HugePages_Total|HugePages_Free" /proc/meminfo
# Per-process THP usage
grep -E "AnonHugePages|Huge" /proc/$(pidof redis-server)/smaps | head -20
# Monitor khugepaged activity
watch -n1 "grep -E 'thp|compact' /proc/vmstat"
# Check for compaction stalls (THP-induced latency spikes)
grep compact_stall /proc/vmstat
# HugeTLB page failures (pool exhausted)
grep HugePages_Surp /proc/meminfo # surplus > 0 = overcommit pool used
# THP enabled/disabled status
cat /sys/kernel/mm/transparent_hugepage/enabled
cat /sys/kernel/mm/transparent_hugepage/defrag # controls compaction aggressiveness
# Disable THP defrag to reduce latency spikes (at cost of fewer promotions)
echo defer-madvise > /sys/kernel/mm/transparent_hugepage/defrag
# Force a specific process to use THP
# (via madvise in code, or hack via LD_PRELOAD intercepting malloc)
Security Implications
THP and heap spray: Heap spraying attacks benefit from huge pages: a single huge page allocation gives an attacker a contiguous 2 MiB block, making heap spray layout more predictable. madvise(MADV_HUGEPAGE) on a heap region can thus assist exploitation.
HugeTLB and information leak: Huge pages are not zeroed on reuse as carefully as 4KB pages in some kernel versions. CVE-2019-14283 described a case where user-space could read stale huge page content under certain conditions.
ASLR weakened by huge pages: The alignment requirement for 2 MiB pages (must be 2 MiB aligned) reduces the entropy of ASLR for huge-page-backed regions from 12 bits to 9 bits (3 bits fewer entropy).
Performance Implications
- Allocation cost:
alloc_pages(order=9)for a 2 MiB page is more expensive thanalloc_pages(order=0)because the buddy allocator must find 512 contiguous frames. Allocation latency can be 10–100x higher if compaction is needed. - CoW with THP: If a THP-backed page is CoW (e.g., after fork), a write fault copies the entire 2 MiB page, not just 4 KiB. This can cause unexpected latency in fork+CoW workloads. Linux 5.11+ introduced THP CoW splitting: the 2 MiB page is split at CoW time, copying only the modified 4 KiB page.
- mprotect on THP:
mprotect()of part of a 2 MiB THP triggerssplit_huge_pmd(), splitting the huge page back into 512 × 4 KiB pages. Applications that frequently callmprotecton subranges (like JVM GC card marking) should disable THP for those VMAs withMADV_NOHUGEPAGE.
Failure Modes and Real Incidents
MongoDB THP pauses (2015–2020): MongoDB consistently recommended transparent_hugepage=never in their production deployment guide after numerous customer reports of multi-second latency spikes. Root cause: khugepaged triggering memory compaction during high-throughput periods. MongoDB Atlas (cloud) enforces this at the OS level.
Elasticsearch heap corruption investigation: A 2018 incident at a major cloud provider attributed Elasticsearch JVM crashes to THP collapse during GC, where khugepaged collapsed pages that the JVM GC concurrently modified. The race was in early THP code and was fixed in later kernels, but the incident highlighted THP's interaction with runtime-managed memory.
HugeTLB allocation failure at peak load: A production system pre-allocated 256 GB of HugeTLB pages. After a reboot, memory fragmentation prevented re-allocating the pool (no contiguous 2 MiB blocks available). The database service failed to start. Fix: use hugepages=N on the kernel command line (allocates at boot before fragmentation) rather than nr_hugepages at runtime.
Modern Usage
- io_uring and huge pages: Applications using io_uring with registered buffers benefit from THP on those buffers, as the kernel maps them once and holds the page table reference.
- QEMU/KVM huge pages: Virtual machine RAM is allocated as HugeTLB on the host, reducing host TLB pressure from guest memory accesses. Configured with
-mem-path /mnt/hugetlbfs. - DPDK: The Data Plane Development Kit mandates HugeTLB for its memory pool (
rte_malloc). Network packet buffers in hugepage memory avoid TLB misses in the fast path, enabling line-rate packet processing. - GPU driver huge pages: NVIDIA's CUDA driver uses
mmap(MAP_HUGETLB)for GPU BAR (Base Address Register) mappings, reducing TLB pressure during GPU memory transfers. - folio API (Linux 5.16+): The kernel is migrating from
struct pagetostruct foliofor compound pages, including THP. The folio API simplifies huge page handling throughout the MM subsystem.
Future Directions
- Variable-size THP: Ongoing Linux patches (LWC series, 2022–2024) add support for THP sizes other than 2 MiB (e.g., 16 KiB, 32 KiB, 64 KiB). This allows THP benefits for workloads with smaller working sets.
- File-backed THP: THP currently only works for anonymous mappings.
CONFIG_READ_ONLY_THP_FOR_FS(Linux 5.4+) enables THP for read-only file-backed mappings. Extending this to writable file mappings is a work in progress. - Contiguous PTE (CONT-PTE) on ARM64: ARM64 hardware allows a set of 16 adjacent PTEs to be treated as a single TLB entry if they map contiguous physical memory. This provides TLB compression at 64 KiB granularity without the alignment requirements of huge pages.
- Intel 5-level paging with 1 PiB pages: Future extensions may add 1 TiB or 1 PiB page sizes for extreme-scale computing.
Exercises
- Write a C program that allocates 4 GB of memory and accesses it randomly. Measure latency with
/sys/kernel/mm/transparent_hugepage/enabledset tonever,madvise(withMADV_HUGEPAGE), andalways. UseCLOCK_MONOTONICwith nanosecond resolution. - Pre-allocate 100 HugeTLB pages, then write a program that allocates them with
mmap(MAP_HUGETLB)until the pool is exhausted. Handle theENOMEMcase. - Observe fragmentation: run a memory-intensive workload for an hour, then try to allocate 2 MiB huge pages. Monitor
/proc/buddyinfobefore and after. Force compaction and measure the time. - Measure the CoW cost of a fork when the parent has a 1 GB THP-backed region vs the same region with 4 KB pages. Use
getrusage()to count page faults. - Instrument
khugepagedwake-up rate and collapse rate by reading/proc/vmstatfieldsthp_fault_alloc,thp_collapse_alloc,thp_split_pmdbefore and after running a workload. - Configure a PostgreSQL instance with
huge_pages=onandhuge_pages=off. Run a pgbench OLAP workload. Comparepg_stat_bgwritercache hit rates and query times.
References
mm/hugetlb.c— HugeTLB pool management,alloc_huge_page(),free_huge_page()mm/huge_memory.c— THP fault handling,khugepaged,split_huge_pmd()mm/compaction.c— memory compaction for THP allocationarch/x86/include/asm/pgtable.h—_PAGE_PSE(Page Size Extension) bitinclude/linux/huge_mm.h— THP API functions/sys/kernel/mm/transparent_hugepage/— runtime THP configuration/proc/sys/vm/nr_hugepages— HugeTLB pool size- Linux man pages:
mmap(2)(MAP_HUGETLB),madvise(2)(MADV_HUGEPAGE) - LWN: "Transparent huge pages in 2.6.38" — https://lwn.net/Articles/423584/
- LWN: "THP and database workloads" — https://lwn.net/Articles/358904/
- MongoDB: "Disable Transparent Huge Pages" — MongoDB documentation
- Andrea Arcangeli, "Transparent HugePages in 2.6.38" (KVM Forum 2010)