Skip to content

Buddy Allocator

Technical Overview

The Linux buddy allocator is the kernel's physical page allocator — it manages the allocation and freeing of physical memory frames, serving as the foundation upon which all other memory allocators (slab, vmalloc, user-space allocations) are built. It was first described by Knuth (1965) and independently by Knowlton (1965), and Linux's implementation in mm/page_alloc.c is a direct descendant of Knuth's original algorithm.

The core idea: memory is partitioned into power-of-2-sized blocks. When a block of order n is split, it produces two buddies of order n-1. When two buddies are both free, they are merged back into an order-n block. This creates a fast O(log n) allocator with natural support for large contiguous allocations (needed for DMA, huge pages, and kernel data structures).

The allocator manages memory in zones (ZONE_DMA, ZONE_NORMAL, ZONE_MOVABLE, etc.) and on NUMA nodes. Each zone has its own free_area[] array indexed by order (0 through MAX_ORDER-1, where MAX_ORDER=11 by default, giving orders 0–10 = 4KB–4MB).

Prerequisites

  • Physical memory concepts (physical frames, NUMA nodes)
  • Virtual memory and paging (01-virtual-memory.md, 02-paging.md)
  • struct page basics
  • NUMA architecture overview (11-numa-memory.md)

Core Content

Buddy System Algorithm

Buddy Allocator Data Structures
=================================

struct zone {
    /* one free_area per order: order 0 = 4KB, order 1 = 8KB, ..., order 10 = 4MB */
    struct free_area  free_area[MAX_ORDER];  /* MAX_ORDER = 11 */
    ...
};

struct free_area {
    struct list_head  free_list[MIGRATE_TYPES];  /* per-migrate-type free lists */
    unsigned long     nr_free;                    /* total free pages at this order */
};

/* MIGRATE_TYPES (anti-fragmentation):
   MIGRATE_UNMOVABLE   = 0  (kernel allocations, cannot be moved)
   MIGRATE_MOVABLE     = 1  (user pages, can be moved/migrated)
   MIGRATE_RECLAIMABLE = 2  (page cache, can be reclaimed)
   MIGRATE_PCPTYPES    = 3  (first 3 are per-CPU page lists)
   MIGRATE_HIGHATOMIC  = 4  (emergency reserve)
   MIGRATE_CMA         = 5  (contiguous memory allocator)
   MIGRATE_ISOLATE     = 6  (isolation for migration)
*/

Allocation of order-2 (16 KB) block:

Buddy Allocator: Splitting Example
=====================================

Initial state (free_area, simplified to orders 0-4):
  order-4: [Block A: 64KB] [Block B: 64KB]
  order-3: []
  order-2: []
  order-1: []
  order-0: []

Request: alloc_pages(order=2)  [16KB]

Step 1: No order-2 block available. Split order-4 Block A into two order-3 buddies:
  order-4: [Block B: 64KB]
  order-3: [A0: 32KB] [A1: 32KB]
  order-2: []

Step 2: No order-2 block. Split A0 into two order-2 buddies:
  order-4: [Block B: 64KB]
  order-3: [A1: 32KB]
  order-2: [A00: 16KB] [A01: 16KB]

Step 3: Allocate A00 (first order-2 block):
  order-4: [Block B: 64KB]
  order-3: [A1: 32KB]
  order-2: [A01: 16KB]    ← A00 returned to caller

Free A00:
Step 4: A00 freed. Is buddy A01 also free? YES → merge:
  order-3: [A1: 32KB] [A0-merged: 32KB]

Step 5: Is buddy A1 free? YES (it's in free_area[3]) → merge:
  order-4: [Block B] [Block A-restored: 64KB]

The buddy of a block at physical frame pfn with order k is at frame pfn XOR (1 << k). Merging is done by checking PageBuddy() flag on the buddy's struct page and verifying it has the same order.

struct page

Every physical frame has a corresponding struct page in the kernel's mem_map[] array. This is one of the most important and complex structures in the Linux kernel (include/linux/mm_types.h):

struct page {
    unsigned long flags;         /* atomic flags: PG_locked, PG_dirty, PG_lru, ... */

    /* page->_mapcount: -1 if not mapped, 0+ = number of ptes pointing here */
    atomic_t _mapcount;

    /* used by various subsystems: */
    union {
        struct {  /* Page cache / anonymous pages */
            struct address_space *mapping;
            pgoff_t index;           /* offset within mapping */
        };
        struct {  /* Compound pages (huge pages) */
            unsigned long compound_head;
        };
        struct {  /* Slab allocator */
            struct kmem_cache *slab_cache;
            void *freelist;
        };
        /* ... many more anonymous unions ... */
    };

    union {
        atomic_t _refcount;    /* reference count */
        unsigned int page_type;
    };

    /* NUMA node in high bits of flags (NODE_SHIFT) */
    /* Zone in lower bits of flags (ZONE_SHIFT) */
} __attribute__((packed));

sizeof(struct page) is typically 64 bytes. On a machine with 1 TiB of RAM: 1TiB / 4KB = 268 million pages × 64 bytes = 17 GB for mem_map alone. This is why huge pages also reduce struct page overhead.

Memory Zones

Linux Memory Zone Layout (x86-64)
===================================

Physical Address Space:
0x0000000000000000
  |
  ├─ ZONE_DMA      [0 – 16 MB]
  |    For legacy ISA DMA devices (24-bit address bus)
  |    GFP_DMA allocations land here
  |
  ├─ ZONE_DMA32    [16 MB – 4 GB]
  |    For devices with 32-bit DMA addresses (most PCI devices)
  |    GFP_DMA32 allocations
  |
  ├─ ZONE_NORMAL   [4 GB – (depends on architecture)]
  |    General kernel allocations (GFP_KERNEL)
  |    Directly mapped in kernel address space
  |
  └─ ZONE_MOVABLE  [configurable, usually end of RAM]
       Movable/reclaimable pages: used by CMA, memory hotplug
       Allows offline of memory ranges without kernel pinned pages

On modern x86-64 with all RAM below 4 GB: ZONE_DMA32 and ZONE_NORMAL overlap. ZONE_HIGHMEM (32-bit era) no longer exists on 64-bit kernels because all physical RAM can be directly mapped.

Zone fallback order: if ZONE_NORMAL is exhausted, try ZONE_DMA32, then ZONE_DMA. This is the zone fallback chain (zonelist in include/linux/mmzone.h).

alloc_pages() Path

alloc_pages(gfp_mask, order)
  │
  ├── mm/page_alloc.c: __alloc_pages()
  │     │
  │     ├── get_page_from_freelist()   [fast path]
  │     │     │
  │     │     ├── For each zone in zonelist:
  │     │     │     └── rmqueue(zone, order, migratetype)
  │     │     │           │
  │     │     │           ├── Try per-CPU page cache (pcp) for order-0
  │     │     │           │   (avoid zone lock for single-page allocs)
  │     │     │           │
  │     │     │           └── __rmqueue(zone, order, migratetype)
  │     │     │                 │
  │     │     │                 ├── __rmqueue_smallest(): take from free_area[order]
  │     │     │                 └── __rmqueue_fallback(): steal from different migratetype
  │     │
  │     └── __alloc_pages_slowpath()   [slow path, if fast path fails]
  │           │
  │           ├── Wake kswapd (async reclaim)
  │           ├── Try allocation again
  │           ├── Direct reclaim (synchronous)
  │           ├── Memory compaction
  │           └── OOM killer (last resort)

Per-CPU page cache (PCP): Each CPU has a small cache of order-0 pages (per_cpu_pages in struct zone). Allocation from PCP avoids the zone spinlock. When the PCP is empty, it is refilled from the buddy allocator in a batch (pcp->batch pages at a time). This is critical for performance: single-page allocations (the most common) are lock-free on the fast path.

GFP Flags

GFP (Get Free Pages) flags control allocation behavior:

Flag Value Meaning
GFP_KERNEL __GFP_RECLAIM \| __GFP_IO \| __GFP_FS Normal kernel alloc; may sleep, reclaim
GFP_ATOMIC __GFP_HIGH Interrupt context; must not sleep
GFP_NOWAIT 0 Do not wait; return NULL if unavailable
GFP_NOIO __GFP_RECLAIM May reclaim but not do block I/O
GFP_NOFS __GFP_RECLAIM \| __GFP_IO May do I/O but not filesystem ops
GFP_USER GFP_KERNEL \| __GFP_HARDWALL User-space allocation
GFP_DMA __GFP_DMA Must be in ZONE_DMA
GFP_DMA32 __GFP_DMA32 Must be in ZONE_DMA32
__GFP_ZERO bit Zero the page on allocation
__GFP_HIGHMEM bit Prefer ZONE_HIGHMEM
__GFP_NOWARN bit Suppress OOM warnings

Critical rule: code holding a spinlock must use GFP_ATOMIC. Using GFP_KERNEL in atomic context → deadlock (the allocator may sleep waiting for reclaim, but reclaim requires acquiring locks you hold).

/proc/buddyinfo Interpretation

cat /proc/buddyinfo
# Node 0, zone      DMA      1      0      0      0      2      1      1      0      1      1      3
# Node 0, zone    DMA32     39     30     27     20     13      8      4      2      1      1    258
# Node 0, zone   Normal   1024    512    256    128     64     32     16      8      4      2    128
#                         ^ord0  ^ord1  ^ord2  ^ord3  ^ord4  ^ord5  ^ord6  ^ord7  ^ord8  ^ord9  ^ord10
#                         4KB    8KB   16KB   32KB   64KB  128KB  256KB  512KB   1MB    2MB    4MB

A zero (or low) count in column 9 (order-9, 2 MB) in ZONE_Normal means THP allocation will fail and compaction is needed.

Memory Fragmentation and Anti-Fragmentation

The fundamental problem: over time, kernel allocations (struct page, page tables, slab caches) and user allocations (file pages, anonymous pages) interleave in physical memory. A single pinned kernel page in the middle of a 2 MiB region prevents it from being used as a huge page.

Solution: Migrate types. Pages are classified at allocation time:

Anti-Fragmentation via Migrate Types
======================================

Physical memory is logically partitioned:
  [UNMOVABLE pages] [RECLAIMABLE pages] [MOVABLE pages]

UNMOVABLE: kernel objects, page tables, driver DMA buffers
  → These will always stay where allocated; cannot be moved

RECLAIMABLE: page cache, slab caches (shrinker-reclaimable)
  → Can be freed (data is on disk), freeing the physical frame

MOVABLE: anonymous pages, user file pages with swap backing
  → Can be moved to a different frame (page migration)

When all free pages in ZONE_NORMAL MOVABLE list are contiguous,
THP allocation can succeed without compaction.

Page migration (mm/migrate.c): - migrate_pages() — copies page content to a new frame, updates PTEs - Used by: memory compaction, NUMA balancing, CMA, memory hotplug - Cannot migrate pages held by DMA (mapped in IOMMU without IOMMU_NOBUF)

Historical Context

Donald Knuth described the buddy system in "The Art of Computer Programming, Vol. 1" (1968), though the algorithm dates to Knowlton (1965). The Linux buddy allocator was present from very early (Linux 0.99), though the zone system and NUMA support were added in Linux 2.6. The per-CPU page cache (PCP) was introduced to reduce zone lock contention as SMP machines became common. The anti-fragmentation/migrate-types system was added by Mel Gorman in Linux 2.6.24 (2008) after years of research on memory fragmentation. The folio conversion (in progress since Linux 5.16) is the latest major refactor.

Production Examples

ZONE_DMA32 exhaustion: A server running many processes each using NVMe drives (which use GFP_DMA32 for DMA bounce buffers) can exhaust ZONE_DMA32 on machines with less than 4 GB of RAM below the 4 GB boundary. This manifests as GFP_DMA32 allocation failures in dmesg and driver errors.

Huge page pool depletion: A cloud provider running many VMs on a host configured huge pages for QEMU. A burst of new VMs requested HugeTLB pages simultaneously. The pool was exhausted, and the runtime allocation (nr_hugepages_mempolicy) failed because no contiguous order-9 blocks remained. Some VMs failed to start. Fix: reserve huge pages at boot.

Buddy allocator lock contention (pre-PCP): On early Linux 2.4 SMP systems, the global zone spinlock for alloc_pages(order=0) became a scalability bottleneck. The PCP cache (introduced in 2.6) eliminated this contention for single-page allocations.

Debugging Notes

# Buddy allocator state
cat /proc/buddyinfo

# Fragmentation index per zone (0=not fragmented, 1000=fully fragmented)
cat /proc/extfrag_index

# Unusual allocation failures (OOM or GFP_ATOMIC failures)
dmesg | grep -E "out of memory|page allocation failure|alloc_pages"

# Watch free pages at each order in real time
watch -n1 "awk '/Node/{printf \"%-10s \", \$4; for(i=5;i<=NF;i++) printf \"%-6s \", \$i; print \"\"}' /proc/buddyinfo"

# Count of free pages at each zone (in pages)
grep -A5 "MemFree\|MemTotal\|MemAvailable" /proc/meminfo

# Memory compaction stats
grep -E "compact|migration" /proc/vmstat

# Force compaction (dangerous on production)
echo 1 > /proc/sys/vm/compact_memory

# Allocation failure trace (kernel debug)
# echo 1 > /proc/sys/vm/panic_on_oom  # panic instead of OOM kill (for debugging)

Security Implications

Heap spray via buddy allocator: An attacker with the ability to allocate and free many pages can manipulate the buddy allocator's free lists to influence where future allocations land. This is used in "heap feng shui" techniques to position shellcode or controlled data adjacent to target kernel structures.

Use-after-free via buddy merge: If a kernel struct is freed but a reference remains (use-after-free), the buddy allocator may give that memory to another allocation. CONFIG_DEBUG_PAGEALLOC poisons freed pages (fills with 0xAA pattern) and marks PTEs non-present, catching use-after-free immediately rather than silently.

Information leak via uncleared pages: On allocation, pages are not zeroed unless __GFP_ZERO is specified. A freshly allocated page from the buddy allocator contains stale data from the previous occupant. Kernel code that allocates a page and copies it to user space without clearing can leak kernel data. CONFIG_INIT_ON_ALLOC_DEFAULT_ON (added in Linux 5.3) enables zeroing of all allocations by default (at ~2–5% CPU overhead).

Rowhammer via buddy allocator: Physical adjacency of pages is determined by the buddy allocator. An attacker who can allocate page pairs that are physically adjacent (from the same buddy block) can use DRAM rowhammer to flip bits in the adjacent page.

Performance Implications

  • Order-0 allocation via PCP: ~50 ns (lock-free, PCP hit)
  • Order-0 allocation from buddy freelist: ~200 ns (zone lock acquired)
  • Order-9 allocation (2 MB, no split needed): ~500 ns
  • Order-9 allocation requiring split: ~1 µs
  • Order-9 allocation requiring compaction: ~1–100 ms

For any allocator built on top (slab, vmalloc, user mmap), these are the baseline costs. The slab allocator's PCP caches amortize buddy allocations for frequently-used small object sizes.

NUMA locality: alloc_pages_node(nid, gfp, order) allocates from the specified NUMA node. The default policy (for anonymous pages) uses MPOL_DEFAULT = first-touch from the local node. Cross-NUMA buddy allocation for hot pages is a major performance issue; NUMA balancing (automatic migration) attempts to fix it after the fact.

Failure Modes and Real Incidents

GFP_ATOMIC failure in interrupt handler: A network driver calling alloc_pages(GFP_ATOMIC) fails when ZONE_NORMAL is depleted. The driver drops the packet. Under severe memory pressure, all packet reception fails, causing the host to appear network-dead while the CPU is actually running. Fix: ensure adequate free memory headroom (vm.min_free_kbytes).

Split huge page cascade: When a huge page must be split (e.g., due to mprotect on a subrange), split_huge_page() must allocate 512 struct page entries and install 512 PTEs. If these allocations fail (memory pressure), the split fails, and the kernel may have to take an alternate code path or OOM kill.

ZONE_DMA32 fragmentation on 32-bit PCI: A server with 64 GB of RAM but heavy NVMe usage fragments ZONE_DMA32 with bounce buffers. After hours of operation, no contiguous 4 MB DMA-addressable block is available. The NVMe driver falls back to single-page DMA, cutting throughput by 60%. Fix: pre-allocate contiguous DMA regions at driver init with CMA.

Modern Usage

  • CMA (Contiguous Memory Allocator): A reserved region of MIGRATE_CMA pages that are normally used as MOVABLE pages but can be reclaimed for large contiguous allocations (GPU drivers, camera ISP, DMA devices). Implemented in mm/cma.c.
  • Memory hotplug: Adding or removing DIMM modules at runtime. ZONE_MOVABLE ensures that memory in hotpluggable regions contains only movable pages that can be migrated before the DIMM is removed.
  • HMM (Heterogeneous Memory Management): Extends buddy allocator concepts to GPU and FPGA device memory, allowing unified virtual address spaces over heterogeneous memory. Used by NVIDIA's CUDA kernel module.
  • GFP_KERNEL_ACCOUNT: Allocates memory and charges it to the cgroup's memory counter. Every kernel allocation on behalf of a user process should use this to properly account memory to the responsible cgroup.

Future Directions

  • Large folio support: The folio API (replacing struct page compound pages) will allow the buddy allocator to directly manage arbitrary-order allocations as first-class objects.
  • Memory tiering: With PMEM (persistent memory) and CXL-attached memory, the buddy allocator will need to manage multiple memory tiers with different latency/bandwidth characteristics. Work in progress in the "multi-gen LRU" and "demotion" patches.
  • Memory safety via DRAM tagging: AMD's SME/SEV and Intel's TME work at the physical frame level, requiring buddy allocator changes to manage encryption key IDs per frame.

Exercises

  1. Write a kernel module that calls alloc_pages(GFP_KERNEL, 9) 1000 times, logging success/failure for each. Run it on a freshly booted system vs a fragmented system. Compare.
  2. Implement a user-space buddy allocator simulator in C. Given a 1 GB address space and a log of allocation/free requests, compute the fragmentation index over time.
  3. Trace buddy allocator activity with perf probe on __alloc_pages and __free_pages. Measure allocations per second for a memory-intensive workload.
  4. Study the effect of vm.min_free_kbytes on GFP_ATOMIC failure rate under memory pressure. Lower it, run a network stress test, observe packet drops.
  5. Use /proc/buddyinfo to write a script that alerts when order-9 (2MB) blocks in ZONE_Normal drop to zero — indicating THP will fail.
  6. Examine /proc/extfrag_index on a freshly booted vs 7-day-uptime machine. Write a tool to plot fragmentation trend over time using cron + syslog.

References

  • mm/page_alloc.c__alloc_pages(), __rmqueue(), free_pages_ok()
  • mm/compaction.c — memory compaction, compact_zone()
  • mm/migrate.c — page migration, migrate_pages()
  • include/linux/mmzone.hstruct zone, struct free_area, zone type definitions
  • include/linux/mm_types.hstruct page
  • include/linux/gfp.h — GFP flag definitions
  • mm/cma.c — Contiguous Memory Allocator
  • Mel Gorman, "Understanding the Linux Virtual Memory Manager" (2004), Chapter 6
  • Mel Gorman, "Anti-Fragmentation Patches" — LWN.net (2008)
  • Donald Knuth, "The Art of Computer Programming, Vol. 1", Section 2.5 (Dynamic storage allocation)
  • LWN: "Memory compaction for better huge page allocations" — https://lwn.net/Articles/368869/
  • LWN: "Anti-fragmentation" — https://lwn.net/Articles/101916/