Skip to content

mmap

Technical Overview

mmap(2) is the most powerful memory management system call in Linux. It maps a file or anonymous memory region into the process's virtual address space, creating a Virtual Memory Area (VMA) that the kernel manages. Unlike read(2)/write(2), which copy data between kernel buffers and user buffers, mmap gives user space direct access to pages in the kernel's page cache (for file-backed mappings) or to anonymous physical frames.

mmap is the foundation of: - Dynamic linker operation (loading shared libraries) - Memory-mapped databases (LMDB, SQLite, RocksDB's block cache) - IPC via shared memory - Copy-on-Write fork semantics - Huge page allocation (MAP_HUGETLB) - User-space zero-copy I/O - Anonymous heap allocation (glibc malloc uses mmap for large allocations)

The system call signature:

void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

Prerequisites

  • Virtual memory and VMA structure (01-virtual-memory.md)
  • Page fault handling (02-paging.md)
  • Copy-on-write mechanics (08-copy-on-write.md)
  • File system page cache basics
  • NUMA concepts (11-numa-memory.md)

Core Content

mmap Variants

mmap Flag Combinations
=======================

1. Anonymous private (malloc equivalent):
   mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
   → Allocates zero-filled virtual memory backed by nothing
   → Pages faulted in from zero page on first access
   → Changes visible only to this process

2. Anonymous shared (SysV shmem alternative):
   mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, 0)
   → Backed by tmpfs (anonymous file)
   → After fork(): shared between parent and child
   → Used for fast IPC (e.g., memcached → worker communication)

3. File private (demand-paging executable):
   mmap(NULL, size, PROT_READ|PROT_EXEC, MAP_PRIVATE, fd, offset)
   → Pages from file's page cache, demand-faulted
   → Writes create CoW copies (not reflected in file)
   → How ld-linux.so loads shared libraries

4. File shared (memory-mapped I/O):
   mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0)
   → Pages from file's page cache, shared with all mappers
   → Writes dirty the page cache → written back to file
   → How databases use mmap for their data files

5. HugeTLB anonymous:
   mmap(NULL, 2<<20, PROT_READ|PROT_WRITE,
        MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB|MAP_HUGE_2MB, -1, 0)
   → Allocates 2MB huge page
   → Fails if HugeTLB pool empty

6. MAP_FIXED placement:
   mmap((void*)0x7f0000000000, size, PROT_READ|PROT_WRITE,
        MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
   → Places at exactly the given address
   → Silently unmaps any existing VMA at that address (DANGEROUS)
   → Use MAP_FIXED_NOREPLACE (Linux 4.17+) to fail instead of clobbering

Key flags and their semantics:

Flag VMA effect
MAP_SHARED writes go to page cache / visible to other processes
MAP_PRIVATE writes create CoW private copy
MAP_ANONYMOUS not file-backed; fd must be -1
MAP_FIXED place at exact address; clobbers existing
MAP_FIXED_NOREPLACE place at exact address; fail if occupied
MAP_POPULATE pre-fault all pages (eager allocation)
MAP_LOCKED pin all pages (like mlock)
MAP_NORESERVE skip swap space reservation
MAP_GROWSDOWN stack-like: VMA grows down
MAP_HUGETLB use HugeTLB pages
MAP_HUGE_2MB request 2MB huge pages specifically
MAP_HUGE_1GB request 1GB huge pages
MAP_SYNC DAX: writes synchronously to persistent memory

VMA Creation: do_mmap() Internals

mmap(addr, len, prot, flags, fd, offset)
  │
  └── sys_mmap_pgoff() [arch/x86/kernel/sys_x86_64.c]
        │
        └── ksys_mmap_pgoff() [mm/mmap.c]
              │
              ├── security_mmap_file() [LSM hook: SELinux, AppArmor check]
              ├── if fd >= 0: fget(fd) → get file reference
              ├── mmap_write_lock(mm)  [acquire mm->mmap_lock for write]
              │
              └── do_mmap():
                    │
                    ├── get_unmapped_area(file, addr, len, pgoff, flags)
                    │     → finds a suitable gap in VA space
                    │     → uses arch_get_unmapped_area_topdown() on x86
                    │        (searches from top of mmap region downward)
                    │
                    ├── Validate: alignment, length, offset, prot bits
                    │
                    ├── mmap_region():
                    │     │
                    │     ├── find_vma_intersection(): check no overlap
                    │     ├── accountable_mapping(): check resource limits
                    │     ├── vm_area_alloc(): allocate struct vm_area_struct
                    │     ├── Set vma->vm_start, vm_end, vm_flags, vm_pgoff
                    │     ├── if file: vma->vm_file = file
                    │     │          call_mmap(): file->f_op->mmap(file, vma)
                    │     │          → installs vm_ops (e.g., generic_file_vm_ops)
                    │     ├── vma_link(): insert into mm's red-black tree + list
                    │     ├── if MAP_POPULATE: mm_populate() → fault in all pages
                    │     └── Return VMA start address
                    │
                    └── mmap_write_unlock(mm)

File-backed mmap and Page Cache Integration

When a file-backed VMA page is faulted in, the kernel calls the VMA's vm_ops->fault() handler:

Page fault on file-backed VMA:
  do_fault() [mm/memory.c]
    │
    └── vma->vm_ops->fault(vmf)  [e.g., filemap_fault()]
          │
          ├── find_get_page(): Is page already in page cache?
          │     YES: pin it, return it (minor fault)
          │     NO:  allocate page, call file->f_mapping->a_ops->readpage()
          │          (initiates I/O, waits for completion = major fault)
          │
          └── Install PTE pointing to page cache page
              MAP_PRIVATE: PTE is read-only (CoW protected)
              MAP_SHARED:  PTE is writable (if prot allows)

This is the key insight: for MAP_SHARED file mappings, the user process and the kernel page cache share the same physical pages. Writing through an mmap'd file region directly dirtifies the page cache page, which will be written back to disk by pdflush/writeback threads or msync().

mmap vs read() Performance Comparison

File Access Method Comparison
===============================

Method 1: read(fd, buf, size)
  User calls read()
    │ copy_to_user()
    ├── kernel reads into page cache
    ├── copies page cache → user buffer (1 copy)
    └── returns

Method 2: mmap(file, MAP_SHARED) + pointer access
  User faults into mmap'd region
    │ no copy — direct access to page cache
    ├── page fault installs PTE → page cache page
    └── subsequent accesses: TLB hit, zero overhead

Advantages of mmap:
  + Zero copy (for read path): no user↔kernel buffer copy
  + Can use huge pages for TLB reduction
  + Random access is as fast as sequential
  + OS handles readahead via page cache
  + Works naturally with CoW (MAP_PRIVATE)

Advantages of read():
  + Predictable buffer sizing (no VA space consumption)
  + Works well for sequential streaming (read-ahead via fadvise)
  + No page table overhead
  + No TLB pressure from page table entries
  + Works over network filesystems (NFS) and FUSE without special handling
  + mmap of a file that is truncated while mapped → SIGBUS

When mmap WINS:
  - Large file, random access, hot working set (databases, key-value stores)
  - Multiple processes accessing same file (page cache shared)
  - Need to update file atomically with msync()

When read() WINS:
  - Sequential streaming access (VFS readahead works better)
  - Small files accessed once
  - Files on FUSE/NFS (mmap has higher implementation overhead)
  - The file changes size during access (mmap → SIGBUS on truncation)
  - Memory is scarce (mmap holds pages; read() can free buffer after use)

mmap for IPC

Fastest IPC between two processes (both running on the same machine):

/* Process A: create shared mapping */
int fd = memfd_create("shared_region", MFD_CLOEXEC);
ftruncate(fd, size);
void *shm = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

/* Send fd to Process B via SCM_RIGHTS (Unix socket) */
send_fd_via_socket(sock, fd);

/* Process B: mmap the received fd */
void *shm2 = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd_received, 0);

/* Now both processes share the same physical pages.
   Write from A is immediately visible to B.
   No system calls needed for communication after setup. */

/* Synchronize with futex (also in the shared region) */
atomic_store_explicit((atomic_int*)shm, value, memory_order_release);
futex((int*)shm, FUTEX_WAKE, 1, NULL, NULL, 0);

Throughput: mmap-based IPC can sustain 10–50 GB/s between two cores on the same CPU (limited by L3 cache bandwidth), compared to pipe (~3 GB/s) or Unix socket (~2 GB/s).

mmap for Databases

LMDB (Lightning Memory-Mapped Database): LMDB's entire data model is built on mmap. The database file is memory-mapped read-only for all readers. Writers lock a mutex, make changes in a copy, then update the file via msync(). Readers access data at pointer speed, with zero copies. Database size is limited by available virtual address space (not RAM).

/* LMDB internals (simplified) */
env->me_map = mmap(NULL, env->me_mapsize, PROT_READ,
                   MAP_SHARED, env->me_fd, 0);
/* All reads: just pointer dereference into me_map */
/* Writes: write to copy, then mprotect+memcpy+msync */

SQLite WAL mode: SQLite uses mmap for the database file in WAL (Write-Ahead Logging) mode. The kernel's page cache provides the "buffer pool" — SQLite itself doesn't need a separate buffer cache. Configured via PRAGMA mmap_size.

RocksDB: Uses mmap for SST (Sorted String Table) file reads via PosixMmapReadableFile. Controlled by allow_mmap_reads option. Each SST file is memory-mapped; block cache lookups become page cache lookups.

mmap and NUMA

By default, anonymous mmap pages are allocated on the NUMA node local to the faulting CPU (first-touch policy). For mmap'd files, pages are allocated when first faulted in, also on the local node.

# Pin an mmap region to a specific NUMA node
numactl --membind=0 ./database_server

# Or use mbind() for fine-grained NUMA policy per mapping
mbind(addr, len, MPOL_BIND, &nodemask, maxnode, 0);

For databases that use mmap, NUMA placement matters critically. If the database server runs on CPU 0 (NUMA node 0) but the mmap'd file pages are on NUMA node 1 (remote), every page access has 2–4x higher latency.

mmap Security: MAP_FIXED Risks and ASLR

MAP_FIXED Dangers:
  mmap((void*)target_addr, page_size, ..., MAP_FIXED, ...)

  If target_addr is already mapped:
    → The existing mapping is SILENTLY UNMAPPED and replaced
    → If target_addr was a shared library's .got section:
       → Silent corruption, no error, possible code execution

  Safer alternative (Linux 4.17+):
  mmap((void*)target_addr, page_size, ..., MAP_FIXED_NOREPLACE, ...)
    → Returns EEXIST if target_addr is occupied

ASLR interaction:
  mmap(NULL, ...) → address chosen by kernel with ASLR entropy
  mmap(hint, ...) → hint is ADVISORY; kernel may use nearby address
  mmap(hint, ..., MAP_FIXED) → exact address, bypasses ASLR for that region

ASLR entropy for mmap region (x86-64):
  vm.randomize_va_space=2: 28 bits of entropy (2^28 = 256M possible offsets)
  An attacker who can spray enough mappings can reduce effective entropy.

Additional security considerations: - /proc/PID/maps exposure: ASLR is undermined if the attacker can read /proc/victim/maps. This is why ptrace_may_access() checks are enforced. - mmap 0 (null page): vm.mmap_min_addr=65536 prevents mapping the first 64 KB, protecting against NULL pointer dereference exploits. - Executable mmap: mmap(..., PROT_EXEC, ...) combined with W^X (mprotect enforcement) is restricted by SELinux execmem boolean and CONFIG_DEFAULT_MMAP_MIN_ADDR.

Historical Context

The mmap system call was introduced in SunOS 4.0 (1988) as a way to expose the virtual memory system to user space. It was standardized in POSIX.1b (1993). Linux added mmap support very early (Linux 0.99). The MAP_ANONYMOUS flag (for non-file-backed mmaps) was added to handle the case of large heap allocations without requiring a backing file. memfd_create (Linux 3.17, 2014) replaced the old trick of creating a file in /tmp and immediately unlinking it just to get a file descriptor for anonymous shared memory.

Production Examples

Nginx sendfile vs mmap: Nginx can serve static files with either sendfile(2) (zero-copy kernel path) or read()+write(). For TLS connections, sendfile doesn't help (data must pass through user space for encryption), so mmap is used to read file data into the TLS encryption buffer without an extra copy.

Elasticsearch index files: Elasticsearch (Lucene-based) memory-maps its index segment files (.cfs, .dvd, .dim files). The JVM's MMapDirectory maps each file and accesses it via ByteBuffer. The OS page cache serves as the "off-heap" buffer pool, effectively doubling the available memory budget.

PostgreSQL large object access: PostgreSQL's buffer pool bypasses mmap for most tables (it implements its own LRU buffer pool). However, for large objects and temporary files, it does use mmap in certain code paths. The effective_cache_size parameter tells the query planner how much of the OS page cache is available for file-backed mmap.

Debugging Notes

# View all mmaps for a process
cat /proc/$(pidof myapp)/maps

# Detailed stats per VMA (PSS, swap, anon, etc.)
cat /proc/$(pidof myapp)/smaps

# Total mmap'd file size vs RAM usage
grep -E "^(VmRSS|VmSize|RssAnon|RssFile|RssShmem)" /proc/$(pidof myapp)/status

# Check if mmap is creating too many VMAs (hit max_map_count)
wc -l /proc/$(pidof myapp)/maps
cat /proc/sys/vm/max_map_count

# Trace mmap/munmap calls
strace -e trace=mmap,munmap,mprotect -p $(pidof myapp)

# msync calls (file-backed mmap durability)
strace -e trace=msync -p $(pidof myapp)

# mmap SIGBUS debugging (truncated file):
# Register SIGBUS handler, print the si_addr to identify the faulting address
# Compare with /proc/PID/maps to identify the file and offset

# For LMDB: check mmap size limit
# MDB_ENVINFO.me_mapsize = current mmap size
# If database grows beyond mapsize → MDB_MAP_FULL error

# Check page cache usage of mmap'd files
# vmtouch tool: shows which pages of a file are in page cache
vmtouch -v /path/to/database.db

Security Implications

JIT spraying via mmap: An attacker can use mmap(PROT_READ|PROT_WRITE|PROT_EXEC, MAP_ANONYMOUS) to create writable-and-executable memory for JIT compilation. Hardened systems enforce W^X policy: - SELinux execmem boolean controls whether a process can have both W and X permissions - PROT_EXEC after PROT_WRITE is blocked by PaX/grsecurity MPROTECT restrictions

mremap address leak: mremap() with MREMAP_FIXED can be used to probe whether a specific virtual address is mapped by another process (the call fails with EINVAL if not). This is an ASLR bypass technique.

MAP_SHARED + mprotect: If process A maps a file MAP_SHARED|PROT_READ|PROT_WRITE, and process B maps the same file MAP_SHARED|PROT_READ, process A's writes are visible to B. If A can write a shared library (that B has loaded), A can control B's execution. This is the basis of shared-library injection attacks.

Huge page SIGBUS on OOM: If a MAP_HUGETLB mapping is established but the system runs out of huge pages before all pages are faulted in, subsequent fault attempts return SIGBUS. Applications must handle this.

Performance Implications

  • mmap + page cache coherency: For database workloads where the working set fits in RAM, mmap eliminates all data copies between kernel and user space. Measured improvement over read()-based I/O: 20–50% throughput increase for random-read workloads (LMDB vs BerkeleyDB-btree benchmark).
  • mmap and GC overhead: JVMs using mmap'd off-heap memory (Lucene, Direct ByteBuffer) bypass GC for the mapped data. This eliminates GC pressure but requires careful lifecycle management.
  • msync overhead: msync(MS_SYNC) forces all dirty pages in the range to disk synchronously. For a write-heavy database using mmap, msync at transaction commit time adds I/O latency. Alternative: use MAP_SYNC with DAX (direct access, bypasses page cache) for NVDIMM.
  • mmap vs fallocate: Pre-allocating a file with fallocate() then mmap-ing it avoids the SIGBUS issue on extending writes, and also avoids fragmentation on many filesystems.
  • mmap unmap cost: munmap(addr, size) must flush TLBs for all CPUs with the mapping. For large mappings, this can be expensive (TLB shootdown cost scales with CPU count).

Failure Modes and Real Incidents

SIGBUS on mmap truncation: A process that mmap's a file, then the file is truncated (by another process or by the process itself) receives SIGBUS on access to the now-past-end pages. LMDB documented this as a known issue: "If you have a reader transaction open when you truncate a database, you will get SIGBUS or a page fault." Production impact: crash of the database process.

mmap OOM in Elasticsearch: A production Elasticsearch cluster with 50k shards (each shard has ~10 index files, each file mmap'd) exceeded the default vm.max_map_count=65530. New shard creation failed with "Cannot allocate memory" (mmap returns -ENOMEM). Fix: sysctl -w vm.max_map_count=262144 — now a standard Elasticsearch deployment prerequisite.

PostgreSQL mmap and NFS: Using PostgreSQL data directory on NFS with mmap-based operations caused silent data corruption when the NFS server rebooted. NFS-backed mmap has subtle consistency issues (no coherent page cache between NFS client and server). PostgreSQL documentation explicitly warns against NFS for data directories.

LMDB and Windows WSL: On Windows Subsystem for Linux 1 (not WSL2), mmap semantics were subtly different (no MAP_SYNC, different SIGBUS behavior). LMDB was not officially supported on WSL1. WSL2 (full Linux kernel in a VM) fixed this.

Modern Usage

  • io_uring registered buffers: Similar to mmap but optimized for I/O: register fixed buffers with the kernel, enabling zero-copy I/O without repeated get_user_pages() calls per operation.
  • DAX (Direct Access mode): For NVDIMM (persistent memory), mmap(..., MAP_SYNC) on a DAX-enabled filesystem creates a persistent memory mapping. Writes go directly to the NVDIMM device without going through the page cache. Used by PMDK (Persistent Memory Development Kit).
  • memfd_secret (Linux 5.14): Creates a "secret" memfd whose pages are removed from the kernel direct map. Only the owning process can access them. Useful for HSM key material, cryptographic secrets.
  • Sealed memfd (MFD_SEAL_WRITE): A sealed file descriptor cannot be written to. Used by Chrome to share read-only data with sandboxed renderer processes without risk of the renderer modifying shared data.

Future Directions

  • mmap for CXL (Compute Express Link): CXL 2.0 memory expanders appear as NUMA nodes. mmap with mbind() will be used to place mappings on CXL memory, enabling tiered memory systems.
  • Huge page mmap for PMEM: PMDK is working on mmap with MAP_HUGETLB for NVDIMM, which would use 2MB or 1GB physical extents on the persistent memory device.
  • Anonymous huge pages via mmap hint: A MAP_THPHINT flag (proposed) would hint to the kernel that this specific mmap region should be a THP candidate, without the overhead of khugepaged scanning.

Exercises

  1. Write a program that opens a 1 GB file and accesses it with read() in a loop vs mmap() in a loop, measuring throughput for both sequential and random access patterns.
  2. Create two processes that communicate through a MAP_SHARED|MAP_ANONYMOUS mmap (passed via a file descriptor sent over a Unix socket). Measure throughput vs a pipe and vs shared memory via shm_open.
  3. Implement a simple memory-mapped key-value store: store 1 million key-value pairs in a mmap'd file, with a hash table in the mapped region. Handle process crashes gracefully using msync.
  4. Trigger a SIGBUS by mmap-ing a file and then truncating it while the mapping is open. Write a SIGBUS handler that remaps the truncated region to anonymous memory.
  5. Measure the overhead of msync(MS_SYNC) for different region sizes on an SSD. Plot latency vs sync'd bytes.
  6. Use /proc/PID/pagemap to determine for each page in a mmap'd file whether it is: (a) in the page cache, (b) not yet faulted in, (c) swapped out. Write a tool that outputs a visual "presence map" of the file.

References

  • mm/mmap.cdo_mmap(), mmap_region(), get_unmapped_area()
  • mm/memory.cdo_fault(), do_read_fault(), do_shared_fault()
  • mm/filemap.cfilemap_fault(), page cache integration
  • arch/x86/mm/mmap.carch_get_unmapped_area_topdown()
  • include/linux/mman.h — MAP_* flag definitions
  • include/uapi/linux/mman.h — userspace MAP_* definitions
  • Linux man pages: mmap(2), munmap(2), mprotect(2), msync(2), madvise(2), mremap(2), memfd_create(2)
  • Howard Chu, "LMDB: Lightning Memory-Mapped Database" (2013) — explains mmap as database buffer pool
  • Ulrich Drepper, "What Every Programmer Should Know About Memory" (2007), Section 4
  • LWN: "Memory mapped files" — https://lwn.net/Articles/591769/
  • LMDB source: libraries/liblmdb/mdb.c — mmap usage in a real database