Overlay Filesystems
Technical Overview
OverlayFS is a Linux union filesystem that presents a merged view of multiple directory layers to processes. It is the primary mechanism by which container images achieve both storage efficiency (layers shared across many containers) and writability (each container gets its own writable layer on top of a shared read-only image). Understanding OverlayFS is essential for debugging container storage issues, understanding image build performance, and reasoning about copy-on-write semantics.
The core concept: OverlayFS presents a merged directory that combines a read-only lower layer (or multiple lower layers) with a read-write upper layer. Reads that find a file in the upper layer return the upper version; reads that don't find a file in upper fall through to lower layers. All writes go to the upper layer. Deletions are represented by special "whiteout" files in the upper layer.
Prerequisites
- Linux VFS (Virtual Filesystem Switch) concepts
- Inode fundamentals (inode numbers, hard links, file identity)
- Understanding of container image layers
- Basic filesystem operations (stat, rename, link)
Historical Context
Union filesystems have a long history in Linux. The original motivation was bootable LiveCD systems — mount a read-only squashfs from CD, overlay a tmpfs for writes, user gets a fully writable system without modifying the CD.
Several union filesystem implementations preceded OverlayFS: - aufs (2006): "Another Union FileSystem," originally a patch set never merged into mainline. Used by Docker as its initial storage driver due to maturity but required out-of-tree kernel patches. - UnionFS: Earlier user-space implementation, limited adoption - overlayfs (Miklos Szeredi): Submitted multiple times to the kernel. Merged into Linux 3.18 in December 2014 after years of review cycles. This was a major event for the container ecosystem because it meant union filesystem support in the mainline kernel without patches. - Multiple lower layers (Linux 4.0, 2015): Initial OverlayFS supported only one lower layer. Linux 4.0 added support for multiple lower layers (up to a depth limit), enabling full container image layer stacking.
Docker migrated from aufs to overlayfs2 as the default storage driver once kernel support was widespread (circa 2017).
OverlayFS Core Operations
Mount Syntax
mount -t overlay overlay \
-o lowerdir=/lower2:/lower1,upperdir=/upper,workdir=/work \
/merged
lowerdir: Colon-separated list of read-only lower directories (rightmost is bottom, leftmost is top of lower stack)upperdir: Read-write upper directory (must be on same filesystem as workdir)workdir: Scratch directory used internally by OverlayFS during operations like rename and copy-up (must be empty, same filesystem as upper)/merged: The merged view mountpoint
Read Operation
Process reads /merged/etc/nginx.conf
OverlayFS lookup:
1. Check upper (/upper/etc/nginx.conf) → not found
2. Check lower1 (/lower1/etc/nginx.conf) → not found
3. Check lower2 (/lower2/etc/nginx.conf) → FOUND
Return content from lower2
The lookup is sequential from upper down through the lower stack. Time complexity is O(N) in the number of layers for a file that exists only in the deepest layer. This is why very deep image layer stacks (the 128-layer limit exists partly for this reason) have measurable read overhead for cold cache accesses.
Write Operation
Process writes to /merged/var/log/nginx/access.log
If file exists in upper:
→ Write directly to upper file (fast path)
If file exists only in lower:
→ COPY-UP: copy entire file from lower to upper first
→ Then write to the upper copy
(future reads see the upper version)
If file does not exist anywhere:
→ Create new file directly in upper
Delete Operation (Whiteout Files)
When a file that exists in a lower layer is deleted through the merged view, OverlayFS cannot remove the lower-layer file (it is read-only). Instead, it creates a whiteout file in the upper layer:
Lower layer has: /lower1/etc/hosts
Process deletes: /merged/etc/hosts
OverlayFS creates: /upper/etc/hosts (with device number 0,0 — character device with major/minor both 0)
This is a whiteout file.
Subsequent lookup of /merged/etc/hosts:
1. Check upper: find whiteout file → stop, return ENOENT
(lower layers not consulted)
Opaque directories: When a directory in a lower layer is deleted and recreated (rename + create), OverlayFS uses an opaque directory in upper — a directory with a trusted.overlay.opaque=y xattr — which prevents lookup from falling through to the same-named directory in lower layers.
Copy-Up
Copy-up is the most expensive OverlayFS operation. It is triggered the first time a lower-layer file is modified through the merged view:
File: /lower1/var/lib/mysql/ibdata1 (2GB InnoDB tablespace)
First write to this file via merged view:
1. Stat the lower file
2. Create parent directories in upper if they don't exist
3. Create new file in upper with same attributes (mode, owner, times, xattrs)
4. COPY entire file content from lower to upper (2GB read + 2GB write!)
5. Atomically rename the temp file to the final upper path
6. Now writes go to the upper copy
Copy-up is atomic (rename-based) but it is not incremental — the entire file is copied. For large files, this causes significant latency on the first write. This is why containers that use large database files perform poorly when the DB data files are in the container's overlay layer — they incur a multi-gigabyte copy-up on first modification.
Mitigation: Mount database data directories as bind volumes (bypassing OverlayFS entirely). This is standard practice for stateful containers.
OverlayFS in Docker: overlay2 Storage Driver
Docker's overlay2 storage driver organizes layers under /var/lib/docker/overlay2/:
/var/lib/docker/overlay2/
├── <layer-sha256-1>/
│ ├── diff/ ← layer contents (the actual filesystem changes)
│ ├── link ← short ID symlink for path length management
│ └── work/ ← OverlayFS work directory
│
├── <layer-sha256-2>/
│ ├── diff/
│ ├── link
│ ├── lower ← text file listing parent layer short IDs
│ └── work/
│
└── <container-id>/
├── diff/ ← container's writable upper layer
├── link
├── lower ← all image layers listed (colon-separated short IDs)
├── merged/ ← the live mountpoint (when container is running)
└── work/
OverlayFS Directory Structure Diagram
Docker Image: nginx (3 layers)
Layer 3 (top image layer: nginx config)
/var/lib/docker/overlay2/<sha3>/diff/
├── etc/nginx/nginx.conf
└── etc/nginx/conf.d/default.conf
Layer 2 (nginx binaries layer)
/var/lib/docker/overlay2/<sha2>/diff/
└── usr/sbin/nginx
Layer 1 (base OS layer)
/var/lib/docker/overlay2/<sha1>/diff/
├── bin/
├── etc/
│ └── hosts
├── lib/
└── usr/
Container writable layer (upper)
/var/lib/docker/overlay2/<container-id>/diff/
└── (empty at start; writes accumulate here)
OverlayFS mount (when container runs):
lowerdir=<sha3>/diff:<sha2>/diff:<sha1>/diff
upperdir=<container-id>/diff
workdir=<container-id>/work
merged=<container-id>/merged ← container's rootfs
Container sees at merged/:
├── bin/ ← from layer 1
├── etc/
│ ├── hosts ← from layer 1
│ └── nginx/ ← from layer 3
│ ├── nginx.conf
│ └── conf.d/
├── lib/ ← from layer 1
└── usr/
├── lib/ ← from layer 1
└── sbin/
└── nginx ← from layer 2
Layer Sharing Across Containers
The power of the layered model is that image layers are shared across all containers that use the same image:
100 containers running nginx:latest
/var/lib/docker/overlay2/<sha1> ← base layer: 1 copy on disk, referenced by all 100
/var/lib/docker/overlay2/<sha2> ← nginx layer: 1 copy on disk, referenced by all 100
/var/lib/docker/overlay2/<sha3> ← config layer: 1 copy on disk, referenced by all 100
Per-container storage (100 separate writable layers):
/var/lib/docker/overlay2/<container1-id>/diff ← empty (or small diffs)
/var/lib/docker/overlay2/<container2-id>/diff ← empty (or small diffs)
...
/var/lib/docker/overlay2/<container100-id>/diff
Disk usage: 3 shared layers + 100 tiny writable layers
(not: 100 × full image size)
OverlayFS Limitations and Quirks
Inode Renumbering
Files in different layers get different inode numbers when viewed through the merge, even for the same logical file. More critically: when a file is copied up from lower to upper, it gets a new inode number. Applications that cache inode numbers (some build tools, some database recovery mechanisms) can be confused by this.
redirect_dir mount option (Linux 4.10+): Improves handling of directory renames by recording the original directory path as an xattr, so cross-layer renames work correctly without full directory copy-up.
Hard Link Semantics
Hard links to files in lower layers do not work as expected through the merged view when copy-up occurs. When one hard link is modified (triggering copy-up), only that path gets copied up; the other hard link in the lower layer becomes unlinked from the upper copy. This breaks hard-link-based atomic file replacement patterns used by some package managers (rpm, dpkg with hard-link deduplication).
d_type Requirement
OverlayFS requires the underlying filesystem to support d_type (directory entry type in readdir()). XFS formatted without ftype support, and some network filesystems, do not support d_type. Docker detects this and falls back to the slower vfs storage driver.
NFS and Remote Filesystems
OverlayFS upper and work directories cannot be on NFS or other network filesystems — they must be local filesystems. Only lower layers can be on remote filesystems.
Kernel Depth Limit
The maximum number of lower layers in a single OverlayFS mount is 128 (enforced by a kernel constant OVERLAY_MAX_STACK_DEPTH). Docker image builds should stay well under this. A docker history with 128+ layers hits this limit.
Layer Model Performance Analysis
Performance characteristics by operation:
Read (cached): O(1) — page cache, same as regular fs
Read (cold): O(N) layers traversed in worst case (file in deepest layer)
Write (new file): O(1) — direct write to upper
Write (modified): O(1) if file already in upper
O(file_size) COPY-UP if file only in lower
Delete: O(1) — whiteout file creation
Rename (dir): O(subtree_size) without redirect_dir
O(1) with redirect_dir xattr optimization
Image build layer analysis:
- Each Dockerfile RUN instruction creates a new layer
- Many small layers: each layer adds lookup overhead
- Fewer, larger layers: better read performance, worse build caching
Container Image Deduplication
Content-addressable storage means layers are identified by their SHA256 digest. If two different images both have a layer with the same content (same digest), the layer is stored once on disk and referenced by both image manifests.
In practice, base image layers (OS packages, common libraries) are heavily shared across many images. The space saving on a production node running dozens of different services from similar base images can be substantial — often 60-80% space reduction compared to storing each image independently.
OverlayFS vs Other Storage Drivers
| Driver | Kernel support | CoW mechanism | Performance | Notes |
|---|---|---|---|---|
| overlay2 | mainline 3.18+ | File-level copy-up | Good | Default; recommended |
| aufs | Out-of-tree | File-level copy-up | Comparable | Deprecated, requires patched kernel |
| devicemapper | mainline | Block-level CoW | Worse for metadata | Complex setup, largely replaced |
| btrfs | mainline | Subvolume snapshots | Excellent for metadata | Requires btrfs-formatted storage |
| zfs | Out-of-tree | ZFS snapshots | Excellent | Requires ZFS kernel module |
| vfs | None | Full copies (no CoW!) | Poor | Fallback only; copies entire image for each container |
Production Examples
Inspect layer structure of a running container:
# Find container's OverlayFS mount
docker inspect <container> | jq '.[0].GraphDriver'
# Shows LowerDir, UpperDir, MergedDir, WorkDir paths
# Examine the actual mount
cat /proc/mounts | grep overlay | grep <container-id-prefix>
# See upper layer changes
ls /var/lib/docker/overlay2/<container-id>/diff/
Find where disk space is consumed:
# Total overlay2 usage
du -sh /var/lib/docker/overlay2/
# Find largest layers
du -sh /var/lib/docker/overlay2/*/diff | sort -rh | head -20
# Dangling layers (not referenced by any image or container)
docker system df -v
docker image prune # remove dangling images and their layers
Understand why a docker build is slow:
# Build with --no-cache to measure fresh build time
# Count layers in final image
docker history myimage --no-trunc | wc -l
# Check if you're hitting copy-up on large files
strace -e trace=open,openat,read,write -p <container-pid> 2>&1 | grep "overlay2"
Debugging Notes
- "too many levels of symbolic links": OverlayFS with redirect_dir and many layers can generate deep symlink chains that hit the kernel limit. Reduce layer count or disable redirect_dir.
- "no space left on device" but
dfshows space: Could be inode exhaustion. OverlayFS on ext4 can exhaust inodes when many small files are created across many layers. Checkdf -i. - Unexpected file permissions: After copy-up, the upper copy has the same permissions as the lower original. If the lower file was owned by a different UID mapping, the upper copy may appear owned by
nobody. - Database refusing to start: Check if the DB data directory is on OverlayFS (it should be a bind volume). Databases (MySQL, PostgreSQL) often fail or perform terribly on OverlayFS due to copy-up of large data files.
- stale NFS handle errors: Using NFS as the Docker storage root. Not supported for overlay2 upper layers. Use local storage.
Security Implications
- Layer content inspection: The lower layers of a running container are readable from the host by root. Sensitive secrets baked into image layers are accessible from the host. This is why secrets should not be in image layers.
- Upper layer writable by host root: The container's writable upper layer (
diff/) is accessible from the host. A host root process can modify it, injecting code into the running container. Containers do not protect their filesystem from host root access. - Layer sharing and side channels: Shared lower layers mean CPU cache and page cache timing side channels could theoretically leak information between containers sharing the same image layers — a theoretical concern for high-security environments.
- Whiteout file leakage: If an image layer contains a whiteout for a lower layer file, the original file content is still visible on the lower layer in the content store. Do not rely on "deleting" sensitive files in a Dockerfile layer to hide them.
Performance Implications
- Copy-up latency: The first write to any lower-layer file adds the full file copy overhead. For large files, this can be seconds. Design containers to mount large mutable data as bind volumes.
- Deep layer stacks: Each cold read traverses up to N layers. Merge common layers in Dockerfile (
RUN apt-get install ... && apt-get clean && ...in one step rather than separate RUN instructions). - Page cache sharing: Lower layers (read-only) are backed by the page cache normally. Multiple containers using the same lower layer share the same page cache entries — good for memory efficiency.
- Write-heavy workloads: OverlayFS is appropriate for mostly-read workloads. Write-heavy container workloads (CI/CD build containers writing many files) should use ephemeral volumes on local SSD rather than the container's overlay layer.
Failure Modes
| Failure | Symptom | Cause |
|---|---|---|
| Copy-up on large file | Container hangs on first write | Large file being copied from lower to upper; use bind volume |
| Inode exhaustion | ENOSPC despite free blocks |
Too many small files; check df -i, resize or tune ext4 inode ratio |
| Layer depth exceeded | Container fails to start | OVERLAY_MAX_STACK_DEPTH exceeded (128 layers); flatten image |
| d_type missing | Docker falls back to vfs driver | Underlying fs doesn't support d_type; reformat with ftype=1 |
| Upper dir not on local fs | Mount fails | NFS/CIFS upper dir; use local storage for Docker root |
| Rename across layers broken | Application errors on file rename | Missing redirect_dir; upgrade kernel or use bind volume |
Modern Usage
- BuildKit layer optimization: Docker BuildKit (and tools like
ko,kaniko) analyze Dockerfile instructions to maximize layer reuse and minimize copy-up during builds - OCI layer streaming: Projects like
estargzandnydusimplement lazy-loading of container image layers — only pull the file blocks that are actually accessed, rather than pulling entire layers before starting - OverlayFS for build caches: CI systems mount read-only base layers as overlayfs lower dirs and writable upper dirs per build step, then cache and reuse upper dirs as new lower layers for subsequent steps
- User namespace + OverlayFS: Linux 5.11+ supports OverlayFS mounting inside user namespaces, enabling rootless overlayfs (previously required root or
fuse-overlayfsfallback)
Future Directions
- Composefs: A new Linux filesystem (merged 6.4) designed for container images — immutable, content-addressed, with per-file fsverity for integrity verification. Intended as a better alternative to squashfs for container image lower layers.
- OverlayFS volatile upper: Kernel optimization allowing the upper layer to bypass journal writes for containers that don't need crash-consistent upper layers (disposable containers) — significant write performance improvement.
- Chunk-level deduplication: Moving beyond file-level sharing to block/chunk-level deduplication in container image stores, reducing storage further for images with similar but not identical layers.
Exercises
- Mount an OverlayFS manually: create lower, upper, work, and merged directories. Mount the overlay. Create files in lower, then modify one through merged. Observe the copy-up in the upper directory. Check the whiteout file after deleting a lower-layer file.
- Measure copy-up cost: create a 500MB file in the lower layer. Time the first write to this file through the merged view. Compare to a direct write to upper.
- Use
docker history nginx:latestto see all layers. Find the corresponding directories under/var/lib/docker/overlay2/. Verify thelowerfile in each layer directory points to the correct parent layers. - Build a Dockerfile that intentionally creates 5 layers. Run a container from it and observe the OverlayFS mount with
cat /proc/mounts. Count the lowerdir entries. - Create a container that writes many small files. Monitor inode usage with
df -i. At what point does inode exhaustion become a concern? - Compare disk space usage: pull the same image for 10 containers vs. having 10 independent copies of the filesystem. Measure the actual disk savings from layer sharing.
References
- OverlayFS kernel documentation:
Documentation/filesystems/overlayfs.rst - Miklos Szeredi's OverlayFS design documents (kernel mailing list archives)
- Docker overlay2 storage driver documentation: docs.docker.com/storage/storagedriver/overlayfs-driver/
- Linux kernel source:
fs/overlayfs/ mount(8)man page, overlayfs section- Composefs design document: lkml archives, 2022
- estargz (Stargz Snapshotter): github.com/containerd/stargz-snapshotter
- Container storage interface (CSI) specification for volumes bypassing OverlayFS