Linux Namespaces
Technical Overview
Linux namespaces are a kernel mechanism that partitions global system resources so that each process (or group of processes) sees its own isolated view of those resources. From the perspective of a process inside a namespace, it appears to be the sole owner of that resource — it sees no other processes, no other network interfaces, no other filesystem mounts than what is permitted within its namespace context.
Namespaces are the foundational isolation primitive beneath containers. When people say "containers are just processes," what they mean precisely is: containers are processes whose global resource views have been partitioned using namespaces. There is no container daemon magic, no hypervisor — just a Linux process running with a different set of namespace memberships than the parent shell.
The kernel tracks namespaces as reference-counted objects. A namespace persists as long as at least one process belongs to it, or as long as a bind mount holds a reference to the namespace file in /proc/PID/ns/.
Prerequisites
- Linux kernel fundamentals (processes, VFS, system calls)
- Understanding of
fork(),clone(),execve()semantics - Familiarity with
/procfilesystem layout - Basic understanding of process credentials (UID/GID)
Historical Context
The namespace concept in Linux evolved incrementally over nearly two decades. The first namespace type — Mount — was introduced in kernel 2.4.19 in 2002 as part of work by Al Viro and others to support private mount trees per process. The vision of generalizing this isolation mechanism to other resource types came later.
Eric Biederman, in particular, drove the broader namespace effort at IBM and later Red Hat. The User namespace — arguably the most powerful and most carefully reviewed — took until kernel 3.8 in 2013 to reach a usable state, after years of partial implementations and security reviews.
The final namespace type added is the Time namespace (Linux 5.6, 2020), allowing processes to have isolated views of the system's monotonic and boottime clocks — primarily useful for checkpoint/restore workflows (CRIU) where restored processes must behave as if they never stopped.
All 8 Namespace Types
| Namespace | Flag | Kernel Version | Year | Isolates |
|---|---|---|---|---|
| Mount | CLONE_NEWNS |
2.4.19 | 2002 | Filesystem mount points (each ns has its own mount tree) |
| UTS | CLONE_NEWUTS |
2.6.19 | 2006 | Hostname and NIS domain name |
| IPC | CLONE_NEWIPC |
2.6.19 | 2006 | System V IPC, POSIX message queues |
| PID | CLONE_NEWPID |
2.6.24 | 2008 | Process ID number space (PIDs restart from 1 inside) |
| Network | CLONE_NEWNET |
2.6.24 | 2008 | Network devices, IPs, routing tables, iptables, sockets |
| User | CLONE_NEWUSER |
3.8 | 2013 | UID/GID mappings (container root → unprivileged host UID) |
| Cgroup | CLONE_NEWCGROUP |
4.6 | 2016 | Cgroup root view (process sees its cgroup as root) |
| Time | CLONE_NEWTIME |
5.6 | 2020 | Monotonic and boottime clock offsets |
Mount Namespace (CLONE_NEWNS)
Each process has a mount namespace that defines the set of filesystem mounts it can see. When a process creates a new mount namespace, it inherits a copy of the parent's mount tree. From that point on, mount/unmount operations in the child namespace do not affect the parent. This is how containers get their own root filesystem: the container runtime performs a pivot_root() or chroot() within the new mount namespace.
The name flag CLONE_NEWNS (not CLONE_NEWMNT) is a historical artifact — it was the only namespace type when the flag was introduced, so it was simply called "new namespace."
UTS Namespace (CLONE_NEWUTS)
The UTS namespace isolates the hostname and NIS domain name returned by uname(). Inside a container, hostname can return "webapp-pod-1" while the host's hostname remains "prod-node-42". This prevents hostname-based service discovery logic inside containers from leaking host identity.
IPC Namespace (CLONE_NEWIPC)
Isolates System V IPC objects (semaphore arrays, shared memory segments, message queues) and POSIX message queues. Without IPC namespace isolation, a process in one container could attach to a shared memory segment created by another container — a serious security violation.
PID Namespace (CLONE_NEWPID)
PIDs are renumbered inside a new PID namespace. The first process created inside a new PID namespace gets PID 1, becoming the "init" of that namespace. If PID 1 inside the namespace exits, the kernel sends SIGKILL to all other processes in that namespace, tearing it down.
A process inside a PID namespace cannot see or signal processes outside its namespace. However, from the host, you can see all processes in all PID namespaces — a container's PID 1 might appear as PID 4721 on the host.
PID namespaces are nested: a process has both an inner PID (as seen inside its namespace) and a global PID (as seen from the root namespace). /proc/PID/status shows NSpid: lines with the PID at each namespace level.
Network Namespace (CLONE_NEWNET)
Each network namespace has its own set of network devices, IPv4/IPv6 stacks, routing tables, firewall rules, and socket table. When created, a new network namespace has only a loopback interface in DOWN state. Container runtimes then wire a virtual ethernet pair (veth) from the host namespace into the container namespace to provide connectivity.
User Namespace (CLONE_NEWUSER)
The most powerful namespace type. User namespaces allow mapping of UIDs/GIDs: UID 0 inside the container can map to UID 100000 on the host. This means a process that appears to be root inside a container is actually an unprivileged user on the host. User namespaces can be created without any privileges — they are the enabler for rootless containers.
UID mapping is defined in /proc/PID/uid_map and /proc/PID/gid_map. Each line specifies: inside-uid start-uid count.
Cgroup Namespace (CLONE_NEWCGROUP)
Without cgroup namespaces, a process inside a container can read /proc/self/cgroup and see its full cgroup path on the host, leaking the host's cgroup hierarchy layout. With a cgroup namespace, the process's cgroup root is virtualized — it appears to be at / regardless of where it actually sits in the host hierarchy.
Time Namespace (CLONE_NEWTIME)
Allows setting per-namespace offsets for CLOCK_MONOTONIC and CLOCK_BOOTTIME. This is primarily used by CRIU (Checkpoint/Restore In Userspace) to restore containers so that their view of elapsed time matches when they were checkpointed, not when they were restored.
Namespace Creation: clone(), unshare(), setns()
clone() with CLONE_NEW* flags
The primary way to create a new namespace is via clone(), which creates a new process that starts life in the new namespace:
// Create a child process in a new network and UTS namespace
pid_t pid = clone(child_fn, stack + STACK_SIZE,
CLONE_NEWNET | CLONE_NEWUTS | SIGCHLD,
NULL);
The child process inherits a copy of the parent's namespaces for all types not listed in the flags.
unshare() — unshare without fork
unshare() lets a process disassociate itself from one or more namespaces and move into new ones, without creating a child process:
// Current process moves into a new mount namespace
unshare(CLONE_NEWNS);
This is how the unshare(1) tool works. For example:
$ unshare --mount --pid --fork bash
# now running in new mount and PID namespaces
Note: for PID namespaces, unshare() takes effect for the next fork() — the current process's PID namespace does not change, but its children will start in the new PID namespace.
setns() — join an existing namespace
setns() allows a process to join a namespace that already exists (identified by a file descriptor pointing to /proc/PID/ns/<type>):
int fd = open("/proc/12345/ns/net", O_RDONLY);
setns(fd, CLONE_NEWNET); // join PID 12345's network namespace
This is the mechanism behind nsenter and docker exec: to execute a command inside a running container, the runtime opens the container process's namespace file descriptors and calls setns() for each namespace type.
The /proc/PID/ns/ Directory
/proc/12345/ns/
├── cgroup -> cgroup:[4026531835]
├── ipc -> ipc:[4026531839]
├── mnt -> mnt:[4026532456]
├── net -> net:[4026532459]
├── pid -> pid:[4026532457]
├── pid_for_children -> pid:[4026532457]
├── time -> time:[4026531834]
├── time_for_children -> time:[4026531834]
├── user -> user:[4026531837]
└── uts -> uts:[4026531838]
Each entry is a symbolic link pointing to a pseudo-file of the form <type>:[inode]. The inode number uniquely identifies the namespace instance. Two processes sharing the same namespace will have symlinks pointing to the same inode number — this is how you determine if two processes are in the same namespace without any special tools.
You can bind-mount these pseudo-files to keep a namespace alive even after all processes in it have exited:
$ mount --bind /proc/12345/ns/net /run/netns/mynet
Namespace Lifecycle
Namespace Creation
│
▼
┌─────────────────────────────────────────────────────────┐
│ Namespace Object (kernel struct) │
│ refcount = (number of processes) + (bind mounts) │
└─────────────────────────────────────────────────────────┘
│
│ All processes exit AND all bind mounts removed
▼
Namespace Destroyed (memory freed, resources released)
The reference counting is important for tools like ip netns: it uses bind mounts under /run/netns/ to keep network namespaces alive even when no process is running in them, allowing ip netns exec to enter them later.
Isolation Diagram
Host (root namespaces)
┌────────────────────────────────────────────────────────────┐
│ PID ns: [1,2,3,...,4721,4722,...] │
│ Net ns: eth0(192.168.1.10), lo │
│ Mnt ns: / → ext4, /home → ext4, /var → xfs │
│ UTS ns: hostname=prod-node-42 │
│ │
│ Container A (new namespaces) Container B │
│ ┌──────────────────────────┐ ┌──────────────────────┐ │
│ │ PID ns: [1,2,3] │ │ PID ns: [1,2] │ │
│ │ (host sees: 4721,4722, │ │ (host: 5010,5011) │ │
│ │ 4723) │ │ │ │
│ │ Net ns: eth0(10.0.0.2) │ │ Net ns: eth0(10.0.0.3│ │
│ │ Mnt ns: overlay rootfs │ │ Mnt ns: overlay rootfs │
│ │ UTS ns: webapp-pod-1 │ │ UTS ns: db-pod-1 │ │
│ │ User ns: uid 0→100000 │ │ User ns: uid 0→100001│ │
│ └──────────────────────────┘ └──────────────────────┘ │
└────────────────────────────────────────────────────────────┘
Tools: nsenter and lsns
nsenter
Enters one or more namespaces of a running process:
# Enter all namespaces of PID 4721
nsenter --target 4721 --mount --uts --ipc --net --pid -- bash
# Enter only the network namespace
nsenter --target 4721 --net -- ip addr
# Enter using explicit namespace file (namespace kept via bind mount)
nsenter --net=/run/netns/mynet -- ip addr
nsenter works by opening /proc/<target>/ns/<type> and calling setns() for each requested namespace type before execve()-ing the command.
lsns
Lists all namespaces visible to the current process, grouped by type:
$ lsns
NS TYPE NPROCS PID USER COMMAND
4026531835 cgroup 215 1 root /sbin/init
4026531836 time 215 1 root /sbin/init
4026531837 user 215 1 root /sbin/init
4026531838 uts 213 1 root /sbin/init
4026531839 ipc 213 1 root /sbin/init
4026531840 pid 213 1 root /sbin/init
4026531992 net 213 1 root /sbin/init
4026532456 mnt 3 4721 user nginx
4026532457 pid 3 4721 user nginx
4026532459 net 3 4721 user nginx
The NPROCS column shows how many processes share that namespace. The inode number in NS is the same value seen in /proc/PID/ns/ symlinks.
Production Examples
Docker container start flow:
1. dockerd calls containerd via gRPC
2. containerd-shim calls runc
3. runc calls clone() with CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWIPC
4. Inside the cloned child, runc sets up mounts, pivots root, drops capabilities, then execve()s the container entrypoint
Debugging a container network issue:
# Find container PID
docker inspect --format '{{.State.Pid}}' mycontainer
# Run tcpdump in the container's network namespace from the host
nsenter --target <PID> --net -- tcpdump -i eth0
# Check routing table inside container without exec
nsenter --target <PID> --net -- ip route
Debugging Notes
- "Operation not permitted" on unshare: User namespace creation requires
kernel.unprivileged_userns_clone=1(Debian/Ubuntu sysctl). Some hardened distributions disable this. - PID namespace and
/proc: After entering a PID namespace,/procstill shows the host's process tree unless you also mount a new/proc. Alwaysmount -t proc proc /procafter creating a PID namespace. - Stale namespace references: If a container crashes but its namespace persists (via bind mount or leaked fd),
lsnswill show the namespace with 0 processes. Find withls -la /proc/*/ns/net | grep <inode>. /proc/PID/ns/pidvspid_for_children: A process's own PID namespace isns/pid; the namespace its children will be placed in isns/pid_for_children. These differ after anunshare(CLONE_NEWPID)call until the next fork.
Security Implications
- Namespaces are an isolation mechanism, not a security mechanism by themselves. A process that is root inside a PID namespace can still affect host resources if it has escaped the namespace (e.g., through a kernel vulnerability).
- User namespaces significantly change the security posture: a root process inside a user namespace maps to an unprivileged UID on the host, so even if it escapes the namespace boundary, it has no host privileges.
- Unprivileged user namespace creation (
CLONE_NEWUSERwithout root) has historically been a source of kernel exploits because it grants new capabilities within the namespace. Many distributions restrict this viakernel.unprivileged_userns_clone. /proc/PID/ns/file descriptors must be protected: if an attacker can open a namespace fd and callsetns(), they can move into that namespace. Container runtimes must ensure these fds are not leaked into containers.
Performance Implications
- Namespace creation itself is cheap (microsecond range), but mount namespace creation involves copying the mount tree which scales with the number of mounts. Systems with many bind mounts (e.g., Kubernetes nodes with many volumes) can experience slow container startup due to mount namespace copy.
- Network namespace setup (creating veth pairs, configuring IP addresses) involves multiple netlink calls and is typically in the millisecond range.
setns()for network namespace involves flushing socket state and is not free — profilingdocker execlatency often shows network namespace switching as a contributor.
Failure Modes
| Failure | Symptom | Cause |
|---|---|---|
| PID namespace orphan | Container process zombies, PID 1 not reaping | Entrypoint is not an init process |
| Mount namespace leak | Container sees stale mounts | pivot_root or chroot not called correctly |
| Network namespace fd leak | Namespace persists after container exit | Runtime bug — fd not closed |
| UTS not isolated | Container hostname shows host hostname |
Namespace not created by runtime |
| User ns mapping missing | Permission errors inside container | /proc/PID/uid_map not written before process starts |
Modern Usage
Namespaces are used beyond containers:
- systemd: Uses mount, UTS, and network namespaces for service isolation (
PrivateTmp=,PrivateNetwork=,ProtectSystem=) - Chrome/Firefox: Sandboxed renderer processes use namespaces for isolation
- Flatpak/Snap: Desktop application sandboxing via bubblewrap (
bwrap) uses all namespace types - CRIU: Checkpoint/restore uses Time namespace to freeze clock state
Future Directions
- Namespace-aware
/proc: Ongoing kernel work to make more/procentries namespace-aware, reducing information leakage between namespaces - Device namespace: Proposed namespace type to isolate device number spaces — not yet merged as of kernel 6.x
- Nested namespaces deeper integration: Work to allow more complete nesting of namespaces (e.g., full network namespace nesting) for more complex sandbox scenarios
- eBPF and namespaces: eBPF programs increasingly need namespace-aware attachment points for per-container observability without host-level privilege
Exercises
- Create a new UTS namespace using
unshareand change the hostname inside it. Verify the host hostname is unchanged. - Write a small C program that uses
clone()to create a child in a new PID namespace. Inside the child, print the PID. Confirm it prints 1. - Use
lsnsto list all namespaces on your machine. Identify which processes share a network namespace. - Use
nsenterto inspect the routing table and open sockets of a running Docker container without usingdocker exec. - Bind-mount a network namespace from a running container. Kill the container. Verify the namespace still exists via
lsns. Enter it and verify network state. Then unmount and confirm it disappears. - Create a new user namespace as an unprivileged user. Inside it, verify you appear as UID 0. Try to write to a root-owned file — what happens and why?
References
clone(2),unshare(2),setns(2)— Linux man pagesnamespaces(7)— Linux man page (comprehensive reference)pid_namespaces(7),network_namespaces(7),user_namespaces(7)— type-specific man pages- Michael Kerrisk, "Namespaces in operation" series — lwn.net/Articles/531114/
- Eric Biederman's namespace design documents — lkml archives
- Linux kernel source:
include/linux/nsproxy.h,kernel/nsproxy.c util-linuxsource fornsenter(1)andlsns(8)— github.com/util-linux/util-linux