Skip to content

Linux Namespaces

Technical Overview

Linux namespaces are a kernel mechanism that partitions global system resources so that each process (or group of processes) sees its own isolated view of those resources. From the perspective of a process inside a namespace, it appears to be the sole owner of that resource — it sees no other processes, no other network interfaces, no other filesystem mounts than what is permitted within its namespace context.

Namespaces are the foundational isolation primitive beneath containers. When people say "containers are just processes," what they mean precisely is: containers are processes whose global resource views have been partitioned using namespaces. There is no container daemon magic, no hypervisor — just a Linux process running with a different set of namespace memberships than the parent shell.

The kernel tracks namespaces as reference-counted objects. A namespace persists as long as at least one process belongs to it, or as long as a bind mount holds a reference to the namespace file in /proc/PID/ns/.


Prerequisites

  • Linux kernel fundamentals (processes, VFS, system calls)
  • Understanding of fork(), clone(), execve() semantics
  • Familiarity with /proc filesystem layout
  • Basic understanding of process credentials (UID/GID)

Historical Context

The namespace concept in Linux evolved incrementally over nearly two decades. The first namespace type — Mount — was introduced in kernel 2.4.19 in 2002 as part of work by Al Viro and others to support private mount trees per process. The vision of generalizing this isolation mechanism to other resource types came later.

Eric Biederman, in particular, drove the broader namespace effort at IBM and later Red Hat. The User namespace — arguably the most powerful and most carefully reviewed — took until kernel 3.8 in 2013 to reach a usable state, after years of partial implementations and security reviews.

The final namespace type added is the Time namespace (Linux 5.6, 2020), allowing processes to have isolated views of the system's monotonic and boottime clocks — primarily useful for checkpoint/restore workflows (CRIU) where restored processes must behave as if they never stopped.


All 8 Namespace Types

Namespace Flag Kernel Version Year Isolates
Mount CLONE_NEWNS 2.4.19 2002 Filesystem mount points (each ns has its own mount tree)
UTS CLONE_NEWUTS 2.6.19 2006 Hostname and NIS domain name
IPC CLONE_NEWIPC 2.6.19 2006 System V IPC, POSIX message queues
PID CLONE_NEWPID 2.6.24 2008 Process ID number space (PIDs restart from 1 inside)
Network CLONE_NEWNET 2.6.24 2008 Network devices, IPs, routing tables, iptables, sockets
User CLONE_NEWUSER 3.8 2013 UID/GID mappings (container root → unprivileged host UID)
Cgroup CLONE_NEWCGROUP 4.6 2016 Cgroup root view (process sees its cgroup as root)
Time CLONE_NEWTIME 5.6 2020 Monotonic and boottime clock offsets

Mount Namespace (CLONE_NEWNS)

Each process has a mount namespace that defines the set of filesystem mounts it can see. When a process creates a new mount namespace, it inherits a copy of the parent's mount tree. From that point on, mount/unmount operations in the child namespace do not affect the parent. This is how containers get their own root filesystem: the container runtime performs a pivot_root() or chroot() within the new mount namespace.

The name flag CLONE_NEWNS (not CLONE_NEWMNT) is a historical artifact — it was the only namespace type when the flag was introduced, so it was simply called "new namespace."

UTS Namespace (CLONE_NEWUTS)

The UTS namespace isolates the hostname and NIS domain name returned by uname(). Inside a container, hostname can return "webapp-pod-1" while the host's hostname remains "prod-node-42". This prevents hostname-based service discovery logic inside containers from leaking host identity.

IPC Namespace (CLONE_NEWIPC)

Isolates System V IPC objects (semaphore arrays, shared memory segments, message queues) and POSIX message queues. Without IPC namespace isolation, a process in one container could attach to a shared memory segment created by another container — a serious security violation.

PID Namespace (CLONE_NEWPID)

PIDs are renumbered inside a new PID namespace. The first process created inside a new PID namespace gets PID 1, becoming the "init" of that namespace. If PID 1 inside the namespace exits, the kernel sends SIGKILL to all other processes in that namespace, tearing it down.

A process inside a PID namespace cannot see or signal processes outside its namespace. However, from the host, you can see all processes in all PID namespaces — a container's PID 1 might appear as PID 4721 on the host.

PID namespaces are nested: a process has both an inner PID (as seen inside its namespace) and a global PID (as seen from the root namespace). /proc/PID/status shows NSpid: lines with the PID at each namespace level.

Network Namespace (CLONE_NEWNET)

Each network namespace has its own set of network devices, IPv4/IPv6 stacks, routing tables, firewall rules, and socket table. When created, a new network namespace has only a loopback interface in DOWN state. Container runtimes then wire a virtual ethernet pair (veth) from the host namespace into the container namespace to provide connectivity.

User Namespace (CLONE_NEWUSER)

The most powerful namespace type. User namespaces allow mapping of UIDs/GIDs: UID 0 inside the container can map to UID 100000 on the host. This means a process that appears to be root inside a container is actually an unprivileged user on the host. User namespaces can be created without any privileges — they are the enabler for rootless containers.

UID mapping is defined in /proc/PID/uid_map and /proc/PID/gid_map. Each line specifies: inside-uid start-uid count.

Cgroup Namespace (CLONE_NEWCGROUP)

Without cgroup namespaces, a process inside a container can read /proc/self/cgroup and see its full cgroup path on the host, leaking the host's cgroup hierarchy layout. With a cgroup namespace, the process's cgroup root is virtualized — it appears to be at / regardless of where it actually sits in the host hierarchy.

Time Namespace (CLONE_NEWTIME)

Allows setting per-namespace offsets for CLOCK_MONOTONIC and CLOCK_BOOTTIME. This is primarily used by CRIU (Checkpoint/Restore In Userspace) to restore containers so that their view of elapsed time matches when they were checkpointed, not when they were restored.


Namespace Creation: clone(), unshare(), setns()

clone() with CLONE_NEW* flags

The primary way to create a new namespace is via clone(), which creates a new process that starts life in the new namespace:

// Create a child process in a new network and UTS namespace
pid_t pid = clone(child_fn, stack + STACK_SIZE,
                  CLONE_NEWNET | CLONE_NEWUTS | SIGCHLD,
                  NULL);

The child process inherits a copy of the parent's namespaces for all types not listed in the flags.

unshare() — unshare without fork

unshare() lets a process disassociate itself from one or more namespaces and move into new ones, without creating a child process:

// Current process moves into a new mount namespace
unshare(CLONE_NEWNS);

This is how the unshare(1) tool works. For example:

$ unshare --mount --pid --fork bash
# now running in new mount and PID namespaces

Note: for PID namespaces, unshare() takes effect for the next fork() — the current process's PID namespace does not change, but its children will start in the new PID namespace.

setns() — join an existing namespace

setns() allows a process to join a namespace that already exists (identified by a file descriptor pointing to /proc/PID/ns/<type>):

int fd = open("/proc/12345/ns/net", O_RDONLY);
setns(fd, CLONE_NEWNET);  // join PID 12345's network namespace

This is the mechanism behind nsenter and docker exec: to execute a command inside a running container, the runtime opens the container process's namespace file descriptors and calls setns() for each namespace type.


The /proc/PID/ns/ Directory

/proc/12345/ns/
├── cgroup  -> cgroup:[4026531835]
├── ipc     -> ipc:[4026531839]
├── mnt     -> mnt:[4026532456]
├── net     -> net:[4026532459]
├── pid     -> pid:[4026532457]
├── pid_for_children -> pid:[4026532457]
├── time    -> time:[4026531834]
├── time_for_children -> time:[4026531834]
├── user    -> user:[4026531837]
└── uts     -> uts:[4026531838]

Each entry is a symbolic link pointing to a pseudo-file of the form <type>:[inode]. The inode number uniquely identifies the namespace instance. Two processes sharing the same namespace will have symlinks pointing to the same inode number — this is how you determine if two processes are in the same namespace without any special tools.

You can bind-mount these pseudo-files to keep a namespace alive even after all processes in it have exited:

$ mount --bind /proc/12345/ns/net /run/netns/mynet

Namespace Lifecycle

Namespace Creation
      │
      ▼
┌─────────────────────────────────────────────────────────┐
│  Namespace Object (kernel struct)                        │
│  refcount = (number of processes) + (bind mounts)        │
└─────────────────────────────────────────────────────────┘
      │
      │  All processes exit AND all bind mounts removed
      ▼
Namespace Destroyed (memory freed, resources released)

The reference counting is important for tools like ip netns: it uses bind mounts under /run/netns/ to keep network namespaces alive even when no process is running in them, allowing ip netns exec to enter them later.


Isolation Diagram

   Host (root namespaces)
   ┌────────────────────────────────────────────────────────────┐
   │  PID ns: [1,2,3,...,4721,4722,...]                         │
   │  Net ns: eth0(192.168.1.10), lo                            │
   │  Mnt ns: / → ext4, /home → ext4, /var → xfs               │
   │  UTS ns: hostname=prod-node-42                             │
   │                                                            │
   │  Container A (new namespaces)    Container B               │
   │  ┌──────────────────────────┐    ┌──────────────────────┐  │
   │  │ PID ns: [1,2,3]          │    │ PID ns: [1,2]        │  │
   │  │  (host sees: 4721,4722,  │    │  (host: 5010,5011)   │  │
   │  │   4723)                  │    │                      │  │
   │  │ Net ns: eth0(10.0.0.2)   │    │ Net ns: eth0(10.0.0.3│  │
   │  │ Mnt ns: overlay rootfs   │    │ Mnt ns: overlay rootfs   │
   │  │ UTS ns: webapp-pod-1     │    │ UTS ns: db-pod-1     │  │
   │  │ User ns: uid 0→100000    │    │ User ns: uid 0→100001│  │
   │  └──────────────────────────┘    └──────────────────────┘  │
   └────────────────────────────────────────────────────────────┘

Tools: nsenter and lsns

nsenter

Enters one or more namespaces of a running process:

# Enter all namespaces of PID 4721
nsenter --target 4721 --mount --uts --ipc --net --pid -- bash

# Enter only the network namespace
nsenter --target 4721 --net -- ip addr

# Enter using explicit namespace file (namespace kept via bind mount)
nsenter --net=/run/netns/mynet -- ip addr

nsenter works by opening /proc/<target>/ns/<type> and calling setns() for each requested namespace type before execve()-ing the command.

lsns

Lists all namespaces visible to the current process, grouped by type:

$ lsns
        NS TYPE   NPROCS   PID USER             COMMAND
4026531835 cgroup    215     1 root             /sbin/init
4026531836 time      215     1 root             /sbin/init
4026531837 user      215     1 root             /sbin/init
4026531838 uts       213     1 root             /sbin/init
4026531839 ipc       213     1 root             /sbin/init
4026531840 pid       213     1 root             /sbin/init
4026531992 net       213     1 root             /sbin/init
4026532456 mnt         3  4721 user             nginx
4026532457 pid         3  4721 user             nginx
4026532459 net         3  4721 user             nginx

The NPROCS column shows how many processes share that namespace. The inode number in NS is the same value seen in /proc/PID/ns/ symlinks.


Production Examples

Docker container start flow: 1. dockerd calls containerd via gRPC 2. containerd-shim calls runc 3. runc calls clone() with CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWIPC 4. Inside the cloned child, runc sets up mounts, pivots root, drops capabilities, then execve()s the container entrypoint

Debugging a container network issue:

# Find container PID
docker inspect --format '{{.State.Pid}}' mycontainer

# Run tcpdump in the container's network namespace from the host
nsenter --target <PID> --net -- tcpdump -i eth0

# Check routing table inside container without exec
nsenter --target <PID> --net -- ip route

Debugging Notes

  • "Operation not permitted" on unshare: User namespace creation requires kernel.unprivileged_userns_clone=1 (Debian/Ubuntu sysctl). Some hardened distributions disable this.
  • PID namespace and /proc: After entering a PID namespace, /proc still shows the host's process tree unless you also mount a new /proc. Always mount -t proc proc /proc after creating a PID namespace.
  • Stale namespace references: If a container crashes but its namespace persists (via bind mount or leaked fd), lsns will show the namespace with 0 processes. Find with ls -la /proc/*/ns/net | grep <inode>.
  • /proc/PID/ns/pid vs pid_for_children: A process's own PID namespace is ns/pid; the namespace its children will be placed in is ns/pid_for_children. These differ after an unshare(CLONE_NEWPID) call until the next fork.

Security Implications

  • Namespaces are an isolation mechanism, not a security mechanism by themselves. A process that is root inside a PID namespace can still affect host resources if it has escaped the namespace (e.g., through a kernel vulnerability).
  • User namespaces significantly change the security posture: a root process inside a user namespace maps to an unprivileged UID on the host, so even if it escapes the namespace boundary, it has no host privileges.
  • Unprivileged user namespace creation (CLONE_NEWUSER without root) has historically been a source of kernel exploits because it grants new capabilities within the namespace. Many distributions restrict this via kernel.unprivileged_userns_clone.
  • /proc/PID/ns/ file descriptors must be protected: if an attacker can open a namespace fd and call setns(), they can move into that namespace. Container runtimes must ensure these fds are not leaked into containers.

Performance Implications

  • Namespace creation itself is cheap (microsecond range), but mount namespace creation involves copying the mount tree which scales with the number of mounts. Systems with many bind mounts (e.g., Kubernetes nodes with many volumes) can experience slow container startup due to mount namespace copy.
  • Network namespace setup (creating veth pairs, configuring IP addresses) involves multiple netlink calls and is typically in the millisecond range.
  • setns() for network namespace involves flushing socket state and is not free — profiling docker exec latency often shows network namespace switching as a contributor.

Failure Modes

Failure Symptom Cause
PID namespace orphan Container process zombies, PID 1 not reaping Entrypoint is not an init process
Mount namespace leak Container sees stale mounts pivot_root or chroot not called correctly
Network namespace fd leak Namespace persists after container exit Runtime bug — fd not closed
UTS not isolated Container hostname shows host hostname Namespace not created by runtime
User ns mapping missing Permission errors inside container /proc/PID/uid_map not written before process starts

Modern Usage

Namespaces are used beyond containers:

  • systemd: Uses mount, UTS, and network namespaces for service isolation (PrivateTmp=, PrivateNetwork=, ProtectSystem=)
  • Chrome/Firefox: Sandboxed renderer processes use namespaces for isolation
  • Flatpak/Snap: Desktop application sandboxing via bubblewrap (bwrap) uses all namespace types
  • CRIU: Checkpoint/restore uses Time namespace to freeze clock state

Future Directions

  • Namespace-aware /proc: Ongoing kernel work to make more /proc entries namespace-aware, reducing information leakage between namespaces
  • Device namespace: Proposed namespace type to isolate device number spaces — not yet merged as of kernel 6.x
  • Nested namespaces deeper integration: Work to allow more complete nesting of namespaces (e.g., full network namespace nesting) for more complex sandbox scenarios
  • eBPF and namespaces: eBPF programs increasingly need namespace-aware attachment points for per-container observability without host-level privilege

Exercises

  1. Create a new UTS namespace using unshare and change the hostname inside it. Verify the host hostname is unchanged.
  2. Write a small C program that uses clone() to create a child in a new PID namespace. Inside the child, print the PID. Confirm it prints 1.
  3. Use lsns to list all namespaces on your machine. Identify which processes share a network namespace.
  4. Use nsenter to inspect the routing table and open sockets of a running Docker container without using docker exec.
  5. Bind-mount a network namespace from a running container. Kill the container. Verify the namespace still exists via lsns. Enter it and verify network state. Then unmount and confirm it disappears.
  6. Create a new user namespace as an unprivileged user. Inside it, verify you appear as UID 0. Try to write to a root-owned file — what happens and why?

References

  • clone(2), unshare(2), setns(2) — Linux man pages
  • namespaces(7) — Linux man page (comprehensive reference)
  • pid_namespaces(7), network_namespaces(7), user_namespaces(7) — type-specific man pages
  • Michael Kerrisk, "Namespaces in operation" series — lwn.net/Articles/531114/
  • Eric Biederman's namespace design documents — lkml archives
  • Linux kernel source: include/linux/nsproxy.h, kernel/nsproxy.c
  • util-linux source for nsenter(1) and lsns(8) — github.com/util-linux/util-linux