05 — IPC Mechanisms
Technical Overview
Inter-Process Communication (IPC) is the set of mechanisms by which separate processes exchange data or synchronize actions. No single IPC mechanism suits all situations: the right choice depends on latency requirements, bandwidth needs, whether processes share a machine or are distributed, whether the communication is structured or stream-based, and whether data ordering or message boundaries matter. Linux provides at least eight distinct IPC families, each with its own trade-offs and legacy baggage.
This file covers the full taxonomy, with emphasis on the mechanisms used in modern production systems.
Prerequisites
01-process-concept.md: address space, file descriptors,mm_struct02-fork-and-exec.md:dup_fd, inherited file descriptors- Basic understanding of kernel buffers and reference counting
Core Content
IPC Comparison Table
┌─────────────────────┬───────────┬──────────────┬────────────────────────────────────┐
│ Mechanism │ Latency │ Bandwidth │ Best use case │
├─────────────────────┼───────────┼──────────────┼────────────────────────────────────┤
│ Anonymous pipe │ ~1 µs │ ~4 GB/s │ Parent↔child, shell pipelines │
│ Named pipe (FIFO) │ ~1 µs │ ~4 GB/s │ Unrelated processes, same host │
│ POSIX MQ │ 2–5 µs │ ~1 GB/s │ Priority messages, real-time │
│ System V msg queue │ 5–10 µs │ ~500 MB/s │ Legacy; avoid in new code │
│ System V shared mem │ <100 ns │ memory speed │ Legacy; avoid; use POSIX shm │
│ POSIX shared memory │ <100 ns │ memory speed │ Fastest bulk data transfer │
│ AF_UNIX stream │ ~5 µs │ ~3 GB/s │ Structured protocol, fd passing │
│ AF_UNIX dgram │ ~3 µs │ ~2 GB/s │ Message-oriented, no ordering need │
│ memfd + mmap │ <100 ns │ memory speed │ Anonymous file-backed shared mem │
│ eventfd │ ~500 ns │ N/A (events) │ Lightweight event notification │
│ signalfd │ ~1 µs │ N/A (signals)│ Signal handling in event loop │
│ timerfd │ ~500 ns │ N/A (timers) │ Timer events in epoll loop │
└─────────────────────┴───────────┴──────────────┴────────────────────────────────────┘
Latency figures are rough order-of-magnitude for same-machine localhost, kernel 5.x, modern x86-64. Actual figures depend heavily on CPU cache state and copy sizes.
Anonymous Pipes
The simplest IPC: a unidirectional byte stream between two related processes.
int pipefd[2];
pipe2(pipefd, O_CLOEXEC); // pipefd[0] = read end, pipefd[1] = write end
pid_t pid = fork();
if (pid == 0) {
// child: writer
close(pipefd[0]);
write(pipefd[1], "hello", 5);
close(pipefd[1]);
_exit(0);
}
// parent: reader
close(pipefd[1]);
char buf[64];
ssize_t n = read(pipefd[0], buf, sizeof(buf));
Internals:
- Pipe is backed by a pipe buffer in kernel memory (not a file on disk).
- Default capacity: 65,536 bytes (PIPE_BUF = 4096 for atomic writes; total
pipe buffer = 16 pages = 65,536 bytes by default on Linux).
- Adjustable per-pipe via fcntl(fd, F_SETPIPE_SZ, size) up to
/proc/sys/fs/pipe-max-size (default 1 MB).
- Atomic writes: writes of ≤ PIPE_BUF (4096) bytes are atomic — they will not
be interleaved with writes from other processes to the same pipe.
- Blocking semantics: write blocks when pipe is full; read blocks when pipe
is empty. Both return 0/EOF when all write ends are closed.
Writer process Reader process
┌──────────┐ ┌──────────┐
│ write() │──►┌─────────────┐──►│ read() │
└──────────┘ │ kernel pipe │ └──────────┘
│ buffer │
│ [65536 B] │
└─────────────┘
Named Pipes (FIFOs)
A FIFO is a pipe with a name in the filesystem. Any process with appropriate permissions can open it.
mkfifo /tmp/my_pipe
# Writer:
echo "data" > /tmp/my_pipe &
# Reader:
cat /tmp/my_pipe
Key behavior: open() on a FIFO blocks until both a reader and a writer have opened
it, unless O_NONBLOCK is used. The FIFO name is in the filesystem but the data never
touches disk — it lives in the same kernel pipe buffer as anonymous pipes.
Use cases: log aggregation pipelines, inter-service communication in legacy Unix scripts,
systemd socket activation for FIFO-based services.
POSIX Message Queues
POSIX MQs (mqueue) provide a message-oriented (not stream-oriented) channel with
priority ordering: messages with higher priority (lower number = higher priority in
POSIX; higher number = higher priority in Linux's actual implementation) are dequeued
first regardless of arrival order.
#include <mqueue.h>
// Sender:
mqd_t mq = mq_open("/my_queue", O_CREAT | O_WRONLY, 0644, NULL);
mq_send(mq, "msg", 3, 10 /* priority */);
mq_close(mq);
// Receiver:
mqd_t mq = mq_open("/my_queue", O_RDONLY);
char buf[256];
unsigned int prio;
mq_receive(mq, buf, sizeof(buf), &prio);
mq_close(mq);
mq_unlink("/my_queue"); // remove when done
Properties:
- Queues appear in /dev/mqueue (tmpfs mountpoint, visible with ls /dev/mqueue).
- Maximum messages and message size configurable via struct mq_attr or
/proc/sys/fs/mqueue/.
- mq_notify(3): asynchronous notification via signal or thread creation when a message
arrives on an empty queue.
- Used in real-time systems, POSIX-compliant IPC code, and cases where message ordering
by priority is needed without a full message broker.
System V IPC (Legacy)
System V IPC (SysV) provides three mechanisms: message queues (msgget/msgsnd/msgrcv),
shared memory (shmget/shmat/shmdt), and semaphores (semget/semop).
Avoid in new code. Reasons:
- Identified by integer keys (generated with ftok(path, id)) rather than
file-descriptor-based names — keys are fragile and prone to collision.
- Not cleaned up automatically when a process exits (shared memory segments persist
until explicitly removed via shmctl(IPC_RMID) or reboot).
- Cannot be selected on with poll/epoll/select.
- Limited introspection compared to /proc.
Inspection:
ipcs -a # show all SysV IPC objects
ipcs -m # shared memory segments
ipcs -q # message queues
ipcs -s # semaphore sets
ipcrm -m <shmid> # remove a shared memory segment
SysV IPC is still encountered in: database shared memory (old PostgreSQL, Oracle), legacy financial trading systems, and any C code written before POSIX IPC was available.
POSIX Shared Memory
The fastest IPC mechanism: once established, both processes read/write the same physical pages with no kernel involvement in the data path.
// Creator:
int fd = shm_open("/my_shm", O_CREAT | O_RDWR, 0644);
ftruncate(fd, 4096);
void *ptr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
close(fd); // can close fd; mapping remains
// Consumer:
int fd = shm_open("/my_shm", O_RDWR, 0);
void *ptr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
Shared memory is backed by a tmpfs file in /dev/shm. Both processes' mmap calls
map the same physical pages. A write by one process is immediately visible to the other
— no copies, no kernel transitions.
Synchronization: shared memory alone is insufficient. Add a synchronization primitive
in the shared region: POSIX unnamed semaphore (sem_init(ptr, 1 /* pshared */, 1)),
futex-based mutex, or an atomic (__atomic_*) lock-free structure.
Process A Process B
┌──────────────────────┐ ┌──────────────────────┐
│ virtual address: │ │ virtual address: │
│ ptr → [0x7f...] │ │ ptr → [0x7e...] │
└──────────┬───────────┘ └──────────┬────────────┘
│ │
└─────────────┬───────────────┘
▼
┌─────────────────┐
│ Physical pages │
│ (tmpfs: /dev/shm│
│ /my_shm) │
└─────────────────┘
AF_UNIX Sockets
Unix domain sockets are the IPC workhorse of modern Linux systems. Unlike pipes, they are
bidirectional (SOCK_STREAM), support message boundaries (SOCK_DGRAM, SOCK_SEQPACKET),
and can transmit ancillary data — including file descriptors (SCM_RIGHTS) and
credentials (SCM_CREDENTIALS).
Types:
- SOCK_STREAM: byte stream, bidirectional, like TCP but without network stack overhead
- SOCK_DGRAM: message boundaries preserved, no connection, unordered (but same-host
delivery is reliable in practice)
- SOCK_SEQPACKET: message boundaries + ordering + connection — the best of both worlds
// Server (abstract socket — no filesystem entry):
int sfd = socket(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0);
struct sockaddr_un addr = { .sun_family = AF_UNIX, .sun_path = "\0my_service" };
bind(sfd, (struct sockaddr*)&addr, sizeof(sa_family_t) + 1 + strlen("my_service"));
listen(sfd, 128);
Passing file descriptors via SCM_RIGHTS:
// Send fd over Unix socket:
struct msghdr msg = { ... };
struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
cmsg->cmsg_level = SOL_SOCKET;
cmsg->cmsg_type = SCM_RIGHTS;
memcpy(CMSG_DATA(cmsg), &fd_to_send, sizeof(int));
sendmsg(sock, &msg, 0);
This is how:
- systemd passes pre-bound sockets to services (socket activation)
- Container runtimes pass network namespace fds to container processes
- Wayland compositors pass GPU buffer fds to clients
- dbus-broker passes message fds for zero-copy delivery
Real-world users: systemd (socket activation via /run/systemd/private/), D-Bus
(system bus at /run/dbus/system_bus_socket), Docker daemon
(/var/run/docker.sock), containerd (/run/containerd/containerd.sock), X11
(abstract socket @/tmp/.X11-unix/X0).
memfd_create: Anonymous File-Based IPC
memfd_create(2) (Linux 3.17) creates an anonymous file backed by tmpfs. The file
descriptor is the only reference — no filesystem path exists.
int fd = memfd_create("my_buffer", MFD_CLOEXEC | MFD_ALLOW_SEALING);
ftruncate(fd, size);
void *ptr = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
// Share fd with another process via SCM_RIGHTS over Unix socket
// Recipient calls mmap() on the received fd
Sealing (MFD_ALLOW_SEALING): once a memfd is sealed, certain operations are
permanently forbidden. Seals are set via fcntl(fd, F_ADD_SEALS, seals):
- F_SEAL_WRITE: no more writes (makes the content immutable)
- F_SEAL_SHRINK / F_SEAL_GROW: cannot resize
- F_SEAL_SEAL: cannot add more seals
Use case: a server generates a response, seals it as read-only, sends the fd to a
client via SCM_RIGHTS. The client mmap's the fd and reads it at memory speed — no
copy. Wayland's wl_shm protocol uses this for buffer passing between compositors
and clients.
File-Descriptor-Based Event Primitives
Linux 2.6+ provides three fds that integrate with epoll for event-driven programming:
eventfd(2) — a counter with read/write semantics:
int efd = eventfd(0, EFD_CLOEXEC | EFD_NONBLOCK);
// Increment counter:
uint64_t val = 1;
write(efd, &val, 8);
// Wait for non-zero (blocking read):
uint64_t count;
read(efd, &count, 8); // reads and resets counter atomically
Used by: io_uring (completion notification), futex-based condition variable
implementations, and Docker's libcontainer for process synchronization.
timerfd_create(2) — a timer as a file descriptor:
int tfd = timerfd_create(CLOCK_MONOTONIC, TFD_CLOEXEC | TFD_NONBLOCK);
struct itimerspec ts = { .it_value = {1, 0}, .it_interval = {1, 0} };
timerfd_settime(tfd, 0, &ts, NULL);
// tfd becomes readable every second; read returns number of expirations
signalfd(2) — signals as a file descriptor (covered in 04-signals.md).
All three integrate naturally with epoll_wait(), enabling a single-thread event loop
that handles I/O, timers, and signals without threads or signal handlers.
Historical Context
Pipes are the oldest UNIX IPC (1971). The | pipe operator in shells was one of the
defining innovations of UNIX, allowing the composition of small programs into processing
pipelines — the "tools and pipes" philosophy.
System V IPC was introduced in Unix System V (AT&T, 1983) and brought from the UNIX world into BSD and Linux. It was widely used until POSIX IPC standardized a cleaner interface in the early 1990s.
AF_UNIX sockets appeared in BSD 4.2 (1983). The ability to pass file descriptors
(SCM_RIGHTS) was added in BSD 4.3 and is one of the most powerful features of the
Unix socket API.
memfd_create and timerfd/eventfd/signalfd are all Linux-specific extensions
that emerged in the 2.6.x era (2005–2009), reflecting the shift toward fd-based,
epoll-compatible interfaces.
Production Examples
Check SysV IPC accumulation (common production issue):
ipcs -m | awk 'NR>3 {print $2, $6}' | sort -k2 -rn | head
# If nattch == 0 for many segments, they're leaked orphans
ipcs -m | awk '$6 == 0 {print "ipcrm -m", $2}' | sh
Unix socket connection count:
ss -x | grep /var/run/docker.sock | wc -l
Pipe buffer size tuning for high-throughput pipelines:
# In C, after creating a pipe:
fcntl(pipefd[1], F_SETPIPE_SZ, 1048576); # 1 MB pipe buffer
# Check max:
cat /proc/sys/fs/pipe-max-size
memfd for zero-copy IPC benchmark:
# Using memfd to share 100MB between two processes via fd passing:
# Measure: mmap + seal + pass fd via Unix socket + mmap in receiver
# vs: write() + read() through a pipe
# Expected: memfd approach is 10-100x faster for large payloads
Debugging Notes
- Pipe full / blocked writer:
cat /proc/PID/fdinfo/N(where N is the write-end fd number) showspos:,flags:, and for pipes:pipe-buf-size. Uselsof +p PIDto find which pipe is full. - SysV shm not cleaning up: use
ipcs -mto find orphaned segments. Thenattchcolumn (number of attached processes) being 0 with a largesegszindicates an orphan. - AF_UNIX socket permission errors: check ownership and permissions of the socket
file. Abstract sockets (path starts with
\0) have no filesystem entry — permissions are handled differently (peer credentials viaSO_PEERCRED). - fd passing failure:
sendmsgwithSCM_RIGHTSfails silently if the receiver's fd table is full (EMFILE). Check the receiver'sRLIMIT_NOFILEand current fd count. - Tracing IPC:
strace -e trace=read,write,recvmsg,sendmsg -p PIDshows all pipe/socket activity.bpftrace -e 'tracepoint:syscalls:sys_enter_pipe { ... }'for production tracing.
Security Implications
- Pipe data visibility:
/proc/PID/fd/Nshows the pipe endpoint, but the data in the kernel pipe buffer is not directly readable from userspace (unlikeprocfsmemory maps). However, if you canptracethe writing process, you can intercept the data. - AF_UNIX credential spoofing:
SCM_CREDENTIALSlets a sender pass arbitrarypid/uid/gid. Always useSO_PEERCRED(automatically populated by the kernel) instead of trustingSCM_CREDENTIALSfrom the peer. systemd usesSO_PEERCREDto identify which service is connecting to its management socket. - POSIX shm and world-readable files:
shm_open("/foo", O_CREAT, 0644)creates a world-readable shared memory object in/dev/shm. Any process on the system can map it and read (or write, if writable) its contents. - memfd sealing as a security primitive: a sealed
MFD_ALLOW_SEALING+F_SEAL_WRITEmemfd can be passed to an untrusted process that needs read access to data without being able to modify it. Chrome uses memfd-based IPC with sealing for renderer/GPU process communication.
Performance Implications
- Copy count: pipes require two copies (user→kernel pipe buffer, kernel→user). Shared memory requires zero copies after setup. For large payloads (>1 MB), shared memory is dramatically faster.
- Synchronization cost: shared memory's zero-copy advantage disappears if you use
a mutex or semaphore that causes a
futexsyscall. Use lock-free atomics or spin loops for extremely latency-sensitive shared-memory IPC. - Unix socket vs. loopback TCP: AF_UNIX sockets skip the entire TCP/IP stack (no header parsing, no checksum, no port table lookup). Typical throughput is 2–5x higher than loopback TCP for small messages.
- epoll + eventfd/timerfd: polling a single fd via
epoll_waitis essentially free (~500 ns) compared to the overhead of thread synchronization. The fd-based event primitives enable efficient single-threaded event loops without OS-level thread costs.
Failure Modes
| Failure | Symptom | Cause |
|---|---|---|
| Broken pipe | SIGPIPE / EPIPE |
All readers closed before write |
| Pipe deadlock | Both processes blocked | Writer waiting on full pipe; reader waiting on writer |
| SysV shm orphan | Growing shm usage after process exit | No shmctl(IPC_RMID) called |
| AF_UNIX ECONNREFUSED | Connection refused | Socket file exists but nothing is listening (server crashed) |
| memfd permission denied | EACCES on mmap |
Wrong flags in shm_open / memfd_create |
| FD table full on fd-pass | EMFILE from recvmsg |
Receiver's fd table full; fd not received |
Modern Usage
io_uring and IPC: io_uring (Linux 5.1+) uses eventfd for completion notification
and internally uses a shared ring buffer between user and kernel space — itself a form of
shared memory IPC with careful memory ordering.
Wayland: the Wayland display protocol uses AF_UNIX sockets for all compositor↔client
communication, SCM_RIGHTS for GPU buffer fd passing, and memfd with sealing for
shared pixel buffers. It is a textbook modern Unix IPC design.
D-Bus → dbus-broker: the original dbus-daemon used AF_UNIX sockets but performed a
message copy in the broker. dbus-broker (2017) uses memfd-based zero-copy message
passing for large messages, maintaining AF_UNIX for control messages.
Future Directions
- io_uring IPC: proposals for io_uring-native IPC that would allow a process to
submit a "receive from shared ring buffer" operation as an io_uring SQE, completing it
asynchronously when data is available — bypassing
epollentirely. - Landlock and IPC access control: the Landlock LSM (Linux 5.13) currently covers filesystem access. Extensions to cover AF_UNIX socket connectivity are in discussion, enabling fine-grained process isolation policies without full seccomp complexity.
- Cross-namespace pipes: current pipe semantics require shared mount/filesystem access for FIFOs; proposals for kernel-mediated cross-namespace byte channels would simplify container-to-host IPC.
Exercises
-
Pipe throughput benchmark: write two programs (producer/consumer) communicating via a pipe. Measure throughput for message sizes of 4 B, 4 KB, 64 KB, 1 MB. At what size does the pipe's 65536-byte buffer become a bottleneck? Then use
fcntl(F_SETPIPE_SZ, 1048576)and repeat. -
Zero-copy IPC: implement a benchmark comparing three methods for transferring a 100 MB payload from process A to process B: (a) write()/read() on a pipe, (b) POSIX shared memory with a mutex handshake, (c) memfd + SCM_RIGHTS fd passing
-
mmap. Plot latency and throughput for each.
-
AF_UNIX fd passing: write a server that opens
/etc/passwd(or any file), then passes the open fd to a client viaSCM_RIGHTS. The client reads from the received fd without knowing the filename. Verify withlsof -p client_pid. -
SysV IPC audit: on a test Linux system, write a C program that uses
shmgetto create a shared memory segment without ever callingshmctl(IPC_RMID). Run it 100 times and observeipcs -maccumulating orphaned segments. Write a cleanup script usingipcrm. -
eventfd-based task queue: implement a thread-safe work queue using a shared circular buffer and an
eventfdfor notification. One producer thread writes work items; one consumer thread usesepollon the eventfd to wake up. Measure scheduling latency (time from eventfd write to consumer wakeup).
References
fs/pipe.c— kernel pipe implementationipc/mqueue.c— POSIX message queueipc/shm.c,ipc/msg.c,ipc/sem.c— System V IPCmm/shmem.c— tmpfs backing for POSIX shm and memfdnet/unix/af_unix.c— AF_UNIX socket implementationfs/eventfd.c,fs/timerfd.c,fs/signalfd.c- Kerrisk, The Linux Programming Interface — Chapters 44 (pipes), 52 (POSIX MQ), 48–50 (SysV IPC), 54 (POSIX shm), 57 (AF_UNIX)
man 2 pipe2,man 3 mq_open,man 2 shmget,man 3 shm_open,man 2 memfd_create,man 2 eventfd,man 2 timerfd_createman 7 unix— AF_UNIX socket details, SCM_RIGHTS, SO_PEERCRED- Wayland protocol specification — wl_shm and buffer passing
- LWN: "Sealed files with memfd_create()" (2014)