02 — Sockets
Technical Overview
The socket abstraction, introduced by BSD Unix in 1983 and adopted wholesale by Linux, presents every network endpoint as a file descriptor. An application that can read and write file descriptors can communicate over any supported network protocol without knowing the underlying transport mechanics. Beneath this uniform interface lies a rich implementation that spans user-space system call stubs, VFS integration, protocol-specific socket implementations, and kernel buffer management. Understanding socket internals is essential for diagnosing application latency, connection exhaustion, and buffer-related performance pathologies.
Prerequisites
- POSIX system call knowledge (file descriptors, read/write/close)
- Basic TCP/UDP protocol understanding
- Linux process model (process, thread, file descriptor table)
- Familiarity with
ss,netstat,strace
Core Content
Socket as File Descriptor
From userspace, a socket is an integer — an index into the process file descriptor table. Internally, the kernel creates:
fd table entry
|
v
struct file (f_op = socket_file_ops)
|
v
struct socket (state, type, sk, ops)
|
v
struct sock (protocol-specific: tcp_sock, udp_sock, raw_sock)
|
v
struct proto_ops (bind, connect, accept, sendmsg, recvmsg, ...)
struct socket is the generic socket wrapper. struct sock is the protocol-specific socket — for TCP it is actually a struct tcp_sock (which embeds struct inet_sock which embeds struct sock). This inheritance-via-embedding pattern is pervasive in kernel networking.
Socket Types and Address Families
Socket types (second argument to socket()):
| Type | Protocol | Use |
|---|---|---|
SOCK_STREAM |
TCP (AF_INET) / UNIX stream | Reliable, ordered, connection-oriented |
SOCK_DGRAM |
UDP (AF_INET) / UNIX datagram | Unreliable, unordered, connectionless |
SOCK_RAW |
IP or lower | Custom protocol, packet injection (requires CAP_NET_RAW) |
SOCK_SEQPACKET |
SCTP / UNIX seqpacket | Reliable, message-boundary-preserving |
SOCK_PACKET |
Ethernet (deprecated) | Use AF_PACKET instead |
Address families (first argument to socket()):
| Family | Purpose | Address structure |
|---|---|---|
AF_INET |
IPv4 | struct sockaddr_in (4-byte addr, 2-byte port) |
AF_INET6 |
IPv6 | struct sockaddr_in6 (16-byte addr, 2-byte port, flow, scope) |
AF_UNIX |
Local IPC | struct sockaddr_un (pathname or abstract) |
AF_NETLINK |
Kernel-userspace IPC | struct sockaddr_nl (pid, groups) |
AF_PACKET |
Raw Ethernet frames | struct sockaddr_ll (interface index, protocol) |
AF_XDP |
Kernel-bypass networking | struct sockaddr_xdp |
Socket System Calls
Core lifecycle:
Server: Client:
socket() socket()
| |
bind() connect() ──────────────┐
| |
listen() |
| |
accept() <────────────────────────────────────────────────┘
|
[new fd]
|
send()/recv() <──────────────> send()/recv()
|
close()
socket(domain, type, protocol) — allocates a socket structure, returns fd. protocol is usually 0 (kernel selects based on type), but IPPROTO_TCP, IPPROTO_UDP, IPPROTO_ICMP can be specified for SOCK_RAW.
bind(fd, addr, addrlen) — associates the socket with a local address and port. For servers, mandatory before listen(). For clients, optional (kernel auto-assigns ephemeral port). Internals: calls inet_bind() → tcp_v4_get_port() to allocate/verify the port in the bind hash table.
listen(fd, backlog) — marks socket as passive (accepting connections). backlog limits the length of the completed connection queue (SYN_RCVD → ESTABLISHED); the kernel also maintains an incomplete queue (SYN received, SYN-ACK sent, awaiting final ACK). Total unaccepted connections bounded by net.core.somaxconn (default 4096 in Linux 5.4+, was 128).
accept(fd, addr, addrlen) — dequeues a connection from the completed queue, returns a new socket fd. Blocks if queue is empty (or returns EAGAIN if O_NONBLOCK).
connect(fd, addr, addrlen) — for TCP: initiates three-way handshake. Returns when connection established (blocking mode) or immediately with EINPROGRESS (non-blocking). For UDP: sets default destination, enables send()/recv() instead of sendto()/recvfrom().
sendmsg(fd, msghdr, flags) / recvmsg(fd, msghdr, flags) — most general I/O calls, supporting:
- iov[] scatter-gather I/O (multiple buffers in one call)
- cmsg ancillary data (timestamps, credentials, TTL, TOS)
- msg_flags control
sendto / recvfrom — convenience wrappers for UDP, equivalent to sendmsg/recvmsg with simple address/buffer.
Socket State Lifecycle (TCP)
socket()
|
v
CLOSED
|
server: listen() client: connect()
| |
v v
LISTEN SYN_SENT
| |
accept() returns SYN-ACK received
new ESTABLISHED fd |
| v
+-------> ESTABLISHED
|
data transfer
|
close() or shutdown(SHUT_WR)
/ \
active close passive close
/ \
FIN_WAIT_1 CLOSE_WAIT
| |
FIN_WAIT_2 LAST_ACK
| |
TIME_WAIT CLOSED
|
(2*MSL, ~60-120s)
|
CLOSED
Inspect socket states:
ss -tan state established
ss -tan state time-wait | wc -l # count TIME_WAIT sockets
ss -o state established '( dport = :80 or sport = :80 )'
Socket Options
SOL_SOCKET level options:
SO_REUSEADDR — allows binding to a port in TIME_WAIT state. Essential for servers that restart: without it, bind() returns EADDRINUSE for 60–120 seconds after shutdown. Does NOT allow two sockets to simultaneously accept on the same port (use SO_REUSEPORT for that).
SO_REUSEPORT — allows multiple sockets to bind to the same address:port. Kernel distributes incoming connections/datagrams across all sockets using a hash (Linux 3.9+). Enables multi-process accept() without lock contention — critical for nginx/envoy worker model.
SO_KEEPALIVE — enables TCP keepalive probes. Configured via:
sysctl net.ipv4.tcp_keepalive_time # idle time before first probe (default 7200s)
sysctl net.ipv4.tcp_keepalive_intvl # interval between probes (default 75s)
sysctl net.ipv4.tcp_keepalive_probes # number of unacknowledged probes before disconnect (default 9)
Or per-socket: setsockopt(fd, SOL_TCP, TCP_KEEPIDLE, &val, sizeof(val))
SO_RCVBUF / SO_SNDBUF — set receive/send buffer sizes. The kernel doubles the value for internal overhead (so requesting 128KB gives ~256KB effective). Hard ceiling: net.core.rmem_max / net.core.wmem_max. Auto-tuning (net.ipv4.tcp_moderate_rcvbuf=1) usually outperforms manual sizing for TCP.
SO_LINGER — controls close() behavior: with l_onoff=1, l_linger=0, close() sends RST instead of FIN (avoids TIME_WAIT). With l_linger>0, close() blocks until send buffer drains or timeout. Avoid l_linger=0 in production — it causes RST storms on server restart.
IPPROTO_TCP level options:
TCP_NODELAY — disables Nagle algorithm. Nagle coalesces small writes into larger segments (waits for ACK of previous segment before sending a new small segment). Disable for latency-sensitive protocols (interactive CLI, database queries): setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &one, sizeof(one))
TCP_CORK — the opposite of TCP_NODELAY: hold segments until full MSS or cork is removed. Use for HTTP response construction: cork, write headers, write body, uncork → single large segment.
TCP_FASTOPEN — send data with the SYN packet using a TFO cookie, saving one RTT. Server: setsockopt(fd, IPPROTO_TCP, TCP_FASTOPEN, &qlen, sizeof(qlen)). Client: sendto() with MSG_FASTOPEN flag. Linux sysctl: net.ipv4.tcp_fastopen=3.
TCP_DEFER_ACCEPT — server does not wake accept() until data arrives with the connection (saves a wakeup for request/response protocols).
Socket Buffers
The receive buffer (SO_RCVBUF) is the kernel memory holding data received from the network but not yet read by the application. When it fills, the kernel stops ACKing data (TCP flow control) or drops datagrams (UDP).
The send buffer (SO_SNDBUF) holds data written by the application but not yet acknowledged by the remote TCP. When it fills, write() blocks (or returns EAGAIN if non-blocking).
For TCP, auto-tuning adjusts buffer sizes dynamically based on the bandwidth-delay product:
/proc/sys/net/ipv4/tcp_rmem: min default max
4096 131072 6291456 (6MB default max)
/proc/sys/net/ipv4/tcp_wmem: min default max
4096 16384 4194304 (4MB default max)
Blocking vs Non-blocking, and I/O Multiplexing
Blocking (default): recv() blocks until data is available; send() blocks until buffer has space.
Non-blocking (O_NONBLOCK or SOCK_NONBLOCK at creation): operations return immediately with EAGAIN/EWOULDBLOCK if they cannot complete.
I/O multiplexing allows one thread to wait on multiple file descriptors:
select(nfds, readfds, writefds, exceptfds, timeout) — O(n) per call, 1024 fd limit, copies fd_sets between user and kernel on every call. Obsolete for high-fd-count servers.
poll(fds[], nfds, timeout) — removes 1024 fd limit but still O(n) per call. Better than select but still copies the entire event array every call.
epoll — O(1) per ready event. Edge-triggered (EPOLLET) or level-triggered (EPOLLIN/EPOLLOUT). The fd interest list lives in the kernel; epoll_wait() only returns ready events.
int epfd = epoll_create1(0);
struct epoll_event ev = { .events = EPOLLIN | EPOLLET, .data.fd = sockfd };
epoll_ctl(epfd, EPOLL_CTL_ADD, sockfd, &ev);
struct epoll_event events[64];
int n = epoll_wait(epfd, events, 64, -1); /* blocks */
for (int i = 0; i < n; i++) {
handle_fd(events[i].data.fd);
}
io_uring (Linux 5.1+) — submits I/O operations to a ring buffer shared between kernel and userspace; completions are posted to a separate ring. Eliminates syscall overhead for high-rate I/O:
struct io_uring ring;
io_uring_queue_init(256, &ring, 0);
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_recv(sqe, sockfd, buf, sizeof(buf), 0);
sqe->user_data = (uint64_t)sockfd;
io_uring_submit(&ring);
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
int result = cqe->res; /* bytes received or -errno */
io_uring_cqe_seen(&ring, cqe);
For networking, io_uring multishot receive (IORING_OP_RECV_MULTISHOT) submits one operation and delivers all future receives as completions without re-submitting.
Historical Context
BSD sockets (4.2BSD, 1983) were designed by Bill Joy and Sam Leffler as an abstraction over ARPANET protocols. The design decision to use file descriptors was controversial — some argued for message-passing interfaces — but proved correct: it made network I/O composable with every Unix tool.
Linux implemented the BSD socket API in 0.99 (1993). The epoll call (Davide Libenzi, Linux 2.5.44, 2002) was a direct response to the "C10K problem" — serving 10,000 concurrent connections on a single machine. The select/poll O(n) scaling made C10K impossible; epoll's O(1) wakeup model enabled it.
SO_REUSEPORT was added in Linux 3.9 (2013), years after BSD had it, to enable multi-process servers without a thundering herd on accept().
Production Examples
nginx worker model with SO_REUSEPORT:
# nginx.conf
worker_processes auto;
events {
use epoll;
multi_accept on;
}
# Each worker process independently accepts connections on port 80
# No accept mutex needed; kernel load-balances via SO_REUSEPORT
Go net.Conn performance tuning:
// Set TCP_NODELAY for RPC (disable Nagle)
conn.(*net.TCPConn).SetNoDelay(true)
// Increase buffer for bulk transfers
conn.(*net.TCPConn).SetReadBuffer(1 << 20) // 1MB
conn.(*net.TCPConn).SetWriteBuffer(1 << 20)
Debugging Notes
# Show all listening sockets with process
ss -tlnp
# Show socket memory usage per socket
ss -tm state established
# Trace all socket syscalls for a PID
strace -e trace=socket,bind,listen,accept,connect,send,recv -p <PID>
# Count TIME_WAIT sockets (per destination IP)
ss -tan state time-wait | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head
# Show socket option values
bpftrace -e 'kprobe:do_setsockopt { printf("fd=%d level=%d optname=%d\n", arg1, arg2, arg3); }'
# Measure accept() queue depth
ss -lnt | awk '{print $2, $3, $4}' # Recv-Q = pending connections
Recv-Q on a LISTEN socket shows the number of connections fully established but not yet accept()ed — if this is consistently non-zero, your application is too slow to drain the accept queue.
Security Implications
SO_REUSEPORThijacking: Any process with the same UID can bind to the same port withSO_REUSEPORT, stealing a fraction of connections. Mitigation:SO_REUSEPORT_CBPF(Linux 4.5) — attach a BPF filter that checks credentials.CAP_NET_BIND_SERVICE: Binding to ports < 1024 requires this capability (ornet.ipv4.ip_unprivileged_port_startsysctl in newer kernels).- Unix socket permissions:
AF_UNIXpath-based sockets are governed by filesystem permissions. Abstract sockets (@name) bypass filesystem — any process in the same network namespace can connect. SO_PEERCRED: Obtain UID/GID/PID of connecting process onAF_UNIXsockets — use for authentication in local service communication.TCP_MD5SIG: TCP MD5 option for BGP session authentication between routers — prevents RST injection attacks.
Performance Implications
| Scenario | Recommendation |
|---|---|
| Many short-lived connections | SO_REUSEADDR + SO_REUSEPORT + TCP_FASTOPEN |
| High-throughput bulk transfer | Large SO_SNDBUF/SO_RCVBUF + TCP_CORK |
| Low-latency RPC | TCP_NODELAY + small SO_SNDBUF (avoid buffer bloat) |
| Many concurrent idle connections | SO_KEEPALIVE with short TCP_KEEPIDLE to reap dead connections |
| Accept-intensive server | SO_REUSEPORT with multiple accept threads, no accept mutex |
The cost breakdown for a TCP recv():
- Syscall entry/exit: ~50–100 ns
- sk_buff dequeue: ~20–50 ns
- Data copy (kernel→user): ~1 ns/byte (limited by memory bandwidth)
- Zero-copy (MSG_ZEROCOPY or io_uring): eliminates the copy, adds completion-notification overhead
Failure Modes and Real Incidents
Incident: Accept queue overflow (2019, e-commerce peak load)
net.core.somaxconn defaulted to 128. Under Black Friday traffic, the completed connection queue filled instantly. New connections were silently dropped (client sees timeout or RST). Fix: sysctl -w net.core.somaxconn=65535 and matching listen() backlog.
Incident: Nagle + delayed ACK interaction (classic)
Client sends a small request in two writes (header then body, no TCP_CORK). Nagle holds the second write until ACK of first. Server delayed ACK timer (200ms) holds ACK. Result: 200ms added latency per request. Diagnosis: tcpdump shows 200ms gap after second write. Fix: TCP_NODELAY on client socket.
Failure Mode: fd exhaustion
A server not calling close() on accepted sockets leaks fds. At the process fd limit (ulimit -n), accept() returns EMFILE. Diagnosis: ls /proc/<pid>/fd | wc -l. Fix: increase ulimit -n to 1048576, and fix the leak.
Modern Usage
- io_uring is now used by production databases (RocksDB, ScyllaDB) and web servers for socket I/O, showing 20–40% reduction in CPU overhead vs epoll
MSG_ZEROCOPY(Linux 4.14) is deployed at Google for large-payload services, reducing data copy overhead on 25Gbps+ NICs- eBPF socket programs (
BPF_PROG_TYPE_SOCK_OPS) can modify TCP behavior per-connection: custom congestion control invocation, RTT measurement, custom options - AF_XDP with
SO_PREFER_BUSY_POLLenables low-latency userspace packet processing for HFT and 5G RAN applications
Future Directions
- io_uring send/recv zerocopy is under active development (Linux 6.x); combined with registered fixed buffers, it targets <1 µs socket round-trip latency
- Multipath TCP (MPTCP) is stable in Linux 5.6+ —
socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP)creates an MPTCP socket with automatic subflow management - TCP over UDP and QUIC challenge the TCP socket model —
AF_INET/SOCK_DGRAMwith userspace QUIC (lsquic, quiche, msquic) is the dominant deployment pattern until kernel QUIC lands
Exercises
-
Write a minimal server using
epollin edge-triggered mode. HandleEAGAINcorrectly. Explain why failing to drain the fd on EPOLLET causes missed events. -
Create a
SO_REUSEPORTserver with 4 worker processes. Usess -tlnpto verify all 4 processes are listening on the same port. Measureaccept()latency per-worker under load withperf stat. -
Demonstrate the Nagle + delayed ACK latency interaction: write a client that sends a 1-byte header and 1-byte body in two separate
write()calls to a server with delayed ACK enabled. Measure RTT withtcpdump. Then fix it withTCP_NODELAYand measure again. -
Trace
do_accept()in the kernel usingbpftraceto measure the time from when a connection is added to the accept queue until the application callsaccept(). At what queue depth does latency increase significantly? -
Using
strace -c, measure the distribution of time spent in socket syscalls for a high-QPS HTTP server. Which syscall consumes the most time? How does switching toio_uringchange the profile?
References
net/socket.c—sys_socket,sys_bind,sys_listen,sys_acceptnet/ipv4/tcp.c—tcp_sendmsg,tcp_recvmsgnet/ipv4/inet_connection_sock.c— accept queue managementinclude/linux/net.h—struct socket,struct proto_ops- Stevens, W.R. Unix Network Programming Vol. 1: The Sockets Networking API. 3rd Ed.
- Kerrisk, M. The Linux Programming Interface. Chapters 56–61 (Sockets).
- Libenzi, D. Scalable Event Multiplexing: epoll vs. kqueue. 2002.
man 7 socket,man 7 tcp,man 7 ip,man 2 epoll_wait,man 2 io_uring_setupDocumentation/networking/ip-sysctl.rst— full sysctl reference