02 — Sockets

Technical Overview

The socket abstraction, introduced by BSD Unix in 1983 and adopted wholesale by Linux, presents every network endpoint as a file descriptor. An application that can read and write file descriptors can communicate over any supported network protocol without knowing the underlying transport mechanics. Beneath this uniform interface lies a rich implementation that spans user-space system call stubs, VFS integration, protocol-specific socket implementations, and kernel buffer management. Understanding socket internals is essential for diagnosing application latency, connection exhaustion, and buffer-related performance pathologies.

Prerequisites

POSIX system call knowledge (file descriptors, read/write/close)
Basic TCP/UDP protocol understanding
Linux process model (process, thread, file descriptor table)
Familiarity with ss, netstat, strace

Core Content

Socket as File Descriptor

From userspace, a socket is an integer — an index into the process file descriptor table. Internally, the kernel creates:

fd table entry
     |
     v
struct file (f_op = socket_file_ops)
     |
     v
struct socket  (state, type, sk, ops)
     |
     v
struct sock    (protocol-specific: tcp_sock, udp_sock, raw_sock)
     |
     v
struct proto_ops  (bind, connect, accept, sendmsg, recvmsg, ...)

struct socket is the generic socket wrapper. struct sock is the protocol-specific socket — for TCP it is actually a struct tcp_sock (which embeds struct inet_sock which embeds struct sock). This inheritance-via-embedding pattern is pervasive in kernel networking.

Socket Types and Address Families

Socket types (second argument to socket()):

Type	Protocol	Use
`SOCK_STREAM`	TCP (AF_INET) / UNIX stream	Reliable, ordered, connection-oriented
`SOCK_DGRAM`	UDP (AF_INET) / UNIX datagram	Unreliable, unordered, connectionless
`SOCK_RAW`	IP or lower	Custom protocol, packet injection (requires CAP_NET_RAW)
`SOCK_SEQPACKET`	SCTP / UNIX seqpacket	Reliable, message-boundary-preserving
`SOCK_PACKET`	Ethernet (deprecated)	Use AF_PACKET instead

Address families (first argument to socket()):

Family	Purpose	Address structure
`AF_INET`	IPv4	`struct sockaddr_in` (4-byte addr, 2-byte port)
`AF_INET6`	IPv6	`struct sockaddr_in6` (16-byte addr, 2-byte port, flow, scope)
`AF_UNIX`	Local IPC	`struct sockaddr_un` (pathname or abstract)
`AF_NETLINK`	Kernel-userspace IPC	`struct sockaddr_nl` (pid, groups)
`AF_PACKET`	Raw Ethernet frames	`struct sockaddr_ll` (interface index, protocol)
`AF_XDP`	Kernel-bypass networking	`struct sockaddr_xdp`

Socket System Calls

Core lifecycle:

Server:                          Client:
socket()                         socket()
   |                                |
bind()                           connect() ──────────────┐
   |                                                      |
listen()                                                  |
   |                                                      |
accept() <────────────────────────────────────────────────┘
   |
[new fd]
   |
send()/recv()  <──────────────>  send()/recv()
   |
close()

socket(domain, type, protocol) — allocates a socket structure, returns fd. protocol is usually 0 (kernel selects based on type), but IPPROTO_TCP, IPPROTO_UDP, IPPROTO_ICMP can be specified for SOCK_RAW.

bind(fd, addr, addrlen) — associates the socket with a local address and port. For servers, mandatory before listen(). For clients, optional (kernel auto-assigns ephemeral port). Internals: calls inet_bind() → tcp_v4_get_port() to allocate/verify the port in the bind hash table.

listen(fd, backlog) — marks socket as passive (accepting connections). backlog limits the length of the completed connection queue (SYN_RCVD → ESTABLISHED); the kernel also maintains an incomplete queue (SYN received, SYN-ACK sent, awaiting final ACK). Total unaccepted connections bounded by net.core.somaxconn (default 4096 in Linux 5.4+, was 128).

accept(fd, addr, addrlen) — dequeues a connection from the completed queue, returns a new socket fd. Blocks if queue is empty (or returns EAGAIN if O_NONBLOCK).

connect(fd, addr, addrlen) — for TCP: initiates three-way handshake. Returns when connection established (blocking mode) or immediately with EINPROGRESS (non-blocking). For UDP: sets default destination, enables send()/recv() instead of sendto()/recvfrom().

sendmsg(fd, msghdr, flags) / recvmsg(fd, msghdr, flags) — most general I/O calls, supporting: - iov[] scatter-gather I/O (multiple buffers in one call) - cmsg ancillary data (timestamps, credentials, TTL, TOS) - msg_flags control

sendto / recvfrom — convenience wrappers for UDP, equivalent to sendmsg/recvmsg with simple address/buffer.

Socket State Lifecycle (TCP)

         socket()
            |
            v
         CLOSED
            |
   server: listen()    client: connect()
            |                  |
            v                  v
         LISTEN           SYN_SENT
            |                  |
     accept() returns    SYN-ACK received
     new ESTABLISHED fd       |
            |                  v
            +-------> ESTABLISHED
                           |
                     data transfer
                           |
               close() or shutdown(SHUT_WR)
                     /           \
              active close    passive close
                  /                 \
           FIN_WAIT_1            CLOSE_WAIT
                |                     |
           FIN_WAIT_2              LAST_ACK
                |                     |
           TIME_WAIT              CLOSED
                |
           (2*MSL, ~60-120s)
                |
           CLOSED

Inspect socket states:

ss -tan state established
ss -tan state time-wait | wc -l   # count TIME_WAIT sockets
ss -o state established '( dport = :80 or sport = :80 )'

Socket Options

SOL_SOCKET level options:

SO_REUSEADDR — allows binding to a port in TIME_WAIT state. Essential for servers that restart: without it, bind() returns EADDRINUSE for 60–120 seconds after shutdown. Does NOT allow two sockets to simultaneously accept on the same port (use SO_REUSEPORT for that).

SO_REUSEPORT — allows multiple sockets to bind to the same address:port. Kernel distributes incoming connections/datagrams across all sockets using a hash (Linux 3.9+). Enables multi-process accept() without lock contention — critical for nginx/envoy worker model.

SO_KEEPALIVE — enables TCP keepalive probes. Configured via:

sysctl net.ipv4.tcp_keepalive_time   # idle time before first probe (default 7200s)
sysctl net.ipv4.tcp_keepalive_intvl  # interval between probes (default 75s)
sysctl net.ipv4.tcp_keepalive_probes # number of unacknowledged probes before disconnect (default 9)

Or per-socket: setsockopt(fd, SOL_TCP, TCP_KEEPIDLE, &val, sizeof(val))

SO_RCVBUF / SO_SNDBUF — set receive/send buffer sizes. The kernel doubles the value for internal overhead (so requesting 128KB gives ~256KB effective). Hard ceiling: net.core.rmem_max / net.core.wmem_max. Auto-tuning (net.ipv4.tcp_moderate_rcvbuf=1) usually outperforms manual sizing for TCP.

SO_LINGER — controls close() behavior: with l_onoff=1, l_linger=0, close() sends RST instead of FIN (avoids TIME_WAIT). With l_linger>0, close() blocks until send buffer drains or timeout. Avoid l_linger=0 in production — it causes RST storms on server restart.

IPPROTO_TCP level options:

TCP_NODELAY — disables Nagle algorithm. Nagle coalesces small writes into larger segments (waits for ACK of previous segment before sending a new small segment). Disable for latency-sensitive protocols (interactive CLI, database queries): setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &one, sizeof(one))

TCP_CORK — the opposite of TCP_NODELAY: hold segments until full MSS or cork is removed. Use for HTTP response construction: cork, write headers, write body, uncork → single large segment.

TCP_FASTOPEN — send data with the SYN packet using a TFO cookie, saving one RTT. Server: setsockopt(fd, IPPROTO_TCP, TCP_FASTOPEN, &qlen, sizeof(qlen)). Client: sendto() with MSG_FASTOPEN flag. Linux sysctl: net.ipv4.tcp_fastopen=3.

TCP_DEFER_ACCEPT — server does not wake accept() until data arrives with the connection (saves a wakeup for request/response protocols).

Socket Buffers

The receive buffer (SO_RCVBUF) is the kernel memory holding data received from the network but not yet read by the application. When it fills, the kernel stops ACKing data (TCP flow control) or drops datagrams (UDP).

The send buffer (SO_SNDBUF) holds data written by the application but not yet acknowledged by the remote TCP. When it fills, write() blocks (or returns EAGAIN if non-blocking).

For TCP, auto-tuning adjusts buffer sizes dynamically based on the bandwidth-delay product:

/proc/sys/net/ipv4/tcp_rmem: min  default  max
                              4096 131072   6291456  (6MB default max)
/proc/sys/net/ipv4/tcp_wmem: min  default  max
                              4096 16384    4194304  (4MB default max)

Blocking vs Non-blocking, and I/O Multiplexing

Blocking (default): recv() blocks until data is available; send() blocks until buffer has space.

Non-blocking (O_NONBLOCK or SOCK_NONBLOCK at creation): operations return immediately with EAGAIN/EWOULDBLOCK if they cannot complete.

I/O multiplexing allows one thread to wait on multiple file descriptors:

select(nfds, readfds, writefds, exceptfds, timeout) — O(n) per call, 1024 fd limit, copies fd_sets between user and kernel on every call. Obsolete for high-fd-count servers.

poll(fds[], nfds, timeout) — removes 1024 fd limit but still O(n) per call. Better than select but still copies the entire event array every call.

epoll — O(1) per ready event. Edge-triggered (EPOLLET) or level-triggered (EPOLLIN/EPOLLOUT). The fd interest list lives in the kernel; epoll_wait() only returns ready events.

int epfd = epoll_create1(0);
struct epoll_event ev = { .events = EPOLLIN | EPOLLET, .data.fd = sockfd };
epoll_ctl(epfd, EPOLL_CTL_ADD, sockfd, &ev);

struct epoll_event events[64];
int n = epoll_wait(epfd, events, 64, -1);  /* blocks */
for (int i = 0; i < n; i++) {
    handle_fd(events[i].data.fd);
}

io_uring (Linux 5.1+) — submits I/O operations to a ring buffer shared between kernel and userspace; completions are posted to a separate ring. Eliminates syscall overhead for high-rate I/O:

struct io_uring ring;
io_uring_queue_init(256, &ring, 0);

struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_recv(sqe, sockfd, buf, sizeof(buf), 0);
sqe->user_data = (uint64_t)sockfd;
io_uring_submit(&ring);

struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
int result = cqe->res;  /* bytes received or -errno */
io_uring_cqe_seen(&ring, cqe);

For networking, io_uring multishot receive (IORING_OP_RECV_MULTISHOT) submits one operation and delivers all future receives as completions without re-submitting.

Historical Context

BSD sockets (4.2BSD, 1983) were designed by Bill Joy and Sam Leffler as an abstraction over ARPANET protocols. The design decision to use file descriptors was controversial — some argued for message-passing interfaces — but proved correct: it made network I/O composable with every Unix tool.

Linux implemented the BSD socket API in 0.99 (1993). The epoll call (Davide Libenzi, Linux 2.5.44, 2002) was a direct response to the "C10K problem" — serving 10,000 concurrent connections on a single machine. The select/poll O(n) scaling made C10K impossible; epoll's O(1) wakeup model enabled it.

SO_REUSEPORT was added in Linux 3.9 (2013), years after BSD had it, to enable multi-process servers without a thundering herd on accept().

Production Examples

nginx worker model with SO_REUSEPORT:

# nginx.conf
worker_processes auto;
events {
    use epoll;
    multi_accept on;
}
# Each worker process independently accepts connections on port 80
# No accept mutex needed; kernel load-balances via SO_REUSEPORT

Go net.Conn performance tuning:

// Set TCP_NODELAY for RPC (disable Nagle)
conn.(*net.TCPConn).SetNoDelay(true)

// Increase buffer for bulk transfers
conn.(*net.TCPConn).SetReadBuffer(1 << 20)   // 1MB
conn.(*net.TCPConn).SetWriteBuffer(1 << 20)

Debugging Notes

# Show all listening sockets with process
ss -tlnp

# Show socket memory usage per socket
ss -tm state established

# Trace all socket syscalls for a PID
strace -e trace=socket,bind,listen,accept,connect,send,recv -p <PID>

# Count TIME_WAIT sockets (per destination IP)
ss -tan state time-wait | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head

# Show socket option values
bpftrace -e 'kprobe:do_setsockopt { printf("fd=%d level=%d optname=%d\n", arg1, arg2, arg3); }'

# Measure accept() queue depth
ss -lnt | awk '{print $2, $3, $4}'  # Recv-Q = pending connections

Recv-Q on a LISTEN socket shows the number of connections fully established but not yet accept()ed — if this is consistently non-zero, your application is too slow to drain the accept queue.

Security Implications

SO_REUSEPORT hijacking: Any process with the same UID can bind to the same port with SO_REUSEPORT, stealing a fraction of connections. Mitigation: SO_REUSEPORT_CBPF (Linux 4.5) — attach a BPF filter that checks credentials.
CAP_NET_BIND_SERVICE: Binding to ports < 1024 requires this capability (or net.ipv4.ip_unprivileged_port_start sysctl in newer kernels).
Unix socket permissions: AF_UNIX path-based sockets are governed by filesystem permissions. Abstract sockets (@name) bypass filesystem — any process in the same network namespace can connect.
SO_PEERCRED: Obtain UID/GID/PID of connecting process on AF_UNIX sockets — use for authentication in local service communication.
TCP_MD5SIG: TCP MD5 option for BGP session authentication between routers — prevents RST injection attacks.

Performance Implications

Scenario	Recommendation
Many short-lived connections	`SO_REUSEADDR` + `SO_REUSEPORT` + `TCP_FASTOPEN`
High-throughput bulk transfer	Large `SO_SNDBUF`/`SO_RCVBUF` + `TCP_CORK`
Low-latency RPC	`TCP_NODELAY` + small `SO_SNDBUF` (avoid buffer bloat)
Many concurrent idle connections	`SO_KEEPALIVE` with short `TCP_KEEPIDLE` to reap dead connections
Accept-intensive server	`SO_REUSEPORT` with multiple accept threads, no accept mutex

The cost breakdown for a TCP recv(): - Syscall entry/exit: ~50–100 ns - sk_buff dequeue: ~20–50 ns - Data copy (kernel→user): ~1 ns/byte (limited by memory bandwidth) - Zero-copy (MSG_ZEROCOPY or io_uring): eliminates the copy, adds completion-notification overhead

Failure Modes and Real Incidents

Incident: Accept queue overflow (2019, e-commerce peak load) net.core.somaxconn defaulted to 128. Under Black Friday traffic, the completed connection queue filled instantly. New connections were silently dropped (client sees timeout or RST). Fix: sysctl -w net.core.somaxconn=65535 and matching listen() backlog.

Incident: Nagle + delayed ACK interaction (classic) Client sends a small request in two writes (header then body, no TCP_CORK). Nagle holds the second write until ACK of first. Server delayed ACK timer (200ms) holds ACK. Result: 200ms added latency per request. Diagnosis: tcpdump shows 200ms gap after second write. Fix: TCP_NODELAY on client socket.

Failure Mode: fd exhaustion A server not calling close() on accepted sockets leaks fds. At the process fd limit (ulimit -n), accept() returns EMFILE. Diagnosis: ls /proc/<pid>/fd | wc -l. Fix: increase ulimit -n to 1048576, and fix the leak.

Modern Usage

io_uring is now used by production databases (RocksDB, ScyllaDB) and web servers for socket I/O, showing 20–40% reduction in CPU overhead vs epoll
MSG_ZEROCOPY (Linux 4.14) is deployed at Google for large-payload services, reducing data copy overhead on 25Gbps+ NICs
eBPF socket programs (BPF_PROG_TYPE_SOCK_OPS) can modify TCP behavior per-connection: custom congestion control invocation, RTT measurement, custom options
AF_XDP with SO_PREFER_BUSY_POLL enables low-latency userspace packet processing for HFT and 5G RAN applications

Future Directions

io_uring send/recv zerocopy is under active development (Linux 6.x); combined with registered fixed buffers, it targets <1 µs socket round-trip latency
Multipath TCP (MPTCP) is stable in Linux 5.6+ — socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP) creates an MPTCP socket with automatic subflow management
TCP over UDP and QUIC challenge the TCP socket model — AF_INET / SOCK_DGRAM with userspace QUIC (lsquic, quiche, msquic) is the dominant deployment pattern until kernel QUIC lands

Exercises

Write a minimal server using epoll in edge-triggered mode. Handle EAGAIN correctly. Explain why failing to drain the fd on EPOLLET causes missed events.
Create a SO_REUSEPORT server with 4 worker processes. Use ss -tlnp to verify all 4 processes are listening on the same port. Measure accept() latency per-worker under load with perf stat.
Demonstrate the Nagle + delayed ACK latency interaction: write a client that sends a 1-byte header and 1-byte body in two separate write() calls to a server with delayed ACK enabled. Measure RTT with tcpdump. Then fix it with TCP_NODELAY and measure again.
Trace do_accept() in the kernel using bpftrace to measure the time from when a connection is added to the accept queue until the application calls accept(). At what queue depth does latency increase significantly?
Using strace -c, measure the distribution of time spent in socket syscalls for a high-QPS HTTP server. Which syscall consumes the most time? How does switching to io_uring change the profile?

References

net/socket.c — sys_socket, sys_bind, sys_listen, sys_accept
net/ipv4/tcp.c — tcp_sendmsg, tcp_recvmsg
net/ipv4/inet_connection_sock.c — accept queue management
include/linux/net.h — struct socket, struct proto_ops
Stevens, W.R. Unix Network Programming Vol. 1: The Sockets Networking API. 3rd Ed.
Kerrisk, M. The Linux Programming Interface. Chapters 56–61 (Sockets).
Libenzi, D. Scalable Event Multiplexing: epoll vs. kqueue. 2002.
man 7 socket, man 7 tcp, man 7 ip, man 2 epoll_wait, man 2 io_uring_setup
Documentation/networking/ip-sysctl.rst — full sysctl reference