PostgreSQL Internals
Overview
PostgreSQL is one of the most sophisticated open-source systems ever built. Behind its SQL interface lies a process-per-connection model, a heap storage format with tuple-level MVCC (Multi-Version Concurrency Control), a WAL (Write-Ahead Log) that guarantees durability, a Volcano-model query executor, and an autovacuum subsystem that reclaims space while the database runs. Understanding these internals is essential for debugging production performance problems, choosing the right query patterns, and making tuning decisions that have first-order impact.
This document covers PostgreSQL from the process level down to the byte layout of individual tuples on disk, and back up through the query execution pipeline.
Prerequisites
- Understanding of relational databases: tables, indexes, transactions, ACID
- Basic operating system concepts: processes, shared memory, virtual memory, mmap
- Understanding of B-tree data structures
- Familiarity with SQL execution concepts: joins, aggregates, indexes
- Basic knowledge of WAL/journaling concepts
Historical Context
PostgreSQL traces its lineage to POSTGRES, developed at UC Berkeley by Michael Stonebraker et al. starting in 1986. POSTGRES introduced extensibility (user-defined types, operators, functions), row-level locking, and a rule system. The name "PostgreSQL" reflects the SQL support added in 1994 when two graduate students (Andrew Yu and Jolly Chen) replaced the POSTQUEL query language with SQL.
PostgreSQL was open-sourced in 1995 and has been developed by a global community (the PostgreSQL Global Development Group) ever since. Major milestones:
- 7.1 (2001): WAL (Write-Ahead Logging). Crash recovery became reliable.
- 7.4 (2003): Background writer, significant performance improvements.
- 8.0 (2005): Windows port, tablespaces, savepoints.
- 8.1 (2005): Two-phase commit, shared row-level locking improvements.
- 9.0 (2010): Hot Standby (read replicas), streaming replication.
- 9.3 (2013): Materialized views, lateral joins.
- 9.4 (2014): JSONB, logical decoding (foundation for CDC).
- 10 (2017): Declarative table partitioning, logical replication.
- 11 (2018): Parallel query improvements, JIT compilation (LLVM).
- 12 (2019): Generated columns, CTE inlining.
- 13 (2020): Parallel vacuuming, incremental sorting.
- 14 (2021): BRIN improvements, pg_stat_progress_* views.
- 15 (2022): Merge command, improved logical replication.
- 16 (2023): Logical replication from standby, pg_stat_io.
Process Model
PostgreSQL uses a process-per-connection model (no threading):
PostgreSQL Process Architecture:
┌──────────────────────────────────────────────────────────────┐
│ Postmaster (PID 1) │
│ │
│ Responsibilities: │
│ - Listen on port 5432 │
│ - Accept connections → fork() backend │
│ - Restart crashed backends │
│ - Manage shared memory │
│ - Start background workers │
└──────────────────────────────────────────────────────────────┘
│ fork() per client connection
│
┌────────────────────────────────────────────────────────────┐
│ Background Processes: │
│ │
│ bgwriter: Periodically writes dirty shared_buffers │
│ pages to disk (reduces checkpoint I/O spikes) │
│ │
│ checkpointer: Performs checkpoints (flushes all dirty pages)│
│ Updates pg_control, advances WAL checkpoint │
│ │
│ walwriter: Flushes WAL buffers to pg_wal (WAL files) │
│ │
│ autovacuum: Launcher → spawns autovacuum workers │
│ Runs VACUUM on tables with dead tuples │
│ │
│ stats collector: Aggregates statistics for pg_stat_* views │
│ │
│ logical replication worker: streams WAL to subscribers │
└────────────────────────────────────────────────────────────┘
│
┌───────────────────────────────────────────────────────┐
│ Shared Memory (mmap, kernel-managed): │
│ │
│ shared_buffers: PostgreSQL's page cache │
│ WAL buffers: pending WAL records before flush │
│ Lock table: all lock metadata (shared row-level locks)│
│ Clog: transaction status (committed/aborted/in-progress)
│ Commit timestamps, multixact info │
└───────────────────────────────────────────────────────┘
Each backend connects to same shared memory.
No data copying between connections for shared pages.
Lock manager in shared memory → backends coordinate locking.
Limitation: each connection = one OS process (~5-10MB overhead).
1000 connections = 10GB RSS overhead before any query work.
→ Use PgBouncer connection pooling.
Heap Storage: Page Layout
Every table is stored as a sequence of 8KB pages (blocks). Each page has a fixed layout:
PostgreSQL 8KB Page Layout:
┌───────────────────────────────────────────────────────────────┐
│ PageHeader (24 bytes) │
│ pd_lsn: 8 bytes (LSN of last WAL record for this page)│
│ pd_checksum: 2 bytes (page checksum if enabled) │
│ pd_flags: 2 bytes (page flags: ALL_VISIBLE, etc.) │
│ pd_lower: 2 bytes (offset to end of item pointer array) │
│ pd_upper: 2 bytes (offset to start of free space) │
│ pd_special: 2 bytes (offset to special area, 0 for heap) │
│ pd_pagesize: 2 bytes (page size, always 8192 for heap) │
│ pd_prune_xid: 4 bytes (oldest prunable transaction XID) │
├───────────────────────────────────────────────────────────────┤
│ ItemId Array (line pointer array) │
│ ItemId[0]: {offset:15, flags:2, length:15} → points to tuple│
│ ItemId[1]: {offset, flags, length} │
│ ... │
│ (grows downward from pd_lower toward pd_upper) │
├───────────────────────────────────────────────────────────────┤
│ Free Space │
│ (between pd_lower and pd_upper) │
├───────────────────────────────────────────────────────────────┤
│ Tuples (heap tuples) │
│ (packed upward from page end toward pd_upper) │
│ │
│ Tuple N: HeapTupleHeader + data │
│ Tuple 1: HeapTupleHeader + data │
└───────────────────────────────────────────────────────────────┘
Physical address of a tuple: (block_number, offset_within_page)
= ctid in PostgreSQL: e.g., (0, 3) = block 0, item pointer 3
HeapTuple: Tuple Header
HeapTupleHeader (23 bytes + padding):
Field Size Purpose
───────── ──── ───────
t_xmin 4B Transaction ID that inserted this tuple
t_xmax 4B Transaction ID that deleted/updated this tuple
(0 = not deleted, or ongoing UPDATE transaction)
t_cid 4B Command ID (for multiple commands in same tx)
OR t_xvac (VACUUM transaction ID, when in special use)
t_ctid 6B Physical location of newest version of this tuple
(self if current version; points to newer version if updated)
t_infomask2 2B Attribute count + flags (HOT update, etc.)
t_infomask 2B Null bitmap present? Has OID? Xmin committed/aborted?
t_hoff 1B Offset to start of user data (past header + null bitmap)
Then: optional null bitmap (1 bit per attribute if any NULLs)
Then: actual column data (type-specific encoding, possibly TOAST pointer)
Example tuple for: INSERT INTO users VALUES (1, 'Alice', 30)
t_xmin = 500 (transaction 500 inserted this row)
t_xmax = 0 (not deleted)
t_ctid = (0,1) (this is the current version)
data: [int32: 1][varchar: Alice][int32: 30]
MVCC: How Visibility Works
PostgreSQL implements MVCC without a separate undo log. Old versions of tuples remain in the heap until VACUUM reclaims them:
MVCC Visibility Rule:
A tuple is visible to transaction T if and only if:
t_xmin committed AND (t_xmin = T.xid OR t_xmin < T.snapshot_xmin)
AND
(t_xmax = 0 OR t_xmax aborted OR t_xmax > T.snapshot_xmax
OR (t_xmax = T.xid AND T.cid <= t_cid))
In plain English:
- The inserting transaction (xmin) must be committed
(not in-progress or aborted)
- The deleting transaction (xmax) must not have committed yet
from T's perspective (its xmax > T's snapshot or xmax aborted)
Snapshot: taken at transaction start (or statement start in READ COMMITTED)
snapshot = (xmin, xmax, xip_list)
xmin: oldest active transaction
xmax: next transaction ID to be assigned
xip_list: list of currently in-progress transactions
UPDATE creates a new tuple version:
UPDATE users SET age = 31 WHERE id = 1;
Before UPDATE (tx 500):
Tuple A: xmin=500, xmax=0, ctid=(0,1), data={1, Alice, 30}
During UPDATE (tx 600):
Tuple A: xmin=500, xmax=600, ctid=(0,2), data={1, Alice, 30} ← old
Tuple B: xmin=600, xmax=0, ctid=(0,2), data={1, Alice, 31} ← new
After tx 600 commits:
- T with snapshot before 600: sees Tuple A (age=30)
- T with snapshot after 600: sees Tuple B (age=31)
Dead tuples: Tuple A (xmax=600 committed) is dead for all future txs.
VACUUM must find and reclaim these dead tuples.
Visibility map:
1 bit per page: set if ALL tuples on page are visible to ALL transactions.
Index-only scans: can skip table heap access for pages in visibility map.
Autovacuum sets bits; VACUUM clears them when tuples are modified.
WAL (Write-Ahead Log)
WAL guarantees: before a data change is visible to other transactions,
its WAL record must be flushed to disk.
This allows crash recovery: replay WAL to reconstruct in-memory state.
WAL directory: $PGDATA/pg_wal/
WAL file naming: 000000010000000000000001 (24 hex chars = timeline + LSN segment)
File size: wal_segment_size (default 16MB)
LSN (Log Sequence Number):
- 64-bit monotonically increasing counter
- Identifies a position in the WAL stream
- Format: hexadecimal XX/XXXXXXXX (upper/lower 32 bits)
WAL record structure:
┌───────────────────────────────────────┐
│ XLogRecord header │
│ xl_tot_len: total record length │
│ xl_xid: transaction ID │
│ xl_prev: previous record LSN │
│ xl_info: record type flags │
│ xl_rmid: resource manager ID │
│ xl_crc: CRC32C of record │
├───────────────────────────────────────┤
│ Block data (changed pages) │
│ Main data (operation-specific info) │
└───────────────────────────────────────┘
Resource managers (rmid):
HEAP: heap tuple operations (INSERT, UPDATE, DELETE, HOT update)
BTREE: index operations
COMMIT: transaction commit/abort records
STANDBY: logical decoding info for hot standby
...
Checkpoint:
- All dirty shared_buffers pages written to disk
- WAL checkpoint record written: recovery can start from this LSN
- checkpoint_timeout (default 5min) or checkpoint_completion_target (0.9)
Recovery after crash:
1. Read pg_control: find last checkpoint LSN
2. Replay WAL from checkpoint LSN to end of WAL
3. Each WAL record tells which page was modified and what the change was
4. Redo operations to restore consistent state
5. Open for connections
Query Execution: Volcano Model
PostgreSQL uses a pull-based (volcano/iterator) execution model:
Query: SELECT u.name, count(o.id)
FROM users u JOIN orders o ON u.id = o.user_id
WHERE u.age > 25
GROUP BY u.name;
Plan tree (result of planner/optimizer):
Aggregate (Hash Aggregate on u.name)
│
└── Hash Join (u.id = o.user_id)
├── Seq Scan on users (filter: age > 25)
└── Seq Scan on orders
Execution (volcano pull model):
Each node has three methods:
Init(): initialize state
Next(): return next tuple (or NULL when done)
End(): cleanup
Execution starts when top node calls Next():
Aggregate.Next() calls HashJoin.Next()
HashJoin.Next() calls both children:
First pass: Seq Scan on orders → build hash table on user_id
Then: Seq Scan on users.Next() → probe hash table
Filter: age > 25 (skip tuples where false)
Match found → return (user.name, order.id) tuple
HashJoin returns tuple to Aggregate
Aggregate accumulates count(o.id) per u.name
When HashJoin exhausted: finalize aggregation, return result rows
This "pull" model: top node drives execution, each node pulls from below.
Simplifies plan composition (nodes are composable).
Row-at-a-time vs vectorized:
Volcano processes one tuple per Next() call → function call overhead.
PostgreSQL JIT (LLVM, 11+): compiles tight inner loops for filter evaluation,
eliminating per-tuple function dispatch overhead for hot paths.
Vectorized engines (DuckDB, Apache Arrow DataFusion) process batches of
tuples (1024-4096 at a time) → better CPU cache utilization, SIMD.
PostgreSQL does NOT use vectorized execution (as of 2024); it is a goal.
EXPLAIN ANALYZE
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT u.name, count(o.id)
FROM users u JOIN orders o ON u.id = o.user_id
WHERE u.age > 25
GROUP BY u.name;
-- Output:
HashAggregate (cost=234.50..236.50 rows=200 width=64)
(actual time=12.3..12.8 rows=150 loops=1)
Group Key: u.name
Buffers: shared hit=45 read=12
-> Hash Join (cost=58.75..217.00 rows=3500 width=32)
(actual time=0.8..9.2 rows=3500 loops=1)
Hash Cond: (o.user_id = u.id)
Buffers: shared hit=45 read=12
-> Seq Scan on orders (cost=0.00..134.00 rows=10000 width=8)
(actual time=0.02..3.4 rows=10000 loops=1)
Buffers: shared hit=34
-> Hash (cost=52.50..52.50 rows=500 width=32)
(actual time=0.7..0.7 rows=500 loops=1)
Buckets: 1024 Batches: 1
-> Seq Scan on users (cost=0.00..52.50 rows=500 width=32)
(actual time=0.01..0.5 rows=500 loops=1)
Filter: (age > 25)
Rows Removed by Filter: 1500
-- Key readings:
-- cost=first_row_cost..total_cost (in arbitrary cost units)
-- actual time=startup_ms..total_ms
-- Buffers: shared hit=X (from shared_buffers cache) read=Y (from OS/disk)
-- loops=N: node executed N times (inside nested loop joins)
-- Rows Removed by Filter: estimate accuracy (500 returned, 1500 filtered)
Autovacuum
MVCC leaves dead tuples in the heap. Autovacuum reclaims them:
Autovacuum Trigger:
For each table, autovacuum runs VACUUM when:
n_dead_tup > autovacuum_vacuum_threshold (50) +
autovacuum_vacuum_scale_factor (0.2) × n_live_tup
Example: table with 10,000 live rows
Trigger at: 50 + 0.2 × 10,000 = 2,050 dead tuples
VACUUM process:
1. Scan each page; identify dead tuples (xmax committed before OldestXmin)
2. Remove dead tuple pointers (mark ItemId as LP_DEAD)
3. Update free space map (for future INSERT reuse)
4. Update visibility map (mark pages as ALL_VISIBLE if safe)
5. Update pg_class.relfrozenxid (for XID freeze, every 200M transactions)
VACUUM FULL:
- Rewrites entire table to new file (like pg_repack)
- Returns space to OS (regular VACUUM returns space to table for reuse only)
- LOCKS TABLE exclusively — avoid on production hot tables
- Use pg_repack instead for online table compaction
XID Wraparound:
Transaction IDs are 32-bit (4 billion total).
If not frozen, a 2-billion-year-old transaction would be "newer" than
a current transaction (modular arithmetic).
VACUUM FREEZEs old tuples: marks xmin as "frozen" (special FrozenXID)
→ visible to all future transactions regardless of their snapshot.
If autovacuum fails for too long: PostgreSQL enters safe mode at
autovacuum_freeze_max_age - 500,000 transactions from wraparound.
Monitor XID age:
SELECT relname, age(relfrozenxid), relfrozenxid
FROM pg_class WHERE relkind = 'r'
ORDER BY age(relfrozenxid) DESC LIMIT 10;
-- Alert if age > 1.5 billion
Connection Pooling: PgBouncer
Without PgBouncer:
1000 app threads → 1000 PostgreSQL backends
Each backend: ~5MB RSS + kernel resources
Total: 5GB RAM just for idle connections
Connection overhead: 5ms per new connection (fork + auth + setup)
PgBouncer architecture:
App threads → PgBouncer → PostgreSQL backends
(1000 clients) (proxy) (10-50 server connections)
Pool modes:
Session pooling:
Client gets dedicated server connection for entire session.
Use: applications that use SET search_path or prepared statements
that must persist across requests.
Concurrency limited by pool size.
Transaction pooling:
Server connection held only during transaction.
Released immediately after COMMIT/ROLLBACK.
1000 app threads → 10 server connections (1% concurrency typical)
Limitation: cannot use session-level features (SET, advisory locks,
WITH HOLD cursors, LISTEN/NOTIFY) without extra configuration.
BEST for typical stateless web applications.
Statement pooling:
Server connection held only during single statement.
Most restrictive; breaks multi-statement transactions.
Rarely used.
PgBouncer with SCRAM:
PgBouncer 1.18+ supports auth_type = scram-sha-256 passthrough.
Earlier versions required md5 — security risk.
Always use recent PgBouncer with scram-sha-256.
Performance Tuning
Critical configuration parameters:
shared_buffers (default: 128MB):
PostgreSQL's page cache. Set to 25% of RAM.
For 32GB server: shared_buffers = 8GB
Higher is better up to ~40% RAM; beyond that OS page cache helps more.
effective_cache_size (default: 4GB):
Planner hint: how much total memory is available for caching
(shared_buffers + OS page cache).
Set to 75% of RAM. Does NOT allocate memory — only affects query plans.
Higher value → planner prefers index scans over seq scans.
work_mem (default: 4MB):
Memory per sort/hash operation PER QUERY.
A complex query may use 10+ sort nodes × work_mem simultaneously.
For 32GB server with 100 connections: max total = 100 × 10 × 4MB = 4GB (unsafe!)
Rule: work_mem = RAM / (max_connections × max_parallel_workers)
Typical: 16-64MB. Increase for analytics workloads.
maintenance_work_mem (default: 64MB):
Memory for VACUUM, CREATE INDEX, ALTER TABLE.
Set to 256MB-2GB for large tables. Speeds up VACUUM significantly.
max_wal_size (default: 1GB):
Maximum WAL size between checkpoints.
Larger = less frequent checkpoints = better write throughput.
Tradeoff: crash recovery takes longer (more WAL to replay).
Production OLTP: 2-8GB. Analytics: 16GB.
checkpoint_completion_target (default: 0.9):
Spread checkpoint I/O over 90% of checkpoint_timeout.
Avoids I/O spikes at checkpoint boundaries.
Keep at 0.9 for production.
random_page_cost (default: 4.0):
Planner's estimate of random disk read cost.
For SSDs: set to 1.1-1.5 (random reads almost as fast as sequential).
For NVMe: set to 1.0.
Affects index vs seq scan choice: lower → planner prefers index scans.
wal_compression (default: off):
Compress WAL records using LZ4 or zstd.
Reduces WAL I/O by 50-80% for compressible workloads.
Enable with: wal_compression = lz4 (PostgreSQL 15+)
PostgreSQL Process Architecture Diagram
PostgreSQL Architecture Overview:
┌─────────────────────────────────────────────────────────────────┐
│ Client Application │
│ (psql / pgAdmin / Java app / Django) │
└─────────────────────────────────────────────────────────────────┘
│ TCP :5432 (or Unix socket)
▼
┌─────────────────────────────────────────────────────────────────┐
│ PgBouncer (optional connection pool) │
│ 1000 app connections → 20 server connections │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Postmaster (pid 1) │
│ - Accepts connections, forks backend processes │
└─────────────────────────────────────────────────────────────────┘
│ fork()
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Backend (pid │ │ Backend (pid │ │Autovacuum │
│ 1234) │ │ 1235) │ │Worker │
│ query: │ │ query: │ │vacuuming │
│ SELECT... │ │ INSERT... │ │table users │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────┐
│ Shared Memory │
│ ┌─────────────────┐ ┌────────────────┐ ┌──────────────────┐ │
│ │ shared_buffers │ │ WAL buffers │ │ Lock table │ │
│ │ (page cache) │ │ (wal_buffers) │ │ (locks, lwlocks)│ │
│ │ 8KB pages │ │ │ │ │ │
│ └─────────────────┘ └────────────────┘ └──────────────────┘ │
│ ┌─────────────────┐ ┌────────────────┐ │
│ │ CLOG (commit │ │ MultiXact │ │
│ │ log: txn status)│ │ (shared row │ │
│ │ │ │ locks) │ │
│ └─────────────────┘ └────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ $PGDATA on disk: │
│ base/ ← table and index heap files (one dir per DB) │
│ pg_wal/ ← WAL segment files (16MB each) │
│ pg_clog/ ← transaction status (commit log) │
│ pg_multixact/ ← shared row lock info │
│ global/ ← pg_database, pg_auth, pg_control │
└─────────────────────────────────────────────────────────────────┘
Debugging Notes
-- Find slow queries (pg_stat_statements required):
SELECT query, calls, mean_exec_time, total_exec_time, rows
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;
-- Find tables bloated with dead tuples:
SELECT schemaname, tablename, n_live_tup, n_dead_tup,
round(n_dead_tup::numeric / (n_live_tup + n_dead_tup + 1) * 100, 2) AS dead_pct,
last_vacuum, last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY dead_pct DESC;
-- Find queries with sequential scans on large tables (should be indexed):
SELECT relname, seq_scan, seq_tup_read, idx_scan, idx_tup_fetch
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC LIMIT 10;
-- Check for index bloat:
SELECT tablename, indexname, pg_size_pretty(pg_relation_size(indexname::regclass)) AS size
FROM pg_indexes
ORDER BY pg_relation_size(indexname::regclass) DESC LIMIT 20;
-- Check for lock waits (blocking queries):
SELECT blocked.pid, blocked.query, blocking.pid AS blocking_pid,
blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE cardinality(pg_blocking_pids(blocked.pid)) > 0;
-- Check checkpoint frequency (too frequent = I/O pressure):
SELECT checkpoints_timed, checkpoints_req, checkpoint_write_time,
checkpoint_sync_time
FROM pg_stat_bgwriter;
-- checkpoints_req >> checkpoints_timed: max_wal_size too low
-- Check cache hit rate:
SELECT sum(blks_hit) * 100 / (sum(blks_hit) + sum(blks_read)) AS cache_hit_pct
FROM pg_stat_database;
-- < 95%: shared_buffers too small or working set > RAM
Security Implications
- PostgreSQL's
pg_hba.confcontrols authentication methods per connection source. Usescram-sha-256notmd5(md5 is broken). Rejecttrustauthentication except for local Unix socket superuser access. - The
SUPERUSERrole has unrestricted access to all databases, including file system access viaCOPY TO/FROMandpg_read_file(). Never run applications as superuser. - Row-level security (RLS) policies enforce data isolation within a table:
CREATE POLICY tenant_isolation ON orders USING (tenant_id = current_setting('app.tenant_id')). Essential for multi-tenant applications. pg_auditextension provides detailed audit logging of all DML/DDL operations — required for PCI-DSS, HIPAA, and SOC 2 compliance.- Connection string credentials must not appear in
pg_stat_activity.query— use parameterized queries, never string interpolation.
Performance Implications
- The process-per-connection model means idle connections consume ~5-10MB RSS. Above ~200 direct connections, PostgreSQL performance degrades due to process scheduling overhead. Always use PgBouncer in transaction mode for web applications.
- Sequential scans read the heap in block order and can use prefetching. Index scans do random I/O to the heap after reading the index. On SSDs, index scans are almost always faster; on HDDs, sequential scans can be competitive for >20% selectivity.
- Partial indexes (
CREATE INDEX ON orders(user_id) WHERE status = 'pending') dramatically reduce index size for selective conditions. A partial index on a 1% subset is 100× smaller than a full index. CLUSTER tablename USING indexnamerewrites the heap in index order (physically). After clustering, range scans on the clustered column do sequential I/O. The cluster ordering degrades as rows are added; re-run periodically.
Failure Modes
| Symptom | Cause | Fix |
|---|---|---|
| XID wraparound warning | Autovacuum not keeping up | VACUUM FREEZE on oldest table; fix autovacuum config |
| Table bloat (1GB real, 10GB on disk) | Dead tuples accumulating | Increase autovacuum aggressiveness; check for long txns |
| Checkpoint too frequent | max_wal_size too low | Increase max_wal_size to 2-8GB |
| OOM killer hitting PostgreSQL | work_mem × connections exhausts RAM | Reduce work_mem; use PgBouncer to limit connections |
| Long lock wait | Transaction holds lock while waiting for I/O | lock_timeout; statement_timeout; kill idle transactions |
| Autovacuum can't run | Long-running transactions holding back OldestXmin | Cancel long-running idle transactions; set idle_in_transaction_session_timeout |
Modern Usage
- Logical Replication: PostgreSQL 10+ supports logical replication, enabling CDC (Change Data Capture) to stream row-level changes to Kafka via Debezium, or to other PostgreSQL databases for OLAP offloading.
- Partitioning: Declarative table partitioning (range, list, hash) enables partition pruning in queries and parallel vacuum per partition. Used at scale for time-series data (partition by month).
- pg_partman: Extension for automatic partition management — creates future partitions on schedule, drops old ones based on retention policy.
- Citus: PostgreSQL extension that distributes tables across multiple nodes (sharding), turning a single PostgreSQL server into a distributed database. Used by Microsoft Azure for Azure Cosmos DB for PostgreSQL.
- TimescaleDB: Extension optimizing time-series queries: automatic chunking, column compression, continuous aggregates. Built on PostgreSQL's extension API.
Future Directions
- Vectorized execution (JIT improvements): The community is exploring batch/vectorized execution to complement row-at-a-time Volcano model, closing the performance gap with DuckDB and ClickHouse for analytics.
- Direct I/O: PostgreSQL 16+ experiments with
O_DIRECTfor WAL writes; full direct I/O for the buffer manager is a long-running project to bypass OS page cache and gain predictable I/O latency. - Incremental backup: pg_basebackup creates full base backups. Incremental backups (only changed blocks since last backup) are in development, critical for large databases where full backups take hours.
- In-memory columnar storage (AX/Oriole): Projects like Oriole DB add a columnar storage engine to PostgreSQL, enabling hybrid row/column storage within the same database.
Exercises
-
Create a table with 1 million rows. Run a query with EXPLAIN ANALYZE and identify if it does a Seq Scan or Index Scan. Add an appropriate index and verify the plan changes. Calculate the reduction in buffer reads from BUFFERS output.
-
Observe MVCC in action: in one session, start a transaction with
BEGIN. In another session, update a row. In the first session,SELECTthat row — verify you see the old version. Commit the second transaction. Verify the first session still sees the old version (snapshot isolation). -
Simulate table bloat: update every row in a 100k-row table 10 times. Check
pg_stat_user_tables.n_dead_tup. RunVACUUM VERBOSEand observe how many dead tuples are reclaimed. Compare table size before and after. -
Reproduce and resolve lock contention: in one session, run a long UPDATE. In another session, run an UPDATE on the same row. Observe the second query blocked via
pg_stat_activity. Usepg_blocking_pids()to identify the blocker. Cancel the blocking query and observe resolution. -
Configure PgBouncer in transaction mode. Connect 100 "clients" to PgBouncer. Verify via
pg_stat_activitythat only 5-10 server connections are active at any time. Measure connection overhead: time to establish 100 direct connections vs 100 PgBouncer connections.
References
- PostgreSQL source code: github.com/postgres/postgres (particularly: src/backend/storage/page/, src/backend/access/heap/, src/backend/access/transam/)
- "The Internals of PostgreSQL" — Hironobu Suzuki (interdb.jp) — free online book, extremely detailed
- PostgreSQL documentation: postgresql.org/docs/current/
- "PostgreSQL: Up and Running" — Regina Obe & Leo Hsu, O'Reilly
- "Explaining the Postgres Query Optimizer" — Bruce Momjian, PGCon talk
- PgBouncer documentation: pgbouncer.org
- "Autovacuum Tuning Basics" — PostgreSQL Wiki
- MVCC in PostgreSQL: postgresql.org/docs/current/mvcc.html
- "Heap Only Tuples (HOT)" — PostgreSQL Wiki (UPDATE optimization)
- "The Art of PostgreSQL" — Dimitri Fontaine (2019)
- Citus: citusdata.com
- TimescaleDB: timescale.com