07 — Query Planning and Optimization

Technical Overview

Query optimization is the process of transforming a declarative SQL query into an efficient execution plan. It is simultaneously one of the most intellectually rich areas of database engineering and one of the most practically impactful: a bad query plan can cause a query to run 1000x slower than an optimal one. The optimizer must search a combinatorial space of possible execution strategies — join orderings, join algorithms, index choices, parallel strategies — and select the one with the lowest estimated cost.

PostgreSQL's optimizer is a cost-based optimizer with rule-based rewrites as preprocessing. InnoDB's optimizer is similar in structure. Understanding how the optimizer works — and critically, where and why it fails — is essential for schema design, query tuning, and interpreting EXPLAIN ANALYZE output.

Prerequisites

Understanding of relational algebra (projection, selection, join, aggregation)
Familiarity with B+ tree and hash index structures
Basic statistics knowledge (histograms, distributions)
Understanding of join algorithms at a conceptual level

Core Content

Query Processing Pipeline

SQL Text
    |
    v
+-------------+
| Parser      |  Lexer + grammar: SQL text -> parse tree
+-------------+
    |
    v
+-------------+
| Analyzer    |  Name resolution: resolve table/column names to OIDs
+-------------+  Type checking, semantic validation
    |
    v
+-------------+
| Rewriter    |  Apply query rewrite rules: view expansion,
+-------------+  rule system, subquery flattening
    |
    v
+-------------+
| Planner     |  Generate logical plan
| (Optimizer) |  Enumerate physical plans
+-------------+  Cost estimation
    |            Select cheapest plan
    v
+-------------+
| Executor    |  Execute physical plan operators
+-------------+  Return rows to client

In PostgreSQL, these stages map to: - Parser: src/backend/parser/gram.y (bison grammar), scan.l (flex lexer) - Analyzer: src/backend/parser/analyze.c (transforms parse tree to Query node) - Rewriter: src/backend/rewrite/rewriteHandler.c - Planner: src/backend/optimizer/ (the bulk of the optimizer code) - Executor: src/backend/executor/

Logical Plan vs Physical Plan

A logical plan expresses the query in terms of relational algebra: it describes what to compute, not how. A physical plan specifies the exact algorithms for each operation.

Query: SELECT e.name, d.dept_name
       FROM employees e JOIN departments d ON e.dept_id = d.id
       WHERE e.salary > 50000

Logical Plan:
   π(name, dept_name)
       |
       σ(salary > 50000)
       |
       ⋈(dept_id = id)
      / \
   Emp   Dept

Physical Plan (one possibility):
   Project(name, dept_name)
       |
   Hash Join (dept_id = id)
   /                   \
Seq Scan (employees)   Index Scan (departments, id)
Filter: salary > 50000

Relational Algebra Operations

The core operations and their physical implementations:

Logical Op	Physical Implementations
Scan	Sequential Scan, Index Scan, Index-Only Scan, Bitmap Scan
Filter (σ)	Applied inline during scan
Project (π)	Applied inline, or as separate Materialize node
Join (⋈)	Nested Loop Join, Hash Join, Merge Join, Index NL Join
Aggregate	HashAggregate, SortAggregate (GroupAggregate in PG)
Sort	External Merge Sort (quicksort for in-memory)
Group By	HashAggregate or GroupAggregate (requires sorted input)

Join Algorithms

Nested Loop Join (NLJ):

for each outer row o:
    for each inner row i:
        if o.key == i.key: emit (o, i)

Cost: O(|outer| * |inner|)
Best when: inner is small, or inner has an index on join key.

Index Nested Loop Join (INLJ):

for each outer row o:
    index_lookup(inner, o.key) -> matching inner rows
    emit (o, matching_inner_rows)

Cost: O(|outer| * log(|inner|))  [with B+ tree index]
Best when: inner has an index on join key; outer is small or selective.

Hash Join:

Build phase: hash all inner rows into a hash table on join key
Probe phase: for each outer row o, probe hash table with o.key

Cost: O(|outer| + |inner|)  [build + probe, I/O-wise]
Best when: no index on inner, moderate to large table sizes.

Grace Hash Join (for tables that don't fit in memory):

Partition phase: partition outer into O_1,...,O_k and inner into I_1,...,I_k
                 using the same hash function on join key
Join phase: for i in 1..k: hash join O_i with I_i (now fit in memory)

Merge Join (Sort-Merge Join):

Sort outer on join key (or use index scan for presorted data)
Sort inner on join key
Merge-scan both sorted sequences

Cost: O(|outer| log |outer| + |inner| log |inner|) + O(|outer| + |inner|)
Best when: both inputs are presorted (e.g., via index), or large tables with no index.

PostgreSQL's join algorithm selection is made by src/backend/optimizer/path/joinpath.c. The functions hash_inner_and_outer(), sort_inner_and_outer(), and match_unsorted_outer() create candidate join paths.

Cost Model

The PostgreSQL cost model estimates the total cost of a plan as a combination of: - I/O cost: number of sequential page reads (seq_page_cost=1.0) and random page reads (random_page_cost=4.0 default, but should be 1.1 on NVMe SSD). - CPU cost: number of tuple processing operations (cpu_tuple_cost=0.01) and operator evaluations (cpu_operator_cost=0.0025).

Sequential Scan cost:
  cost = seq_page_cost * relpages + cpu_tuple_cost * reltuples

Index Scan cost (simplified):
  cost = random_page_cost * pages_fetched + cpu_tuple_cost * tuples_fetched

The absolute values of these parameters matter less than their ratios. Setting random_page_cost=1.1 on an NVMe system tells the optimizer that random I/O is nearly as fast as sequential I/O, favoring index scans more aggressively.

Statistics: pg_statistic, Histograms, MCVs

The optimizer relies on statistics about column data distributions to estimate cardinalities (how many rows a filter or join will produce). PostgreSQL stores statistics in pg_statistic (updated by ANALYZE).

Key statistics per column: - null_frac: fraction of NULLs - avg_width: average byte width of a column value - n_distinct: number of distinct values (negative means fraction of total: -0.1 means 10% distinct) - most_common_vals (MCV): list of most common values + their frequencies - histogram_bounds: bucket boundaries for non-MCV values - correlation: physical ordering correlation (1.0 = perfectly sorted, 0 = random)

Example stats for 'status' column:
  most_common_vals = {'active', 'inactive', 'pending'}
  most_common_freqs = {0.70, 0.25, 0.04}
  null_frac = 0.01

Query: WHERE status = 'active'
  Estimated fraction: 0.70 (from MCV)
  Estimated rows: 0.70 * table_cardinality

pg_stats is a view over pg_statistic with human-readable output. ANALYZE VERBOSE my_table shows statistics collection progress. ALTER TABLE t ALTER COLUMN c SET STATISTICS 500 increases the MCV list and histogram resolution (default: 100 buckets).

Cardinality Estimation Errors and Their Impact

Cardinality estimation is the optimizer's Achilles heel. Systematic underestimation of join result sizes causes: - Nested loop join chosen instead of hash join (NLJ is catastrophic for large results) - Wrong join ordering (inner and outer sides flipped) - Index scans chosen when sequential scan would be faster (for large fractions of the table)

Common sources of estimation error: - Correlated columns: WHERE city='Seattle' AND state='WA' — the optimizer assumes independence, underestimates if city values are highly correlated with state. - Function application: WHERE upper(name) = 'ALICE' — statistics are for name, not upper(name). - JOIN cardinality: Joining two large tables through complex predicates; errors compound multiplicatively. - Out-of-date statistics: After large bulk loads, statistics may be stale. Auto-analyze updates statistics when ~10% of rows change.

PostgreSQL 14+ introduced extended statistics (CREATE STATISTICS) for multi-column correlations. CREATE STATISTICS stat1 (dependencies) ON city, state FROM addresses; teaches the optimizer about the correlation.

Join Ordering: Dynamic Programming and GEQO

For n tables in a query, there are O(n!) possible join orderings. PostgreSQL uses dynamic programming (the System R approach, Selinger et al. 1979) for up to join_collapse_limit tables (default 8).

The DP approach: for each subset S of tables, find the optimal plan to compute the join of all tables in S. Build up from single-table plans to full join plans using optimal substructure.

DP Join Ordering (simplified):

Plans[{A}] = best scan of A
Plans[{B}] = best scan of B
Plans[{A,B}] = best join of Plans[{A}] and Plans[{B}]
Plans[{A,B,C}] = min(
    join(Plans[{A,B}], Plans[{C}]),
    join(Plans[{A,C}], Plans[{B}]),
    join(Plans[{B,C}], Plans[{A}])
)

For more than geqo_threshold tables (default 12), PostgreSQL falls back to GEQO (Genetic Query Optimizer): a genetic algorithm that evolves a population of join orderings toward locally optimal solutions. GEQO is non-deterministic and can produce different plans across runs.

Query Plan Tree Diagram

EXPLAIN ANALYZE output interpretation:
  SELECT e.name, sum(o.amount)
  FROM employees e JOIN orders o ON e.id = o.employee_id
  WHERE e.dept = 'Engineering'
  GROUP BY e.name;

                     QUERY PLAN
  +-----------------------------------------+
  | HashAggregate (cost=850..900 rows=500)  |  <- aggregate on name
  |   Group Key: e.name                     |
  |   -> Hash Join (cost=200..700)          |  <- join employees + orders
  |        Hash Cond: (o.emp_id = e.id)     |
  |        -> Seq Scan on orders (cost=50)  |  <- full scan (build side)
  |        -> Hash                          |
  |             -> Index Scan on employees  |  <- index on dept (probe side)
  |                  Filter: dept='Engg'    |
  +-----------------------------------------+

Key metrics in EXPLAIN ANALYZE:
  (cost=startup..total) - estimated costs
  (rows=N) - estimated output rows
  (actual time=start..end ms) - measured timing
  (actual rows=N loops=M) - measured rows and loop count
  Rows Removed by Filter: X  <- important for detecting selectivity errors

A large discrepancy between rows=estimated and actual rows=measured is the primary diagnostic signal for optimizer estimation failures.

Rule-Based vs Cost-Based Optimization

PostgreSQL uses both:

Rule-based rewrites (applied always, before cost estimation): - View expansion (replace view reference with view definition) - Subquery to join conversion (IN (SELECT ...) → semi-join) - Constant folding (WHERE 1+1 > 1 → WHERE TRUE) - Outer join elimination (when foreign key constraints are known)

Cost-based optimization (applied after rewrites): - Choose among multiple access methods (seq scan vs index scan vs bitmap scan) - Choose join algorithm and join order - Choose aggregation method (hash vs sort-based) - Choose parallelism (parallel sequential scan, parallel hash join)

InnoDB's MySQL optimizer is similar in structure but differs in important details: MySQL uses a range-based single-table optimizer for predicate analysis, and the join optimizer uses a cost-based left-deep join tree search with heuristic pruning.

Adaptive Query Execution

Static query plans fail when cardinality estimates are wrong. Adaptive Query Execution (AQE) re-optimizes the plan at runtime based on actual statistics observed during execution.

Apache Spark AQE (since Spark 3.0): - Dynamically coalesces shuffle partitions based on actual data sizes - Dynamically switches join strategies (sort-merge join → broadcast join if one side turns out small) - Dynamically optimizes skew joins

Presto/Trino implements adaptive partitioning and dynamic filtering.

PostgreSQL does not yet have AQE for intra-query replanning, but has parallel query plans that adapt worker count.

Historical Context

The modern cost-based query optimizer originates with the IBM System R optimizer (Selinger et al., 1979), which introduced the concepts of: access path selection, cardinality estimation, dynamic programming for join ordering, and interesting orders. The System R paper, "Access Path Selection in a Relational Database Management System," remains required reading.

Starburst (IBM, 1985-1995) introduced extensible rule-based query rewriting, influencing PostgreSQL's rewriter. Volcano/Cascades (Graefe, 1993-1995) introduced the Memo structure and rule-based physical transformation, used in SQL Server and CockroachDB.

Production Examples

PostgreSQL planner: src/backend/optimizer/plan/planner.c is the entry point. subquery_planner() handles top-level query planning. make_one_rel() plans the FROM clause. grouping_planner() handles GROUP BY, ORDER BY, LIMIT.

MySQL optimizer: sql/sql_optimizer.cc. JOIN::optimize() is the main optimization entry point. make_join_statistics() and best_extension_by_limited_search() implement the join ordering search.

CockroachDB optimizer: Uses the Cascades framework (pkg/sql/opt/). Rules are expressed in a DSL (Optgen) and compiled to Go code. The Memo data structure stores all equivalent plan fragments.

Debugging Notes

EXPLAIN ANALYZE: Always use EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) in PostgreSQL — BUFFERS shows buffer pool hits/misses, critical for diagnosing I/O vs CPU bottlenecks.
Stale statistics: SELECT relname, last_analyze, n_live_tup, n_dead_tup FROM pg_stat_user_tables — if last_analyze is old relative to DML volume, run ANALYZE.
Estimation errors: Large discrepancy between rows=X and actual rows=Y in EXPLAIN ANALYZE indicates estimation failure. Create extended statistics or expression indexes.
Disable join type: SET enable_hashjoin=off; to force the planner to avoid hash joins (for testing). Never do this in production without understanding the cost.
Trace planner: PostgreSQL debug_print_parse, debug_print_rewritten, debug_print_plan GUC settings log intermediate query representations.
MySQL EXPLAIN FORMAT=JSON: Provides more detail than default EXPLAIN including per-table filtered percentage and actual row estimates.

Security Implications

Plan cache poisoning: Prepared statements cache execution plans. If table statistics change dramatically after a plan is cached, the cached plan may be catastrophically bad. PostgreSQL uses generic vs custom plans dynamically and invalidates plans when statistics change. Ensure plan_cache_mode=auto (default).
EXPLAIN privilege: In PostgreSQL, EXPLAIN SELECT requires only SELECT privilege on the queried tables, not EXPLAIN-specific privilege. This is generally safe. However, EXPLAIN ANALYZE actually executes the query — do not run EXPLAIN ANALYZE DELETE ... accidentally.
Information leakage via timing: Query execution time reveals information about data volume and index structure. This is a minor side channel in most threat models but relevant for compliance-sensitive deployments.

Performance Implications

random_page_cost tuning: The single most impactful planner configuration parameter. On HDD: random_page_cost=4.0 (default). On SSD: random_page_cost=1.5. On NVMe: random_page_cost=1.1. Incorrect values cause systematic preference for sequential scans or index scans.
Parallel query: PostgreSQL 9.6+ supports parallel sequential scans. max_parallel_workers_per_gather (default 2) controls parallelism. For large analytical queries, set to 4-8. min_parallel_table_scan_size controls when parallel scan activates.
work_mem: Each sort or hash join operation can use up to work_mem (default 4MB) of memory for its in-memory data structure. For complex queries with many sorts/joins, actual memory usage can be work_mem * (number_of_operations). Setting this too low causes external sort; too high causes memory pressure.
Join order sensitivity: For queries with many tables, small cardinality estimation errors compound into large plan cost differences. Use pg_hint_plan (PostgreSQL extension) or USE INDEX / STRAIGHT_JOIN (MySQL) to override specific planner decisions.

Failure Modes

Join reordering explosion: join_collapse_limit=8 (default). A query joining 20 tables exceeds this and triggers GEQO, which may produce a suboptimal non-deterministic plan.
Statistics lag: After a large INSERT/DELETE, statistics are stale until next ANALYZE. The optimizer sees old histograms and produces wrong cardinality estimates. Enable autovacuum aggressively.
Nested loop on large result sets: If the optimizer underestimates join cardinality, it may choose a nested loop join that is O(n²) on what is actually a large result. Symptoms: query runs for hours when it should take seconds. Fix: SET enable_nestloop=off; temporarily, then diagnose the statistics issue.
Bitmap index scan regression: PostgreSQL's bitmap scan is efficient for medium selectivities, but can be slower than sequential scan for high selectivities if the bitmap itself is large. The optimizer estimates this, but heap correlation affects the estimate.

Modern Usage

Modern query optimizers face the challenge of increasingly complex queries (100+ table joins in analytical workloads) and distributed execution (multi-node query planning where network costs must be modeled).

Learned cardinality estimators: Naru (Yang et al., 2019), MSCN (Kipf et al., 2018), and NeuroCard (Yang et al., 2020) use neural networks to estimate join cardinalities more accurately than histograms. Some are integrated experimentally into PostgreSQL and SQL Server.

Vectorized execution: DuckDB and Velox (Meta) use vectorized (batch-at-a-time, SIMD-friendly) execution engines rather than the tuple-at-a-time Volcano model. The query planner must account for batch size in cost models.

Future Directions

End-to-end learned optimizers: Bao (Marcus et al., 2021) uses a bandit learning approach to select query plans, learning from execution feedback. It selects among PostgreSQL's generated plans rather than generating plans itself.
LLM-assisted query optimization: Using large language models to suggest index creation, rewrite queries, and explain query plans in natural language. GitHub Copilot and Atlas (MongoDB) are early examples.
Unified OLTP/OLAP optimization: HTAP systems need optimizers that can seamlessly choose between row-store indexes and column-store scans within a single query plan.

Exercises

Write a 5-table join query in PostgreSQL. Run EXPLAIN ANALYZE and identify which join algorithm was chosen for each join. Then force different join algorithms using SET enable_hashjoin=off and observe cost changes.
Create a table with 1M rows and a correlated two-column predicate (e.g., city and zip code). Run a query filtering on both columns. Observe cardinality estimation error in EXPLAIN. Create CREATE STATISTICS ... (dependencies) and re-observe.
Benchmark the impact of random_page_cost on index vs sequential scan choice: set random_page_cost=4.0, 2.0, and 1.1. Run a query with ~5% selectivity and observe the plan change.
Implement a simplified dynamic programming join optimizer in Python: given a set of tables with cardinalities and join predicates, find the optimal left-deep join order by minimizing estimated intermediate result sizes.
Use pg_stat_statements to find the top-5 queries by total execution time in a running PostgreSQL instance. Use EXPLAIN ANALYZE to understand their plans and identify optimization opportunities.

References

Selinger, P.G., et al. (1979). Access Path Selection in a Relational Database Management System. SIGMOD 1979.
Graefe, G. (1993). The Volcano Model of Query Evaluation. IEEE Data Engineering Bulletin, 16(4).
Graefe, G. (1995). The Cascades Framework for Query Optimization. IEEE Data Engineering Bulletin, 18(3).
Ioannidis, Y. (1996). Query Optimization. ACM Computing Surveys, 28(1).
Leis, V., et al. (2015). How Good Are Query Optimizers, Really? PVLDB 2015.
Marcus, R., et al. (2021). Bao: Making Learned Query Optimization Practical. SIGMOD 2021.
PostgreSQL Optimizer README: src/backend/optimizer/README