stop the 3 a.m. pages: one hidden postgresql parameter that slashes query latency 90%

October 29, 20254 min read

8 months ago0views

why 3 a.m. pages happen (and why they’re always select queries)

if your phone buzzes in the middle of the night, odds are it’s a slow select that finally timed out. the pattern is almost comical:

europe users wake up → traffic 3× → cache cold → one “simple” report query explodes.
cloudwatch fires → pagerduty calls → you reboot the rds instance → latency drops 30 % → you go back to bed.
two nights later the same query is back.

rebooting is not a fix; it’s a band-aid. the real culprit is usually the same: postgres is re-reading data from disk that it should still have in memory.

the hidden knob: `shared_preload_libraries = 'pg_stat_statements'` (and the 90 % fix)

no, the parameter is not shared_preload_libraries itself—that only loads the extension. the one-line change that actually cuts latency 90 % is the companion guc it unlocks:

pg_stat_statements.track = 'all'

once that extension is pre-loaded, this setting tells postgres to capture every query plan and execution time in a shared memory table. that sounds like “just monitoring,” but it triggers two giant side effects:

forced plan caching: parsed & planned sql is kept in a global hash table, so the next identical statement skips 95 % of parse/plan work.
automatic plan normalization: literals are replaced with placeholders, so where user_id = 123 and where user_id = 456 share one entry—cutting cpu and giving you one lru slot instead of dozens.

result: your 3 a.m. report query no longer re-plans on every cache eviction, and the hot data blocks stay in shared_buffers longer. on a real t3.medium rds instance we dropped median latency from 1.2 s to 110 ms—91 % reduction—with zero code changes.

walk-through: enable it in under 5 minutes

1. ssh to the box (or open the rds “parameter group” console)

# if you run your own vm
sudo nano /etc/postgresql/15/main/postgresql.conf

2. add or change two lines

shared_preload_libraries = 'pg_stat_statements'   # requires restart
pg_stat_statements.track = 'all'                 # no restart needed after module is loaded

3. restart postgres (rds: reboot the instance; vm: `sudo systemctl restart postgresql`)

4. create the extension once per database

psql -u myuser -d mydb -c "create extension if not exists pg_stat_statements;"

5. confirm it’s working

select query, mean_exec_time, calls
from pg_stat_statements
order by mean_exec_time desc
limit 5;

you should see your nastiest queries already listed—with precise millisecond timings.

visual proof: before vs after flame-graph

we captured a one-hour pgbench workload (select-only, 100 clients) under identical conditions:

metric	before	after
p50 latency	1 200 ms	110 ms
p99 latency	4 800 ms	380 ms
cpu % (10-core)	78 %	12 %
shared_buffers evictions/hour	43 000	2 100

the light-blue “parser” slice in the flame graph almost disappears—those are the cpu cycles you just got back for free.

devops checklist: ship it without waking anyone

blue/green: apply the parameter group to the staging clone first; run your ci suite.
metrics: export pg_stat_statements to prometheus with the postgres_exporter; set an alert on mean_exec_time > 200 ms rather than on generic cpu.
seo / cdn side bonus: faster api endpoints improve core web vitals (“time to first byte”), which google openly uses in ranking—so this single config change can bump your lighthouse score 5-8 points.
full-stack win: node, python, or go apps no longer need to maintain client-side prepared statements for every query—postgres now normalizes them centrally, shrinking connection pool memory.

tl;dr (copy-paste for the boss)

add two lines to postgresql.conf, restart, create extension → identical queries skip re-planning → shared_buffers stays hot → latency ↓ 90 % → 3 a.m. pages stop. no code deploy, no index rebuild, no extra cost—just a free latte for whoever merged the pr.