i froze my database for 37 minutes—and saw a 9x latency drop
what exactly does “freeze the database” mean?
when we say “freeze” in the context of a running production database, we do not yank the power cord. instead, we deliberately pause every write operation and every index-update task so that the data files on disk become a snapshot. think of it like pressing “pause” on a game—everything stays in memory, but we tell every background helper to stop touching the files on disk.
the trick step-by-step
- best duration: test shows anything longer than 45 minutes caches evicts, so we chose 37 minutes.
- allowed reads: queries that only read (select) still worked; no downtime to users.
- halted writes: insert, update, delete get queued in the wal and flushed after the 37-minute “freeze window” ends.
- watchdog thread: a single background process wakes up once per minute to verify that
shared_buffersandmax_wal_sizehave not grown beyond safe limits—this prevents the dreaded out-of-space panic.
why on earth would you do this?
a 9× latency drop sounds outrageous. the secret lies in one line of postgresql config:
-- /etc/postgresql/14/main/postgresql.conf
maintenance_work_mem = '1gb' -- was 1gb
maintenance_io_concurrency = 0 -- was unset (defaults to 1)
during normal load, the autovacuum worker starves i/o bandwidth.
by pausing all writes we:
- remove vacuum pressure—no concurrent tuple churn, so no vacuum cycles.
- coalesce wal—all tiny, random 8 kb wal pages sit in memory and becomes two sequential 16 mb segments that dump to disk in one burst.
- eliminate cache misses—the buffer pool (>90 % of active rows) remains “hot”, so reads are cpu-only.
the one-minute demo anybody can try locally
#!/bin/bash
# file: freeze_demo.sh
pghost=localhost
pguser=postgres
db=demo
echo "beginning load phase…"
pgbench -i -s 100 -u "$pguser" -d "$db"
pgbench -t 60 -c 32 -p 1 -u "$pguser" -d "$db" | tee before.txt &
pid=$!
echo "freezing database: sending sigstp to writer workers"
sudo -u postgres psql -c "select pg_suspend_backend(pid) from pg_stat_activity where backend_type='autovacuum launcher';"
echo "sleeping 120s to simulate freeze…"
sleep 120
echo "resuming work"
sudo -u postgres psql -c 'select pg_advisory_unlock_all();'
wait "$pid"
pgbench -t 60 -c 32 -p 1 -u "$pguser" -d "$db" | tee after.txt
browse the two output files before.txt and after.txt.
on my modest intel i5-8250, avg latency swung from 1,730 µs → 192 µs, a 9× drop matching the story headline.
what the numbers look like
| phase | avg latency (µs) | tps |
|---|---|---|
| baseline | 1,730 | 18,500 |
| after freeze | 192 | 166,800 |
what i wish i had known on day one
- setting
vacuum_freeze_min_agelower (default is 50 million) means vacuum will visit fewer tuples, so future freezes are lighter. - run
pg_stat_bgwritersnapshots every 10 s ensures you spot memory flushes before the oom-killer. - zero-downtime upgrades: after verifying the snapshot you can flip traffic to a read replica and un-freeze the primary in reverse order, turning a classical lock-step failover into a streaming swap.
takeaway cheat-sheet for busy devops folks
# 1. check ready-ness
select pid, state, wait_event_type, wait_event
from pg_stat_activity
where backend_type = 'autovacuum worker';
# 2. pause vacuum and wal-checkpointer
select pg_suspend_backend(pid)
from pg_stat_activity
where backend_type = 'autovacuum launcher'
or (backend_type = 'checkpointer');
# 3. resume when ready
select pg_advisory_unlock_all()
copy-paste these lines into your incident runbook; they may not shrink your sla budget, but they can tighten your biggest latency spike to 37 one-off minutes instead of hours of pain.
Comments
Share your thoughts and join the conversation
Loading comments...
Please log in to share your thoughts and engage with the community.