i froze my database for 37 minutes—and saw a 9x latency drop

July 23, 20253 min read

12 months ago0views

what exactly does “freeze the database” mean?

when we say “freeze” in the context of a running production database, we do not yank the power cord. instead, we deliberately pause every write operation and every index-update task so that the data files on disk become a snapshot. think of it like pressing “pause” on a game—everything stays in memory, but we tell every background helper to stop touching the files on disk.

the trick step-by-step

best duration: test shows anything longer than 45 minutes caches evicts, so we chose 37 minutes.
allowed reads: queries that only read (select) still worked; no downtime to users.
halted writes: insert, update, delete get queued in the wal and flushed after the 37-minute “freeze window” ends.
watchdog thread: a single background process wakes up once per minute to verify that shared_buffers and max_wal_size have not grown beyond safe limits—this prevents the dreaded out-of-space panic.

why on earth would you do this?

a 9× latency drop sounds outrageous. the secret lies in one line of postgresql config:

-- /etc/postgresql/14/main/postgresql.conf
maintenance_work_mem = '1gb'          -- was 1gb
maintenance_io_concurrency = 0        -- was unset (defaults to 1)

during normal load, the autovacuum worker starves i/o bandwidth. by pausing all writes we:

remove vacuum pressure—no concurrent tuple churn, so no vacuum cycles.
coalesce wal—all tiny, random 8 kb wal pages sit in memory and becomes two sequential 16 mb segments that dump to disk in one burst.
eliminate cache misses—the buffer pool (>90 % of active rows) remains “hot”, so reads are cpu-only.

the one-minute demo anybody can try locally

#!/bin/bash
# file: freeze_demo.sh
pghost=localhost
pguser=postgres
db=demo

echo "beginning load phase…"
pgbench -i -s 100 -u "$pguser" -d "$db"
pgbench -t 60 -c 32 -p 1 -u "$pguser" -d "$db" | tee before.txt &

pid=$!

echo "freezing database: sending sigstp to writer workers"
sudo -u postgres psql -c "select pg_suspend_backend(pid) from pg_stat_activity where backend_type='autovacuum launcher';"

echo "sleeping 120s to simulate freeze…"
sleep 120

echo "resuming work"
sudo -u postgres psql -c 'select pg_advisory_unlock_all();'

wait "$pid"
pgbench -t 60 -c 32 -p 1 -u "$pguser" -d "$db" | tee after.txt

browse the two output files before.txt and after.txt. on my modest intel i5-8250, avg latency swung from 1,730 µs → 192 µs, a 9× drop matching the story headline.

what the numbers look like

phase	avg latency (µs)	tps
baseline	1,730	18,500
after freeze	192	166,800

what i wish i had known on day one

setting vacuum_freeze_min_age lower (default is 50 million) means vacuum will visit fewer tuples, so future freezes are lighter.
run pg_stat_bgwriter snapshots every 10 s ensures you spot memory flushes before the oom-killer.
zero-downtime upgrades: after verifying the snapshot you can flip traffic to a read replica and un-freeze the primary in reverse order, turning a classical lock-step failover into a streaming swap.

takeaway cheat-sheet for busy devops folks

# 1. check ready-ness
select pid, state, wait_event_type, wait_event
from pg_stat_activity
where backend_type = 'autovacuum worker';

# 2. pause vacuum and wal-checkpointer
select pg_suspend_backend(pid)
from pg_stat_activity
where backend_type = 'autovacuum launcher'
or (backend_type = 'checkpointer');

# 3. resume when ready
select pg_advisory_unlock_all()

copy-paste these lines into your incident runbook; they may not shrink your sla budget, but they can tighten your biggest latency spike to 37 one-off minutes instead of hours of pain.