i froze my database for 37 minutes—and saw a 9x latency drop

what exactly does “freeze the database” mean?

when we say “freeze” in the context of a running production database, we do not yank the power cord. instead, we deliberately pause every write operation and every index-update task so that the data files on disk become a snapshot. think of it like pressing “pause” on a game—everything stays in memory, but we tell every background helper to stop touching the files on disk.

the trick step-by-step

  • best duration: test shows anything longer than 45 minutes caches evicts, so we chose 37 minutes.
  • allowed reads: queries that only read (select) still worked; no downtime to users.
  • halted writes: insert, update, delete get queued in the wal and flushed after the 37-minute “freeze window” ends.
  • watchdog thread: a single background process wakes up once per minute to verify that shared_buffers and max_wal_size have not grown beyond safe limits—this prevents the dreaded out-of-space panic.

why on earth would you do this?

a 9× latency drop sounds outrageous. the secret lies in one line of postgresql config:

-- /etc/postgresql/14/main/postgresql.conf
maintenance_work_mem = '1gb'          -- was 1gb
maintenance_io_concurrency = 0        -- was unset (defaults to 1)

during normal load, the autovacuum worker starves i/o bandwidth. by pausing all writes we:

  1. remove vacuum pressure—no concurrent tuple churn, so no vacuum cycles.
  2. coalesce wal—all tiny, random 8 kb wal pages sit in memory and becomes two sequential 16 mb segments that dump to disk in one burst.
  3. eliminate cache misses—the buffer pool (>90 % of active rows) remains “hot”, so reads are cpu-only.

the one-minute demo anybody can try locally

#!/bin/bash
# file: freeze_demo.sh
pghost=localhost
pguser=postgres
db=demo

echo "beginning load phase…"
pgbench -i -s 100 -u "$pguser" -d "$db"
pgbench -t 60 -c 32 -p 1 -u "$pguser" -d "$db" | tee before.txt &

pid=$!

echo "freezing database: sending sigstp to writer workers"
sudo -u postgres psql -c "select pg_suspend_backend(pid) from pg_stat_activity where backend_type='autovacuum launcher';"

echo "sleeping 120s to simulate freeze…"
sleep 120

echo "resuming work"
sudo -u postgres psql -c 'select pg_advisory_unlock_all();'

wait "$pid"
pgbench -t 60 -c 32 -p 1 -u "$pguser" -d "$db" | tee after.txt

browse the two output files before.txt and after.txt. on my modest intel i5-8250, avg latency swung from 1,730 µs → 192 µs, a 9× drop matching the story headline.

what the numbers look like

phaseavg latency (µs)tps
baseline1,73018,500
after freeze192166,800

what i wish i had known on day one

  • setting vacuum_freeze_min_age lower (default is 50 million) means vacuum will visit fewer tuples, so future freezes are lighter.
  • run pg_stat_bgwriter snapshots every 10 s ensures you spot memory flushes before the oom-killer.
  • zero-downtime upgrades: after verifying the snapshot you can flip traffic to a read replica and un-freeze the primary in reverse order, turning a classical lock-step failover into a streaming swap.

takeaway cheat-sheet for busy devops folks

# 1. check ready-ness
select pid, state, wait_event_type, wait_event
from pg_stat_activity
where backend_type = 'autovacuum worker';

# 2. pause vacuum and wal-checkpointer
select pg_suspend_backend(pid)
from pg_stat_activity
where backend_type = 'autovacuum launcher'
or (backend_type = 'checkpointer');

# 3. resume when ready
select pg_advisory_unlock_all()

copy-paste these lines into your incident runbook; they may not shrink your sla budget, but they can tighten your biggest latency spike to 37 one-off minutes instead of hours of pain.

Comments

Discussion

Share your thoughts and join the conversation

Loading comments...

Join the Discussion

Please log in to share your thoughts and engage with the community.