i killed my mysql cluster—then rebuilt it in rust & cloud to achieve 30x cheaper ops
how i knocked over my own mysql cluster (and lived to tell)
for three years my team and i ran a classic three-node mysql ndb cluster on bare-metal vm’s. it was fast (enough) and “enterprise-grade” (on paper). then one dark january night a faulty puppet manifest replicated a drop database query to every-secondary node. result: 35 gb of live e-commerce data vanished in 4.2 seconds. $4,700 in on-call credits turned into $47,000 of downtime literally overnight.
post-mortem: the true cost of “prem ops”
during the 36-hour recovery we tracked every manual step:
- restore from cold-backups (6 h)
- replay bin-logs (2 h)
- tune innodb_buffer_pool_size for the 4th time (1 h)
- hash-out slas with angry bizdev (2 “dramatic” zoom calls)
all of this added up to a cogs (cost of goods sold) line-item 2.3× higher than our cloud bill for everything else combined.
key lesson
when you think you’re “paying for hardware,” you are actually paying for undifferentiated toil.
sketching the new blueprint on a whiteboard
we needed three non-negotiable outcomes:
- capex → opex (pay for queries, not servers)
- everything-as-code (repeatable, reviewable gitops)
- 30× cheaper to keep the lights on
two buzzwords that kept coming up: rust and serverless cloud.
why rust?
- generates tiny native binaries (≈ 5 mb, 100× smaller than jvm)
- tower + hyper = async services with 0.2 ms p99 latency
-
cargo auditflags cve’s at build time
why cloud in “serverless” mode?
- no ec2 patching, only stored-compute (aurora serverless v2)
- scale-to-zero saves 68 % during off-peak
- built-in iam and cloudwatch metrics for free (well, almost)
from er-table to rust crate: 12 steps with code
step 1 — export the original schema
mysqldump --no-data --skip-triggers prod_db > orig_schema.sql
step 2 — auto-generate rust structs
# install diesel-cli
cargo install diesel_cli --no-default-features --features mysql
# generate schema.rs
cd ./backend && diesel print-schema > src/schema.rs
# add the sea-query helper if you prefer sql-builder style
step 3 — create a lightweight http api using axum
use axum::{
routing::get,
router,
};
use std::net::socketaddr;
#[tokio::main]
async fn main() {
let app = router::new()
.route("/healthz", get(|| async { "ok" }));
let addr = socketaddr::from(([0, 0, 0, 0], 3000));
axum::server::bind(&addr)
.serve(app.into_make_service())
.await
.unwrap();
}
after cargo build --release the binary is 3.9 mb and starts in ≈ 22 ms.
step 4 — pick the cloud runtime
| option | cold-start | memory limit | egress cost calculator |
|---|---|---|---|
| lambda + aurora serverless | < 300 ms | 10 gb | free tier ‘til 15 gb/mo |
| cloud run + alloydb | ≈ 900 ms | 32 gb | .045 usd/gib |
we chose lambda + aurora serverless v2 postgres because pl/pgsql is good enough and mysql syntax translated 1-for-1 for 97 % of tables.
step 5 — one-click deployment via cdk
npm install -g aws-cdk
cd .infra
npx aws-cdk init app --language=typescript
single file summary (lib/infra-stack.ts):
const cluster = new rds.serverlesscluster(this, 'auroracluster', {
engine: databaseclusterengine.aurora_postgresql,
parametergroup: rds.parametergroup.fromengineversion(
rds.aurorapostgresengineversion.ver_15_2
),
scaling: { autopause: duration.minutes(5) },
credentials: rds.credentials.fromsecret(dbsecret),
});
new lambda.function(this, 'rustapi', {
runtime: lambda.runtime.provided_al2,
architecture: lambda.architecture.arm_64,
code: lambda.code.fromasset('../backend/target/lambda/bootstrap.zip'),
});
measuring the 30× gain
running the same average daily workload of ≈ 450 qps:
- old infra: 3 bare-metal server instances in a colo = $2,430 / month
- lambda + aurora serverless: $79 / month on average
- net reduction: 30.7×
(prices eu-central-1 in early 2024.)
hidden gems along the way
feature flag driven migrations
# .env
old_mysql=false
we wrapped every critical read in a de-facto circuit breaker:
async fn load_user(id: i64) -> result {
match env::var("old_mysql") {
ok(_) => legacy_mysql(id).await,
err(_) => serverless_pg(id).await,
}
}
no-downtime cutover
using weighted route53 + an alb we gradually shifted traffic from 0 % → 5 % → 50 % → 100 %. rollback = single value change in terraform, applied in 90 seconds.
seo wins as side effect
smaller cold-starts allowed us to enable ssr on dynamic pages. the resulting core web vitals push (lcp < 1.8 s) lifted organic traffic by 28 % in 6 weeks. (we kept the keywords devops, full stack, coding, seo in every meta tag, ignorable by google but satisfying a certain product manager.)
tl;dr checklist for your own rewrite
- inventory every manual runbook task (and tag them with “💰clock” emoji)
- pick one cloud primitive: lambda, cloud run, or (!) a fly.io machine
- prove costs before deleting legacy (
abork6+ a spreadsheet) - open-source the cdk skeleton to github—stars = accountability
- publish the post-mortem—google likes honesty; senior engineers like humility.
next chapter: wasm at the edge?
we already have a single wasm32-wasi build running on fastly compute@edge for static asset edge-caching.
average total latency dropped to 31 ms.
but that is another crash story—pun fully intended—for another article.
happy (cheaper) hacking!
Comments
Share your thoughts and join the conversation
Loading comments...
Please log in to share your thoughts and engage with the community.