i killed my mysql cluster—then rebuilt it in rust & cloud to achieve 30x cheaper ops

July 24, 20254 min read

12 months ago0views

how i knocked over my own mysql cluster (and lived to tell)

for three years my team and i ran a classic three-node mysql ndb cluster on bare-metal vm’s. it was fast (enough) and “enterprise-grade” (on paper). then one dark january night a faulty puppet manifest replicated a drop database query to every-secondary node. result: 35 gb of live e-commerce data vanished in 4.2 seconds. $4,700 in on-call credits turned into $47,000 of downtime literally overnight.

post-mortem: the true cost of “prem ops”

during the 36-hour recovery we tracked every manual step:

restore from cold-backups (6 h)
replay bin-logs (2 h)
tune innodb_buffer_pool_size for the 4th time (1 h)
hash-out slas with angry bizdev (2 “dramatic” zoom calls)

all of this added up to a cogs (cost of goods sold) line-item 2.3× higher than our cloud bill for everything else combined.

key lesson

when you think you’re “paying for hardware,” you are actually paying for undifferentiated toil.

sketching the new blueprint on a whiteboard

we needed three non-negotiable outcomes:

capex → opex (pay for queries, not servers)
everything-as-code (repeatable, reviewable gitops)
30× cheaper to keep the lights on

two buzzwords that kept coming up: rust and serverless cloud.

why rust?

generates tiny native binaries (≈ 5 mb, 100× smaller than jvm)
tower + hyper = async services with 0.2 ms p99 latency
cargo audit flags cve’s at build time

why cloud in “serverless” mode?

no ec2 patching, only stored-compute (aurora serverless v2)
scale-to-zero saves 68 % during off-peak
built-in iam and cloudwatch metrics for free (well, almost)

from er-table to rust crate: 12 steps with code

step 1 — export the original schema


mysqldump --no-data --skip-triggers prod_db > orig_schema.sql

step 2 — auto-generate rust structs


# install diesel-cli
cargo install diesel_cli --no-default-features --features mysql

# generate schema.rs
cd ./backend && diesel print-schema > src/schema.rs
# add the sea-query helper if you prefer sql-builder style

step 3 — create a lightweight http api using axum


use axum::{
    routing::get,
    router,
};
use std::net::socketaddr;

#[tokio::main]
async fn main() {
    let app = router::new()
        .route("/healthz", get(|| async { "ok" }));

    let addr = socketaddr::from(([0, 0, 0, 0], 3000));
    axum::server::bind(&addr)
        .serve(app.into_make_service())
        .await
        .unwrap();
}

after cargo build --release the binary is 3.9 mb and starts in ≈ 22 ms.

step 4 — pick the cloud runtime

option	cold-start	memory limit	egress cost calculator
lambda + aurora serverless	< 300 ms	10 gb	free tier ‘til 15 gb/mo
cloud run + alloydb	≈ 900 ms	32 gb	.045 usd/gib

we chose lambda + aurora serverless v2 postgres because pl/pgsql is good enough and mysql syntax translated 1-for-1 for 97 % of tables.

step 5 — one-click deployment via cdk


npm install -g aws-cdk
cd .infra
npx aws-cdk init app --language=typescript

single file summary (lib/infra-stack.ts):


const cluster = new rds.serverlesscluster(this, 'auroracluster', {
  engine: databaseclusterengine.aurora_postgresql,
  parametergroup: rds.parametergroup.fromengineversion(
    rds.aurorapostgresengineversion.ver_15_2
  ),
  scaling: { autopause: duration.minutes(5) },
  credentials: rds.credentials.fromsecret(dbsecret),
});
new lambda.function(this, 'rustapi', {
  runtime: lambda.runtime.provided_al2,
  architecture: lambda.architecture.arm_64,
  code: lambda.code.fromasset('../backend/target/lambda/bootstrap.zip'),
});

measuring the 30× gain

running the same average daily workload of ≈ 450 qps:

old infra: 3 bare-metal server instances in a colo = $2,430 / month
lambda + aurora serverless: $79 / month on average
net reduction: 30.7×

(prices eu-central-1 in early 2024.)

hidden gems along the way

feature flag driven migrations


# .env
old_mysql=false

we wrapped every critical read in a de-facto circuit breaker:


async fn load_user(id: i64) -> result {
    match env::var("old_mysql") {
        ok(_) => legacy_mysql(id).await,
        err(_) => serverless_pg(id).await,
    }
}

no-downtime cutover

using weighted route53 + an alb we gradually shifted traffic from 0 % → 5 % → 50 % → 100 %. rollback = single value change in terraform, applied in 90 seconds.

seo wins as side effect

smaller cold-starts allowed us to enable ssr on dynamic pages. the resulting core web vitals push (lcp < 1.8 s) lifted organic traffic by 28 % in 6 weeks. (we kept the keywords devops, full stack, coding, seo in every meta tag, ignorable by google but satisfying a certain product manager.)

tl;dr checklist for your own rewrite

inventory every manual runbook task (and tag them with “💰clock” emoji)
pick one cloud primitive: lambda, cloud run, or (!) a fly.io machine
prove costs before deleting legacy (ab or k6 + a spreadsheet)
open-source the cdk skeleton to github—stars = accountability
publish the post-mortem—google likes honesty; senior engineers like humility.

next chapter: wasm at the edge?

we already have a single wasm32-wasi build running on fastly compute@edge for static asset edge-caching. average total latency dropped to 31 ms. but that is another crash story—pun fully intended—for another article.

happy (cheaper) hacking!