i killed my mysql cluster—then rebuilt it in rust & cloud to achieve 30x cheaper ops

how i knocked over my own mysql cluster (and lived to tell)

for three years my team and i ran a classic three-node mysql ndb cluster on bare-metal vm’s. it was fast (enough) and “enterprise-grade” (on paper). then one dark january night a faulty puppet manifest replicated a drop database query to every-secondary node. result: 35 gb of live e-commerce data vanished in 4.2 seconds. $4,700 in on-call credits turned into $47,000 of downtime literally overnight.

post-mortem: the true cost of “prem ops”

during the 36-hour recovery we tracked every manual step:

  • restore from cold-backups (6 h)
  • replay bin-logs (2 h)
  • tune innodb_buffer_pool_size for the 4th time (1 h)
  • hash-out slas with angry bizdev (2 “dramatic” zoom calls)

all of this added up to a cogs (cost of goods sold) line-item 2.3× higher than our cloud bill for everything else combined.

key lesson

when you think you’re “paying for hardware,” you are actually paying for undifferentiated toil.

sketching the new blueprint on a whiteboard

we needed three non-negotiable outcomes:

  1. capex → opex (pay for queries, not servers)
  2. everything-as-code (repeatable, reviewable gitops)
  3. 30× cheaper to keep the lights on

two buzzwords that kept coming up: rust and serverless cloud.

why rust?

  • generates tiny native binaries (≈ 5 mb, 100× smaller than jvm)
  • tower + hyper = async services with 0.2 ms p99 latency
  • cargo audit flags cve’s at build time

why cloud in “serverless” mode?

  • no ec2 patching, only stored-compute (aurora serverless v2)
  • scale-to-zero saves 68 % during off-peak
  • built-in iam and cloudwatch metrics for free (well, almost)

from er-table to rust crate: 12 steps with code

step 1 — export the original schema


mysqldump --no-data --skip-triggers prod_db > orig_schema.sql

step 2 — auto-generate rust structs


# install diesel-cli
cargo install diesel_cli --no-default-features --features mysql

# generate schema.rs
cd ./backend && diesel print-schema > src/schema.rs
# add the sea-query helper if you prefer sql-builder style

step 3 — create a lightweight http api using axum


use axum::{
    routing::get,
    router,
};
use std::net::socketaddr;

#[tokio::main]
async fn main() {
    let app = router::new()
        .route("/healthz", get(|| async { "ok" }));

    let addr = socketaddr::from(([0, 0, 0, 0], 3000));
    axum::server::bind(&addr)
        .serve(app.into_make_service())
        .await
        .unwrap();
}

after cargo build --release the binary is 3.9 mb and starts in ≈ 22 ms.

step 4 — pick the cloud runtime

optioncold-startmemory limitegress cost calculator
lambda + aurora serverless< 300 ms10 gbfree tier ‘til 15 gb/mo
cloud run + alloydb≈ 900 ms32 gb.045 usd/gib

we chose lambda + aurora serverless v2 postgres because pl/pgsql is good enough and mysql syntax translated 1-for-1 for 97 % of tables.

step 5 — one-click deployment via cdk


npm install -g aws-cdk
cd .infra
npx aws-cdk init app --language=typescript

single file summary (lib/infra-stack.ts):


const cluster = new rds.serverlesscluster(this, 'auroracluster', {
  engine: databaseclusterengine.aurora_postgresql,
  parametergroup: rds.parametergroup.fromengineversion(
    rds.aurorapostgresengineversion.ver_15_2
  ),
  scaling: { autopause: duration.minutes(5) },
  credentials: rds.credentials.fromsecret(dbsecret),
});
new lambda.function(this, 'rustapi', {
  runtime: lambda.runtime.provided_al2,
  architecture: lambda.architecture.arm_64,
  code: lambda.code.fromasset('../backend/target/lambda/bootstrap.zip'),
});

measuring the 30× gain

running the same average daily workload of ≈ 450 qps:

  • old infra: 3 bare-metal server instances in a colo = $2,430 / month
  • lambda + aurora serverless: $79 / month on average
  • net reduction: 30.7×

(prices eu-central-1 in early 2024.)

hidden gems along the way

feature flag driven migrations


# .env
old_mysql=false

we wrapped every critical read in a de-facto circuit breaker:


async fn load_user(id: i64) -> result {
    match env::var("old_mysql") {
        ok(_) => legacy_mysql(id).await,
        err(_) => serverless_pg(id).await,
    }
}

no-downtime cutover

using weighted route53 + an alb we gradually shifted traffic from 0 % → 5 % → 50 % → 100 %. rollback = single value change in terraform, applied in 90 seconds.

seo wins as side effect

smaller cold-starts allowed us to enable ssr on dynamic pages. the resulting core web vitals push (lcp < 1.8 s) lifted organic traffic by 28 % in 6 weeks. (we kept the keywords devops, full stack, coding, seo in every meta tag, ignorable by google but satisfying a certain product manager.)

tl;dr checklist for your own rewrite

  1. inventory every manual runbook task (and tag them with “💰clock” emoji)
  2. pick one cloud primitive: lambda, cloud run, or (!) a fly.io machine
  3. prove costs before deleting legacy (ab or k6 + a spreadsheet)
  4. open-source the cdk skeleton to github—stars = accountability
  5. publish the post-mortem—google likes honesty; senior engineers like humility.

next chapter: wasm at the edge?

we already have a single wasm32-wasi build running on fastly compute@edge for static asset edge-caching. average total latency dropped to 31 ms. but that is another crash story—pun fully intended—for another article.

happy (cheaper) hacking!

Comments

Discussion

Share your thoughts and join the conversation

Loading comments...

Join the Discussion

Please log in to share your thoughts and engage with the community.