The Rollback That Didn't Roll Back

deployment reliability consulting devops database

We practiced deployments religiously but never tested a rollback. When a release broke checkout and we hit the big red button, we found out half the system couldn't actually go backward.

The client shipped a new checkout flow on a Tuesday afternoon. It had been through staging, passed end-to-end tests, gotten sign-off from product. The deploy went out via their standard blue-green pipeline. Within twenty minutes, customer support was flooded. Orders were failing at payment confirmation, throwing a generic "something went wrong" error. Conversion rate fell off a cliff.

No panic, though. The team had a rollback button. One click in their CI dashboard, and the previous version would be redeployed. The lead engineer clicked it, the pipeline ran, the old containers spun up.

Orders kept failing.

Forward-only state

The new checkout flow had come with a database migration. Nothing dramatic — three new columns on the orders table, a renamed field (payment_ref became payment_reference_id), and a new enum value in the order_status column. The migration ran automatically as part of the deploy. Standard practice, nobody thought twice about it.

But the rollback didn't reverse the migration. It just redeployed the old application code. That old code expected a column called payment_ref. The column was now called payment_reference_id. Every insert into the orders table threw a column-not-found error, caught by a generic exception handler that returned a friendly "something went wrong" to the user.

This is the part that surprised the team: they'd been doing blue-green deployments for two years. They'd practiced failovers. They had runbooks. But nobody had ever tested what happens when you roll back across a schema change. The deploy pipeline treated application code and database migrations as a single forward-moving unit, with no concept of reverse.

It got worse

While the team was figuring out the column rename, a second problem surfaced. The new checkout flow had introduced a new order status: awaiting_confirmation. During the twenty minutes the new code was live, about 340 orders had been created with this status. The old code didn't recognize awaiting_confirmation — its status enum only knew pending, confirmed, shipped, and cancelled. The order listing page threw a deserialization error for any customer whose recent order carried the new status.

Then a third problem. The new code had started publishing order events to a Kafka topic with an updated schema — an additional field in the payload. A downstream fulfillment service consumed that topic, and its deserializer was lenient enough to ignore unknown fields going forward. But when the old code resumed publishing events without the new field, a different downstream service — one that had been quickly patched to read the new field during the rollout — started throwing null pointer exceptions.

Three layers of forward-only state: the database schema, the data itself, and the event contracts. The rollback only addressed the application binary.

What we actually did

The real recovery took four hours, not the thirty seconds the rollback button promised.

First, we wrote a targeted SQL migration to rename the column back. This was the easy part, but it required an engineer with production database access and the nerve to run DDL against a live checkout system during peak hours.

ALTER TABLE orders RENAME COLUMN payment_reference_id TO payment_ref;

Second, we updated the 340 orders stuck in awaiting_confirmation back to pending and reprocessed them manually. Some had already been charged by the payment provider, so we had to reconcile each one against Stripe's records to avoid double-charges.

Third, we had to deal with the Kafka consumers. The downstream service that expected the new field needed a hotfix to make it optional. We couldn't un-publish the events that had already been produced with the new schema, so the consumer had to tolerate both shapes.

None of this was in the runbook.

What changed afterward

The team adopted three rules that I've since carried to other engagements.

Separate schema migrations from code deploys. Migrations now run independently, ahead of the code that uses them. New columns get added first, the code deploys to use them, and only after the old code is fully retired do we drop unused columns. This means every migration must be backward-compatible with the currently running code. It's more work, but it means a code rollback never lands on an incompatible schema.

Write the down migration before you merge. Every migration file now requires a corresponding rollback script that's tested in CI. Not because we plan to run them automatically — we don't — but because writing the reverse forces you to think about whether the change is actually reversible. A column rename is trivial. Dropping a column isn't. Merging two tables definitely isn't. If you can't write the down migration, that's a signal the deploy needs extra caution.

Test rollbacks in staging. Deploy v2, generate some traffic, then roll back to v1. Does the old code still work against the database? Can downstream consumers handle the schema transition in both directions? This caught two more issues in the first month alone — a new index that the old query planner handled poorly, and a cache key format change that caused stale data after rollback.

Warning

If your deploy pipeline runs migrations forward but has no concept of reverse, your rollback button is decorative. Test it before you need it.

The pattern I keep seeing

Teams invest heavily in making deploys safe — canary analysis, feature flags, progressive rollouts. But rollback gets treated as a magic undo button that will always be there. It's the deployment equivalent of a backup you never tested restoring.

The uncomfortable truth is that most non-trivial releases create forward-only state. Database schemas change. Data takes new shapes. Event contracts evolve. Cache keys rotate. Each of these is a one-way door that your application rollback can't walk back through.

I don't have a universal answer for this. Expand-and-contract migrations help. Schema versioning helps. But they add real complexity, and I've watched teams adopt them with enthusiasm and then quietly stop writing down-migrations three months later when nothing has gone wrong for a while.

Maybe the honest minimum is just this: before any release that touches shared state, ask the question out loud. "If we need to roll this back in an hour, what won't come with us?"