The Canary That Didn't Sing — What Our Deployment Strategy Missed

We built a canary deployment pipeline with automated rollbacks. It still let a bad release through to 100% of users. Here's what went wrong.


Six months ago I helped a client set up canary deployments. Argo Rollouts, automated traffic shifting, Prometheus metrics, the works. We were proud of it. The pipeline would shift 5% of traffic to a new version, watch error rates for ten minutes, then gradually ramp to 25%, 50%, and finally 100%. If the error rate crossed a threshold, it would roll back automatically.

It worked perfectly in every test we ran. Then it let a broken release sail through to production without a single alarm.

The Setup

The client ran a mid-size e-commerce platform — about 40 microservices, mostly Go and TypeScript. They'd been doing blue-green deployments for years, which worked but felt wasteful. Two full environments, a hard cutover, and when something went wrong you had to scramble to flip the switch back.

Canary felt like the obvious next step. We defined our rollout spec to watch two metrics: HTTP 5xx rate and p99 latency. If either spiked beyond a configurable threshold during any analysis window, the rollout would pause and then roll back.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
spec:
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: { duration: 10m }
        - setWeight: 25
        - pause: { duration: 10m }
        - setWeight: 50
        - pause: { duration: 10m }
        - setWeight: 100
      analysis:
        templates:
          - templateName: error-rate
          - templateName: latency-p99

Clean, readable, well-documented. We ran drills where we intentionally deployed a version that returned 500s and watched the rollback kick in within minutes. Everyone felt good.

What Went Wrong

Three weeks later, a developer shipped a change to the order confirmation service. The code looked fine in review. It passed all tests. The canary rolled out to 5%, waited ten minutes, saw no spike in errors or latency, and promoted to 25%. Then 50%. Then 100%.

The problem didn't surface until the next morning when the finance team noticed that about 8% of orders from the previous evening had incorrect tax calculations. The amounts were off by small margins — a few cents here, a dollar there. No errors were thrown. The HTTP responses were all 200s. Latency was normal.

The bug was a rounding issue introduced when the developer refactored a currency conversion function. The old code used math.Round after multiplying by 100. The new code applied rounding differently, and for certain currency pairs the result drifted by fractions of a cent, which compounded on multi-item orders.

Our canary pipeline was watching the door for someone kicking it down. The bug walked in through a window.

The Real Problem: We Measured the Infrastructure, Not the Business

This is the part that stung. We'd spent weeks building a deployment pipeline that could detect infrastructure failures — crashes, timeouts, error spikes. But we hadn't wired in any business-level validation.

Nobody asked: "After this deploy, are the numbers still correct?"

It's easy to see in hindsight. At the time, we were focused on uptime. The client's previous pain had been full outages caused by bad deploys, so naturally we optimized for detecting outages. But the most dangerous bugs aren't the ones that crash your service. They're the ones that silently corrupt your data while every dashboard stays green.

Warning

A canary that only watches HTTP status codes and latency will catch maybe 40% of production issues. The rest are logic bugs, data corruption, and subtle behavioral changes that don't trip infrastructure alarms.

What We Changed

We didn't scrap the canary setup. We layered business metrics into the analysis. For the order service specifically, we added:

  • Order value deviation: compare the average order value between canary and stable pods over the analysis window. If they diverge by more than 1%, flag it.
  • Tax calculation spot checks: a lightweight job that runs sample calculations against known inputs and compares outputs between versions.
  • Conversion rate delta: if the canary cohort has a meaningfully different checkout completion rate, something might be off in the user-facing flow.

These weren't hard to implement. The data was already in Prometheus and their data warehouse. We just hadn't thought to pipe it into the rollout analysis.

We also added a longer bake period for services that touch financial data. Instead of 10-minute windows, those services now get 30-minute windows at each step, with the first step running at just 2% of traffic. Slower, yes. But an overnight tax calculation bug costs a lot more than an extra hour of deployment time.

The Broader Lesson

Every deployment strategy has assumptions baked into it. Blue-green assumes your smoke tests are comprehensive. Feature flags assume someone will eventually clean them up (they won't — I wrote about that). Canary deployments assume your metrics capture what "healthy" actually means for your system.

The tooling around progressive delivery has gotten genuinely impressive. Argo Rollouts, Flagger, and similar tools make the mechanics of traffic shifting almost trivial. But the mechanics were never the hard part. The hard part is deciding what to measure — and that requires understanding your domain, not just your infrastructure.

I keep coming back to a question that I don't think has a clean answer: how do you systematically identify the metrics that matter before a silent failure teaches you the hard way? We got better at it for this client, but it was a reactive improvement. Somewhere out there, another canary is humming along, all green, while the data underneath slowly drifts.