The Staging Environment Nobody Trusted (So Everyone Tested in Production)

devops consulting environments reliability developer-experience

A client's staging environment had drifted so far from production that developers stopped using it. Tests passed in staging and failed in prod. Tests failed in staging and passed in prod. Eventually the team just stopped looking.

I noticed it on my second day. The team had a perfectly functional staging environment — CI deployed to it on every merge, there was a URL pinned in Slack, the monitoring dashboards existed. But when I asked the tech lead to walk me through their QA process before a production deploy, she said, "Honestly? We just watch the logs for the first ten minutes after it goes out."

Nobody used staging. Not because they were reckless, but because they'd learned — through months of painful experience — that staging was a liar.

How it got this way

Nobody made a decision to let staging rot. It happened the way these things always happen: gradually, invisibly, one small skip at a time.

The production database got upgraded to PostgreSQL 15 eight months ago. Staging stayed on 13. Someone meant to update it. The ticket sat in the backlog.

Two third-party integrations — a payment processor and an address verification API — were stubbed out in staging with mock responses written in 2024. The real APIs had changed their response formats twice since then. The stubs still returned the old shape, which meant the parsing code looked fine in staging and threw errors in production.

Staging had 200 rows in the users table. Production had 1.4 million. The product catalog in staging had 50 items. Production had 12,000 with nested variants, bundles, and locale-specific pricing that nobody had bothered to seed.

And then there were the environment variables. Production had 63. Staging had 47. Sixteen config values — feature flags, API keys for newer integrations, rate limit thresholds — had been added to production over the past year and never propagated to staging. Some of the staging values were still pointing at services that had been decommissioned.

The trust collapse

Here's the thing about a drifted staging environment: it doesn't just stop being useful. It actively undermines your confidence.

When a test passes in staging but fails in production, you lose trust in staging. That's obvious. But the subtler damage comes from the reverse — when something fails in staging but works fine in production. That teaches developers to ignore staging failures. "Oh, that's just a staging thing." Once a team starts saying that, staging is dead. They just haven't buried it yet.

This team's workflow had quietly evolved into something nobody would have designed on purpose: merge to main, CI deploys to staging (because the pipeline requires it), skip any manual verification there, deploy to production, and watch Datadog for five minutes. They were testing in production. They just didn't call it that.

The migration that made it real

A schema migration brought the problem into sharp focus. The migration added an index to the orders table — a sensible change, well-written, reviewed by two senior engineers. In staging, it completed in 80 milliseconds. In production, it locked the orders table for 47 minutes.

Two hundred rows versus 1.8 million rows. The migration used CREATE INDEX without CONCURRENTLY because it had never been a problem. In staging, it never would be.

The team lost 47 minutes of order processing during peak hours. The postmortem was uncomfortable because everyone already knew the root cause. They'd just been living with it.

What we actually fixed

The temptation was to build something elaborate — a full environment management platform, automated sync pipelines, the works. We didn't. We did four boring things.

Matched the infrastructure specs. Same Terraform modules for staging and production, parameterized by environment. Same database version, same Elasticsearch cluster topology, same Redis configuration. This took two days and surfaced a dozen silent incompatibilities that had been hiding for months.

Seeded realistic data. Not a full production mirror — that's its own compliance headache. We wrote a seeding script that generated 100,000 users, 8,000 products with realistic variant structures, and six months of synthetic order history. Enough to make query plans and migration timings meaningful.

Added an environment variable drift check. A CI step that compared the set of config keys across environments and failed the build if staging was missing any key that production had. Fifteen lines of bash:

#!/bin/bash
PROD_KEYS=$(aws ssm get-parameters-by-path --path /app/prod \
  --query 'Parameters[].Name' --output text | tr '\t' '\n' | \
  sed 's|/app/prod/||' | sort)
STAGING_KEYS=$(aws ssm get-parameters-by-path --path /app/staging \
  --query 'Parameters[].Name' --output text | tr '\t' '\n' | \
  sed 's|/app/staging/||' | sort)
 
MISSING=$(comm -23 <(echo "$PROD_KEYS") <(echo "$STAGING_KEYS"))
if [ -n "$MISSING" ]; then
  echo "Staging is missing these config keys:"
  echo "$MISSING"
  exit 1
fi

This caught three new drift cases in the first two weeks.

Replaced the API stubs with sandbox accounts. Both the payment processor and the address API offered sandbox environments. Switching to those meant staging was hitting real endpoints with real validation — just not processing real money. The stubs were deleted.

Note

The total effort was about four engineering days. The staging environment went from a ghost town to something developers actually opened in a browser before deploying to production.

The uncomfortable math

That staging environment had been running for 18 months at roughly $2,800/month. The team had spent $50K on infrastructure that actively misled them. If they'd had no staging environment at all, they probably would have built better production safeguards — canary deploys, feature flags, tighter rollback automation. Instead, they had the illusion of a safety net that hadn't been connected to anything for a year.

I don't think this is unusual. I've walked into a dozen client environments and asked, "How closely does staging match production?" The answer is almost always a confident "pretty close" followed by an uncomfortable silence when you start comparing specifics.

The maintenance tax on environment parity is real, and nobody budgets for it. Everyone budgets for building the environment. Nobody budgets for keeping it honest.