The Sixty-Seven Environment Variables Nobody Documented

A client's deployment kept failing in staging but not locally. The root cause wasn't code — it was sixty-seven environment variables spread across five files with no documentation and no single source of truth.


The deploy failed on a Thursday afternoon. Staging, not production — small mercy. The error was a generic ConnectionRefusedError from a service that had been working fine for months. The engineer who triggered the deploy checked the code diff: two lines changed in a utility function. Nothing that should touch networking.

She reverted. Redeployed the previous version. Same error.

That's when I got pulled in. I was four weeks into a consulting engagement with a mid-size SaaS company, mostly helping them untangle their CI/CD pipeline. But a staging environment that breaks on a no-op deploy is the kind of mystery that demands immediate attention.

Pulling the thread

The ConnectionRefusedError pointed to their payment gateway integration. I checked the service — it was trying to reach an internal URL that didn't exist in the staging VPC. The URL came from an environment variable called PAYMENT_GATEWAY_INTERNAL_URL.

I asked where it was set. Three different people gave me three different answers.

One pointed me to a .env.staging file in the repo. Another said it lived in AWS Parameter Store. A third was confident it was injected by their Helm chart. All three were partially right. The variable was defined in all three places, and the Helm chart value — the one that actually won at runtime — had been silently overwritten during an infrastructure change two weeks earlier. Nobody noticed because the old value still worked until a VPC peering connection expired.

That was the immediate fix. But the real problem was much bigger.

The audit

I spent the next two days mapping every environment variable the application consumed. The .env.example file in the repo listed 23 variables. The actual running application in production used 67.

Forty-four environment variables had no documentation, no example values, and no mention in any onboarding guide. Some highlights from the gap:

  • LEGACY_AUTH_FALLBACK_ENABLED — set to "true" in production, absent everywhere else. Nobody on the current team knew what the legacy auth system was or why it needed a fallback.
  • CACHE_BYPASS_HEADER_NAME — a custom HTTP header that, when present, skipped the Redis cache. Documented nowhere. Discovered by one engineer during a debugging session eight months ago and quietly added to production.
  • ML_SCORING_TIMEOUT_MS — set to "30000" in production and "5000" in staging. The ML team had bumped it in production after their model got slower, but nobody updated staging. Every ML-dependent feature had been subtly broken in staging for three months.
  • FEATURE_X_ROLLOUT_PCT — for a feature that had been at 100% for over a year. Still checked on every request.

Warning

If your .env.example and your actual runtime configuration diverge, you don't have configuration management. You have configuration archaeology.

How it gets this bad

Nobody sets out to create sixty-seven undocumented environment variables. It happens one at a time, each one perfectly reasonable in isolation.

A developer needs a quick toggle for a feature flag. Environment variable. The ops team needs to configure a timeout without redeploying. Environment variable. Someone integrates a third-party service and drops in an API key. Environment variable. A debug setting that was supposed to be temporary becomes permanent because removing it feels riskier than leaving it.

The .env.example file gets updated for the first few, then people forget. New hires copy someone else's .env.local file over Slack. The values in that file are six months stale, but the app mostly works, so nobody asks questions until something breaks.

The pattern is identical to technical debt in code, but worse in one specific way: stale code at least shows up in grep results and IDE searches. A missing environment variable manifests as a runtime error — or worse, as a silent behavioral difference between environments that nobody notices for months.

What we did about it

We couldn't fix everything at once, but we made three changes that stopped the bleeding.

First, we created a canonical schema. We wrote a validation module that ran at application startup. Every environment variable the app needed was declared in a single file with its type, a description, whether it was required, and a default value for development.

const config = validateEnv({
  PAYMENT_GATEWAY_INTERNAL_URL: {
    type: 'string',
    required: true,
    description: 'Internal VPC endpoint for payment gateway',
  },
  ML_SCORING_TIMEOUT_MS: {
    type: 'number',
    required: false,
    default: 5000,
    description: 'Timeout for ML scoring requests',
  },
  CACHE_BYPASS_HEADER_NAME: {
    type: 'string',
    required: false,
    default: null,
    description: 'Header name to bypass Redis cache (debugging only)',
  },
});

If a required variable was missing, the app refused to start. No more discovering missing config through a ConnectionRefusedError twenty minutes into a deploy.

Second, we killed the multi-source problem. Environment variables came from exactly one place per environment: Parameter Store for staging and production, .env.local for development. The Helm chart stopped injecting values directly. The .env.staging file in the repo got deleted.

Third, we added a CI check. A simple script compared the keys in the schema file against what was defined in Parameter Store for each environment. If staging was missing a variable that production had, the pipeline flagged it before any code shipped. The drift between environments dropped from forty-four variables to zero within a week.

The cleanup

With the schema in place, we could finally see the full picture. Of the sixty-seven variables, we removed nineteen. Eight were for features that no longer existed. Six were duplicates with slightly different names (DB_HOST vs DATABASE_HOST — both set, only one read). Five were debug flags that had been on for so long that the code paths they guarded had become the only paths.

LEGACY_AUTH_FALLBACK_ENABLED turned out to protect a code path that hadn't been reachable since a database migration eighteen months earlier. We removed the flag, the fallback code, and about 400 lines of authentication logic that existed purely to serve a configuration value nobody understood.

The deeper issue

Configuration sprawl is a symptom of something teams rarely talk about: the gap between "it works on my machine" and "it works the same way everywhere." Every environment variable without documentation is a piece of tribal knowledge. When the person who added it leaves, that knowledge leaves with them, and what's left is a string value in a parameter store that everyone is afraid to touch.

I've seen this pattern at five different clients now. The number of variables varies. The shape of the problem doesn't. And the fix is always the same boring, unsexy work: write it down, validate it at startup, and make drift visible before it becomes an outage.

What's your configuration story? I'm genuinely curious how many production environment variables your team can account for without checking — and how far off that number is from reality.