The Feature Flag Graveyard We Built Over Two Years

technical-debt feature-flags code-quality consulting

How 340 feature flags turned a codebase into a minefield — and the boring process that dug us out.

Last year I joined a client engagement where the backend had 340 feature flags. Three hundred and forty. The team had started with good intentions — careful rollouts, A/B tests, kill switches for risky features. But nobody had ever removed one.

The codebase wasn't legacy in the traditional sense. It was a three-year-old Node.js monolith, actively developed, with a team of 14 engineers shipping weekly. By every surface metric, they were doing fine. Deploys were green. Sprints were closing. But the velocity numbers told a different story: features that should have taken three days were consistently taking eight or nine.

How it got this bad

The pattern was always the same. A product manager asks for a gradual rollout. An engineer wraps the new behavior in a flag. The feature launches, everyone's happy, the flag stays. Nobody deletes it because deletion requires understanding the blast radius, and understanding the blast radius requires reading code that's wrapped in three other flags.

I pulled up the flag management dashboard on my first day and sorted by creation date. The oldest active flag was from 21 months ago. It controlled whether users saw a redesigned settings page. The redesign had been fully rolled out for 19 months. Everyone on the current team assumed someone else owned it.

The real damage wasn't the stale flags themselves. It was the interactions between them. One endpoint had this:

async function getCheckoutSummary(user: User, cart: Cart) {
  let pricing = await calculateBase(cart);
 
  if (await flagService.isEnabled('new-tax-engine', user)) {
    pricing = await calculateTaxV2(pricing, user.region);
  } else {
    pricing = await calculateTax(pricing);
  }
 
  if (await flagService.isEnabled('loyalty-discount', user)) {
    if (await flagService.isEnabled('loyalty-v2-tiers', user)) {
      pricing = applyTieredDiscount(pricing, user.loyaltyTier);
    } else {
      pricing = applyFlatDiscount(pricing, user.loyaltyPoints);
    }
  }
 
  if (await flagService.isEnabled('checkout-fee-experiment', user)) {
    pricing = addProcessingFee(pricing);
  }
 
  return pricing;
}

Four flags, eight possible states. The new-tax-engine flag had been at 100% for five months. The loyalty-v2-tiers flag was nested inside loyalty-discount, which itself was at 100%, meaning the outer check was dead code guarding live code. Nobody was confident enough to remove either one because the checkout flow was the one place you absolutely could not break.

Warning

When your team says "we can't remove that flag because we're not sure what happens" — that's not caution, that's a symptom. The flag has already become technical debt.

The audit nobody wanted to do

I proposed a flag audit. The room was not enthusiastic. One engineer said, and I'm paraphrasing, "We know the flags are a mess, but we have a roadmap to deliver." Which is fair. Cleanup work doesn't demo well in sprint reviews.

We compromised: two engineers for one week, with a clear scope. Categorize every flag into one of four buckets:

Dead — rolled out to 100% for over 30 days, or never evaluated in the last 30 days
Active experiment — still being tested or partially rolled out
Ops toggle — intentionally long-lived (circuit breakers, maintenance modes)
Unknown — nobody could explain what it did

The results were grim. Out of 340 flags: 214 were dead, 47 were active experiments, 31 were ops toggles, and 48 were unknown. That means 77% of all flags were either dead weight or mystery meat.

The cleanup

We didn't try to be clever about this. No automated refactoring tools, no big-bang migration. Just a spreadsheet, a priority column, and a rule: every PR that touches a file with a dead flag must also remove that flag. We called it the "boy scout rule for flags."

For the 48 unknowns, we took a different approach. We added logging to track whether each flag's code path was actually being hit in production. After two weeks of data collection, 41 of the 48 were confirmed dead. The remaining seven turned out to be ops toggles that the infrastructure team had created and forgotten to document.

The highest-impact change was the simplest. We added a createdAt and expiresAt field to every new flag. When expiresAt passes, the flag doesn't stop working — that would be terrifying — but it shows up on a weekly Slack digest as "expired and pending removal." A human still has to do the work, but now there's a nudge.

interface FeatureFlag {
  key: string;
  enabled: boolean;
  rolloutPercentage: number;
  createdAt: Date;
  expiresAt: Date;        // required for release flags
  owner: string;          // team or individual
  type: 'release' | 'experiment' | 'ops';
}

Three months later

After 12 weeks of the boy scout approach plus some dedicated cleanup sprints, we were down to 89 flags. The team reported something I hear a lot on these engagements: "The code feels smaller." Cyclomatic complexity on the checkout module dropped from 34 to 11. Code review times shortened because reviewers weren't mentally parsing dead branches.

The velocity improvement was real but hard to attribute cleanly. Average cycle time for medium-sized features went from 8.2 days to 5.1 days. Was all of that from flag cleanup? No. But when engineers stop hesitating before modifying a function because they're unsure which code paths are live, things move faster.

What I'd do differently next time

I should have pushed for the expiration policy from day one instead of starting with the audit. The audit was useful for understanding the scale of the problem, but the expiration policy is what prevents recurrence. Fix the system before you fix the symptoms.

I've also started recommending that teams write the removal PR at the same time they write the flag. Not merge it — just draft it. When the flag is ready to retire, the removal is a two-minute review instead of a two-hour archaeology expedition.

The uncomfortable truth about feature flags is that the discipline to remove them is harder than the discipline to add them. Every flag you create is a promise to your future self that you'll come back and clean up. Most of us are terrible at keeping that promise. The question isn't whether your team has flag debt — it's whether you know how much.