The Fallback That Was Worse Than the Failure

reliability architecture consulting debugging

A client's "graceful degradation" strategy silently served stale pricing data for 11 hours. The outage would have been better.

The incident report said the system "handled the outage gracefully." That was technically true. When the pricing service went down for maintenance at 10 PM on a Thursday, the e-commerce platform kept accepting orders. Customers saw prices. The checkout flow worked. No error pages, no 503s, no pager alerts.

The problem was that the prices were wrong.

For 11 hours, the platform served cached pricing data from a stale Redis snapshot — data that was between 3 and 47 days old depending on the product. By the time someone noticed at 9 AM Friday, the system had processed 1,400 orders with incorrect prices. Most were undercharged. Some significantly.

I got brought in the following week to help them figure out what happened and make sure it couldn't happen again.

The architecture that felt reasonable

The setup was straightforward. A product catalog service fetched pricing from a dedicated pricing service via REST. The pricing service owned the source of truth — base prices, discount rules, regional adjustments. It had its own Postgres database and a small API surface.

Between the two sat a Redis cache with a 15-minute TTL. Every request checked Redis first. On a cache miss, it called the pricing service, stored the response, and returned it. Standard read-through cache.

The team had also built a fallback path. If the pricing service returned an error or timed out, the catalog service would fall back to whatever was in Redis — even if the TTL had expired. The logic looked roughly like this:

async function getPrice(productId: string): Promise<PriceResult> {
  const cached = await redis.get(`price:${productId}`);
 
  try {
    const fresh = await pricingService.getPrice(productId);
    await redis.set(`price:${productId}`, JSON.stringify(fresh), 'EX', 900);
    return fresh;
  } catch (err) {
    if (cached) {
      return JSON.parse(cached);
    }
    throw new PricingUnavailableError(productId);
  }
}

Notice what's missing. When the fallback path triggers, there's no indication in the response that the data is stale. The caller gets a PriceResult either way. No staleness flag, no timestamp, no warning.

The team had actually discussed this during design. They decided that showing some price was better than showing an error page. For a brief outage — a 30-second deploy, a flaky network blip — that's defensible. The cache is at most 15 minutes old. Close enough.

But there was a second problem. The Redis instance wasn't configured with a max-memory eviction policy that matched the fallback assumption. Old keys weren't evicted on a schedule. They lingered. Some of the cached prices had been written weeks earlier, back when the TTL refresh cycle was keeping them alive, and then the pricing service had stopped being called for those particular products (low-traffic SKUs). The keys sat in Redis with no TTL because the fallback write path didn't set one.

Actually, look at the code again. The set call applies a TTL of 900 seconds. But when the fallback path reads an expired key, Redis has already deleted it — unless redis.get is hitting a key that was written by the fallback path itself in a previous failure. And the fallback path doesn't write anything. So where were the stale keys coming from?

The second cache

It turned out there was another writer. A nightly batch job — built six months earlier by a different team — pre-warmed the cache for the top 5,000 products. It wrote to the same Redis keys but with no TTL. The original developer had reasoned that these products were popular enough that the read-through path would refresh them constantly, so a TTL was unnecessary.

That was true under normal operation. But during the outage, those never-expiring keys became the fallback source. The batch job ran with the previous night's data, so the prices were roughly 12-36 hours stale. For products whose prices had recently changed — and a promo had gone live that Wednesday — the cached prices were flat-out wrong.

The nightly job didn't know about the fallback behavior. The fallback behavior didn't know about the nightly job. Two reasonable systems, built at different times, combined into something neither team intended.

What made it worse

Three things turned a stale cache into an 11-hour incident.

No staleness signal. The API response looked identical whether the price was fresh or 30 days old. Downstream consumers — the cart service, the checkout service, the order confirmation emails — had no way to know they were working with bad data.

No alerting on fallback activation. The catch block logged a warning, but nobody had built an alert for "pricing service fallback activated." The warning was level warn, and the team's alerting threshold was error. The log line existed. It just didn't wake anyone up.

The outage was planned. The pricing service was down for a scheduled database migration. The team doing the migration had sent a Slack message to #platform-updates at 4 PM. The catalog team didn't see it. There was no formal dependency check, no pre-maintenance validation that downstream fallbacks were safe for extended outages.

Warning

If your fallback path doesn't have its own alerting, you won't know it's active until the damage is done. Treat fallback activation as an incident, not a feature.

The fixes

We made four changes that week.

First, every cache read in the fallback path now attaches a staleness field to the response — the delta between the current time and the cachedAt timestamp. Any consumer can check it. The checkout service now rejects prices that are more than 30 minutes stale and shows a "prices temporarily unavailable" message instead.

Second, the nightly batch job now writes to a separate key namespace (price-warm: instead of price:). The read-through cache and the pre-warm cache no longer collide. The fallback path only reads from the read-through namespace, where keys have TTLs.

Third, fallback activation fires a pricing.fallback.activated metric that triggers a PagerDuty alert if it exceeds 10 events in a 5-minute window. The team knows within minutes when they're serving stale data.

Fourth — and this was the uncomfortable one — we added a hard staleness limit. If the cache is older than 60 minutes, the system returns an error instead of a stale price. The product team pushed back hard. "We'd rather show something than nothing." I get that instinct. But showing a wrong price isn't showing "something" — it's showing a lie that generates financial and legal liability. A clear error message is more honest and less expensive than 1,400 orders at the wrong price.

The broader pattern

I've seen this pattern on three other engagements since. The specifics differ — sometimes it's a stale feature flag, sometimes it's cached user permissions, sometimes it's an old product configuration — but the shape is always the same:

A team builds a fallback for short outages
Nobody tests what happens during a long outage
A different team adds a second writer to the same data store
The fallback silently serves data that's stale enough to cause real harm

The instinct to keep the system "up" is strong. Showing an error feels like failure. But there's a category of data where stale is worse than absent. Prices, permissions, rate limits, account balances — anything where acting on old data creates a problem that's harder to fix than the outage itself.

The question worth asking about any fallback path: what's the worst data this could serve, and what happens downstream when it does? If you don't like the answer, maybe the right fallback is a clear, honest error.