The Caching Layer That Made Everything Worse

performance caching architecture consulting

We added Redis to fix slow API responses. Instead we got stale data, thundering herds, and a system that was harder to debug than the original problem.

Last fall I joined a client engagement where the main complaint was simple: the product detail API was too slow. Average response time sat around 2.4 seconds, with P99s touching 6 seconds. The product team was frustrated. The engineering lead's proposed fix was straightforward — put Redis in front of the database. Cache the heavy queries. Ship it.

I've seen this movie before, and it rarely ends at the first act.

The Obvious Fix

The codebase was a Node.js monolith backed by PostgreSQL. Product pages pulled data from five tables: products, pricing, inventory, reviews, and recommendations. Each page load triggered a handful of JOINs, a couple of subqueries, and one particularly ugly aggregation for review scores.

The team's initial approach was reasonable on the surface. They wrapped the main product query in a cache-aside pattern:

async function getProduct(id: string): Promise<Product> {
  const cached = await redis.get(`product:${id}`);
  if (cached) return JSON.parse(cached);
 
  const product = await db.query(PRODUCT_QUERY, [id]);
  await redis.set(`product:${id}`, JSON.stringify(product), 'EX', 3600);
  return product;
}

One hour TTL. Simple key structure. Within a week they had cached most of the hot path. Response times dropped to around 200ms for cache hits. The team celebrated.

Then the bug reports started.

Three Problems Nobody Anticipated

Problem one: stale prices. The pricing team updated product prices through an admin tool that wrote directly to PostgreSQL. Nobody had wired up cache invalidation for that path. Customers were seeing yesterday's prices — sometimes for the full hour until the TTL expired. For a commerce platform processing a few thousand orders per day, this wasn't a minor glitch. One customer bought 200 units of something at a price that had been corrected upward three hours earlier.

Problem two: the thundering herd. The most popular products had cache entries that all expired around the same time (they'd been warmed during a batch process that ran at midnight). When those keys expired, dozens of concurrent requests would all miss the cache simultaneously, all hit the database with that expensive JOIN, and the database would buckle. Response times spiked to 10+ seconds during these windows — worse than before caching existed.

Problem three: debugging got harder. Before caching, the mental model was clean — request comes in, hits the database, returns. Now every bug report required asking "is this the cached version or the live version?" The team started adding cache-busting query parameters for internal testing, which leaked into production URLs through copy-paste. Customer support couldn't reproduce issues because they were seeing different cached states than the customers reporting problems.

What Actually Fixed It

We ripped out the caching layer. Not permanently, but we needed to go back to a known state and fix things in the right order.

First, we fixed the actual query. I ran EXPLAIN ANALYZE on the product query and found what I expected — a sequential scan on the reviews table that had 1.2 million rows with no index on product_id. The recommendations subquery was doing a sort on an unindexed column. Two indexes and a materialized view for the review aggregation brought the uncached query from 2.4 seconds to 180ms.

CREATE INDEX idx_reviews_product_id ON reviews (product_id);
CREATE INDEX idx_recommendations_score ON recommendations (product_id, score DESC);
 
CREATE MATERIALIZED VIEW product_review_stats AS
SELECT product_id,
       COUNT(*) as review_count,
       AVG(rating) as avg_rating
FROM reviews
GROUP BY product_id;

Note

Always run EXPLAIN ANALYZE before reaching for a cache. The slowness you're trying to hide might be a missing index that takes two minutes to add.

Second, we added caching surgically. Instead of caching the entire product response, we only cached the recommendations query — the one piece that was genuinely expensive to compute and rarely changed. We used a short TTL of five minutes with a stale-while-revalidate pattern: serve the stale value immediately while refreshing in the background.

async function getRecommendations(productId: string): Promise<Recommendation[]> {
  const key = `recs:${productId}`;
  const cached = await redis.get(key);
 
  if (cached) {
    const { data, refreshAt } = JSON.parse(cached);
    if (Date.now() > refreshAt) {
      // Refresh in background, serve stale data now
      refreshRecommendations(productId).catch(log.error);
    }
    return data;
  }
 
  return refreshRecommendations(productId);
}
 
async function refreshRecommendations(productId: string): Promise<Recommendation[]> {
  const key = `recs:${productId}`;
  const lock = await redis.set(`lock:${key}`, '1', 'NX', 'EX', 30);
  if (!lock) return; // Another process is refreshing
 
  const data = await db.query(RECOMMENDATIONS_QUERY, [productId]);
  await redis.set(key, JSON.stringify({
    data,
    refreshAt: Date.now() + 4 * 60 * 1000 // Refresh after 4 min
  }), 'EX', 600); // Hard expire at 10 min
 
  return data;
}

The lock prevented thundering herds. The background refresh meant users almost never waited for a cold cache. And because we were only caching one specific thing, the invalidation surface was tiny.

Third, we made the cache observable. We added metrics for hit rate, miss rate, stale-serve rate, and refresh latency. When the cache hit rate dropped below 80%, an alert fired. This sounds basic, but the original implementation had zero observability — the team had no idea what their hit rate was or whether the cache was actually helping.

The Numbers After

Final state: average response time of 160ms (down from 2.4 seconds). P99 at 420ms. No stale pricing incidents. The thundering herd problem was gone entirely. And we were caching exactly one thing instead of the entire response.

Redis memory usage dropped from 2.1 GB to about 340 MB because we weren't stuffing entire product payloads into it.

What I Took Away

Caching is a trade-off, not a solution. You're trading simplicity and consistency for speed, and you should only make that trade when you've exhausted the cheaper options. Missing indexes, bad query plans, unnecessary data fetching — fix those first. They're free. Caching has operational cost: invalidation logic, observability, debugging complexity, memory.

The client's instinct to "just add Redis" wasn't wrong in principle. It was wrong in sequence. They reached for a distributed system primitive to solve what turned out to be a missing B-tree index.

I keep coming back to a question that shapes how I approach these engagements: if you can't explain why a specific query is slow, how do you know caching is the right fix?