The Retry Storm That Took Down Three Services
A single slow database query triggered aggressive retries across four microservices. Within minutes, the entire order pipeline was down. Here's how we traced it and what we changed.
I got pulled into a production incident two hours into my first day on a new client engagement. The order service was returning 503s, the payment service was timing out, and the notification service had a queue depth that was climbing by thousands per minute. Three services down, all at once, with no deployment in the last 24 hours.
The engineering lead's first instinct was to restart everything. That bought about four minutes of calm before it all collapsed again.
The Setup
The client ran a mid-size e-commerce platform — roughly 8,000 orders per day across a dozen microservices. The architecture was sensible on paper: an API gateway routed to an order service, which called inventory, payment, and notification services downstream. Standard fan-out pattern. Each service had its own database, its own deploy pipeline, and its own on-call rotation.
What each service also had was its own retry logic. And that's where things went sideways.
Finding the Trigger
We pulled up Grafana and started correlating timelines. The first signs of trouble appeared in the inventory service. Its P99 response time jumped from 80ms to 2.3 seconds at 10:47 AM. Two minutes later, the order service started returning errors. Payment followed a minute after that.
The inventory service wasn't throwing errors — it was just slow. A routine query that checked stock levels across warehouses had started doing a sequential scan. The table had grown past 12 million rows overnight after a bulk import from a new supplier integration. The query planner switched strategies, and a 40ms indexed lookup became a 2-second table scan.
Here's the part that turned a slow query into a platform-wide outage: the order service retried every failed or timed-out call to inventory three times with no backoff.
async function checkInventory(itemId: string): Promise<boolean> {
for (let attempt = 0; attempt < 3; attempt++) {
try {
const res = await fetch(`${INVENTORY_URL}/check/${itemId}`, {
timeout: 1000,
});
if (res.ok) return (await res.json()).available;
} catch {
// retry
}
}
throw new Error('Inventory check failed');
}The timeout was 1 second. The inventory service was taking 2+ seconds. So every single request timed out, and every timeout spawned two more attempts. The order service was tripling its outbound traffic to a service that was already drowning.
But it didn't stop there. The payment service also called inventory to verify stock before processing charges — with its own retry loop. And the API gateway had a retry policy too, so each user-facing request that failed got retried at the gateway level before the user even saw an error.
One user clicking "Place Order" could generate up to 18 requests to the inventory service. Multiply that by a few hundred concurrent users and the inventory service was getting hammered with 20x its normal load — all retries, none of them going to succeed.
The Arithmetic of Retries
This is the thing about retry storms that makes them so vicious. The math works against you exponentially.
Under normal conditions, the inventory service handled about 150 requests per second. When it slowed down, those requests started timing out, and the retry logic kicked in across all callers. Within minutes, the service was receiving over 3,000 requests per second. Its connection pool filled up. Its database connection limit was hit. Now requests weren't just slow — they were being actively rejected, which triggered even more retries from callers that had previously been getting slow-but-valid responses.
The failure cascaded outward. The order service exhausted its own connection pool waiting on inventory responses that were never coming. Payment couldn't reach the order service to confirm order state. The notification service was processing a growing backlog of failure notifications, each of which triggered its own downstream calls.
Three services down because one table got too big for its query plan.
What We Fixed
The immediate fix was adding that missing index to the inventory database. Five minutes, problem solved — for now. But the retry behavior was a ticking bomb that would go off again the next time any downstream service hiccupped.
We spent the following week adding three layers of protection.
Exponential backoff with jitter. Every retry now waits longer than the last, with randomized jitter to prevent synchronized retry waves. This alone cut retry amplification by roughly 80%.
async function callWithBackoff<T>(
fn: () => Promise<T>,
maxAttempts = 3,
): Promise<T> {
for (let attempt = 0; attempt < maxAttempts; attempt++) {
try {
return await fn();
} catch (err) {
if (attempt === maxAttempts - 1) throw err;
const baseDelay = Math.pow(2, attempt) * 200;
const jitter = Math.random() * baseDelay;
await sleep(baseDelay + jitter);
}
}
}Circuit breakers. We added a circuit breaker in front of each downstream call. After five consecutive failures within a 30-second window, the circuit opens and calls fail immediately for 60 seconds before allowing a probe request through. This prevents a struggling service from getting buried under traffic it can't handle.
The key insight with circuit breakers: they protect the caller as much as the callee. When the order service's circuit breaker to inventory opened, it stopped burning its own threads on requests that were going to fail anyway. It could still process orders that didn't require an inventory check, and it returned fast errors instead of slow timeouts for those that did.
Request budgets. Each service now tracks how many of its outbound requests are retries. If more than 20% of recent requests to a given downstream are retries, it stops retrying entirely and fails fast. This caps the amplification factor regardless of how many layers of retries exist in the call chain.
Warning
The Uncomfortable Part
None of this was exotic. Circuit breakers, exponential backoff, jitter — these patterns have been in every distributed systems textbook for a decade. The team knew about them. They'd even discussed adding circuit breakers during an architecture review six months earlier.
What happened was the same thing I see at most places: the retry logic was written early, when the system was small and fast, and nobody revisited it as the system grew. That naive three-attempt loop with no backoff was fine when inventory responses took 40ms. It became a weapon when responses took 2 seconds.
I've started asking a specific question during architecture reviews at client sites: "Show me your retry policy." Not whether they have one — everyone has one — but the actual parameters. Timeout values, attempt counts, backoff strategy, and whether retries compound across service boundaries. The answer is almost always a shrug followed by someone grepping the codebase.
The systems that stay up aren't the ones with the best happy path. They're the ones where someone thought carefully about what happens when a dependency gets ten times slower on an ordinary Tuesday morning.