The Third-Party API That Went Slow, Not Down

reliability debugging consulting architecture resilience

A payment provider started responding in 8 seconds instead of 200ms. It wasn't an outage — their status page stayed green. But it took out our client's entire checkout flow because nobody had configured a timeout.

The incident started at 9:47 AM on a Wednesday. Customers couldn't complete checkout. The error wasn't a clean failure — no 500s, no connection refused. Requests just hung. The loading spinner spun forever, users rage-clicked, and the support queue exploded.

I was two weeks into a reliability engagement with an e-commerce client running about 4,000 orders per day. Their system was a fairly standard setup: a Next.js frontend, a Python API layer (FastAPI), and integrations with five external services — payment processing, shipping rates, inventory sync, email, and fraud scoring.

The first thing the on-call engineer checked was their own infrastructure. Databases healthy. CPU normal. Memory fine. No recent deploys. So the team did what everyone does: stared at dashboards and waited.

The status page said everything was fine

Forty minutes into the incident, someone finally checked the payment provider's status page. All green. "Operational." No degradation notice, no maintenance window. So the team dismissed it and kept looking internally.

This is where I got pulled in. I opened up the application logs and noticed something: requests to the /checkout/confirm endpoint were taking 12-15 seconds before timing out at the load balancer level (which had a 30-second timeout). Not all of them — about 60% were eventually completing, just very slowly.

I ran a quick curl against the payment provider's API directly from the production box:

time curl -X POST https://api.paymentprovider.example/v1/charges \
  -H "Authorization: Bearer $TOKEN" \
  -d '{"amount": 100, "currency": "EUR"}' \
  -w "\n%{time_total}\n"

8.2 seconds. Normally this call took 180-220ms.

The provider wasn't down. They were degraded. Their API was accepting connections, processing requests, and returning valid responses — just forty times slower than usual. And their status page wouldn't reflect this for another two hours.

Why "slow" is worse than "down"

A dead dependency is easy to handle. Connection refused, DNS failure, immediate 503 — your code gets an error back in milliseconds. You can catch it, show a friendly message, maybe retry once.

A slow dependency is insidious. The request is in flight. The socket is open. Your thread (or coroutine, or connection) is occupied, waiting. And if you haven't set an explicit timeout, it'll wait as long as the operating system lets it — which on Linux defaults to around 2 minutes for TCP.

Here's what happened to my client: the FastAPI service had a thread pool of 40 workers. Under normal load, a checkout request occupied a worker for about 300ms total. At 4,000 orders/day, that's roughly 3 concurrent checkout requests at any moment. Plenty of headroom.

When the payment API slowed to 8 seconds, each checkout request held a worker for 8+ seconds. The math changed fast. Even moderate traffic — 15 checkout attempts per minute — meant 2 workers perpetually stuck waiting. Then users started retrying. Within ten minutes, all 40 workers were blocked on payment calls, and the entire API became unresponsive. Not just checkout — product pages, search, account settings, everything.

One slow dependency took out the entire service.

What was missing

I spent the afternoon after the incident reviewing their integration code. The payment client was initialized like this:

import httpx
 
payment_client = httpx.Client(
    base_url="https://api.paymentprovider.example/v1",
    headers={"Authorization": f"Bearer {settings.payment_api_key}"},
)

No timeout parameter. No retry configuration. No circuit breaker. The httpx default timeout is 5 seconds for connection but no read timeout by default — meaning once connected, it'll wait indefinitely for a response.

The fix wasn't complicated. It never is, after the fact.

import httpx
from circuitbreaker import circuit
 
payment_client = httpx.Client(
    base_url="https://api.paymentprovider.example/v1",
    headers={"Authorization": f"Bearer {settings.payment_api_key}"},
    timeout=httpx.Timeout(connect=2.0, read=3.0, write=2.0),
)
 
@circuit(failure_threshold=5, recovery_timeout=30)
def create_charge(amount: int, currency: str) -> ChargeResult:
    response = payment_client.post(
        "/charges",
        json={"amount": amount, "currency": currency},
    )
    response.raise_for_status()
    return ChargeResult.from_response(response)

Warning

Setting a timeout isn't enough on its own. Without a circuit breaker, you'll still burn through your thread pool — you'll just do it slightly slower, failing each request after 3 seconds instead of hanging forever.

The circuit breaker is what actually protects the rest of the system. After five consecutive failures (or timeouts), it stops sending requests to the payment API entirely and fails immediately. This keeps your workers free for everything else. After 30 seconds, it lets one request through to test if the dependency recovered.

The deeper problem: no dependency health model

The real issue wasn't a missing timeout — that was just the symptom. The team had no mental model for dependency health. They treated every external service as either "working" or "an exception to catch." The spectrum between those two states — degraded, slow, partially available, returning stale data — wasn't part of their design vocabulary.

After the incident, we mapped all five external dependencies and asked three questions about each:

What's our timeout budget? (How long can this call take before the user experience degrades unacceptably?)
What's our fallback? (Can we serve a degraded experience, queue the work for later, or do we hard-fail?)
How do we detect degradation independently of the provider's status page?

For payments, the answers were: 3 seconds max, hard-fail with a clear user message, and a synthetic canary request every 30 seconds that tracked p95 latency.

For shipping rates, the fallback was simpler — show a "calculated at next step" message and fetch rates asynchronously. For email, we could queue indefinitely. Each dependency got its own resilience strategy based on how critical it was and what alternatives existed.

The uncomfortable truth

Every team I've worked with that's had this kind of incident says the same thing afterward: "We knew we should have had timeouts." They did know. It's in every distributed systems talk, every architecture book, every "lessons learned" blog post. Including this one, now.

The reason it doesn't get done isn't ignorance — it's prioritization. Setting up proper timeouts, circuit breakers, and fallback behavior for five external services is maybe two days of work. But it's two days that compete against feature work with visible business value. It only becomes urgent after the first time you lose half a day of revenue.

I don't have a satisfying answer for that. The best I've managed is: when you add a new external dependency, configure its resilience in the same PR that adds the integration. Don't file a ticket for "add timeouts later." Later never comes until 9:47 AM on a Wednesday.