The Webhook That Silently Dropped Forty Thousand Events

reliability architecture consulting debugging webhooks

A client's payment provider was sending webhook notifications correctly. Their system acknowledged every one. And then quietly threw most of them away.

The support tickets started slow. A customer here and there complaining that their order status still showed "processing" days after they'd paid. The client — a mid-sized marketplace platform — assumed it was a UI caching issue and bumped it to the backlog.

Two weeks later, the trickle became a flood. Their finance team noticed that bank reconciliation was off. Payments were landing in Stripe, but the platform's internal records hadn't updated. When I got the call, they had roughly 40,000 webhook events that their system had received, acknowledged, and then lost.

The setup that looked fine

The architecture wasn't unusual. Stripe sends a webhook to their endpoint. The endpoint validates the signature, returns a 200, and puts the event payload onto a Redis queue for async processing. A set of worker processes picks events off the queue, updates the database, triggers emails, and adjusts order statuses.

They had monitoring on the workers. They had alerts on queue depth. Everything was green.

The webhook handler itself was short and straightforward:

@app.post("/webhooks/stripe")
async def handle_stripe_webhook(request: Request):
    payload = await request.body()
    sig_header = request.headers.get("stripe-signature")
 
    try:
        event = stripe.Webhook.construct_event(
            payload, sig_header, webhook_secret
        )
    except ValueError:
        raise HTTPException(status_code=400, detail="Invalid payload")
    except stripe.error.SignatureVerificationError:
        raise HTTPException(status_code=400, detail="Invalid signature")
 
    await redis_client.lpush("webhook_events", json.dumps(event))
    return {"status": "ok"}

Receive, validate, enqueue, respond. The textbook pattern.

Finding the gap

The first thing I checked was whether Stripe was actually delivering events. Their Stripe dashboard showed a healthy delivery rate — 99.7% success, all returning 200. So the events were arriving and being acknowledged.

Next I looked at Redis. The queue length was near zero, which the team had taken as a sign that workers were keeping up. But near-zero can mean two things: workers are fast, or nothing is being enqueued.

I added a counter. A simple INCR webhook_received in the handler, and a separate INCR webhook_processed in the worker. After 24 hours the numbers were:

Received: 8,241
Processed: 8,206

That looked right. So where did the 40,000 go?

The answer was in the deploy logs. The platform ran on Kubernetes, and their webhook service had been restarting frequently — not crashing, just rolling deployments. They shipped multiple times a day, and each deploy cycled every pod in the webhook service.

Here's what happened during a deploy: Kubernetes sends SIGTERM to the old pod, starts a new one, and routes traffic to the new pod once it's ready. But there's a window — typically a few seconds — where the old pod has already received requests, pushed them to Redis, and is now shutting down. If the lpush to Redis happened but the pod died before the response went back to Stripe, Stripe would see a timeout and retry. That part was actually fine.

The real problem was subtler. Their Redis was configured with maxmemory-policy allkeys-lru. When memory pressure hit — which happened during traffic spikes when the queue backed up — Redis silently evicted keys using LRU. Including keys from the webhook queue.

Warning

If your Redis instance uses allkeys-lru as its eviction policy and you're using it as a message queue, you're building on sand. Evicted queue entries don't throw errors — they just vanish.

The events were received. They were enqueued. And then Redis, doing exactly what it was configured to do, quietly removed them to make room for newer data.

Why the monitoring missed it

Their alerting was built around queue depth. High queue depth meant workers were falling behind. Low queue depth meant everything was fine. But eviction doesn't show up as queue depth — it shows up as queue depth decreasing, which looks identical to workers doing their job.

They didn't have a metric for total events enqueued versus total events dequeued. The counter I added would have caught this, but nobody had thought to add it because the system appeared healthy from every angle they were watching.

Redis does expose eviction stats. INFO stats includes evicted_keys. But nobody was shipping that metric to their monitoring stack. It was sitting right there in the Redis CLI, invisible to Grafana.

The fix

We changed three things.

First, we moved the webhook queue to a Redis Stream instead of a plain list. Streams in Redis have a different eviction behavior — entries in a stream aren't subject to allkeys-lru in the same way, and you can set explicit max lengths. More importantly, streams give you consumer groups with acknowledgment semantics. If a worker crashes mid-processing, the event doesn't disappear. It stays in the pending entries list until another worker claims it.

await redis_client.xadd(
    "webhook_events_stream",
    {"payload": json.dumps(event)},
    maxlen=500000,
)

Second, we added an idempotency layer. Every Stripe event has a unique ID. Before processing, the worker checks a Postgres table for that ID. If it's already been processed, skip it. If not, process it and record the ID. This meant we could safely replay Stripe's events for the affected time window without worrying about double-processing.

Third, we added the missing observability. Redis eviction count as a metric. A gauge comparing enqueue rate to dequeue rate. And a daily reconciliation job that compared Stripe's event log against our processed-events table and flagged any gaps.

Note

Stripe lets you list all events via their API with timestamp filters. For recovery, we pulled the full event history for the affected period and replayed them through the idempotent processor. It took about 20 minutes to recover all 40,000 events.

The deeper lesson

This one sticks with me because every individual component was doing the right thing. Stripe delivered correctly. The handler validated and enqueued correctly. The workers processed correctly. Redis evicted according to its configuration. Kubernetes deployed according to its spec.

The failure existed only in the interaction between components — in the assumptions each layer made about the others. The handler assumed Redis would hold the data. The monitoring assumed queue depth told the full story. The team assumed that a 200 response to Stripe meant the event was safe.

I've started asking a specific question during architecture reviews now: "What happens to in-flight data when this component restarts or drops a message?" Not whether it can happen, but what the system does when it does. Most teams I work with don't have an answer, and that's the moment where the interesting conversations start.

Webhooks are deceptively simple. Receive a POST, do a thing. But "receive" and "durably process" are separated by a gap that's easy to forget about — until 40,000 events remind you.