reliability

The Rate Limiter That Let Everything Through
api-design performance reliability consulting architecture
A client's API had rate limiting configured and enforced. It still couldn't prevent a single customer from tanking performance for everyone else. The problem wasn't the limiter — it was what we were counting.
Published On
July 10, 2026
Read more →
The Rollback That Didn't Roll Back
deployment reliability consulting devops database
We practiced deployments religiously but never tested a rollback. When a release broke checkout and we hit the big red button, we found out half the system couldn't actually go backward.
Published On
July 1, 2026
Read more →
The Deploy That Dropped Requests in Silence
reliability kubernetes consulting devops debugging
Every deploy was losing a handful of HTTP requests, but nobody noticed until a payment callback disappeared. The fix wasn't in the deployment pipeline — it was in the application code that never learned how to shut down.
Published On
June 24, 2026
Read more →
The Query Plan That Changed Its Mind at 3AM
databases debugging postgres consulting reliability
A routine ANALYZE flipped a Postgres query plan from an index scan to a sequential scan, and our API went from 12ms to 8 seconds. Here's what we learned about a failure mode most teams never think about.
Published On
June 22, 2026
Read more →
The Webhook That Silently Dropped Forty Thousand Events
reliability architecture consulting debugging webhooks
A client's payment provider was sending webhook notifications correctly. Their system acknowledged every one. And then quietly threw most of them away.
Published On
June 15, 2026
Read more →
The Fallback That Was Worse Than the Failure
reliability architecture consulting debugging
A client's "graceful degradation" strategy silently served stale pricing data for 11 hours. The outage would have been better.
Published On
June 12, 2026
Read more →
Your Healthcheck Endpoint Is Probably Lying
reliability infrastructure kubernetes consulting
Most healthcheck endpoints return 200 OK as long as the process is running. That's not a healthcheck — it's a pulse check. Here's what happened when we confused the two, and what a real healthcheck should verify.
Published On
June 8, 2026
Read more →
The Third-Party API That Went Slow, Not Down
reliability debugging consulting architecture resilience
A payment provider started responding in 8 seconds instead of 200ms. It wasn't an outage — their status page stayed green. But it took out our client's entire checkout flow because nobody had configured a timeout.
Published On
June 1, 2026
Read more →
The Staging Environment Nobody Trusted (So Everyone Tested in Production)
devops consulting environments reliability developer-experience
A client's staging environment had drifted so far from production that developers stopped using it. Tests passed in staging and failed in prod. Tests failed in staging and passed in prod. Eventually the team just stopped looking.
Published On
May 20, 2026
Read more →
The Connection Pool That Starved at 3 PM Every Day
database debugging performance consulting reliability
A client's API started throwing 500s every weekday afternoon like clockwork. The database was fine. The queries were fast. The problem was a reporting job that quietly hogged every available connection during peak traffic.
Published On
May 18, 2026
Read more →
The Job Queue That Silently Ate 12,000 Emails
reliability queues debugging consulting observability
A client's notification queue was draining normally and all dashboards showed green. But three weeks of transactional emails had vanished into a catch block nobody thought to monitor.
Published On
May 15, 2026
Read more →
The Timezone Bug That Quietly Ate Three Weeks of Revenue Data
debugging postgresql consulting reliability
A Node.js service was writing UTC timestamps to a PostgreSQL database configured for Europe/Berlin. Nobody noticed the mismatch until a DST transition made an entire hour of orders vanish from daily reports.
Published On
May 8, 2026
Read more →
The Retry Storm That Took Down Three Services
microservices reliability architecture consulting
A single slow database query triggered aggressive retries across four microservices. Within minutes, the entire order pipeline was down. Here's how we traced it and what we changed.
Published On
April 24, 2026
Read more →
The Postmortems We Wrote But Never Acted On
incident-management engineering-culture consulting reliability
We ran blameless postmortems after every incident. We wrote detailed action items with owners and deadlines. Six months later, 78% of those items were still open — and the same incidents kept happening.
Published On
April 17, 2026
Read more →
The Cron Job That Ran Twice (And Charged Everyone Twice Too)
kubernetes distributed-systems consulting debugging reliability
A consulting story about a nightly billing job that quietly started double-charging customers after a Kubernetes migration — and the boring lock that finally fixed it.
Published On
April 13, 2026
Read more →

reliability

Tags