A Node.js service was writing UTC timestamps to a PostgreSQL database configured for Europe/Berlin. Nobody noticed the mismatch until a DST transition made an entire hour of orders vanish from daily reports.
A single slow database query triggered aggressive retries across four microservices. Within minutes, the entire order pipeline was down. Here's how we traced it and what we changed.
We ran blameless postmortems after every incident. We wrote detailed action items with owners and deadlines. Six months later, 78% of those items were still open — and the same incidents kept happening.
A consulting story about a nightly billing job that quietly started double-charging customers after a Kubernetes migration — and the boring lock that finally fixed it.