A Node.js service was writing UTC timestamps to a PostgreSQL database configured for Europe/Berlin. Nobody noticed the mismatch until a DST transition made an entire hour of orders vanish from daily reports.
A client's dashboard took 11 seconds to render. Everyone blamed the database. The real problem was an ORM doing exactly what we told it to — we just never looked at what that meant.
A 20-person team was running 14 microservices with three full-time engineers just keeping the infrastructure alive. We consolidated six of them into a modular monolith and cut their deploy time by 70%.
A client's team was drowning in a 200K-line TypeScript monolith. When we finally measured what was actually running in production, we found that almost a third of it was dead code nobody had touched in over a year.
We ran load tests before a big product launch, got green across the board, and watched the system buckle under real traffic two days later. The tests weren't wrong — they just weren't testing reality.
A client's on-call rotation was burning people out — not because of real incidents, but because of noise. Here's how we cut alerts by 87% without missing anything that actually required a human.
A single slow database query triggered aggressive retries across four microservices. Within minutes, the entire order pipeline was down. Here's how we traced it and what we changed.
A 47-minute build pipeline had become sacred infrastructure. When we finally opened it up, we found cargo-culted steps, redundant checks, and a team afraid of their own tooling.
We ran blameless postmortems after every incident. We wrote detailed action items with owners and deadlines. Six months later, 78% of those items were still open — and the same incidents kept happening.
A tool comparison from a consulting engagement — what Biome actually delivered, where it fell short, and the decision I'd make again (and the one I wouldn't).
A consulting story about a nightly billing job that quietly started double-charging customers after a Kubernetes migration — and the boring lock that finally fixed it.
We added Redis to fix slow API responses. Instead we got stale data, thundering herds, and a system that was harder to debug than the original problem.
A consulting story about a platform engineering initiative that checked every box on paper but collected dust in practice — and the uncomfortable reasons why.
The Axios supply chain attack reminded me of a dependency audit I ran at a client last year. What I found was worse than any vulnerability scanner could flag.
A consulting story about a minor field rename in an internal API that cascaded into a production incident, and what we put in place to stop it from happening again.
The latest DORA report confirms what I've seen on consulting engagements — AI tools amplify existing conditions. If your processes are solid, AI accelerates you. If they're not, you just create technical debt faster.
A consulting engagement where we measured where engineering hours actually went — and discovered the biggest productivity killer wasn't what anyone expected.
A debugging deep dive into replacing wall-of-text logs with structured logging and trace IDs — and how it cut our mean time to resolution from hours to minutes.
A consulting war story about a PostgreSQL-to-MongoDB migration that went sideways, what we missed in planning, and the uncomfortable lessons about knowing when not to migrate.