A client moved their reads to database replicas for performance. The latency numbers looked great — until customers started getting charged twice and inventory counts drifted from reality.
A client's payment provider was sending webhook notifications correctly. Their system acknowledged every one. And then quietly threw most of them away.
A client was confident about how their services talked to each other. Then we instrumented the system with OpenTelemetry and found out what was actually happening.
A payment provider started responding in 8 seconds instead of 200ms. It wasn't an outage — their status page stayed green. But it took out our client's entire checkout flow because nobody had configured a timeout.
A startup founder built their MVP almost entirely with AI coding agents. It worked. Then they hired a team, and within two months nobody could ship anything. I got called in to figure out why.
A consulting engagement where we finally opened the cloud bill and found forgotten dev environments, runaway log storage, and a data pipeline reprocessing everything from scratch every night.
A 20-person team was running 14 microservices with three full-time engineers just keeping the infrastructure alive. We consolidated six of them into a modular monolith and cut their deploy time by 70%.
We ran load tests before a big product launch, got green across the board, and watched the system buckle under real traffic two days later. The tests weren't wrong — they just weren't testing reality.
A single slow database query triggered aggressive retries across four microservices. Within minutes, the entire order pipeline was down. Here's how we traced it and what we changed.
We added Redis to fix slow API responses. Instead we got stale data, thundering herds, and a system that was harder to debug than the original problem.
A consulting story about a platform engineering initiative that checked every box on paper but collected dust in practice — and the uncomfortable reasons why.
A consulting story about a minor field rename in an internal API that cascaded into a production incident, and what we put in place to stop it from happening again.
A consulting war story about a PostgreSQL-to-MongoDB migration that went sideways, what we missed in planning, and the uncomfortable lessons about knowing when not to migrate.