A 20-person team was running 14 microservices with three full-time engineers just keeping the infrastructure alive. We consolidated six of them into a modular monolith and cut their deploy time by 70%.
We ran load tests before a big product launch, got green across the board, and watched the system buckle under real traffic two days later. The tests weren't wrong — they just weren't testing reality.
A single slow database query triggered aggressive retries across four microservices. Within minutes, the entire order pipeline was down. Here's how we traced it and what we changed.
We added Redis to fix slow API responses. Instead we got stale data, thundering herds, and a system that was harder to debug than the original problem.
A consulting story about a platform engineering initiative that checked every box on paper but collected dust in practice — and the uncomfortable reasons why.
A consulting story about a minor field rename in an internal API that cascaded into a production incident, and what we put in place to stop it from happening again.
A consulting war story about a PostgreSQL-to-MongoDB migration that went sideways, what we missed in planning, and the uncomfortable lessons about knowing when not to migrate.