I Audited a Codebase That Was Vibe-Coded in Two Weeks
A startup founder built their MVP almost entirely with AI coding agents. It worked. Then they hired a team, and within two months nobody could ship anything. I got called in to figure out why.
A founder reached out in April with a familiar-sounding problem. He'd built an MVP for a logistics SaaS product in about two weeks using Claude and Cursor. The app worked. It had paying customers. Early feedback was strong. So he hired three engineers to keep building.
Two months later, the team was averaging one meaningful feature per sprint. Bugs that should have taken hours were taking days. Two of the three engineers had privately told the founder they were considering leaving. Nobody could agree on how anything in the codebase was supposed to work.
He wanted me to do a code audit and figure out where things went sideways.
The codebase at first glance
The stack was reasonable: Next.js frontend, a Node.js API layer, PostgreSQL. About 38,000 lines of TypeScript across 260 files. On the surface it looked like any early-stage product — maybe a bit messy, but functional.
Then I started reading.
The first thing I noticed was that the codebase had no memory. Every feature felt like it was written by a different person who had never seen the rest of the code. There were three different patterns for making API calls. Two different ways to handle authentication state. A utils/ folder with 47 files, at least a dozen of which exported functions that nothing imported.
This is the signature of AI-generated code built across many separate sessions. Each session starts clean, solves the immediate problem competently, and has no awareness of what the last session produced. A human developer builds up a mental model of the codebase over time. They remember that there's already a formatCurrency function in utils/format.ts. An AI agent prompted with "add price display to the invoice page" will just write a new one.
The patterns that kept showing up
Inconsistent error handling. Some endpoints had meticulous try-catch blocks with structured error responses. Others would let exceptions bubble up unhandled. One particularly memorable controller had a try-catch around every single database call — seven of them — each returning a slightly different error shape. The AI had been asked to "make this more robust" and it obliged in the most literal way possible.
Schema as afterthought. The PostgreSQL schema had 23 tables, which was reasonable for the domain. But five of those tables had a metadata column of type jsonb that was doing the heavy lifting for what should have been proper relations. One of those JSON blobs contained nested arrays of objects with their own implicit foreign keys to other tables. No constraints, no indexes on the JSON paths. Queries against this data were slow and the application code that parsed it was brittle.
Warning
Duplicate logic everywhere. I found the same date-range filtering logic implemented in four separate places, each slightly different. One used >= and < for the bounds. Another used >= and <=. A third used BETWEEN. The fourth parsed dates in JavaScript and filtered in memory after fetching all rows. They produced different results for edge cases on day boundaries, which explained a category of bugs the team had been chasing for weeks.
Tests that tested nothing. There was a test suite — about 80 files. I was impressed until I looked inside. Most tests checked that a function "returns a result" without asserting what that result was. Several tests had the expected output hardcoded from whatever the function happened to return when the test was generated. The suite passed, always, regardless of what you changed. It was a green checkmark generator.
What the team was actually struggling with
The engineers weren't struggling because the code was bad in some abstract sense. They were struggling because they couldn't build a mental model of it. When every file solves the same problem differently, you can't develop intuitions about how the system works. You have to read every function from scratch every time. That kills velocity more than any single technical decision.
One of the engineers told me: "I spend half my time just figuring out which pattern I'm supposed to follow, and the other half discovering there are two more patterns I didn't know about."
What we actually did
We didn't rewrite the codebase. That almost never makes sense for a product with paying customers and a small team. Instead, we spent two weeks on what I'd call "alignment work."
First, we picked winners. For every case where multiple patterns existed, we chose one and documented it. One API call pattern. One error handling approach. One way to do date filtering. We put these in a lightweight architecture decision record that lived in the repo.
Second, we deleted dead code aggressively. About 6,000 lines — roughly 16% of the codebase — were unreachable. Removing them didn't change any behavior but dramatically reduced the surface area the team had to reason about.
Third, we extracted the jsonb columns into proper tables with foreign keys and indexes. This was the riskiest change and took most of the two weeks, but it fixed an entire class of data consistency bugs and made several slow queries fast.
We didn't touch the test suite. I told the founder it needed a complete rethink, but that was a separate project. In the short term, the team started writing meaningful tests for new code and for anything they touched during bug fixes.
The uncomfortable takeaway
Vibe coding works for getting from zero to something. The founder built a real product, got real customers, and validated a real market — in two weeks. That's genuinely impressive and I'm not going to pretend it doesn't matter.
But there's a phase transition that nobody talks about. The moment you add a second person to the codebase, the rules change completely. Code that one person can hold in their head becomes a maze for two people. Conventions that were implicit become invisible. The AI agent doesn't leave behind the reasoning for its choices, and the person who prompted it probably doesn't remember why they accepted that particular suggestion three weeks ago.
I don't think the answer is "don't use AI to build prototypes." That ship has sailed and honestly the productivity gains are real. The answer might be closer to: treat vibe-coded output the way you'd treat a spike or a proof of concept. It proved the idea works. Now build the thing that a team can maintain — and use the prototype as a spec, not a foundation.
Or maybe there's a middle ground I haven't found yet. How are you handling the handoff from AI-built prototype to team-maintained product?