We Let AI Review Every Pull Request for Three Months. Here's What It Actually Caught.

AI code review tools are everywhere now. After rolling one out on a real project, I learned more about our team than about the tool itself.


Earlier this year I helped a client team of nine developers adopt an AI code review tool on their main repository. The pitch was straightforward: faster reviews, fewer bugs reaching staging, less burden on senior engineers who were drowning in PR queues. Three months later, the tool had flagged over 2,000 issues across roughly 400 pull requests. Some of those flags were genuinely useful. A surprising number were not. And the most interesting outcomes had nothing to do with the tool's accuracy at all.

The setup

The project was a mid-sized TypeScript backend — about 120,000 lines of code, a mix of REST endpoints and event-driven processing. The team was shipping two to three PRs per developer per week. Code reviews were a bottleneck. Senior engineers were spending close to 40% of their time reviewing other people's code, and the median time from "PR opened" to "first human review" was 19 hours.

We integrated the AI reviewer to comment on every PR automatically before a human looked at it. The idea wasn't to replace human review — it was to handle the low-hanging fruit so humans could focus on architecture and logic.

What it genuinely caught

The tool was excellent at a specific category of issues: things that are mechanically detectable but tedious for humans to notice consistently.

Unused imports and dead code paths. Inconsistent error handling — one endpoint wrapping errors in a custom AppError class while another threw raw strings. Missing await keywords on async calls that wouldn't fail immediately but would cause subtle race conditions down the line.

// The AI caught this pattern repeatedly — fire-and-forget async
// calls that should have been awaited
function processOrder(order: Order) {
  validateInventory(order.items); // returns Promise<void> — oops
  chargePayment(order.paymentMethod, order.total);
  return { status: 'processing' };
}

It also caught a real security issue in week two: a logging statement that included raw request bodies, which occasionally contained authentication tokens. A human reviewer might have caught that too, but it had slipped through twice before in similar endpoints.

Over the three months, I'd estimate about 15% of the AI's comments were things that genuinely improved the code and that humans had missed or would likely miss.

What it got wrong

The other 85% ranged from "technically correct but not useful" to "actively misleading."

The tool loved suggesting more defensive code. It wanted null checks on values that TypeScript's strict mode already guaranteed couldn't be null. It flagged database queries that "might be slow" based on superficial pattern matching, without understanding that the table had 200 rows and an index on every queried column. It suggested extracting three-line helper functions that were used exactly once.

The worst category was what I started calling "style drift." The AI had opinions about code organization that conflicted with the team's existing conventions. It wanted utility functions in a utils/ directory when the team organized by domain. It suggested barrel exports when the team had explicitly decided against them after a painful circular dependency incident six months prior.

Warning

An AI reviewer doesn't know your team's history. It can't distinguish between a convention and an accident. If your team made a deliberate architectural decision, the tool will still flag deviations from whatever pattern it considers standard.

The second-order effects nobody predicted

Here's what I didn't expect: the tool changed how people wrote code before submitting it.

Junior developers started self-reviewing against the AI's likely complaints. That sounds good — and partly it was. They caught their own unused imports and inconsistent patterns. But they also started writing overly defensive code to avoid AI comments. Null checks everywhere. Try-catch blocks wrapping code that couldn't throw. One developer told me he spent twenty minutes restructuring a perfectly readable function because "the bot would probably complain about the nesting depth."

The AI reviewer created a second audience for every PR, and that audience was more demanding and less nuanced than any human. Developers optimized for the tool's preferences, not for readability.

On the senior side, something else happened. Review quality from humans actually improved. With the mechanical stuff handled, senior reviewers stopped spending their time on formatting and naming and started leaving comments about data flow, error recovery strategies, and domain modeling. The median length of human review comments went up by about 30%, and they shifted from "rename this variable" to "this approach won't handle the retry case correctly."

The numbers, honestly

After three months:

  • PRs with at least one human-missed issue caught by AI: 23%
  • PRs where AI comments were all noise: 41%
  • Median time to first human review: down from 19 hours to 11 hours
  • Developer satisfaction with the tool (anonymous survey): 6.2 out of 10

That satisfaction score tells the story. Nobody hated it. Nobody loved it. It was a mild net positive wrapped in a significant amount of noise.

What I'd do differently

If I set this up again, I'd do two things from the start.

First, spend a day configuring the tool's rules before turning it on. Most AI review tools let you suppress categories of feedback. We didn't do this upfront because we wanted to "see what it finds." What it found was a thousand style opinions that poisoned the team's first impression.

Second, I'd frame it differently to the team. We introduced it as "an extra reviewer." That set expectations wrong. A better framing: it's a linter with opinions. Treat its output like you'd treat ESLint warnings — worth a glance, not worth agonizing over.

Note

The biggest ROI wasn't in what the AI caught — it was in freeing senior engineers to review what actually matters. If your review bottleneck is volume, not quality, that alone might justify the tool.

The uncomfortable question

The part I'm still chewing on: if 41% of PRs got nothing but noise from the AI, and developers started contorting their code to satisfy it, is the net effect really positive? The review time improvement was real. The security catch was real. But there's a cost to adding a voice to every conversation that's right just often enough that you can't ignore it.

I don't have a clean answer. The team decided to keep the tool but with heavy rule customization and a standing agreement that "the bot said so" is never a valid reason to change code. That feels about right — but I wonder how long that cultural norm holds once I'm not around to reinforce it.