More PRs, Slower Delivery: The AI Agent Throughput Trap

ai productivity consulting developer-experience

A client's team tripled their PR output with coding agents. Six weeks later, their cycle time had doubled and nobody could explain why.

In February I started an engagement with a mid-size fintech company — about 40 engineers across five teams. They'd gone all-in on AI coding agents three months prior. Management was thrilled. The numbers looked fantastic: PR volume up 180%, lines of code per developer per week nearly tripled. Engineering leadership had slides about it.

The product team told a different story. Features that used to take a sprint were stretching into three. Releases kept slipping. The backlog of "ready for review" items had grown so long that developers joked about it in standups the way you joke about weather — constant, unavoidable, not worth fighting.

I was brought in to figure out the gap between the metrics and the reality.

Measuring the Wrong Thing

The first thing I did was pull their cycle time data from LinearB. The picture was stark. Median time from first commit to merge had gone from 1.8 days to 4.3 days. But the time spent actually writing code had dropped — from about 6 hours to under 2. The coding was faster. Everything around it was slower.

The bottleneck was review.

Each developer was now producing two to three PRs per day where they'd previously done one every day or two. But the team's review capacity hadn't changed. You still had the same humans reading the same diffs, except now each diff was bigger, there were more of them, and the code was subtly different from what those reviewers expected.

One senior engineer put it bluntly: "I spend my entire morning reviewing AI-generated code that almost works. The bugs aren't obvious — they're the kind where the logic is 95% right and the last 5% requires me to understand the full business context to catch."

The 95% Correctness Problem

This is what made it so insidious. The code wasn't bad in an obvious way. It compiled, it passed the existing tests, it mostly did the right thing. But "mostly" in a payments system means occasionally charging someone the wrong amount or dropping a webhook event.

I sat with three different developers doing code reviews over two days. The pattern was consistent. Agent-generated PRs had:

Correct structure and reasonable naming
Passing unit tests (the agent wrote those too)
Subtle misunderstandings of domain rules that lived in tribal knowledge, not in code

One PR handled refund calculations. The logic was clean. It also didn't account for a tax clawback rule specific to three US states that the team had implemented as a special case eighteen months ago. The agent didn't know about it because the rule lived in a different service. The tests passed because the test fixtures didn't include those states.

Warning

When your tests and your agent share the same blind spots, the feedback loop can't catch domain errors. The agent writes code that passes agent-written tests — that's circular validation, not quality assurance.

Code Churn Tells the Real Story

I asked the team to pull their code churn metrics. Churn — the percentage of code that gets rewritten within two weeks of being merged — had climbed from 3% to nearly 9%. That meant almost one in ten lines merged was getting revised shortly after. Not refactored in a healthy way. Revised because it was wrong or incomplete.

The math is unforgiving. If you triple your PR output but 9% of that code needs rework, and each rework cycle involves another review, you're creating a compounding queue. More code in, more fixes needed, more reviews stacked up, slower cycle time for everything.

The team was running faster and falling behind.

What We Changed

We didn't roll back the AI tools. That would've been the wrong lesson. Instead we restructured how the team used them.

Review-first capacity planning. We made review load a first-class constraint. Each developer was expected to open no more PRs than the team could review that day. Sounds obvious, but when agents make it effortless to generate PRs, you need explicit limits or the queue grows without friction.

Smaller, scoped agent tasks. Instead of pointing an agent at a Jira ticket and letting it generate a 400-line PR, developers started breaking work into chunks where the agent handled the mechanical parts — boilerplate, test scaffolding, type definitions — and the developer wrote the business logic by hand. Hybrid, not autonomous.

Domain context files. The team created lightweight markdown documents capturing the tribal knowledge that agents kept missing. Tax rules, retry policies, edge cases in the billing flow. These got fed into agent context. It didn't eliminate domain errors, but it cut them roughly in half over the following month.

Churn as a health metric. We added code churn to the team's weekly dashboard, right next to cycle time. When churn crept above 5%, it triggered a conversation about whether the team was leaning too hard on agents for complex work.

The Uncomfortable Takeaway

After six weeks of adjustments, cycle time came back down to about 2.2 days — not quite as fast as pre-agent, but close, and with higher throughput. The team was delivering more features per sprint than before agents, but only about 40% more. Not the 180% the PR metrics had implied.

Forty percent is still meaningful. But it's a long way from the tripling that the dashboard suggested, and it only materialized after the team deliberately slowed down the part that felt most productive.

I think there's a pattern here that extends beyond this one team. AI coding agents are genuinely good at producing code. The bottleneck was never code production — it was code understanding, code review, and domain knowledge transfer. Making the fast part faster doesn't help when the slow part is the constraint. It's the Theory of Constraints applied to software delivery, and it shouldn't surprise anyone, but the dashboards make it easy to miss.

What's your team's actual constraint right now — and are your AI tools making it better or just making everything else louder?