Code Reviews are Broken and AI Can’t Fix Them. What Now?

Key Takeaways

- Code review is structurally broken. AI generates 70% more issues per PR, PRs are 154% longer, and human reviewers lose effectiveness past 400 lines. The math doesn't work.

- AI-powered code review tools aren't the fix. Even the best AI reviewers miss nearly half of bugs — so more AI code plus imperfect review means more bugs in production.

- Automated E2E testing is now a critical quality gate. It's the only layer that catches what AI is most prone to breaking: integration failures, workflow errors, and silent data corruption.

It seems like some core aspect of software engineering is declared broken or dead every six months. The sprint is broken. Standups are dead. Junior engineers are extinct. You learn to ignore them. But every once in a while, one of these declarations turns out to be true.

This is one of those situations.

It shouldn’t come as a surprise that code reviews no longer work as they were intended. With AI coding agents, hundreds or thousands more lines of code each week get sent to the same number of reviewers. Inevitably, some AI slop slips through. That’s resulted in a 23.5% increase in incidents per pull request and a 30% increase in change failure rates.

The verification crisis for AI-generated code is not new. But for the last year or so, there’s been an overwhelming amount of content published claiming AI code reviews are the solution. Code review tools like CodeRabbit, Greptile, Bugbot, Claude and others offer automated reviews as a first pass and free up reviewers to find more complex issues but they don’t replace an engineer’s understanding of your team’s conventions and repo.

They’re helpful. But they only improve code reviews. They don’t fix them. Industry benchmarks show that even the best code review tools only catch around 50% of issues. That still translates into more bugs slipping into production than ever before.

Which brings us to the uncomfortable conclusion that many in the industry haven’t yet reckoned with: if code review — human or AI-assisted — can no longer be relied upon as a key quality gate, other quality gates need to be reinforced.

The obvious solution to the verification crisis we find ourselves in is to increase other forms of verification. That can mean investing more in testing and other forms of QA. One particularly effective quality gate that teams should be doubling down on is automated end-to-end (E2E) tests.

E2E tests are the only gate that operates at the right layer to catch the kinds of errors studies show AI is more likely to make, like integration errors that break user flows or bugs in multi-step workflows. And, if you want to keep bugs from production in the era of AI slop, you’ll need more fast, automated ways to test and catch them.

The numbers behind the collapse of code reviews

The strain on code reviews isn't just anecdotal; the data makes a damning case.

70% more issues per PR

A CodeRabbit study found that AI-generated code produces approximately 1.7x more issues than human-written code. Not just more bugs, but also worse bugs. The study also found around 1.4x to 1.7x more critical and major issues including that logic and correctness errors are 75% more common in AI-generated PRs, security vulnerabilities increase 1.5 to 2x, and performance inefficiencies appear nearly 8x more often.

20% more PRs

Now, layer in volume. According to Cortex's Engineering in the Age of AI report, pull requests per author are up 20% year-over-year. More bugs per PR multiplied by more PRs in the queue translates into significantly more work for human reviewers.

PRs are 154% longer

AI-generated PRs are also getting larger. One analysis found that teams with high AI adoption saw PR size grow 154%, with review time up 91%. More code, bigger diffs, more bugs to find.

Human review quality degrades after 400 lines

What makes this especially hard to solve is that the limits of code review aren't a matter of effort — they're a matter of biology. One study spent 10 months analyzing 2,500 reviews covering 3.2 million lines of code at Cisco Systems and found that reviewers spot defects most effectively when examining 200 to 400 lines of code within 60 minute time periods. Once reviews grow past 400 lines or 60 minutes, defect detection starts to drop off. The brain simply can't hold more than that at once. At 200 lines, review effectiveness sits at 80–90%, but it drops below 50% once a PR exceeds 1,000 lines.

AI code reviewers find less than 52% of bugs

Independent industry benchmarks like With Martian show that no tool finds above 52% of bugs in the PRs they review. That means that a lot of the added bugs in AI-generated PRs are left to overwhelmed humans to find and that means many are bound to slip through.

Why comprehensive, automated E2E testing

With code review increasingly failing within AI-driven development teams, it’s important to invest more in building quality layers that can serve as additional lines of defense. That's where automated end-to-end testing comes in.

There are two distinct categories of AI-generated bugs making it to production right now:

Bugs that a thorough code review should have caught but didn't, because the reviewer was fatigued, the diff was too long, or the PR was the fifteenth one they'd looked at that week.
Bugs that no code review could have caught or which code review was unlikely to catch — the ones that only surface when you actually run the application end-to-end, follow a real user flow, and watch what happens.

The second group is also spiking with AI code and is another reason why end-to-end testing is more important than ever. Both types of bugs can be found with the help of a robust automated end-to-end testing suite.

Common AI-generated bugs E2E testing can catch

AI-generated code is notorious for passing reviews at the surface level but not working in production. That’s why E2E tests are critical.

Here are some bugs that AI is more likely to generate which can quickly be caught by E2E tests.

Integration and cross-service failures

AI-generated code is particularly prone to these types of bugs because of misaligned assumptions between human-written components and AI-generated logic. Two pieces of code can each pass review in isolation and break when they interact. Code review evaluates files, not call boundaries, so it can miss these failures. But not catching them can lead to API contract mismatches, wrong data format assumptions, and misconfigured service connections which only surface when the full stack is actually running.

Multi-step workflow and state bypass bugs

These are high impact in production while also hard for code reviewers to spot. In a multi-step checkout, payment is expected before order confirmation. If the confirmation endpoint verifies only that the user has a valid session — not that payment was processed — the authentication check passed. The workflow logic did not. A reviewer reading the confirmation endpoint code in isolation sees nothing wrong. The bug only surfaces when you actually run the whole flow. E2E tests help protect the services that rely on a system integrating with real user behavior.

Silent data loss

AI tends to generate handlers that process requests but don't actually write everything to the database correctly — wrong field mappings, missing foreign keys, or updates that affect the wrong record. The UI does what it’s supposed to and nobody knows data isn't being saved until a user reports it days later. Code review misses it because nothing is obviously wrong. Unit tests miss it because they mock the database layer. E2E tests running against a real database catch it.

State pollution between sessions or users

AI sometimes generates code that stores user-specific state in a shared scope — a module-level variable or a cache with a missing key namespace, a singleton that wasn't designed for concurrent access. One user's session bleeds into another's. This is virtually undetectable in code review unless the reviewer is actively hunting for it, and unit tests almost never simulate concurrent users. E2E tests that run parallel sessions surface it.

Performance degradation that breaks user flows

According to one study, performance inefficiencies appear nearly 8x more often in AI-generated code such as excessive I/O, N+1 queries or unindexed lookups. A checkout flow that works fine under normal conditions starts failing when the orders table query starts doing a full scan instead of using an index.

The common thread across most of these is that they arise from the interaction between components, not from any single piece of code in isolation. Those are exactly the type of issues that code reviews struggle to see. While some of these can’t be found until they’re under production-level load, E2E tests are able to catch more of these issues than other types of testing.

Not having a robust E2E testing suite is costing you

The consequences of not having a rigorous and comprehensive E2E testing suite aren't theoretical. A survey of 200 senior site reliability and DevOps leaders found that 43% of AI-generated code changes still require manual debugging in live production environments. It’s clear that far too many bugs are slipping into production.

Without comprehensive E2E tests, the risk of every AI-generated PR is unknown. You're shipping code and hoping for the best, then scrambling when something breaks in production. The study also found that organizations require multiple redeployment cycles to verify a single AI-suggested fix and, according to Google's 2025 DORA report, a single redeploy cycle takes a day to a week on average. It’s far better for teams to find these issues earlier at the testing phase and fix them before they make it to production and eat up even more time and cycles.

The real costs of skipping comprehensive E2E testing include:

The time it takes to diagnose and fix production incidents that those tests would have caught.
Potential loss of current or future customers due to low uptime.
Developer burnout from the extra stress outages and hotfixes cause.

Code reviews can’t be fixed. The answer is better QA.

Code review has always had a hidden assumption baked in: that the amount of code humans could write would stay roughly proportional to the amount other humans could meaningfully review. AI has quietly demolished that assumption. The tools that generate code have no cognitive limits, no fatigue, and no maximum throughput. The humans reviewing that code have all three. And when those two curves diverge sharply enough, the system doesn't degrade gradually — it breaks. And even AI-enabled code reviews can’t save it.

The answer is relying more on QA. That means comprehensive, automated E2E tests that run on every PR, cover every critical user workflow, and block releases the moment something breaks. When the quality gate is automated and fast, bug-free code ships at AI speed while buggy code gets caught before it ever reaches a user. That's the version of QA that scales with the output of agentic development.

Everything else is just wishful thinking and hoping the next PR isn't the one that takes down production.

Sign up for the waitlist to access our AI platform that helps developers run E2E tests 12x faster.