Every AI-Written PR has 11 Bugs: How Many Does Your Current Test Suite Catch?

Key Takeaways

- Testing velocity mismatch: Agents ship dozens of PRs weekly, but QA infrastructure was built for human-speed development. The window between bug introduction and detection keeps widening.

- Novel failure modes: AI-generated bugs are different. They're often syntactically correct code that does the wrong thing. Existing tests weren't designed to catch these.

- The stakes are existential: Gartner predicts 2,500% increase in defects by 2028 if teams don't solve testing. The cost: production outages, revenue churn, and engineering burnout.

‍

Testing in the agentic SDLC

It's the end of a two-week sprint. An impressive amount of PRs were committed (thanks to the help of your AI coding agents). One problem: none of it was end-to-end tested.

The test suite that was updated six weeks ago was written against an application that no longer resembles the one in production. A bunch of the tests flake on every run because the UI has changed too much, the element IDs are different, and the API contracts have broken.

The rest of the PRs you’re shipping have no test coverage because they’re completely new features. Writing tests would take longer than the agent took to build the feature so, you hope for the best and ship without them.

This is the current failure mode of the agentic software development lifecycle. We’ve successfully automated the creation of software but are struggling to automate its verification.

Meanwhile, data on the impact of AI generated code is worrying:

AI-generated code produces roughly 1.7x more issues than human-written code across every major quality category, including logic, security, maintainability, and performance.
AI-generated PRs average nearly 11 issues or bugs each, compared to roughly six in human-generated submissions — resulting in longer review cycles and a higher risk of defects reaching production. Given that the best AI code review products top out at finding roughly 50% of issues, many are slipping past code review.

Testing is the agentic SDLC's most critical unsolved problem. And it’s one that’s becoming riskier for teams to ignore.

This is the first article in a three part series.

Why the agentic SDLC’s testing problem is so critical

The problem isn't just that agents ship untested code. It's that the conditions created by agentic development amplify the consequences of that untested code in ways that have no historical precedent.

Scale asymmetry: Bugs multiply at agent speed

Human engineering teams have always shipped bugs. What's new is the rate at which those bugs can now accumulate before anyone notices.

The average developer might open five to ten pull requests in a good week. A well-orchestrated agentic pipeline can open that many before lunch. PRs per author have increased 20% year-over-year while incidents per pull request rose by 23.5% and change failure rates climbed around 30%.

The traditional QA feedback loop was calibrated for human-speed development but many teams were already struggling to keep up. In an agentic pipeline, dozens of PRs have shipped since Tuesday. Each one is potentially interacting with the bug or inheriting the bad assumption at the heart of it. The window between introduction and detection stretches while the surface area of impact grows.

Novel failure modes: Errors that don't look like errors

Agentic development doesn't just produce more bugs. It produces different kinds of bugs. These are categories of failure that existing testing infrastructure wasn't designed to detect.

The most insidious of these: code that is syntactically correct, passes all static checks, and implements a coherent-looking function that nonetheless does the wrong thing because the agent misunderstood the requirement or silently made an assumption that doesn't hold.

The integration time bomb

Software is a system: a web of shared states, implicit contracts, dependencies, and emergent behaviors – including some that only manifest when components interact under load or in complex sequences.

When agents build features in parallel, across multiple sessions, without a persistent model of the full system they're contributing to, they make local decisions that can have global consequences. For example, an agent building a discount engine has no awareness of the agent that built the checkout flow two sprints ago. Each feature works in isolation. But together, they produce a buggy checkout path.

In an agentic codebase growing at velocity, the number of potential interaction surfaces between features grows faster than most testing regimes can track.

The stakes: What happens when the agentic SDLC's testing gap goes unfixed

The consequences of the coverage gap often compound silently over time. But when it finally becomes visible, it tends to do so in ways that are expensive, public, and time consuming to address.

2,500% increase in failures by 2028: Get ready for unprecedented production failures

The first and most direct consequence is software that breaks, at a scale and frequency that existing reliability practices were never designed to absorb.

More changes ship faster, but each change is slightly more likely to break something and unmanaged AI-generated code drives maintenance costs to 4x that of traditional levels by year two as technical debt compounds.

The 2025 production incident rate hints at what the future will look like. 2025 had a higher level of outages and incidents than prior years and while not every outage can be tied directly to AI on a one-to-one basis, the correlation is difficult to dismiss. Gartner believes this is only the beginning and predicts that AI-driven development will increase software defects by 2,500% by 2028 if current trajectories hold.

Millions lost: Customer churn and revenue loss

A 2014 report by Gartner claimed that the average cost of downtime is $5,600 per minute. But even if your team avoids an outage, performance degradations and buggy user flows can also lead to financial losses. Amazon once claimed that even a few milliseconds difference in latency would cost them tens of millions of dollars. While not every business operates at Amazon’s scales, performance losses add up.

If you’re a SaaS company, quality issues could even impact your retention metrics. Churn rates can increase as much as 7% following a significant reliability or breach incident. That translates, for a company generating $100 million annually, to $7 million in direct revenue loss, not including the cost of rebuilding the churned pipeline or investment in customer win-back programs. Reputational damage extends beyond current customers, as news travels at the speed of social media and a major incident can deter prospective customers for years.

Engineering team burnout

The human cost of the testing gap doesn't show up in incident post-mortems, but it accumulates steadily. When a test suite can't be trusted, the burden of quality shifts back onto engineers as a constant, low-grade anxiety that never fully resolves. Every release carries uncertainty and every sprint adds more untested surface area.

These are the conditions that burn out engineering teams. Developers who joined to build things find themselves spending increasing portions of their time firefighting, investigating flaky tests, and manually verifying behavior that should have been covered automatically.

Agentic development was supposed to eliminate toil. But, for teams without adequate testing processes and infrastructure, it just relocated it to another part of the SDLC.

What comes next

The verification gap described in this piece is not a temporary growing pain that teams will naturally work through as agentic development matures. It’s a structural problem that gets worse the longer it goes unaddressed.

Teams are reaching for solutions and the most intuitive ones are already in wide use: asking the coding agent to write unit tests alongside the code, generating E2E scripts from the same agent that built the features, and even pointing computer-use agents at the rendered UI as a proxy for real user testing. The next article in this series examines why these are still letting an unprecedented number of bugs through to production systems.

The third article describes a real solution: a purpose-built QA platform designed around the structural requirements of the agentic SDLC from the ground up. QA Wolf's agentic testing platform was built specifically for this problem using learnings from running 100 million tests for clients like Doordash, Drata, Grafana, and CodeRabbit.

The teams that navigate the verification crisis well will be the ones that treat testing as something core to their work — not an afterthought. The ones that don’t might just end up seeing that 2,500% increase in software defects that Gartner predicted.

Curious about our platform? Try it today!