Making sense of all the AI-powered QA tools

John Gluck
Kirk Nathanson
September 2, 2025

A new AI-powered tool for QA seems to pop up every week. Making sense of all the different options can be confusing. How do they work? What do they do? Will they fit into my development process? This article will help you make sense of the landscape.

Before we begin, it’s important to understand that unless the AI-powered QA tool is developing its own model from the ground up (a costly and extremely unlikely proposition), it’s making use of someone else’s underlying LLM from OpenAI, Anthropic, Google, etc.

Because existing tools are constantly developing new features, new tools are coming on the market, and other tools are going out of business, we’ve decided the most helpful thing would be to discuss the four main types you’ll find out there:

  1. Code-based QA Agents
  2. Codeless QA Agents
  3. IDE Co-pilots
  4. Session Recorders

We’ll describe the different approaches at a high level and how their application of LLM technology to the challenges of QA leads to different outcomes, their benefits and drawbacks, and how to identify them.

So let’s begin!

Code-based QA Agents

Examples: QA Wolf

Code-based QA Agents take the best parts of AI and combine them with the best parts of traditional testing. Users prompt the AI to create a test (“Add pants to cart and check out”), which the AI generates in deterministic and verifiable Playwright or Appium code. As such, Code-based QA Agents are considered the “gold standard” for AI-powered QA.

With Code-based QA Agents, including QA Wolf, the efficiency and accessibility of natural language prompts return true E2E tests that are…

  1. Deterministic. Each test contains a series of steps, executed sequentially, and ending with an expected outcome.
  2. Verifiable. You can validate that the test was executed as intended each time.
  3. Realistic. The tests interact with the front-end UI the same way a human behind a keyboard would.  

Self-healing is a fundamental feature of most AI-powered QA tools since test maintenance (investigating failures and updating tests after a UI change) is the most time-consuming and labor-intensive part of the QA lifecycle. As we will see below, self-healing means different things depending on the type of AI agent we’re talking about.

With Code-based QA Agents, self-healing means programmatically updating the Playwright or Appium test code. The AI identifies where a test is failing and makes the appropriate changes, such as new selectors or additional test steps.

Since the Code-based QA Agent is repairing and validating deterministic code, there’s no possibility for the agent to inadvertently allow a bug to escape the way there is with Codeless QA Agents. The test code itself is what will determine whether the workflow passes or fails, not the judgment (or lack thereof) of an AI.

Sign up for our webinar today.

Benefits of Code-based QA Agents

  • True determinism. Code-based QA Agents are the gold standard for AI-enabled E2E testing as they combine the speed and efficiency of AI with the determinism of coded tests.
  • Test complexity. The benefit of using code instead of prompts is that code provides greater flexibility for more comprehensive testing (accessibility, performance, APIs, and complex tests like Canvas APIs and browser extensions). Of course, the exact testing capabilities will vary from vendor to vendor.
  • Portability.  Because the output is standard test code (e.g., Playwright, Appium), your tests aren’t trapped inside a proprietary runtime. You can move them across environments, run them locally or in CI, and avoid lock-in to a single vendor’s infrastructure.
  • Transparency. Every test is human-readable code, which makes it easy to review, audit for compliance, and trace exactly what the test does. Unlike opaque no-code flows, the logic is explicit.
  • Extensibility. You can layer on custom helpers, utilities, and frameworks as your test suite matures. This allows teams to evolve their testing strategy without waiting for a vendor to add features.

Codeless QA Agents

Examples: Momentic, Spur, Mabl

Codeless QA Agents attempt to make testing more accessible to non-technical stakeholders by removing code from E2E testing. In a pre-AI world, that approach simply limited what the tools were capable of testing (things like APIs or third-party systems were mostly unavailable to them), but in an AI world, it also makes the tests themselves less trustworthy.

Codeless QA Agents abstract away the underlying test operators (selectors, locators, API calls) and present higher-level, natural-language or no-code interfaces. Whether you’re giving them English prompts, recording flows, or choosing from prebuilt actions, the system decides how to execute them under the hood.  Because the execution depends on heuristic choices — whether via AI models, adaptive locators, or vision-based recognition — the tests are non-deterministic. Two runs may resolve elements or flows differently, even with identical inputs.

Self-healing in these systems means the tool adapts the test to keep it running — whether by swapping locators, falling back to vision, or trying alternate paths. That adaptability reduces maintenance, but it can also hide regressions, since the test may ‘pass’ even when the user experience is broken.

Benefits and drawbacks of Codeless QA Agents

The main benefit of Codeless QA Agents is their ability to side-step broken selectors by using adaptive locators, natural language mapping or computer vision to follow test steps. In theory, this makes tests easier to maintain — as the UI changes, the AI makes judgment calls. And that’s where you run into the drawbacks:

  • Non-determinism. Agent-driven runs are less repeatable than code. Whether the system is choosing locators, exploring flows heuristically, or falling back to vision, you don’t get the same guarantee of step-for-step consistency that Playwright or Cypress provides.
  • Lack of portability. Some vendors offer an export-to-code feature, but the output is often verbose and loses key behaviors. Locator abstraction, vision-based matching, and other heuristics don’t translate cleanly, so exported tests don’t behave the same as they did inside the product.
  • Vendor lock-in. The features that reduce maintenance — auto-healing, exploration logic, adaptive locators — remain proprietary. Even with export options, the most valuable functionality stays tied to the vendor runtime, making it costly to switch.
  • Coverage limitations. Codeless QA Agents are built around browser interactions. They have limited reach into APIs, background jobs, third-party services, or state setup.
  • High execution cost. Codeless agents are significantly more expensive to run than code-based automation, making large-scale or frequent execution cost-prohibitive.
  • Slower performance. Tests executed through codeless agents run much slower than compiled code, creating bottlenecks when integrated into fast-moving CI/CD pipelines.

IDE Co-Pilots

Examples: Cursor, GitHub Co-Pilot, Windsurf

While AI-enhanced IDEs are not QA tools, strictly speaking, they are frequently considered by teams leveraging AI as part of their approach to automated testing. So we’re including them here as an honorable mention.

Most of these IDE co-pilots use the codebase itself as the source of truth to generate end-to-end tests. Instead of crawling the UI or recording flows, these tools analyze your application code and generate runnable tests in your chosen framework (e.g., Playwright, Cypress, Jest).

From the outside, it looks like a productivity boost: You prompt the tool or point it at a function, and it scaffolds test code that lives in your repo. Your team can run these tests in your existing CI/CD pipeline, and, because they are just regular test code, they are fully editable. Planning, coverage modeling, and assertions still fall to your team, but the tool can generate tests at any level of complexity your framework supports.

Risks and Limitations

Despite their appeal, these tools come with important constraints:

  • Implementation-driven coverage. Since they reason from your source code, they often verify what the code already does instead of whether the system meets business requirements.
  • Conflict of interest. The codebase becomes both the specification and the subject of verification, which means the source of truth is also the source of the bugs.
  • Ongoing maintenance. Like any test suite, you own execution, reporting, and upkeep. These tools won’t self-heal — resilience depends on selectors and helpers your engineers create.
  • Infrastructure requirements. While generating tests is straightforward, running them at scale (e.g., daily or on every pull request) requires CI/CD infrastructure that your team must build and maintain.
  • Scaling challenges. Like any IDE, they are not a complete QA toolchain. They lack the infrastructure needed for parallel execution, environment management, and reliable debugging beyond a developer’s local machine.
  • Lack of visibility. These tools are designed for technical resources, leaving PMs, execs, and manual testers in the dark about test coverage, test results, and product quality.

Session Recorders

Examples: Meticulous, Replay.io

Session Recorders don’t “test” in the true sense of the word by interacting with the rendered UI and asserting that some result happened. Instead, Session Recorders have you instrument the code base (through a browser extension or code snippet on the application header) so they can execute lines of code directly.

To determine which lines of code to execute, the Session Recorders observe a human’s clicks and keystrokes and log the network activity between the client and server. To run a test, a Session Recorder re-executes those recorded interactions against the application’s UI, simulating real user behavior.

What’s really happening under the hood is that they capture browser-rendered activity — DOM mutations, JavaScript events, user inputs, and network traffic — then reconstruct it on replay. To make this work, they typically mock or snapshot network calls (including to third-party services), which means they don’t validate the actual backend or side effects. Think of these tools like a driving simulator for websites.  They’re useful for replaying what a user saw, but not for true end-to-end testing. And because they rely on browser-level rendering, they won’t work at all for native mobile apps or desktop applications (like Electron) that don’t expose a DOM.

Benefits and drawbacks of Session Recorders

Setup is fairly straightforward: Just install an extension or code snippet on your application header, and you’re off to the races. Tests can be developed by non-technical team members like designers, customer success managers, and product managers.

But there are drawbacks:

  1. Every test has to be recorded individually. And as new features are built, those tests have to be recorded as well.
  2. Not true E2E functional tests. It's nice that these things can stub out servers, but, of course, mocking can miss real-world issues like tricky redirects and cookies, cross-site rules, data mismatches, slow servers, and side effects (like an email that never actually gets sent). They’re okay for reproducing buggy user sessions, but not so great for proving a given path works from beginning to end.
  3. Security concerns. Most testing tools require giving the vendor some level of access (for example, source code snippets, test results, or limited environment credentials). Session recorders go significantly further—they instrument the browser and continuously capture live sessions, which can include credentials, session tokens, internal URLs, and PII. Allowing a vendor that level of visibility into staging or behind-VPN environments greatly expands the attack surface, making access controls essential. There’s also the ever-present possibility that the instrumentation in your code will get accidentally deployed to production.
  4. Limited test cases. These tools can only test what the application exposes in the UI. APIs, database calls, browser extensions, etc., would all be out of reach for these agents. Furthermore, they won’t work for native mobile apps that don’t render a DOM, desktop applications built in technologies that don’t expose browser-level events (e.g., Electron, Qt, .NET), or heavily backend-driven workflows where correctness depends on server logic, state, or external APIs. Lastly, these tools won’t let you test side effects (sending email, charging a credit card, generating a report, authentication, database state, emails, payments, and so on) that mocking can’t prove.
  5. Potentially noisy results. These tools flag visual diffs in the application, which means that everything from intentional changes to feature flags to rendering errors in the environment could cause a test to fail. These tools can also create too many irrelevant low-impact tests, because they simply replay user flows, and there are nearly an infinite number of user flows.  
  6. Vendor lock-in. The artifacts created by these tools are not open source and are only designed to be used within the tool.

Choosing the right kind of AI in QA

When you line them all up, the differences get pretty clear.

Session recorders are fine if you just want replays, but they miss real-world issues, pile on noisy results, and box you into a vendor’s sandbox. Codeless QA agents? They can be flashy, but the randomness and lock-in make them more of a gamble than a guarantee. IDE co-pilots? Great for scaffolding code, but they’re really just parroting back your own source, and you’re still left carrying all the maintenance.

Code-based QA Agents are in a different league. They give you real, verifiable tests in code you own, with the speed of AI and the reliability of proven frameworks. No smoke, no mirrors, no lock-in — just tests you can see and trust.

For any team serious about leveraging AI in QA, the only logical choice is a Code-based QA Agent—delivering the efficiency of AI without sacrificing the trustworthiness and durability that automated testing demands.

Some disclaimer text about how subscribing also opts user into occasional promo spam

Keep reading

Parallelization
Running the same test in multiple environments isn’t as simple as it sounds
E2E testing
Automated mobile testing without tradeoffs: The QA Wolf approach
Culture of quality
The Test Pyramid is a relic of a bygone era