A new AI-powered tool for QA seems to pop up every week. Making sense of all the different options can be confusing. How do they work? What do they do? Will they fit into my development process? This article will help you make sense of the landscape.
Before we begin, it’s important to understand that unless the AI-powered QA tool is developing its own model from the ground up (a costly and extremely unlikely proposition), it’s making use of someone else’s underlying LLM from OpenAI, Anthropic, Google, etc.
Because existing tools are constantly developing new features, new tools are coming on the market, and other tools are going out of business, we’ve decided the most helpful thing would be to discuss the four main types you’ll find out there:
We’ll describe the different approaches at a high level and how their application of LLM technology to the challenges of QA leads to different outcomes, their benefits and drawbacks, and how to identify them.
So let’s begin!
Examples: QA Wolf
Code-based QA Agents take the best parts of AI and combine them with the best parts of traditional testing. Users prompt the AI to create a test (“Add pants to cart and check out”), which the AI generates in deterministic and verifiable Playwright or Appium code. As such, Code-based QA Agents are considered the “gold standard” for AI-powered QA.
With Code-based QA Agents, including QA Wolf, the efficiency and accessibility of natural language prompts return true E2E tests that are…
Self-healing is a fundamental feature of most AI-powered QA tools since test maintenance (investigating failures and updating tests after a UI change) is the most time-consuming and labor-intensive part of the QA lifecycle. As we will see below, self-healing means different things depending on the type of AI agent we’re talking about.
With Code-based QA Agents, self-healing means programmatically updating the Playwright or Appium test code. The AI identifies where a test is failing and makes the appropriate changes, such as new selectors or additional test steps.
Since the Code-based QA Agent is repairing and validating deterministic code, there’s no possibility for the agent to inadvertently allow a bug to escape the way there is with Codeless QA Agents. The test code itself is what will determine whether the workflow passes or fails, not the judgment (or lack thereof) of an AI.
Examples: Momentic, Spur, Mabl
Codeless QA Agents attempt to make testing more accessible to non-technical stakeholders by removing code from E2E testing. In a pre-AI world, that approach simply limited what the tools were capable of testing (things like APIs or third-party systems were mostly unavailable to them), but in an AI world, it also makes the tests themselves less trustworthy.
Codeless QA Agents abstract away the underlying test operators (selectors, locators, API calls) and present higher-level, natural-language or no-code interfaces. Whether you’re giving them English prompts, recording flows, or choosing from prebuilt actions, the system decides how to execute them under the hood. Because the execution depends on heuristic choices — whether via AI models, adaptive locators, or vision-based recognition — the tests are non-deterministic. Two runs may resolve elements or flows differently, even with identical inputs.
Self-healing in these systems means the tool adapts the test to keep it running — whether by swapping locators, falling back to vision, or trying alternate paths. That adaptability reduces maintenance, but it can also hide regressions, since the test may ‘pass’ even when the user experience is broken.
The main benefit of Codeless QA Agents is their ability to side-step broken selectors by using adaptive locators, natural language mapping or computer vision to follow test steps. In theory, this makes tests easier to maintain — as the UI changes, the AI makes judgment calls. And that’s where you run into the drawbacks:
Examples: Cursor, GitHub Co-Pilot, Windsurf
While AI-enhanced IDEs are not QA tools, strictly speaking, they are frequently considered by teams leveraging AI as part of their approach to automated testing. So we’re including them here as an honorable mention.
Most of these IDE co-pilots use the codebase itself as the source of truth to generate end-to-end tests. Instead of crawling the UI or recording flows, these tools analyze your application code and generate runnable tests in your chosen framework (e.g., Playwright, Cypress, Jest).
From the outside, it looks like a productivity boost: You prompt the tool or point it at a function, and it scaffolds test code that lives in your repo. Your team can run these tests in your existing CI/CD pipeline, and, because they are just regular test code, they are fully editable. Planning, coverage modeling, and assertions still fall to your team, but the tool can generate tests at any level of complexity your framework supports.
Despite their appeal, these tools come with important constraints:
Examples: Meticulous, Replay.io
Session Recorders don’t “test” in the true sense of the word by interacting with the rendered UI and asserting that some result happened. Instead, Session Recorders have you instrument the code base (through a browser extension or code snippet on the application header) so they can execute lines of code directly.
To determine which lines of code to execute, the Session Recorders observe a human’s clicks and keystrokes and log the network activity between the client and server. To run a test, a Session Recorder re-executes those recorded interactions against the application’s UI, simulating real user behavior.
What’s really happening under the hood is that they capture browser-rendered activity — DOM mutations, JavaScript events, user inputs, and network traffic — then reconstruct it on replay. To make this work, they typically mock or snapshot network calls (including to third-party services), which means they don’t validate the actual backend or side effects. Think of these tools like a driving simulator for websites. They’re useful for replaying what a user saw, but not for true end-to-end testing. And because they rely on browser-level rendering, they won’t work at all for native mobile apps or desktop applications (like Electron) that don’t expose a DOM.
Setup is fairly straightforward: Just install an extension or code snippet on your application header, and you’re off to the races. Tests can be developed by non-technical team members like designers, customer success managers, and product managers.
But there are drawbacks:
When you line them all up, the differences get pretty clear.
Session recorders are fine if you just want replays, but they miss real-world issues, pile on noisy results, and box you into a vendor’s sandbox. Codeless QA agents? They can be flashy, but the randomness and lock-in make them more of a gamble than a guarantee. IDE co-pilots? Great for scaffolding code, but they’re really just parroting back your own source, and you’re still left carrying all the maintenance.
Code-based QA Agents are in a different league. They give you real, verifiable tests in code you own, with the speed of AI and the reliability of proven frameworks. No smoke, no mirrors, no lock-in — just tests you can see and trust.
For any team serious about leveraging AI in QA, the only logical choice is a Code-based QA Agent—delivering the efficiency of AI without sacrificing the trustworthiness and durability that automated testing demands.