Quality Assurance in the age of AI-powered testing tools

Kirk Nathanson
March 27, 2024
I'd say the various no-code AI tools were very easy to set up but very difficult to scale. It always felt like the no-code AI tools just made things more complicated and unpredictable in more unpredictable ways. The premise is interesting but they slowed you down about as much as they sped you up.
Philip Seifi, co-founder @ Colabra

Regression testing is a repetitive, ceaseless operation — Does that work? Yeah. Does that work? Yeah. Does that work? Yeah — that’s why so much effort is spent automating it. Robots are great at repetitive tasks. But just running automated tests is a very, very small part of what goes into QA automation.

The real effort is investigating test failures after each and every run. That means manually reproducing the test to confirm there’s a bug, re-running the test if it flaked, or fixing the test if the latest release changed the expected behavior.

The reason that two-thirds of companies test less than 50% of their user workflows is because it takes one test automation engineer for every four front-end developers to keep up with the investigation and maintenance demands of a test suite. That’s why AI-powered testing tools that promise “natural language” test creation and “self-healing” are so appealing: if you could get faster and cheaper test maintenance, you could maintain a larger test suite and test more often.

But before you replace your whole QA team with generative AI, make sure you understand how the AI tools actually work, their limits, and the risks of using them.

Generative AI’s primary motivation is getting a test to pass

This is in contrast to humans, whose primary motivation is to catch bugs before they go into production.

AI-powered testing tools, sometimes billed as “self-healing” tests, try to reduce the cost of test maintenance by automatically experimenting with new selectors when the tests appear to fail. “If a test can pass,” the AI reasons, “then ipso facto there can’t be a bug.” To an AI, the logic is sound. All the AI needs to do is fix what’s wrong with the test, and the errant failure will be resolved.

For example: Your app has a built-in messaging system. To view your messages you have to click a button that dynamically shows the number of unread messages, like “Inbox (3)”. When the AI writes a test to open the inbox it uses “Inbox (3)” as a selector and doesn’t know that the message count can change. On the next run, the UI shows “Inbox (7)” which breaks the test, so the AI replaces the selector with “Inbox (7)” and the test passes.

This approach works, and the test can make this change an infinite number of times. But the approach has its downsides, and it’s why companies need to be cautious about how they implement generative AI: if you modify a failing test just so it passes you could be changing the intention of the test and letting a bug through.

Consider a similar example:

A developer accidentally changed the “log in” button to read “logggin.” The steptest that checks the log-in flow fails because it can’t find the old (correct) selector and the AI attempts to autoheal, which it does by changing the selector to “logggin.” Now the steptest locator is valid, the test resumes and passes. And if you are practicing continuous delivery, you have a glaring typo on your production site.

The worst part is not the embarrassment of the typo, it’s that your developers no longer trust that the AI is accurately testing their releases — so they go back to manually testing each PR.

LLMs with restricted context limits can struggle to find the correct selectors

As with most things, you can buy your way out of this problem if you can afford it.

Without getting too technical, the context limit is the maximum amount of text (measured in tokens) that an LLM can consider (or “remember”) at any one time. To manage very large amounts of data, generative AI tools “compress” the HTML of a web app so that it can be analyzed more efficiently. In doing so, genAI-powered tools with lower context limits have to drop parts of the page from memory, potentially including the selectors needed to create or update a test.

Since a higher context limit (bigger memory, less compression) increases the cost of using LLMs like ChatGPT, LLama, and Claude, testing teams have to weigh how much uncertainty they’re willing to accept for the testing budget that they can afford.

Generative AI creates tests for itself, not for collaboration or portability

This is only a problem if you have to debug a test or you ever want to change vendors.

One of many ways that generative AI testing tools try to reduce the cost of QA is by providing a point-and-click interface for recording test steps, or using “natural-language processing” to describe a test in plain English. Both approaches are designed for non-technical (and presumably less expensive) people in the company to create tests.

Both approaches also hide the underlying code from the tester. Should you ever want to change your approach, you won’t be able to take the tests you’ve created. You’ll have to rebuild your test suite from scratch.

While there are exceptions, and some tools allow you to export the underlying code in Playwright or Cypress, that doesn’t mean you’ll be able to run them out of the box or even understand the code. Since the AI was writing tests only for its own comprehension, the code is usually complex and inefficient, lacking comments, and using internal services that you’ll need to recreate in your own testing environment. Evaluate a handful of tests to understand the level of effort it would take to make the suite useful should the need arise.

Self-healing tests don’t eliminate the need for human investigation (yet) because the AI can’t be completely trusted (yet)

At the top of this piece we pointed out that the most expensive part of automated testing is investigating failures and updating tests when the UI changes — actions that seem like they can be delegated to generative AI. But let’s really acknowledge what we’re asking generative AI to do for us, because it’s not simply fixing selectors. In fact, we’re entrusting the AI with business-critical decisions about the state of our software that could have huge repercussions on our users and our company.

That time is coming — probably… maybe… eventually — but if you are banking the future of your QA processes on AI it’s important to understand when it comes to QA you’re not paying for human labor — you’re paying for human judgment.

Keep reading