Measuring what matters: QA metrics that improve products, teams, and businesses

Kirk Nathanson
July 21, 2023

There’s the old chestnut, “You can’t improve what you don’t measure.” What to measure, of course, isn’t always clear. When it comes to QA and automated testing we find that there’s an over-reliance on performance metrics of individual contributors — things like the number of test cases allocated to each team member, test cases executed, or open bugs to be retested. IC performance metrics show a very limited view of product quality and aren’t driving company performance. In this post, we’ll talk about the QA metrics that we think engineering and business leaders should be focused on, targets, and actions to take. 

We're also going to assume that you already have basic tracking on bug reports and customer support tickets. If you don't, pause here and set that up. Make sure that you can slice the data by severity and by feature set. If you're only focused on the product engineering side of things, you're not counting the true cost that bugs are having on your business. A minor reduction in customer support volume can have a huge impact. At GUIDEcx, a 19% reduction in ticket volume from software bugs saved the company $98K/year in customer support. 

Automated test coverage metrics 

Unit & integration test coverage

Frequently called “code coverage” because these lower-level, white-box tests check individual functions or small groups of components in the code itself. 100% code coverage is the typical mantra, but we like to say 100% of common sense tests. Writing false-negative tests doesn't add any value. Practicing test-driven development (TDD) is the first step to getting your score up. TDD uses the “red-green” approach wherein the test fails because the function isn’t written yet, and passes when the function is working as intended. By starting with the test, you’ll make sure that everything you build has some level of test coverage. 

Effective end-to-end workflow coverage

With E2E tests, you’re checking entire workflows — ideally every possible workflow — from beginning to end. The basic formula for E2E test coverage is the total number of tests over total workflows, with each permutation of a workflow is counted separately, for example:

  • Add a pair of pants and check out with a new account
  • Add a pair of pants and check out with an existing account
  • Add a pair of pants and check out with an anonymous account

But a better metric is your effective E2E test coverage, which doesn’t count tests that are down for maintenance. Flaky coverage is fake coverage and tests that are written but not maintained provide a false sense of security.

You should aim to have a functioning E2E test running against 80% of your total workflows or more. When there’s less coverage than that, the likelihood of regressions finding their way into production increases dramatically. Notably, fewer than 15% of companies reach that goal, mostly because of the dizzying amount of maintenance that a test suite needs as it grows (SmartBear, 2022). 

Depending on the complexity of your application, you may find there’s an infinite number of possible tests. If that's the case there are a variety of strategies that help you define a representative sample of tests: Risk-based testing focuses on what users see and frequently use, or the business risk of shipping bugs in different areas of the application.

→ Read more: How to prioritize your first 100 end-to-end tests

Time to fulfill coverage requests

As your team releases new features, your developers need to request additional tests. The time it takes to fulfill those coverage requests is a helpful metric that can point to delays in the release process. As long as your team is blocking releases from going out until they're fully tested — and they should be! — then new tests have to be written before new features can ship. 

A healthy SLA is about one business week after a request is made, although that can be shortened by bringing QA closer to the product design stage so that work can begin during development and QA engineers can push for more testable code. 

Percentage of tests skipped 

High-performing teams can expect 5–10% of their E2E tests will break each week without continuous maintenance — within six weeks, nearly 50% of a test suite will be out of date. That means, inevitably, when a developer wants to deploy some of the regression tests will be down for maintenance and one of two things happen:

  1. The developer bypasses the test and merges anyway.
  2. The developer has to spend hours manually testing before bypassing the automated test.

In the first case, the risk of bugs goes way up. In the second, the productivity of the developer goes way down. 

A good target for a mature engineering team is to have fewer than 5% of automated tests down at any given time. 

QA cycles and deployment timeline metrics

Time spent babysitting test runs 

Babysitting test runs, that is, waiting around for a test suite to finish is an under-appreciated time-suck at most companies. While automated tests will always be faster and more accurate than the same suite done manually, it can still take hours or days to run all of them sequentially. If 250 tests take five minutes a piece, a developer may need to wait 10–20 hours for a full regression report. Even a small smoke test, say 50 workflows, could leave a developer idle for several hours. 

Google recently showed the devastating impact that long builds have on developer productivity, which everyone should read.

You can speed up your test suite by hours or days if you run the tests in parallel. Keep in mind that test-running services like GitHub or Browserstack charge extra to run tests in parallel, so you may need to compare the cost of running your tests in parallel versus the lost productivity of your engineering team. At QA Wolf, we charge by test created and maintained (not by the test run). We include unlimited, parallel runs in the management fee.

→ Read more: Why QA Wolf charges per test managed, not per test run

Time spent triaging test failures and reproducing bugs

When tests fail, a human needs to investigate whether there's a bug or just a flaky test. That investigation time adds up fast. Say you have 250 tests and 23% fail, that's 57 tests to triage. You can safely assume it takes 15–30 minutes to review each one. That's 14–28 hours. And that's before they start fixing the bugs.

The best way to reduce your triage times is to reduce the number of failing tests. According to Tricentis, about 23% of automated test runs flake out. We've gotten that down to 15% with a mix of better tests and new technology. And we layer on 24-hour global teams to keep up with the volume. Our clients don't even worry about this metric because QA Wolf takes care of everything for them.

→ Read more: How to build tests like a QA Wolf

Engineering productivity and team velocity metrics

Mean Time to Recovery

By tracking MTTR, organizations can get a sense of how effective they are at addressing and resolving problems. MTTR is the average time it takes to fix a problem from the moment it happened and has two major contributing factors: Time needed to report a bug, and time needed to fix a bug. We find it helpful to measure both of those separately because they can identify different areas for improvement, but they're certainly related.

According to survey data from Undo, developers spend more than 10% of their time — half a day each week — simply trying to reproduce bugs assigned to them. Detailed bug reports can dramatically cut that down. At QA Wolf, the bug reports also include video recordings, Playwright trace logs, and HAR files to minimize the time developers have to spend identifying the root cause of the issue and maximize the time they spend shipping new code.

Revenue and business-impact metrics

Revenue-damage from bugs

To be clear, every bug has a revenue impact on the business. But some bugs have a more direct impact than others. In 2014, a bug on Amazon.com caused prices for thousands of items to drop to one (1) penny for over an hour. The bug cost Amazon and sellers millions of dollars and did permanent damage to the reputation of small businesses across the UK. 

Just as companies track the cost of internal team meetings, you need to be calculating the lost revenue that bugs cause. This can be a difficult one to do in realtime, but post-mortems are an ideal time to go back and look at what a bug did to the bottom line. 

On the other side, QA teams should be tracking the revenue saved by bugs caught. A million-dollar bug that was prevented by a $100,000 investment in automated testing demonstrates the huge return that QA provides companies in simple dollars and cents. 

Metrics we never recommend

Percentage of tests that catch bugs

We often hear people say that their automated tests aren’t catching bugs, and therefore they’re not worth the cost or effort. This is completely missing the point, because the percentage of tests that catch bugs is conflating two separate (and more important) metrics. A test suite might not be catching bugs because the developers are careful and quality-focused, so there are no bugs to catch. Or, a test suite isn’t catching bugs because it’s testing the wrong things (or not testing enough). 

The latter is far more likely, and why we suggest 80% test coverage or higher. But maybe your developers really are that good — that doesn’t mean you shouldn’t invest in testing. Test coverage is an insurance policy that gives your developers confidence to move fast, and it only takes one bug to ruin years of work. 

Test cases executed

Like test case effectiveness above, test cases executed creates the appearance of testing without tracking the value of the work. Tracking it can incentivize teams to create unnecessary tests, which needlessly increases the burden of maintenance and time to do a full regression. 

Test cases allocated to each team member

This is another metric that incentivizes the wrong behavior. Team members may create unnecessary tests to juice their numbers, or duplicate test cases across multiple tests. Moreover, not every test case is equal and the most complex 20% will take up 80% of the team’s energy. 

Open bugs to be retested

While open bugs to retest is not inherently a good or bad metric, it can be misleading and unproductive. On its own, the number of open bugs doesn’t reflect the complexity of the bugs in question. Complex bugs that take developers a while to solve can make a QA team look ineffective. Consequently, QA teams may prefer to file easy, non-critical bugs to test and close instead of focusing on complex, critical ones.

The bigger issue with this metric is that QA should be preventing bugs in the first place, not just finding them and re-testing. Preventative activities aren’t captured by a metric focused only on retesting.

Measure what matters 

When we're talking about QA and automated testing, we need to look at the big picture, which is increasing developer productivity and driving a financial impact on the business. These aren’t always easy things to quantify and measure, and all of these metrics taken on their own have blind spots. What’s most important is delivering the best possible product to the end user, minimizing the cost of preventable issues, and maximizing return on investment in QA.

That’s where QA Wolf comes in. We combine high E2E test coverage with nearly-instantaneous test results by running everything in parallel. We remove the bottlenecks in the deployment process by keeping test suites healthy, reproducing bugs and separating them from flaky tests. And we free up in-house QA resources to focus on other quality initiatives.

Keep reading