What your system should do with a flaky test

John Gluck
July 11, 2025

Automated tests will fail from time to time, just as the sun rises tomorrow. When there’s too much flakiness in a test suite, your team gets burned out chasing shadows, and trust in the test suite disappears (which means reverting to manual testing). You should do everything possible to reduce the rate of flakes to below 10%. And the way you do that is to log each flake as structured data: tagged, classified, de-duped, and tracked over time.

When you do that, flakes cease to be agents of chaos and become agents of change. You can see the patterns. You know what to fix. And your test suite becomes a real signal again.

Here’s what to do:

Auto-retries on every individual test

A flake-resistant system retries failures by default. That way, you don’t waste time re-running the entire suite or hand-picking flaky tests while trying to guess what other tests they depend on. The system should rerun only what failed and do so automatically.

If a test fails once and passes on retry, that’s a flake. If it continues to fail multiple successive retries, that’s something worth investigating. You need the retry to separate the noise from the signal.

Operational rules:

  • Run retries on the same commit and environment to eliminate drift.
  • Cap retries at 2–3 attempts to avoid masking real instability.
  • Log the retry behavior alongside the test result for traceability.
  • Bonus: For each round of retries, adjust the conditions—add waits, reduce parallelism, or limit test scope—to help surface timing issues, resource conflicts, or order-dependent flakes that don’t show up in default runs.

Track and classify flakes with failure signatures

Automatically retrying tests will reduce the number of failures to investigate on any given run of your test suite. Still, if you’re not logging and analyzing which tests need retries, you won’t be able to identify the tests with setup problems, race conditions, or environment-specific bugs. Without structured history, they’ll keep passing on the second or third try and slipping through unnoticed.

That’s why your system needs two things:

  1. A historical log of test outcomes.
  2. Failure signatures that group recurring flakes by behavior.

Log the essentials:

  • Test name or ID.
  • Timestamp of the run.
  • Error message and stack trace.
  • Basic environment info (browser, device).
  • Commit hash or run ID.

Trends indicate where things are deteriorating. To go further, generate failure signatures: fingerprints that group similar failures across different tests or environments. If 12 tests throw the same timeout in the same interaction, that’s one flaky behavior, not 12 separate bugs.

Operational rules:

  • Log test results with consistent metadata.
  • Normalize and hash the error + stack trace to generate signatures.
  • Track flake frequency per test and signature.
  • Monitor trends: which signatures are rising, stable, or fixed?
  • Let reviewers reclassify or merge signatures when needed.
  • Bonus: Create a report that identifies tests that flake 30% of the time or more and use it to investigate the root cause.

Drive action based on flake ownership

A flaky test without ownership is just noise. If the test execution system can’t identify where a flake originated or who is responsible for it, your team will just end up playing whack-a-mole. Your system needs to do more than detect and group flakes. It has to route them to the right person, with the right context, at the right time.

Every recurring flake should be linked to a known issue:

  • A real product bug, such as race conditions, mismatched feature flags, or unstable third-party dependencies (open or resolved).
  • A test bug (pending or fixed).
  • An infrastructure problem (e.g., staging slowness, auth timeouts).
  • Or marked “won’t fix” with a clear reason.

That linkage turns passive noise into active signals. It’s how your system stops flagging the same failure over and over—and starts treating flakes like incidents with a path to resolution.

Operational rules:

  • Show unresolved flake signatures in a live dashboard.
  • Link each flake to tickets, status pages, or code annotations.
  • Assign test bugs to the test maintainer, not the product team.
  • Route clean product bugs to the owning team automatically.
  • Escalate infrastructure flakes instead of working around them.
  • Periodically review bugs in “won’t fix” and “cannot reproduce” resolution statuses—they have a habit of coming back.
  • Bonus: Build automated handoffs and chains of responsibility—so when flakes spike, the system can automatically determine when to raise the alarm and who to notify.

Flaky coverage is fake coverage

Left to their own devices, flakes give you no information.  But worse, they force your team to repeat the same investigations over and over, which slows down product delivery.

You can’t eliminate flakes. But you can turn each one into a signal that makes your suite more resilient. A real flake-tolerant system automates retries, fingerprints errors, tracks patterns, links known defects, and drives action—without ever asking a developer to re-run a test “just to be sure.” That’s how you get trustworthy test results and stop wasting time chasing ghosts.

We’ve built this system. If you’re ready for test infrastructure that treats flakes like data—not drama—talk to us.

Some disclaimer text about how subscribing also opts user into occasional promo spam

Keep reading

AI-research
Making sense of all the AI-powered QA tools
Parallelization
Running the same test in multiple environments isn’t as simple as it sounds
E2E testing
Automated mobile testing without tradeoffs: The QA Wolf approach