Why mobile E2E tests flake, and how QA Wolf controls every layer to stop it

John Gluck
May 29, 2025

Automated end-to-end testing of mobile apps is harder than web apps, but not because selectors are flakier, devices are slower, or gestures are harder to script—although all those things are true, they’re easy enough to deal with. The real difficulty with mobile E2E tests is that they’re running in an environment that wasn’t designed to be tested at all.

Web browsers evolved with testability in mind. DevTools, headless mode, stateless sessions, inspectable DOMs, URLs, network mocking, stateless cookies—it’s all automation-friendly by default.

Mobile? You’re testing a black box built for human fingers. There’s no API to tell the app, “Skip onboarding.” There’s no selector you can count on across Android and iOS. Every gesture is a guess, every permission dialog is a landmine, and every OS update is a new source of flakes waiting to happen.

Automated mobile app testing is basically a long list of things you can’t control or do anything about. So if you want reliable mobile tests, you have to build the infrastructure to force state control at three levels: device, app, and data. And unless your team builds that infrastructure, you’re gonna end up with a test suite you spend more time apologizing for than trusting.

Here’s how to guarantee your tests start from a known state and keep the environment stable so tests run the same way every time.

Controlling device state: permissions, configuration, and OS-level leftovers

Every mobile test runs on top of a full operating system (Android or iOS), and that system holds onto state — meaning artifacts from previous runs linger in the OS unless you explicitly reset it. Reinstalling the app doesn’t wipe system-level memory. Neither does rebooting the test runner.

When testing on mobile, you’ll run into test failures that the app didn’t cause. The problem lives inside the OS. Here are specific examples of OS-level state that can interfere with mobile test reliability:

  • Permission grants: The OS tracks system permissions (camera, location, notifications, microphone). If a previous run granted access, a subsequent run may not be prompted for access. The OS decides when those prompts appear—your test and your app have no say in it.
  • System configuration: Animation speed, font scaling, text contrast, and accessibility settings are OS-level configurations that persist across app sessions. They directly affect how and when the UI is drawn. A test can fail simply because a system animation is still in progress, even though the app is behaving correctly.
  • Locale and input settings: The OS controls regional formatting, keyboard layouts, and language preferences. For example, if a test sets the device to use Japanese, the following test may display full-width characters in text inputs or different layouts for date/time pickers.
  • System overlays and alerts: First-launch agreements, low battery warnings, and OS update prompts are external to the app and can interrupt execution. They appear based on system conditions that aren’t always predictable or controllable during test execution. When they do appear, they block UI interaction until dismissed, causing tests to fail even though the app logic hasn’t changed.
  • Background processes: Processes like Google Play Services (Android) or Spotlight indexing (iOS) can trigger mid-test, creating unpredictable timing issues. These aren’t visible to the app and aren’t cleared by reinstalling it.

These failures all come from the system-level state the OS retains between tests. If you don’t reset that state, you get unexpected permission prompts, system overlays that block input, and UI elements that load out of sequence. The result is flakiness—failures unrelated to your code that still break your tests and pollute your signal.

How to recognize device state issues

You’re likely dealing with leftover device state when:

  • A permission dialog appears on one test run but not the next.
  • The same test passes on one device (which could be a real device, emulator, or simulator) but fails on another, even with identical app builds.
  • A flaky interaction becomes reliable after wiping or erasing the device.
  • You observe different UI behavior caused by regional settings or system-managed keyboard behavior, like autocorrect, locale-specific layouts, or input methods, that the test didn’t configure.

These failures tend to disappear when you start from a freshly erased device, which confirms that the root cause lives in the system layer.

How QA Wolf resets the device between tests

QA Wolf treats the mobile device as part of the test environment, not just a container for running the app. We rebuild that environment from scratch before every test run to eliminate OS-level state as a variable.

On Android, we sidestep the whole issue by destroying the emulator and spinning up new ones with controlled boot parameters for locale, timezone, and network conditions for each test run.

On iOS, we run an Apple Configurator to wipe and re-provision user data, installed apps, and configurations.

Controlling app state: stored flags, cached logic, and persistent sessions

Application state refers to any data the app stores locally that affects its behavior when launched. This state persists between test runs unless the test infrastructure intentionally removes it. Uninstalling the app isn’t always enough—some of this state lives in storage that the system doesn’t clear by default.

These retained values can quietly shift app behavior:

  • A stored onboarding flag might cause the app to bypass first-time setup.
  • A saved session token might trigger an automatic login.
  • A cached feature flag might activate a UI variant that the test doesn’t cover.

Most test frameworks don’t reset secure storage, local files, or sandboxed preferences by default. That means the app might launch with leftover session tokens, feature flags, or dismissed popups from a previous run. It can skip onboarding, auto-log in, or load the wrong config before your test even begins. When that happens, you’ve got a test that didn’t set up the state properly and is stuck with the consequences.

Sometimes, that leads to false positives: the test fails because the app skipped some logic it should have run. Sometimes it’s a false negative: the test passes even though the app state is wrong. Either way, you can’t trust the result. Tests need isolation. Cleanup has to happen before the app launches—once it starts up in a stale state, it’s already too late to undo.

How to recognize app state issues

You’re likely seeing retained app state when:

  • A new test account launches into a logged-in or partially completed session instead of a first-time user experience.
  • The app skips onboarding screens or routes to a different home screen based on data that the test didn’t set.
  • UI elements reflect prior preferences, like saved filters, toggled settings, or dismissed dialogs.
  • Feature flags or A/B variants persist across runs, even when the test intends to reset them.
  • A test that fails repeatedly starts passing only after manually wiping app data or reinstalling with storage-clearing tools.

These behaviors usually trace back to app-owned state—data stored in local preferences, caches, or secure storage—that persists between runs and overrides the conditions the test expects.

How QA Wolf wipes the app state between tests

At QA Wolf, we eliminate retained app state before every test run. That means removing all storage that the app can access, not just superficial caches.

On Android, since we boot a fresh emulator every time, there's no retained app data—no sandbox files, no stored sessions, no permission history, and no cached onboarding state. That means the test controls every part of the app’s state from the first launch screen forward.

On iOS, we uninstall the app between tests using a scripted process, not just by reinstalling over an existing binary. Uninstalling makes sure we remove all app container data. During device provisioning, we clear NSUserDefaults and on-device caches, remove Keychain entities, and reset any test-related environment variables.

Controlling data state: Backend seeding, sync timing, and UI readiness

Most mobile apps don’t fetch fresh data on launch—they wait for a user action like opening a view, pulling to refresh, or backgrounding and resuming. Some even delay fetches on purpose to improve cold start performance or reduce server load.

But test code doesn’t wait. It seeds backend data and immediately checks the UI, expecting the app to reflect the update. The app, meanwhile, is still showing cached content (or nothing at all). The backend is right. The test is right. But the app hasn’t caught up. It’s not a bug, it’s a timing problem. And unless your infrastructure coordinates the sync, your test will fail for the wrong reason.

How to recognize that you have unsynced data

You’re likely dealing with unsynced data when:

  • The test creates a new item (such as an order or message), but the UI displays an empty list.
  • The system enables a feature flag for the test user, but the app still renders the default layout.
  • The app doesn’t display seeded data until the user refreshes, navigates, or restarts the session.
  • The first test run fails, but the rerun passes, even though nothing changed.
  • The test only passes when it explicitly triggers a sync action.

How QA Wolf handles data sync and timing

We don’t leave sync behavior to test authors or rely on timing guesses. We control exactly when and how the app fetches test data, and we pause the test until it does.

We seed data through the same APIs the app uses in production. Each test uses its own scoped test data—seeded specifically for that run—so no other test can modify or rely on it. When the app needs to sync, our infrastructure triggers the process by navigating to the correct screen, simulating a refresh gesture, or by backgrounding and resuming the app.

We wait for a clear signal that the sync has completed, such as a loading spinner disappearing or a specific record appearing, before running any assertions. We don’t guess. We script the interaction, trigger the sync, and verify the result.

Why you’re having trouble with device farms

So far, we’ve focused on what mobile tests need to be stable: clean device state, reset app state, and synced backend data. But even if you get all three right, you’ll still hit a flake if you’re running tests on infrastructure you don’t control, like a public device farm.

Device farms don’t give you more control—they take it away. They hide the details, share devices across runs, and skip setup steps you actually need. You don’t control the environment, and you can’t fix what you don’t own. If your mobile tests fail in a device farm, it’s probably not your test—it’s the setup. The environment just isn’t built for consistency.

Here’s what gets in the way:

  • Devices aren’t truly clean: You don’t get to wipe them. You don’t always control OS version, language, or animation settings. One test’s leftover state becomes another test’s flake.
  • Apps don’t reset completely: Farms often reinstall apps but don’t clear secure storage or cached files. You don’t control the teardown, so local state bleeds across runs.
  • Data and test execution are out of sync: The test infrastructure seeds the backend, but the farm fails to trigger the app to sync or wait for confirmation.
  • You don’t own the timing: Shared infrastructure introduces contention. Two tests may compete for device time. Background tasks or slow teardown sequences overlap. Your test doesn’t get to run when the environment is ready—it runs when the scheduler decides.

Even with a clean device and solid test code, flaky tests happen when you don’t control the setup. Device farms hide what’s going on under the hood. You don’t know what ran before, you can’t control timing, and you can’t guarantee the right order. When something breaks, you’re stuck debugging someone else’s system.

How QA Wolf handles orchestration and control

QA Wolf doesn’t lease test infrastructure from someone else. We configure and manage it ourselves, so we control what runs, when it runs, and what state it starts in.

iOS tests run on USB-connected real devices with fixed OS versions. That gives us complete control over the system image, hardware behavior, and device lifecycle. We install, configure, run, and wipe each test locally—nothing shared, nothing virtualized, nothing left behind. It’s the only way to reliably suppress system prompts, apply entitlement-specific builds, and eliminate mid-test interruptions.

Android tests run on warm emulators that get reset between runs, with pre-applied settings for locale, network, animations, and runtime flags. Each test gets its isolated environment, entirely under our control. No shared state. No resource contention. No unpredictability.

Each test starts from a known baseline: a clean device state, reset app state, and seeded backend data that has already been synced and verified. We don’t inherit test environments—we recreate them, from the OS up.

We don’t rely on external schedulers. Our internal orchestration engine is wired directly into environment readiness. A test doesn’t start until we:

  • Provision the correct OS and hardware.
  • Install and sign the app, and confirm it reaches an idle state.
  • Seed the test data and confirm that the app has pulled it before starting assertions.

We gate every run with readiness checks, not queue availability. This approach removes timing flakiness and cross-test contamination.

We enforce teardown after every run, regardless of whether it passes or fails. That includes app removal, state reset, and device reconfiguration. No assumptions, no skipped steps. If teardown fails, the next test doesn’t start. Full stop.

We script timing instead of guessing. That means we don’t add sleeps or retries and hope they’re long enough. We wait for deterministic signals in the UI, the logs, or the system layer before continuing. If a screen isn’t ready, the test doesn’t move. That level of synchronization only works when you control every layer of the system, and we do.

When a test fails, we don’t retry until it passes. We trace the failure back to its origin—timing, state, and config—and fix it there. That’s what it takes to run mobile tests at scale without flakiness. Not luck. Not volume. Control.

Mobile tests don’t flake when you control the stack

Sure, sometimes the test is wrong. But on mobile, it’s more likely your environment is doing something you didn’t ask for. The OS remembers things. The app reads from places you didn’t clear. The data hasn’t synced yet. None of that is your fault. Most teams don’t have the tooling to control those layers because until now, no one has made it easy.

Most teams try to stabilize flake at the surface—tweaking waits, changing selectors, reworking flows. But stability doesn’t come from patching. It comes from infrastructure built to handle the full stack of mobile state.

QA Wolf gives our QA engineers the tools to do precisely that:

  • The device starts from a known state—clean, consistent, and isolated.
  • The app launches without leftovers—no cached sessions or residual flags.
  • The data arrives on time, seeded, synced, and confirmed before any checks run.

Every test starts from the ground up, with each layer deliberately set and verified.

If you’re still chasing down flakes, rewriting the test won’t fix what the infrastructure broke. You need control over everything the test depends on. We built QA Wolf to give you that control.

Some disclaimer text about how subscribing also opts user into occasional promo spam

Keep reading

Test automation
The best mobile E2E testing frameworks in 2025: Strengths, tradeoffs, and use cases
E2E testing
5 strategies to address Android emulator instability during automated testing
Alternatives
When crowdsourcing works and when it doesn’t — and why