Automated end-to-end testing of mobile apps is harder than web apps, but not because selectors are flakier, devices are slower, or gestures are harder to script—although all those things are true, they’re easy enough to deal with. The real difficulty with mobile E2E tests is that they’re running in an environment that wasn’t designed to be tested at all.
Web browsers evolved with testability in mind. DevTools, headless mode, stateless sessions, inspectable DOMs, URLs, network mocking, stateless cookies—it’s all automation-friendly by default.
Mobile? You’re testing a black box built for human fingers. There’s no API to tell the app, “Skip onboarding.” There’s no selector you can count on across Android and iOS. Every gesture is a guess, every permission dialog is a landmine, and every OS update is a new source of flakes waiting to happen.
Automated mobile app testing is basically a long list of things you can’t control or do anything about. So if you want reliable mobile tests, you have to build the infrastructure to force state control at three levels: device, app, and data. And unless your team builds that infrastructure, you’re gonna end up with a test suite you spend more time apologizing for than trusting.
Here’s how to guarantee your tests start from a known state and keep the environment stable so tests run the same way every time.
Every mobile test runs on top of a full operating system (Android or iOS), and that system holds onto state — meaning artifacts from previous runs linger in the OS unless you explicitly reset it. Reinstalling the app doesn’t wipe system-level memory. Neither does rebooting the test runner.
When testing on mobile, you’ll run into test failures that the app didn’t cause. The problem lives inside the OS. Here are specific examples of OS-level state that can interfere with mobile test reliability:
These failures all come from the system-level state the OS retains between tests. If you don’t reset that state, you get unexpected permission prompts, system overlays that block input, and UI elements that load out of sequence. The result is flakiness—failures unrelated to your code that still break your tests and pollute your signal.
You’re likely dealing with leftover device state when:
These failures tend to disappear when you start from a freshly erased device, which confirms that the root cause lives in the system layer.
QA Wolf treats the mobile device as part of the test environment, not just a container for running the app. We rebuild that environment from scratch before every test run to eliminate OS-level state as a variable.
On Android, we sidestep the whole issue by destroying the emulator and spinning up new ones with controlled boot parameters for locale, timezone, and network conditions for each test run.
On iOS, we run an Apple Configurator to wipe and re-provision user data, installed apps, and configurations.
Application state refers to any data the app stores locally that affects its behavior when launched. This state persists between test runs unless the test infrastructure intentionally removes it. Uninstalling the app isn’t always enough—some of this state lives in storage that the system doesn’t clear by default.
These retained values can quietly shift app behavior:
Most test frameworks don’t reset secure storage, local files, or sandboxed preferences by default. That means the app might launch with leftover session tokens, feature flags, or dismissed popups from a previous run. It can skip onboarding, auto-log in, or load the wrong config before your test even begins. When that happens, you’ve got a test that didn’t set up the state properly and is stuck with the consequences.
Sometimes, that leads to false positives: the test fails because the app skipped some logic it should have run. Sometimes it’s a false negative: the test passes even though the app state is wrong. Either way, you can’t trust the result. Tests need isolation. Cleanup has to happen before the app launches—once it starts up in a stale state, it’s already too late to undo.
You’re likely seeing retained app state when:
These behaviors usually trace back to app-owned state—data stored in local preferences, caches, or secure storage—that persists between runs and overrides the conditions the test expects.
At QA Wolf, we eliminate retained app state before every test run. That means removing all storage that the app can access, not just superficial caches.
On Android, since we boot a fresh emulator every time, there's no retained app data—no sandbox files, no stored sessions, no permission history, and no cached onboarding state. That means the test controls every part of the app’s state from the first launch screen forward.
On iOS, we uninstall the app between tests using a scripted process, not just by reinstalling over an existing binary. Uninstalling makes sure we remove all app container data. During device provisioning, we clear NSUserDefaults and on-device caches, remove Keychain entities, and reset any test-related environment variables.
Most mobile apps don’t fetch fresh data on launch—they wait for a user action like opening a view, pulling to refresh, or backgrounding and resuming. Some even delay fetches on purpose to improve cold start performance or reduce server load.
But test code doesn’t wait. It seeds backend data and immediately checks the UI, expecting the app to reflect the update. The app, meanwhile, is still showing cached content (or nothing at all). The backend is right. The test is right. But the app hasn’t caught up. It’s not a bug, it’s a timing problem. And unless your infrastructure coordinates the sync, your test will fail for the wrong reason.
You’re likely dealing with unsynced data when:
We don’t leave sync behavior to test authors or rely on timing guesses. We control exactly when and how the app fetches test data, and we pause the test until it does.
We seed data through the same APIs the app uses in production. Each test uses its own scoped test data—seeded specifically for that run—so no other test can modify or rely on it. When the app needs to sync, our infrastructure triggers the process by navigating to the correct screen, simulating a refresh gesture, or by backgrounding and resuming the app.
We wait for a clear signal that the sync has completed, such as a loading spinner disappearing or a specific record appearing, before running any assertions. We don’t guess. We script the interaction, trigger the sync, and verify the result.
So far, we’ve focused on what mobile tests need to be stable: clean device state, reset app state, and synced backend data. But even if you get all three right, you’ll still hit a flake if you’re running tests on infrastructure you don’t control, like a public device farm.
Device farms don’t give you more control—they take it away. They hide the details, share devices across runs, and skip setup steps you actually need. You don’t control the environment, and you can’t fix what you don’t own. If your mobile tests fail in a device farm, it’s probably not your test—it’s the setup. The environment just isn’t built for consistency.
Here’s what gets in the way:
Even with a clean device and solid test code, flaky tests happen when you don’t control the setup. Device farms hide what’s going on under the hood. You don’t know what ran before, you can’t control timing, and you can’t guarantee the right order. When something breaks, you’re stuck debugging someone else’s system.
QA Wolf doesn’t lease test infrastructure from someone else. We configure and manage it ourselves, so we control what runs, when it runs, and what state it starts in.
iOS tests run on USB-connected real devices with fixed OS versions. That gives us complete control over the system image, hardware behavior, and device lifecycle. We install, configure, run, and wipe each test locally—nothing shared, nothing virtualized, nothing left behind. It’s the only way to reliably suppress system prompts, apply entitlement-specific builds, and eliminate mid-test interruptions.
Android tests run on warm emulators that get reset between runs, with pre-applied settings for locale, network, animations, and runtime flags. Each test gets its isolated environment, entirely under our control. No shared state. No resource contention. No unpredictability.
Each test starts from a known baseline: a clean device state, reset app state, and seeded backend data that has already been synced and verified. We don’t inherit test environments—we recreate them, from the OS up.
We don’t rely on external schedulers. Our internal orchestration engine is wired directly into environment readiness. A test doesn’t start until we:
We gate every run with readiness checks, not queue availability. This approach removes timing flakiness and cross-test contamination.
We enforce teardown after every run, regardless of whether it passes or fails. That includes app removal, state reset, and device reconfiguration. No assumptions, no skipped steps. If teardown fails, the next test doesn’t start. Full stop.
We script timing instead of guessing. That means we don’t add sleeps or retries and hope they’re long enough. We wait for deterministic signals in the UI, the logs, or the system layer before continuing. If a screen isn’t ready, the test doesn’t move. That level of synchronization only works when you control every layer of the system, and we do.
When a test fails, we don’t retry until it passes. We trace the failure back to its origin—timing, state, and config—and fix it there. That’s what it takes to run mobile tests at scale without flakiness. Not luck. Not volume. Control.
Mobile tests don’t flake when you control the stack
Sure, sometimes the test is wrong. But on mobile, it’s more likely your environment is doing something you didn’t ask for. The OS remembers things. The app reads from places you didn’t clear. The data hasn’t synced yet. None of that is your fault. Most teams don’t have the tooling to control those layers because until now, no one has made it easy.
Most teams try to stabilize flake at the surface—tweaking waits, changing selectors, reworking flows. But stability doesn’t come from patching. It comes from infrastructure built to handle the full stack of mobile state.
QA Wolf gives our QA engineers the tools to do precisely that:
Every test starts from the ground up, with each layer deliberately set and verified.
If you’re still chasing down flakes, rewriting the test won’t fix what the infrastructure broke. You need control over everything the test depends on. We built QA Wolf to give you that control.