- To execute hundreds of thousands of concurrent Android E2E tests with different devices and OS configurations you need to solve two related problems: providing access to the devices, and huge fluctuations in load.
- Pre-provisioning environments for every device and OS combination creates an artificial limit on how many options you can support.
- QA Wolf treats OS version and device type as independent inputs means any combination resolves at runtime—no pre-built environments required.
- Cloud infrastructure providers are insufficient and only a custom-built on-prem solution can provide the necessary flexibility and performance.
The Android ecosystem is deeply fragmented. At any given time, six or more major OS versions each hold meaningful market share, running across hundreds of manufacturers and thousands of distinct hardware configurations. The OS version determines how your app interacts with system services, permissions, and background processes. The hardware defines how it performs: CPU speed, GPU capability, and available memory set the performance ceiling. Your users are distributed across all of it, and your E2E test coverage has to follow them.
If your team runs automated E2E tests on Android, you've probably used emulators. Emulators run a full Android OS and respond to the same inputs a real device would. Provisioning one locally is usually straightforward. Provisioning hundreds or thousands of them across every device and OS combination your users run on, and maintaining consistent performance when users request them on demand, is an entirely different problem.
Building the in-house infrastructure that supports any permutation of OS and device a customer needs, at scale, without performance issues meant solving two fundamental problems:
- Device access. No two apps are the same, and no two user bases are the same. Customers need to test on the device/OS combination that reflects their users—not just what’s available..
- Demand. Developers trigger test suites at unpredictable times—sometimes one test at time, sometimes multiple simultaneously. Each session consumes varying amounts of CPU, GPU, and memory. Our infrastructure had to absorb that load so tests could execute quickly and reliably whenever they are ready to release.
These are the problems our system was built to solve.
Pre-provisioning environments for every combination is the obvious solution—and the wrong one
Because most mobile testing frameworks require a full device configuration upfront, QA teams typically use a pre-provisioning strategy to solve device access: define a target device and OS version, create an Android Virtual Device (AVD) for that combination, and point the framework at it. Each combination in the matrix gets its own AVD.
That system works, but only as long as the matrix stays small. Every new combination adds another environment to build and run, and every new test executes across every entry in the matrix. The resource cost of running all those environments concurrently sets a ceiling on coverage. This forces teams to limit their test coverage to what they can afford to manage.
Ideally, a new combination means a new configuration
The approach we took follows the principle of separation of concerns: keep what something is separate from how it runs. Android Studio already applies this principle—the OS image and the device profile are two separate inputs that the emulator combines at runtime. For example, a Pixel 9 and a Pixel Tablet running Android 14 share the same OS image—they differ only in device configuration.
Our system applies the same model, but executes it remotely across shared infrastructure. The test specifies an OS version and a device type. The system resolves them into a running emulator instance at runtime—no pre-built environment required. Each instance runs in its own container with dedicated CPU, GPU, and memory. Sessions don't share resources and don't interfere with each other. A resource-intensive environment doesn't block lighter ones from running alongside it.
That makes it easy for our users to specify in a test. Here’s what that Appium code looks like in our platform:
import { flows } "@qawolf/flows/android";
export default flow(
"FlowNameHere",
"Android - Tablet(Android 14)",
async ({test, wdio, ...testContext}) => {
const driver = wdio.startAndroid({
"appium:browserName": "Chrome",... }
);
},
);
Since adopting this model, we've expanded from supporting a single device type and OS version to supporting any combination a customer needs.
Moving on-premises gave the model what it needed to perform
QA Wolf built the initial infrastructure on Google Cloud. Since emulators are virtual machines, running them on GCP meant running a VM inside GCP's own virtualization layer. That layer is proprietary, so when sessions degraded, the team had limited ability to inspect what was happening or optimize performance at the hardware level. We determined that an on-prem solution, where we owned the hardware, would be necessary to achieve the level of control the system required.
Moving to dedicated racks in a colocation facility and running on bare metal gave the team direct access to physical hardware and eliminated the cloud virtualization layers. The performance improvements were immediate and concrete. On GCP, every session required pulling a fresh container image—at roughly 120,000 runs per day, that image pull traffic added measurable startup latency and consumed significant network bandwidth. On-prem, images stay on the node. Sessions start instantly. The bandwidth that was consumed by image pulls is now available for test traffic. When a 2,000-test suite fires, every test starts running immediately rather than waiting for GKE to autoscale and pull images. On GCP, spot instances kept costs down but introduced another failure mode—nodes could be terminated mid-run, interrupting tests unpredictably. On-prem, nodes are always available and never terminate mid-run.
Local NVMe storage replaced GCP's network-attached persistent disks. Emulators are I/O-intensive—every read and write that previously crossed a network hop now hits local storage directly. The latency difference compounds across thousands of concurrent sessions.
As the infrastructure grew and the machines got larger, the team added more GPUs per machine to increase capacity. This created a new problem: more GPUs meant more devices for Nvidia drivers and the OS kernel to track and manage. A kernel responsible for a large number of GPUs accumulates management overhead that slows down every session running on that machine.
The team solved this with a hypervisor. Rather than attaching many GPUs to a single large machine, the hypervisor partitions each physical machine into smaller virtual nodes. Each node runs its own kernel and manages a small number of GPUs, keeping management overhead low. The hypervisor allocates a dedicated portion of the physical machine's GPU to each virtual node. Once allocated, the node accesses that GPU directly. The team can create and destroy nodes as VMs without touching the physical hardware, which makes the infrastructure easier to manage and limits the security surface area each kernel is responsible for.
The hypervisor also increased emulator density per machine. Because pods rarely consume their full CPU and memory allocation, the hypervisor allows the team to overprovision virtual nodes. A machine with 512 CPU cores can be scheduled at 768 cores—fully utilizing resources that would otherwise sit idle.
The system runs on Kubernetes with custom controllers and scheduling built specifically to handle large numbers of concurrent Android emulator sessions. The scheduling layer allocates sessions across the machine pool based on each session's resource requirements. Each session runs in full isolation with dedicated resources, regardless of what runs alongside it.
For customers, the difference is immediate: lower latency, faster session startup, sharper device rendering, and more responsive interactions during test sessions. Every layer of this is invisible to the engineer running tests. They specify a device and an OS version. The infrastructure handles the rest.
What is an Android emulator grid?
An Android emulator grid is a system that runs multiple Android emulator sessions simultaneously on shared infrastructure. Each session runs a full Android OS instance against a specified device profile and OS version, allowing teams to execute automated tests across many device and OS combinations concurrently.
Why is Android E2E testing harder than web testing?
Android E2E tests run full OS instances rather than browser sessions. Each session requires dedicated CPU, GPU, and memory, and needs to be fully isolated from every other session running alongside it. That resource demand, combined with Android's device and OS fragmentation, makes reliable execution at volume a harder infrastructure problem than web testing.
Why do teams end up with gaps in their Android test coverage?
Most mobile testing frameworks require a full device configuration upfront, leading teams to pre-provision a separate environment for each combination. As the matrix grows, the system produces more test flakes and false positives. When tests start misbehaving, teams shrink the matrix to restore stability rather than fix the infrastructure—and coverage gaps follow.
What is the right architecture for Android emulator testing?
OS versions and device types are independent inputs. A system that treats them that way resolves any combination at runtime without requiring a pre-built environment for each one. Each session runs in its own isolated container with dedicated resources. New combinations require new configurations, not new infrastructure.
Why don't more teams run Android emulator infrastructure on-premises?
It's extremely difficult to set up and maintain. Cloud providers handle the hard parts—provisioning, scaling, and managing the layer between emulators and physical hardware—at the cost of some overhead. On-premises gives emulators direct access to physical GPUs and full control over resource allocation, but the team takes on all of that operational burden themselves.