Preventing mobile performance regressions with Maestro

Previously we have written about how we adopted the React Native New Architecture as one way to boost our performance. Before we dive into how we detect regressions, let’s first explain how we define performance.

Mobile performance vitals

In browsers there is already an industry standard set of metrics to measure performance in the Core Web Vitals, and while they are by no means perfect, they focus on the actual impact on the user experience. We wanted to have something similar but for apps, so we adopted App Render Complete and Navigation Total Blocking Time as our two most important metrics.

App Render Complete is the time it takes to open the cold boot the app for an authenticated user, to it being fully loaded and interactive, roughly equivalent to Time To Interactive in the browser.

Navigation Total Blocking Time is the time the application is blocked from processing code during the 2 second window after a navigation. It’s a proxy for overall responsiveness in lieu of something better like Interaction to Next Paint.

We still collect a slew of other metrics – such as render times, bundle sizes, network requests, frozen frames, memory usage etc. – but they are indicators to tell us why something went wrong rather than how our users perceive our apps.

Their advantage over the more holistic ARC/NTBT metrics is that they are more granular and deterministic. For example, it’s much easier to reliably impact and detect that bundle size increased or that total bandwidth usage decreased, but it doesn’t automatically translate to a noticeable difference for our users.

Collecting metrics

In the end, what we care about is how our apps run on our users’ actual physical devices, but we also want to know how an app performs before we ship it. For this we leverage the Performance API (via react-native-performance) that we pipe to Sentry for Real User Monitoring, and in development this is supported out of the box by Rozenite.

But we also wanted a reliable way to benchmark and compare two different builds to know whether our optimizations move the needle or new features regress performance. Since Maestro was already used for our End to End test suite, we simply extended that to also collect performance benchmarks in certain key flows.

To adjust for flukes we ran the same flow many times on different devices in our CI and calculated statistical significance for each metric. We were now able to compare each Pull Request to our main branch and see how they fared performance wise. Surely, performance regressions were a thing of the past.

Reality check

In practice, this didn’t have the outcomes we had hoped for a few reasons. First we saw that the automated benchmarks were mainly used when developers wanted validation that their optimizations had an effect – which in itself is important and highly valuable – but this was typically after we had seen a regression in Real User Monitoring, not before.

To address this we started running benchmarks between release branches to see how they fared. While this did catch regressions, they were typically hard to address as there was a full week of changes to go through – something our release managers simply weren’t able to do in every instance. Even if they found the cause, simply reverting often wasn’t a possibility.

On top of that, the App Render Complete metric was network-dependent and non-deterministic, so if the servers had extra load that hour or if a feature flag turned on, it would affect the benchmarks even if the code didn’t change, invalidating the statistical significance calculation.

Precision, specificity and variance

We had to go back to the drawing board and reconsider our strategy. We had three major challenges:

Precision: Even if we could detect that a regression had occurred, it was not clear to us what change caused it.
Specificity: We wanted to detect regressions caused by changes to our mobile codebase. While user impacting regressions in production for whatever reason is crucial in production, the opposite is true for pre-production where we want to isolate as much as possible.
Variance: For reasons mentioned above, our benchmarks simply weren’t stable enough between each run to confidently say that one build was faster than another.

The solution to the precision problem was simple; we just needed to run the benchmarks for every merge, that way we could see on a time series graph when things changed. This was mainly an infrastructure problem, but thanks to optimized pipelines, build process and caching we were able to cut down the total time to about 8 minutes from merge to benchmarks ready.

When it comes to specificity, we needed to cut out as many confounding factors as possible, with the backend being the main one. To achieve this we first record the network traffic, and then replay it during the benchmarks, including API requests, feature flags and websocket data. Additionally the runs were spread out across even more devices.

Together, these changes also contributed to solving the variance problem, in part by reducing it, but also by increasing the sample size by orders of magnitude. Just like in production, a single sample never tells the whole story, but by looking at all of them over time it was easy to see trend shifts that we could attribute to a range of 1-5 commits.

Alerting

As mentioned above, simply having the metrics isn’t enough, as any regression needs to be actioned quickly, so we needed an automated way to alert us. At the same time, if we alerted too often or incorrectly due to inherent variance, it would go ignored.

After trialing more esoteric models like Bayesian online changepoint, we settled on a much simpler moving average. When a metric regresses more than 10% for at least two consecutive runs we fire an alert.

Next steps

While detecting and fixing regressions before a release branch is cut is fantastic, the holy grail is to prevent them from getting merged in the first place.

What’s stopping us from doing this at the moment is twofold: on one hand running this for every commit in every branch requires even more capacity in our pipelines, and on the other hand having enough statistical power to tell if there was an effect or not.

The two are antagonistic, meaning that given that we have the same budget to spend, running more benchmarks across fewer devices would reduce statistical power.

The trick we intend to apply is to spend our resources smarter – since effect can vary, so can our sample size. Essentially, for changes with big impact, we can do fewer runs, and for changes with smaller impact we do more runs.

Making mobile performance regressions observable and actionable

By combining Maestro-based benchmarks, tighter control over variance, and pragmatic alerting, we have moved performance regression detection from a reactive exercise to a systematic, near-real-time signal.

While there is still work to do to stop regressions before they are merged, this approach has already made performance a first-class, continuously monitored concern – helping us ship faster without getting slower.

What's Hot

If Hyperliquid Fails, Does Crypto Survive?

Ethereum Foundation-Backed Program Exposes 100 Nort Korea Operatives Infiltrating Crypto Firms

SEC Gives Some Self-Custody Crypto Apps 5 Years to Sort Out Broker Licensing

Preventing mobile performance regressions with Maestro

SEC Gives Some Self-Custody Crypto Apps 5 Years to Sort Out Broker Licensing

Kelp Hacked, Losses Climb to $293M As Other Protocols Impacted

Kelp DAO exploited for $292 million with wrapped ether stranded across 20 chains

What Classical Property Law Says Happens Next

SEC Gives Some Self-Custody Crypto Apps 5 Years to Sort Out Broker Licensing

Kelp Hacked, Losses Climb to $293M As Other Protocols Impacted

Kelp DAO exploited for $292 million with wrapped ether stranded across 20 chains

MetaMask Enters Prediction Markets With Polymarket Integration

Solana (SOL) Maintains Strength, Gradual Gains Signal Bullish Bias

Adam and Eve Pattern Points to $2.5K ETH Price Analysis:

Trump orders federal agencies to halt Anthropic use amid dispute over military AI terms

Top Insights

If Hyperliquid Fails, Does Crypto Survive?

Ethereum Foundation-Backed Program Exposes 100 Nort Korea Operatives Infiltrating Crypto Firms

SEC Gives Some Self-Custody Crypto Apps 5 Years to Sort Out Broker Licensing

What's Hot

Preventing mobile performance regressions with Maestro

Mobile performance vitals

Collecting metrics

Reality check

Precision, specificity and variance

Alerting

Next steps

Making mobile performance regressions observable and actionable

Related Posts

Subscribe to Updates