- What is “probability to beat baseline”? Is that the same as confidence?
- What is “probability to be best”?
- How do you decide when to call an experiment?
- How do you decide when you have a leader?
- Why aren’t the median conversion rates the same if I just divide one number by the other?
- What are the conversion rate numbers that Optimize shows? What does each one mean?
- Why does Optimize show a range of numbers for “improvement”? Most tools don’t do that.
- Is there a way to know how long my experiment should last?
- What are the benefits of user-centric and session-centric measurement?
- Related resources
What is “probability to beat baseline”? Is that the same as confidence?
No, it’s not, but you could use it in a similar way, and probability to beat baseline is better aligned with what most people think or wish confidence would provide. Most people have a hard time intuitively understanding confidence (which is 1 minus p-value), while probability to beat baseline is exactly what it sounds like: the probability that a variant is going to perform better than the original. You can wait until probability to beat baseline reaches 95%, or deploy earlier if you’re willing to accept a higher risk of being incorrect. Read more about the challenges of interpreting p-values.
What is “probability to be best”?
Probability to be best tells you which variant is likely to be the best performing overall. It is exactly what it sounds like -- no extra interpretation needed! To produce the same result in a frequentist environment, you’d have to do extra legwork, such as Bonferroni corrections (read about multiple comparisons) to ensure you’re not getting inaccurate answers.
How do you decide when to call an experiment?
Currently, we follow a couple of rules when displaying the status message at the top of the reporting page:
- We check that there is traffic to the experiment each day, to ensure it’s a valid experiment.
- We wait until an experiment has run for two weeks. Why two weeks? With experiments designed to be deployed to regularly-encountered parts of a digital property, that’s generally a good period of time to get a well-rounded sense of your data, including weekdays, weekends, and any other anomalies that might happen from one week to the next. However, two weeks is a minimum, and experiments can run longer. You can also make a call earlier if you think your traffic isn’t likely to change qualitatively, although we don’t recommend doing this.
- We look at a metric called Potential Value Remaining. The statistical term for this is “Regret”, but you can also think of it as Potential Loss or Potential Opportunity Cost. This metric is currently not available in the Optimize user interface, but we may provide it in the future. This metric describes how much your objective metric (e.g., conversion rate or revenue) might still be improved over the current leader. An example of a statement you could make using that metric against a revenue objective is: “There’s a chance that one of your variants could still beat your current leader by $2. Running your experiment longer is likely to help lower the risk of losing that $2.” It generally trends down to 0 as you get more data and more certainty about your results, though changes in how your experiment traffic behaves sometimes cause it to rise. Right now, we call experiment results when we don’t think we can improve your best conversion rate by more than an additional 1%.
How do you decide when you have a leader?
If the conditions above are met, then we look for the variant with the highest probability to be best. If that variant is also likely to beat the original by more than 95%, we show a leader.
Why aren’t the median conversion rates the same if I just divide one number by the other?
We use advanced models that take into account time, user context, result consistency, and other factors. Simply dividing one number by the other can't account for all of these factors. We do this to best model how your variants are likely to perform in the future, so your results are more likely to remain valid in the long run.
What are the conversion rate numbers that Optimize shows? What does each one mean?
Optimize shows several numbers, all modeled. In particular, we show the range where your actual conversion rate is 95% likely to be contained. Hovering over these numbers in the bottom-most card also shows you the median value and the 50% range. Usually, those bounds narrow over the course of the experiment as more data comes in. As the conversion rate ranges overlap less, you’ll see your probabilities go up for better-performing variants. You can see this progression in the time series chart at the bottom of the reporting page, and the ranges are also shown in the rows of that same bottom card.
Why does Optimize show a range of numbers for “improvement”? Most tools don’t do that.
They should! Those numbers show the range of values possible for each variant’s improvement relative to the original. Every improvement number, regardless of testing method, should have an interval around it. Tools that don't show this oversimplify and don't tell you the whole story. The interval that we show is where we’re 95% sure your true improvement falls, and where it will stay if conditions stay consistent. You can also hover over the numbers on the last card to see the median and 50% interval.
Is there a way to know how long my experiment should last?
Because our approach adapts to the test conditions, different conditions cause different experiment lengths. For example, if your conversion rates are very consistent over time, we’ll find results faster. But if there’s a lot of variability in the rates, it’s likely to take longer as we model out the various influencing factors. Tools that allow for predicting experiment length assume that there’s no variability or dependence on time. This is seldom true in real experiments. Also, we suggest that you run experiments for a minimum of two weeks in order to capture cyclical variations in your traffic, such as weekdays/weekends, and to even out any recency effects or other anomalies.
What are the benefits of user-centric and session-centric measurement?
User-centric measurement lets you see experiment results for people who use your website across many sessions. Session-centric measurement lets you see experiment results for individual sessions, regardless of whether those sessions come from the same person.
User-centric measurement provides you with a more holistic view into user behavior and can provide you with more meaningful experiment results. For example, if you want to measure the number of people you acquire compared to the number of people who make a purchase, you likely want to look at the number of converted users rather than the number of converted sessions to understand the effectiveness of your marketing efforts.
Optimize for Google Analytics 4 provides user-centric measurement, while Optimize for Universal Analytics provides session-centric measurement.
This FAQ article is part of a series of FAQ articles on Optimize statistics and methodology. Here are the other FAQs: