Experiment results

In this article:

What is “probability to beat baseline”? Is that the same as confidence?

No, it’s not, but you could use it in a similar way, and probability to beat baseline is better aligned with what most people think or wish confidence would provide. Most people have a hard time intuitively understanding confidence (which is 1 minus p-value), while probability to beat baseline is exactly what it sounds like: the probability that a variant is going to perform better than the original. You can wait until probability to beat baseline reaches 95%, or deploy earlier if you’re willing to accept a higher risk of being incorrect. Read more about the challenges of interpreting p-values.

What is “probability to be best”?

Probability to be best tells you which variant is likely to be the best performing overall. It is exactly what it sounds like -- no extra interpretation needed! To produce the same result in a frequentist environment, you’d have to do extra legwork, such as Bonferroni corrections (read about multiple comparisons) to ensure you’re not getting inaccurate answers.

How do you decide when to call an experiment?

Currently, we follow a couple of rules when displaying the status message at the top of the reporting page:

  • We check that there is traffic to the experiment each day, to ensure it’s a valid experiment.
  • We wait until an experiment has run for two weeks. Why two weeks? With experiments designed to be deployed to regularly-encountered parts of a digital property, that’s generally a good period of time to get a well-rounded sense of your data, including weekdays, weekends, and any other anomalies that might happen from one week to the next. However, two weeks is a minimum, and experiments can run longer. You can also make a call earlier if you think your traffic isn’t likely to change qualitatively, although we don’t recommend doing this.
  • We look at a metric called Potential Value Remaining. The statistical term for this is “Regret”, but you can also think of it as Potential Loss or Potential Opportunity Cost. This metric is currently not available in the Optimize user interface, but we may provide it in the future. This metric describes how much your objective metric (e.g., conversion rate or revenue) might still be improved over the current leader. An example of a statement you could make using that metric against a revenue objective is: “There’s a chance that one of your variants could still beat your current leader by $2. Running your experiment longer is likely to help lower the risk of losing that $2.” It generally trends down to 0 as you get more data and more certainty about your results, though changes in how your experiment traffic behaves sometimes cause it to rise. Right now, we call experiment results when we don’t think we can improve your best conversion rate by more than an additional 1%.

How do you decide when you have a leader?

If the conditions above are met, then we look for the variant with the highest probability to be best. If that variant is also likely to beat the original by more than 95%, we show a leader.

Why aren’t the median conversion rates the same if I just divide one number by the other?

We use advanced models that take into account time, user context, result consistency, and other factors. Simply dividing one number by the other can't account for all of these factors. We do this to best model how your variants are likely to perform in the future, so your results are more likely to remain valid in the long run.

What are the conversion rate numbers that Optimize shows? What does each one mean?

Optimize shows several numbers, all modeled. In particular, we show the range where your actual conversion rate is 95% likely to be contained. Hovering over these numbers in the bottom-most card also shows you the median value and the 50% range. Usually, those bounds narrow over the course of the experiment as more data comes in. As the conversion rate ranges overlap less, you’ll see your probabilities go up for better-performing variants. You can see this progression in the time series chart at the bottom of the reporting page, and the ranges are also shown in the rows of that same bottom card.

Why does Optimize show a range of numbers for “improvement”? Most tools don’t do that.

They should! Those numbers show the range of values possible for each variant’s improvement relative to the original. Every improvement number, regardless of testing method, should have an interval around it. Tools that don't show this oversimplify and don't tell you the whole story. The interval that we show is where we’re 95% sure your true improvement falls, and where it will stay if conditions stay consistent. You can also hover over the numbers on the last card to see the median and 50% interval.

Is there a way to know how long my experiment should last?

Because our approach adapts to the test conditions, different conditions cause different experiment lengths. For example, if your conversion rates are very consistent over time, we’ll find results faster. But if there’s a lot of variability in the rates, it’s likely to take longer as we model out the various influencing factors. Tools that allow for predicting experiment length assume that there’s no variability or dependence on time. This is seldom true in real experiments. Also, we suggest that you run experiments for a minimum of two weeks in order to capture cyclical variations in your traffic, such as weekdays/weekends, and to even out any recency effects or other anomalies.

Why do you use session-centric measurement rather than user-centric?

Different experiments require potentially very different approaches, from extremely granular to extremely coarse. A publisher solving for maximum pageviews might optimize against a pageviews-per-session based objective. In contrast, an ecommerce provider solving for new customer acquisition might focus on the first checkout and optimize against a “converted users” based objective. Many examples also exist in between.

Additionally, each type of measurement presents its own challenges when deciding how to best measure statistical impact. Granular forms of measurement allow for precise understanding of context and greater insight into what might be happening on a day-to-day basis. More coarse measurements prevent this insight.

With Optimize, we continually look into new ways of evaluating experiments to help identify the best, most actionable results. Our session-centric approach seeks to strike a balance between the various options and tradeoffs available. It allows us to understand more about day-to-day variation in experiment performance while producing very similar results to other, more coarse grained approaches.

This FAQ article is part of a series of FAQ articles on Optimize statistics and methodology. Here are the other FAQs:

Was this article helpful?
How can we improve it?