Optimize uses a Bayesian inference approach to generate experiment results from data. The following help article will help acquaint you with the basics of Bayesian inference, its benefits, and its pitfalls.
Primer on Bayesian inference
Bayes' theorem is an equation that tells us how we can use observable data to make inferences on unobservable things. For example, Optimize users often want to pick a treatment that has the largest conversion rate in the long-run. The only way we can know this with certainty is to observe every single website visitor for the treatment’s entire lifetime. However, waiting this long would defeat the purpose of an experiment. Instead, we take a random sample of users to estimate which treatment has the largest conversion rate in the long-run.
Bayes' theorem allows us to take data from a random sample of users and make estimates on something that is unobservable – like whether a treatment has the largest conversion rate among a set of treatments. An unobservable statement like this is a hypothesis and is represented by "H."
The Bayesian methodology used by Optimize centers around using data to infer how likely a hypothesis is to be true with Bayes' theorem:
which outputs P(H | data). The function "P()" is a way of saying "probability" and "|" is a way of saying "given that". So P(H | data) is the probability of a hypothesis being true given the data we’ve observed. The right-hand side of Bayes’ theorem must be understood in order to perform one’s own Bayesian inference, but it isn’t necessary to understand Optimize’s output. If interested, there are many good introductory resources on the topic1, 2, 3.
Optimize has built a Bayesian-based methodology to determine the probability of a hypothesis given data. The core hypotheses that Optimize considers are whether each treatment is better than all the others. That is, Optimize is looking to see which treatment is the best.
In an A/B test with an original and a single variant, Optimize considers two hypotheses:
H1: The original is better than the variant
H2: The variant is better than the original
Optimize uses Bayes' theorem to determine P(H1 | data) and P(H2 | data); i.e, the probabilities that the original and variant are the best treatment respectively (see Probability to be Best in Optimize reports). In a test with more treatments, there is a hypothesis for each treatment being better than all of the others. Optimize uses Bayes’ theorem to determine the probability of each of these hypotheses given the data (also Probability to be Best).
Notice that there is no hypothesis for the original and variant to be tied. This is because our methodology assumes it’s impossible for the two treatments to be exactly equal in an Optimize experiment (the reason is quite technical). However, it’s possible for the original and variant to have a negligible difference as we’ll discuss here.
Other uses of Bayes' theorem
In addition to being used to make inferences for true or false hypotheses, Bayes' theorem can be used to make inferences on continuous ranges of values. For example, we can use data to answer questions like:
- What is the probability that the treatment’s conversion rate is less than 50%?
- What is the probability that the treatment’s conversion rate lies between 1% and 4%?
- What is the range that the treatment’s conversion rate has a 95% chance of being in?
The third question is calculated in Optimize and called the 95% credible interval.
Benefits of Optimize's Bayesian approach
Get clear answers to important questions
The following are some questions you should answer before deciding which treatment to deploy:
- How big of an impact did a change make to my site?
- How much happier are my customers with a change?
- Which change offers the largest improvement with how customers see my product?
As long as these questions (and the many more one could ask) can be quantified, they can be answered. However, a random sample during an experiment can’t answer these questions with complete certainty. Optimize’s Bayesian approach instead provides a range of highly probable answers. Take for example, the question of "How big of an impact did a change make to my site?" The random sample of users doesn’t provide a single answer to this question. Instead, we say something like, "There's a 95% chance this change will gain somewhere between $0.47 and $0.57 per session."
Find treatments that provide more value
Optimize uses something called Potential Value Remaining (PVR)4 to recommend ending an experiment when the data suggests there is little reason to continue running it. Potential Value Remaining recommends ending an experiment if either:
- There is a high probability* one treatment is best, or
- There is a high probability* that there is negligible difference** between top ranked treatments.
In situation 1, there is little reason to continue running the experiment because there is a high probability that deploying the winning treatment is a good choice for optimizing your website.
In situation 2, one could continue running the experiment to determine which treatment is better. However, Optimize is confident that the difference in the top ranked treatments is negligible so there is only a very small advantage gained by finding the absolute best of these treatments. The additional time an experiment must run to find this small advantage could be better spent by ending and starting your next, potentially high impact, experiment.
To summarize Potential Value Remaining into one idea, Optimize makes a recommendation when there's a treatment that has a small probability of being worse than the truly optimal treatment by more than a negligible amount. We compare this criterion to another common criterion in a later section. We’ll also see in the next section that our ability to make recommendations with this criterion holds regardless of the number of times we check if it has passed our threshold. Thus, we say that we can recommend a treatment as soon as the data suggests it.
*Optimize considers >95% to be high probability.
**Optimize considers a <1% relative difference to be negligible.
Recommendations aren’t impacted by multiple comparisons or peeking
As mentioned earlier, an A/B test observing only a random sample can’t know with complete certainty which treatment is optimal. Given that some error is inevitable, many A/B testing approaches will make a mathematical guarantee about the error. For example, Optimize makes a recommendation when a treatment has a high chance of either being the optimal treatment or being suboptimal by a negligible amount. As another example, A/B testing tools that use Null Hypothesis Significance Testing (NHST) often make a recommendation when the chance of a false positive is small. Here, a false positive is defined as "concluding a difference between treatments when in fact there was no difference."
We say there’s a "multiple comparisons problem" when the mathematical guarantee of a testing approach doesn’t apply when one considers a set of statistical inferences. For example, in a set of statistical inferences where each has a chance of a false positive, the chance that any one of the inferences considered results in a false positive increases as the number of inferences increases. This may happen when one is simultaneously comparing multiple variants against the original.
Similarly, we say there’s a "peeking problem" when the mathematical guarantee doesn’t apply when one checks the results of the experiment several times during the experiment and acts upon the result that they see. For example, if one "peeks" at the results over and over again, eventually sees that the results suggest a difference between treatments, and then claims there is a difference, then they have increased their chances of claiming a false positive. Said differently, if one were to collect more data, and "peek" again, the result may change from claiming a "difference" to "no difference." The reason is very similar to the multiple comparisons problem. Each time one peeks there is a chance of a false positive, so the overall chance of a false positive goes up as the number of peeks increases.
A/B testing approaches that focus on keeping the false positive rate small must take the multiple comparisons and peeking problems into account in order to keep their guarantee. To do this one usually must hurt their chances of a true positive. Here, a true positive is defined as "concluding a difference between treatments when there is indeed a difference." Lower chances of making a true positive can be overcome by requiring more data (and hence in the case of web experiments, requiring longer experiments). This is a negative side effect and it’s not clear which is worse - the medicine or the disease.
The approach Optimize takes does not have a multiple comparisons or peeking problem because our guarantee still applies with multiple comparisons and with peeking. Optimize makes a recommendation when a treatment has a high chance of either being the optimal treatment or being suboptimal by a negligible amount. This is true regardless of the number of comparisons or the number of times we check if it has passed our threshold.
Criticism of Optimize's Bayesian approach
False positive rate
As alluded to above, Optimize does not focus on false positives. One reason to not focus on false positives is we believe they will never happen in a real experiment. A false positive happens when we conclude a difference between treatments when in fact there was no difference. However, we believe there is always some difference between treatments - it’s just a matter of the magnitude of the difference and which treatment is better. Instead of focusing on false positives, Optimize makes a recommendation when there is a treatment that has a high chance of either being the optimal treatment or being suboptimal by a negligible amount. We believe making recommendations based on the likely outcomes of each treatment is better for those who want to optimize their websites. Controlling the false positive rate, as NHST does, perhaps makes sense when a false positive could exist and its consequences are high. For example, perhaps when scientists declare discoveries as scientifically true they want a very small proportion to be false - a loss of faith in science or one’s professional reputation may be at stake. However, when making business decisions Optimize believes there is little reason to emphasize false positive rate over other kinds of errors.
A consequence of this is that our false positive rate is likely higher than testing tools that focus on this. For example, testing tools that use Null Hypothesis Significance Testing (NHST) with a significance threshold set to 95% often control the false positive rate to be at most 5% (assuming the multiple comparisons and peeking problems have been accounted for if necessary). In an A/A test where there is no difference between the treatments, there is an expectation that an NHST testing tool will recommend a treatment at most 5% of the time. This is why A/A tests are a useful way to check if an NHST testing tool is doing what it’s supposed to do.
Optimize on the other hand doesn’t focus on false positives so there should be no expectation that A/A tests in Optimize will recommend a treatment only 5% of the time. If Optimize is confident that there is a treatment that is suboptimal by only a negligible amount, it will make a recommendation. In an A/A test, Optimize makes a good recommendation according to its own criteria as the suboptimality of one A instead of the other A is zero. Optimize’s recommended treatment in an A/A test ought to be reviewed together with the modeled improvement. You will likely see the modeled improvement is negligible.
Anyone using Bayesian analysis must decide on what’s called a "prior." In the equation for Bayes’ theorem, the "prior" is P(H). Priors express one's beliefs, and the certainty of these beliefs, about whatever is being estimated before data is taken into account. The prior is contrasted with the posterior, P(H | data), which is the probability of a hypothesis being true after data is considered.
There are many ways one can decide on a prior. For example, a prior can express a high certainty in the conversion rate of a variant. In this case, lots of data is needed to overcome this belief. These are called "informative priors." A prior can also express that one has very little idea of what the conversion rate of a variant is without any data. In this case, the data in the experiment speaks for itself. These are called "uninformative priors." Although a prior that is completely uninformative isn’t possible, Optimize chooses its priors to be quite uninformed.
There are two common criticisms of priors. The first is that it can take quite a bit of work to decide on a well-reasoned prior. Luckily, Optimize does this work for you. The second is that a prior adds subjective assumptions to an analysis, even if they can be weak assumptions. To this, we point out that every analysis must make some assumptions. For example, a non-Bayesian analysis may assume that "error" is a normal distribution around zero. Similarly, it may be assumed in a Bayesian analysis that all conversion rates are equally likely before any data is seen. These assumptions all have some impact on the results of the analysis.
Comparisons with Null Hypothesis Significance Testing (NHST)
Optimize does not use the same techniques that you may be familiar with, in particular Null Hypothesis Significance Testing (NHST). If you are familiar with these terms, it may be tempting to compare Optimize’s results with NHST results. However, this is not recommended as Optimize results are not the same as NHST results (some of the reasons we don’t take an NHST approach are highlighted here). This section clarifies what Optimize does provide relative to some of the NHST terms you may be familiar with.
Statistical significance & p-values
Using a p-value to determine the statistical significance of a result is the goal of NHST. One of the first steps of NHST is to pick a significance threshold. If a p-value is below this threshold, often 0.05, one says the result is "statistically significant" or, with a threshold of 0.05, "significant at 95%."
A common misinterpretation of statistical significance is that it’s the probability that a variant is outperforming the original5. So, if an experiment says the variant is outperforming the original at a 95% statistical significance threshold using an NHST methodology, one is not in general able to say that the variant has a 95% probability of beating the original. Rather, one is able to say that the probability of concluding a difference between treatments, when in fact there was no difference, is at most 5%. The difference is perhaps subtle, but is meaningful6.
Optimize’s Bayesian-based methodology does not compute p-values or determine statistical significance. Instead, we compute interpretable probabilities that directly answer some of your questions. For example, a 95% probability to beat the original is just as it seems: the probability that this variant is better than the original. No extra interpretation needed.
Optimize provides "credible intervals" instead of "confidence intervals". A common misinterpretation of "confidence intervals" is that they provide a probability statement on the range of likely values for the experiment objective7. This is a misinterpretation of a confidence interval but, luckily, it is exactly the correct interpretation of Optimize’s credible intervals. In other words, a credible interval can be thought of as a range of likely values for the experiment objective.
 Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis. CRC press.
 Kruschke, J. (2014). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan. Academic Press.
 McElreath, R. (2020). Statistical rethinking: A Bayesian course with examples in R and Stan. CRC press.
 Scott, S. L. (2015). Multi‐armed bandit experiments in the online service economy. Applied Stochastic Models in Business and Industry, 31(1), 37-45.
 McShane, B. B., & Gal, D. (2017). Statistical significance and the dichotomization of evidence. Journal of the American Statistical Association, 112(519), 885-895.
 Nickerson, R. S. (2000). Null hypothesis significance testing: a review of an old and continuing controversy. Psychological methods, 5(2), 241.
 Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E. J. (2014). Robust misinterpretation of confidence intervals. Psychonomic bulletin & review, 21(5), 1157-1164.