General methodology

In this article:

What is different about the Optimize approach to measuring experiment results?

When compared to the approach taken by many other testing tools -- a frequentist-based analysis of the results over the life of the experiment -- our approach differs in two important ways.

First, we use Bayesian inference to generate our statistics. Bayesian inference is an advanced method of statistical analysis that allows us to continually refine experiment results as more data is gathered. While computationally involved and expensive, Bayesian inference offers a number of benefits when compared to more traditional approaches:

  • We can state the probability of any one variant to be the best overall, without the multiple testing problems associated with hypothesis-testing approaches.
  • Bayesian methods allow us to compute probabilities directly, to better answer the questions that marketers actually have (as opposed to providing p-values, which few people truly understand). Read more about p-values.

One of the greatest benefits of using Bayesian inference, however, is that it allows us to use more advanced models for our analysis of A/B and multivariate testing results -- the second major difference in our approach. With traditional testing methods, numerous assumptions are made that treat experiment results with a “one-size-fits-all” approach. With Bayesian inference, however, we’re able to use different models that adapt to each and every experiment. We’re constantly evaluating new models to help experimenters find highly accurate results as quickly as possible. For example, here are a few models we’ve used:

  • Hierarchical models allow us to model the consistency of a variant’s conversion rates over time. If an experiment has significant “newness” effects that wear off over time, hierarchical models more effectively compensate for this, offering a more accurate representation of how the variants will perform in the future.
  • Contextual models allow us to capture information about the experiment or user context. If new users behave differently from returning users, we can incorporate that information into the overall results to give you a more accurate final outcome.
  • Restless models neutralize overall performance trends that affect all variants, isolating and clarifying the impact of each variant's changes. So, if conversion rates on the weekend are much different from conversion rates on weekdays, those effects are evened out and the differences surface more clearly.

By using Bayesian inference with more complex models, we’re better able to model all of the factors that can affect your test results. In the real world, users don't always see a variant just once and then convert. Some users see a variant multiple times, others just once. Some users visit on sale days, some visit other days. Some have interacted with your digital properties for years, others are new. Our models capture factors like these that influence your test results, whereas traditional approaches ignore them. Here are just a few of the benefits:

  • We can take into account other complexities that affect your test results, offering greater accuracy about the performance you can expect from your variants.
  • We can often provide results more quickly in low-traffic experiments, since we don’t require minimum sample sizes and can rely on other aspects of your results.
  • We can run and analyze multivariate tests quickly and comprehensively.

What problems does the Optimize approach seek to solve with respect to A/B testing measurement?

When we looked at the current state of the market and the data that we had from past experience with Content Experiments and Google Website Optimizer, we noticed a few primary problems:

  • Experimenters want to know that results are right. They want to know how likely a variant's results are to be best overall. And they want to know the magnitude of the results. P-values and hypothesis tests don’t actually tell you those things! Most experimenters don’t really understand what p-values tell them and arrive at incorrect conclusions as a result. Even scientists often struggle with this.
  • Experimenters like to look at tests frequently, resulting in the “multiple look” problem. Acting on early data in a frequentist framework can result in incorrect decisions.
  • Experimenters want to get results quickly, but also accurately. Off-the-shelf approaches to testing assume that results aren't affected by time, even though most experiments change as users react to new content or change behavior throughout the course of an experiment. Consequently, many experimenters find that test results don't hold over time, even after finding a confident winner. In addition, cyclical behavior, such as differences between a weekday and weekend, often affect results, and ignoring those cycles can lead to incorrect conclusions.
  • Simplistic approaches to multivariate tests often require a tradeoff between very long runtimes vs running just a few combinations and sacrificing data quality.

What’s an example of an “advanced model” that you use?

We use a variety of models for different objectives, but one that we use often is a hierarchical model, which allows us to take the daily conversion rate for each variant as inputs to our models. (This contrasts with the more typical approach, in which the raw conversion and trial numbers over the life of the test are summed and used as inputs to a simple frequentist calculation.) This matters because it means we can better understand how your conversion rates are going to perform in the future. It also means that we can provide results more quickly when conversion rates are very consistent, and more accurate results when conversion rates are highly variable.

Consider a simple example:

  • one original, one variant
  • 1000 trials a day for each
  • Variant true conversion rate (CvR) in long term: 1%
  • Original (constant ) CvR: 3%
  • “Newness” effect for variant: users click on it more often because it’s new (e.g at 10% at the start of the experiment, tapering off over a few days.)

This could result in performance that looks something like this over time:

Chart: average conversion rate

Most tools show the average conversion rate (in red). Note that the average conversion rate takes a very long time to approach the true conversion rate of 1%. It also shows the variant as winning until day 8 or so.

On the other hand, what we calculate with these hierarchical models is much more like the Daily CvR curve (in blue). Because we’re looking to see how consistent conversion rates are, we see that they’re actually highly variable. As a result, while the variant is winning for a couple of days, by day 3, it's clear that the results are far more uncertain than the average CvR rate over the life of the experiment would describe.

What is Bayesian inference?

Bayesian inference is a fancy way of saying that we use data we already have to make better assumptions about new data. As we get new data, we refine our “model” of the world, producing more accurate results.

Here is a practical illustration.

Imagine that you’ve lost your phone in your house, and hear it ringing in one of 5 rooms. You know from past experience that you often leave your phone in your bedroom.

A frequentist approach would require you to stand still and listen to it ring, hoping that you can tell with enough certainty from where you’re standing (without moving!) which room it’s in. And, by the way, you wouldn’t be allowed to use that knowledge about where you usually leave your phone.

On the other hand, a Bayesian approach is well aligned with our common sense. First, you know you often leave your phone in your bedroom, so you have an increased chance of finding it there, and you’re allowed to use that knowledge. Secondarily, each time the phone rings, you’re allowed to walk a bit closer to where you think the phone is. Your chances of finding your phone quickly are much better.

That’s interesting, but can you make Bayesian statistics clearer for me?

We’d love to, but we don’t know that we could do better than several great statisticians already have. Here’s a great starter overview.

Why doesn’t everyone use Bayesian inference or these advanced models?

There are a few reasons. First, non-Bayesian methods are easier to teach. As a result, they are traditionally taught in introductory statistics classes. Bayesian modeling requires a more in-depth approach to probability, and furthermore, Bayesian inference is computationally quite expensive. Generating results for a single variant/objective combination requires tens of thousands (or more) Markov chain Monte Carlo (MCMC) iterations - simulations that model the performance of each variant. This wasn’t feasible for a long time, and even now, it takes a great deal of scale to compute many of these. Fortunately, Google happens to be very good at this kind of scaling problem.

Using Bayesian methods also makes it feasible to use advanced models. While it might be possible to use some of these models with frequentist approaches, the corrections needed to display accurate results are much, much more difficult and still don’t offer some of the advantages that Bayesian inference does.

How does Optimize solve for these problems?

The interpretation problem: Bayesian statistics can answer the question “how likely is this variant to be better than what I had?”, or “how likely is this variant to be the best overall?”. While the computations are more complex, the answers are actually more aligned with how humans think.

The multiple-look (a.k.a. “peeking”) problem: Because we use models that are designed to take into account changes in your results over time, it’s always OK to look at the results. Our probabilities are continually refined as we gather more data.

The multiple-comparisons problem: Because Bayesian methods directly compute the relative performance of all variants together, and not just pairwise comparisons of variants, experimenters don’t have to perform multiple comparisons of variants to understand how each is likely to perform. Additionally, Bayesian methods don’t require advanced statistical corrections when looking at different slices of the data. In hypothesis testing approaches, however, statistical corrections are needed when you look at the data in different ways, and most tools don’t do this. Random chance can always still produce “clear” results if you look at your data in enough slices, but we try to minimize the chances of this happening.

Speed and accuracy: Because we more accurately model the performance of all variants together over time (and aren’t just performing pairwise comparisons on the totals), we’re not subject to a “one-size-fits-all” frequentist approach. So, we’re often faster when your data is consistent, particularly in low-volume environments, and more accurate when it’s not.

Traffic changing over time: We use advanced models that assume that time can affect the results of your experiment. We factor that assumption in and include it in our analysis, to give you the best results that are more likely to hold true over time.

Multivariate testing: Optimize’s approach can learn about both the performance of combinations against each other, and about the performance of a variant across various combinations. As a result, we can run all combinations, but find results much more quickly than an equivalent A/B test.

This FAQ article is part of a series of FAQ articles on Optimize statistics and methodology. Here are the other FAQs:

Was this article helpful?
How can we improve it?