Overview of Content Experiments

FAQ (multi-armed bandit)

Does the bandit always find the optimal arm?

The multi-armed bandit algorithm we use is actually guaranteed to find the optimal arm if the experiment is run forever [3] [4]. You’re not going to run your experiment forever, so there can be no iron-clad guarantee that the arm that is found is optimal. Of course, no statistical method gives you a 100% certainty of finding an optimal answer with finite data, so it isn’t surprising that the our bandit algorithm won’t either. This is why we cap the experiment length at 3 months. If we haven’t found a winner by that time, it probably means that there’s not much to find and you’d be better off experimenting with other aspects of your site.

Is it always shorter than a classical test?

The bandit can yield results much faster than classical testing, at less cost, and with just as much statistical validity, but there may be occasional experiments that take longer than expected just by chance.

What type of experiments make the multi-armed bandit do particularly well (or poorly) compared to classical testing?

The multi-armed bandit has the clearest advantages over classical testing in complex experiments where there really is an effect to be found [1]. If one of your variations performs much better than the others, the optimal arm will be found very quickly. If one or more variations perform much worse than the others, they will be down-weighted very quickly, so the experiment can focus on finding the best arm.

The worst case for the bandit is an experiment with two arms that perform exactly the same. In that case the ideal solution is for the arms to accumulate observations at identical rates until the experiment ends. The bandit exhibits this behavior on average, but in any given experiment one arm will accumulate observations more quickly just by chance.

It is worth remembering that people do experiments because they think they can improve on the existing page, so we don’t want to over-emphasize the worst case scenario assumed by classical tests.

What happens if the optimal arm is unlucky in the beginning? Can it recover?

Even if an arm has been downweighted early in the experiment, it can still recover. An arm can be unfairly downweighted for two reasons. Either that arm did uncharacteristically bad, or another arm did uncharacteristically well (or both). If chance has unfairly favored an inferior arm, that arm will begin to accumulate more observations, we will learn that it wasn’t as good as we thought, its weight will decrease, and the weight of competing arms will increase.

Are the results from the bandit statistically valid?

Yes. The bandit uses sequential Bayesian updating to learn from each day’s experimental results, which is a different notion of statistical validity than the one used by classical testing. A classical test starts by assuming a null hypothesis. For example, “The variations are all equally effective.” It then accumulates evidence about the hypothesis, and makes a judgement about whether it can be rejected. If you can reject the null hypothesis you’ve found a statistically significant result.

Statistical significance exists to keep you from making a type I error. In the context of website optimization, a type I error means picking a new variation that is really no different from the original, performance wise. You’d like to avoid type I errors (they’re errors, after all), but in this context they are far less costly than type II errors. For us, a type II error means failing to switch to a better arm, which is costly because it means you’re losing conversions.

Bayesian updating asks the question “What is the probability that this is the best arm, given what I know now?” Hypothesis testing asks “What is the probability that I would see this outcome if all the arms were equal?” Both are valid questions, but the Bayesian question is easier for most people to understand, and it naturally balances between type I and type II errors by taking advantage of information from your experiment as it becomes available.

Classical hypothesis tests make you wait until you’ve seen a certain number of observations before you look at your data, because the probability question they need to answer becomes too complicated otherwise. If you’ve got a poorly performing arm in your experiment, then classical tests impose a heavy opportunity cost. So if both methods are valid, why not use the one that saves you time and money, and skip the complicated, expensive one that makes you wait to look at your experimental results?

Was this helpful?
How can we improve it?