25 Apr

A/B Testing - Common Mistakes - Simultaneous Tests

A/B Testing: Common Mistakes

Running more than one A/B test at the same time

When you are running an A/B test for a new feature, you wait to reach that magical 95% confidence level so you have statistical significance. You declare a victor and update your site accordingly. When you are running many A/B tests for multiple features, you wait for each of your tests to reach that magical 95% confidence level so you’ve reached statistical significance for each test. You declare the victors and update your site accordingly. Unfortunately, you’re actually not done. If you haven’t already, please read this previous post first which explains how a statistical test works.

What changes if I’m running many A/B tests for different features?

If there is a single A/B test running, 95% significance means that our observed difference or larger will only happen 5% of the time by chance if we assume that the treatment has no effect. However, if we have many A/B tests running, it is more likely that we will observe a large difference just by chance. If we use a 95% significance for each test, we have a larger than 5% chance of any single test having a large difference and accidentally rejecting our assumption.

Lets run some examples in R. In the previous post, we created two tests sets, both from the same distribution and saw how often we would’ve rejected the Null Hypothesis. With a 95% confidence interval, we correctly found that we would fail 5% of the experiments.

Now, we will create more test data sets from the same distribution. But this time, we will create 8 different test sets and run 4 tests at the same time. Lets see how often at least one of the tests fail the t-test.

set.seed(2015)
run_experiment_once <- function(x) {
    # Test 1 and its p-value
    data1 <- rnorm(100, mean=1, sd=2)
    data2 <- rnorm(100, mean=1, sd=2)
    p_value_1 <- t.test(data1, data2)$p.value

    # Test 2 and its p-value
    data1 <- rnorm(100, mean=1, sd=2)
    data2 <- rnorm(100, mean=1, sd=2)
    p_value_2 <- t.test(data1, data2)$p.value

    # Test 3 and its p-value
    data1 <- rnorm(100, mean=1, sd=2)
    data2 <- rnorm(100, mean=1, sd=2)
    p_value_3 <- t.test(data1, data2)$p.value
    
    # Test 4 and its p-value
    data1 <- rnorm(100, mean=1, sd=2)
    data2 <- rnorm(100, mean=1, sd=2)
    p_value_4 <- t.test(data1, data2)$p.value
    
    # Since we only want to find if any of the tests fail, we only
    # need to return the most significant test
    min(p_value_1, p_value_2, p_value_3, p_value_4)
}

# sapply will repeat the experiment 10,000 times
result <- sapply(1:10000, run_experiment_once)

# "< 0.05" will compare the result of each experiment (the p-value) with 0.05 which will
# create a list of "TRUE" and "FALSE" values
reject.null.hypothesis <- result < 0.05

# sum() will add up the "TRUE" and "FALSE" values where TRUE=1 and FALSE=0. So this gives
# the number of "TRUE" values
true.count <- sum(reject.null.hypothesis)

# Finally, divide by 10,000 to get the percentage
true.count / 10000
## [1] 0.185

We can see that, with 4 simultaneous tests, we reject at least one test 18.5% of the time at a 95% significance level and not 5% as expected.

What do I do now?

There are multiple ways to fix this problem. However, the easiest way is to use a Bonferroni Correction. Let say we are running 4 A/B tests. Instead of looking for a 100% - 5% = 95% confidence level, we now look for a 100% - (5% / 4) = 98.75% confidence level. That is, we divide the 5% in the confidence level by the number of tests and compute a new confidence level to test for. This is usually a conservative correction, meaning that we are less likely to reject than necessary, but it is very easy to compute. Depending on the exactly situation, there are other correction that are less conservative, but they are outside the scope of this article.

So lets repeat the previous experiment when we ran 4 simultaneous A/B tests, but this time we apply the Bonferroni correction and look for 98.75% significance level.

set.seed(2015)
run_experiment_once <- function(x) {
    data1 <- rnorm(100, mean=1, sd=2)
    data2 <- rnorm(100, mean=1, sd=2)
    p_value_1 <- t.test(data1, data2)$p.value

    data1 <- rnorm(100, mean=1, sd=2)
    data2 <- rnorm(100, mean=1, sd=2)
    p_value_2 <- t.test(data1, data2)$p.value

    data1 <- rnorm(100, mean=1, sd=2)
    data2 <- rnorm(100, mean=1, sd=2)
    p_value_3 <- t.test(data1, data2)$p.value
    
    data1 <- rnorm(100, mean=1, sd=2)
    data2 <- rnorm(100, mean=1, sd=2)
    p_value_4 <- t.test(data1, data2)$p.value
    
    min(p_value_1, p_value_2, p_value_3, p_value_4)
}

#Bonferroni Correction is applied here
sum(sapply(1:10000, run_experiment_once) < 0.0125) / 10000
## [1] 0.0506

Perfect! By using a Bonferroni correction, we reject any single experiment only 5% of the time, which falls in line with our original 95% confidence level.

But what happens as the number of tests change over time?

Say you have 10 tests running right now and you’re using a 99.5% confidence level. What if 8 of your tests end, leaving you with 2 tests? Then you update your confidence level to 97.5%, and suddenly one of the two remaining tests might have statistical significance! Personally, I would suggest staying conservative and keep using the 99.5% level for these two tests. This would imply using the highest confidence level that each test had during the life of the test.

Additional Information

Here are some wikipedia articles on correcting for multiple tests. I have also included a link about False discovery rates, which Optimizely is using in their new stats engine.

http://en.wikipedia.org/wiki/Familywise_error_rate

http://en.wikipedia.org/wiki/Bonferroni_correction

http://en.wikipedia.org/wiki/False_discovery_rate

http://blog.optimizely.com/2015/01/20/statistics-for-the-internet-age-the-story-behind-optimizelys-new-stats-engine/

Conclusion

If you are running many A/B tests, don’t forget to change your significance level. Otherwise, you’ll declare statistical significance when you don’t actually have it.

21 Apr

A/B Testing - Basics - 95% Significance

What does 95% significance level mean anyway?

Please read this other article for a technical explanation of 95% significance.For this article, I’d like to try to give a more intuitive explanation that I have been told.

To develop this intuition, we’re going to start flipping coins. However, we're unsure if this coin is a fair coin or not. This is an imperfect analogy since we rarely see fake coins, but this should help develop the intuition. So let’s start flipping the coin and suppose the first coin flip is heads. At this point, you would have no reason to believe the coin is a fake. The chance of getting heads on the first coin flip is 50%, which is also 50% significance.

Suppose the second coin flip is also heads. You’d probably still think the coin is probably real. The chance of getting two heads is 25%, which is a 75% significance.

Suppose the third and fourth coin flip are also heads. You might start to get suspicious, but probably would have a hard time saying for certain the coin is fake. The chance of getting four heads is 6.25% which is 93.75% significance.

Suppose the fifth and sixth coin flip are also heads. Now you’re probably wanting to take a good look at that coin. The chance of getting 5 heads in a row is 3.125% (96.875% significance) and the chance of getting 6 heads in a row is 1.5625% (98.4375% significance). This is the 95% significance level that is commonly used for tests. I suggest sticking with 95% confidence or higher.

However, if you tell yourself that I’ll be happy to decide after seeing only 3 or 4 heads, let me try this situation. Your company is running an A/B test and everyone is unsure if it is successful or not. It may or may not be losing money each day. But after deciding, the feature may be profitable or not for the remainder of the life of the product. How many heads would you wait for until you declared an answer? Are you actually willing to take those risks after only seeing 4 heads in a row?

Why 95%?

This is the confidence level that is commonly used for many experiments across many areas of science. However, it is not the only one. For those experiments hunting for the Higgs Boson particle, they are looking for a significance of 99.99997133%. But to get this level of significance, it takes lots and lots of data. So while 95% has become a standard, it depends on the level of certainty you’re looking for and how long you’re willing to wait to declare the test is over.

Lastly, consider this. You don’t need to defend using a 95% significance level since is it standard. It’ll only slightly higher to defend 99% significance level. It’ll be hardest to defend using a 90% confidence level.

Conclusion

So intuition for 95% significance is approximately equal to seeing 5 heads in a row.

11 Apr

A/B Testing - Basics - Statistical Tests

If there are two data sets, with lots of variance in each data set, how do we tell if one data set is higher than the other on average? We can see that there is a difference in the average between the two data sets. But is that difference real or just by chance? If we hypothetically run this A/B test again, we will get two different data sets with two different averages, and new difference between the averages. The difference may be smaller or larger. What would that mean? Well, actually, we can use this idea of repeated hypothetical experiments to find if the difference is real or not.

The Null Hypothesis

First, we assume that the treatment doesn't do anything. Then statistics tells us how often we will encounter our observed data just by chance. If our observed data doesn't happen often just by chance, then we have evidence that our assumption is incorrect and that the treatment does do something.

Let me try to explain it with more technical terms. Let’s assume that both data sets are samples from the same distribution. This is called the Null Hypothesis and we assume that it is true. Then using statistics, we can estimate what would happen if we ran this experiment many times, getting different data sets each time, and look at the difference between the averages of the two data sets. Then statistics can tell us how often our observed difference (or larger) in our actual test data will happen if we assume the Null Hypothesis. If the size of the difference (or larger) that we observed does not happen often, then we have evidence that our assumption is not true. We say that the no-effect treatment assumption is not true and we “reject the Null Hypothesis”.

Examples: Two different distributions

Lets look at some examples written in R. Let the first data set be 100 Normally distributed points with mean=1 and standard deviation=2. Let the second data set be 100 Normally distributed points with mean=2 and standard deviation=2. What does Student’s t-test tell us?

set.seed(2015)
# Create a list of 100 random draws from a normal distribution 
# with mean 1 and standard deviation 2
data1 <- rnorm(100, mean=1, sd=2)
# Create a second list of 100 random draws from a normal
# distribution with mean 2 and standard deviation 2
data2 <- rnorm(100, mean=2, sd=2)
# Perform a t-test on these two data sets and get the p-value
t.test(data1, data2)$p.value
## [1] 0.0005304826

In this case, the data was actually created with two different distributions. We can see that, if we assume they came from the same distribution (Null Hypothesis), the t-test says 0.05% of the time we will observe data this far apart or further. This is a 99.95% significance level. So we reject the null hypothesis and declare the second data set to be higher than the first.

Now, lets move the second data set closer to the first data set. Lets change its mean to 1.3 and and keep the first data set mean at 1.0

set.seed(2015)
# Create a list of 100 random draws from a normal distribution 
# with mean 1 and standard deviation 2
data1 <- rnorm(100, mean=1, sd=2)
# Create a second list of 100 random draws from a normal
# distribution with mean 1.3 and standard deviation 2
data2 <- rnorm(100, mean=1.3, sd=2)
# Perform a t-test on these two data sets and get the p-value
t.test(data1, data2)$p.value
## [1] 0.3258681

Now, even though the data was created with two different distributions, the t-test shows that we have a 33% chance of observing data this far apart or further when we assume the Null Hypothesis. This is a 67% significance level. So we cannot reject the null hypothesis and declare that we don’t have a winner yet.

Examples: A single distribution

Let’s look at a different type of example. Let’s still create two different data sets but from the same distribution. We will repeat this experiment 10,000 times and see what happens.

set.seed(2015)
run_experiment_once <- function(x) {
    # Create a list of 100 random draws from a 
    # specific Normal distribution
    data1 <- rnorm(100, mean=1, sd=2)
    # Create a second list of 100 random draws from the 
    # same specific Normal distribution
    data2 <- rnorm(100, mean=1, sd=2)
    # Perform a t-test on these two data sets and get
    # the p-value
    t.test(data1, data2)$p.value
    # the p-value only will be returned from this function
}

# sapply will repeat the experiment 10,000 times
result <- sapply(1:10000, run_experiment_once)

# "< 0.05" will compare the result of the experiment
# (the p-value) with 0.05. This will create a list of
# "TRUE" and "FALSE" values
reject.null.hypothesis <- result < 0.05

# sum() will add up the "TRUE" and "FALSE" values where 
# TRUE=1 and FALSE=0. So this gives the number of "TRUE"
# values
true.count <- sum(reject.null.hypothesis)

# Finally, divide by 10,000 to get the percentage
true.count / 10000
## [1] 0.051

Even though the two data sets came from the same distribution, we still reject the Null Hypothesis 5.1% of the time. This falls in line with our 95% significance level. Remember we assume that the two distributions are actually equal, which we did in this example. Then we determine how often the difference (or larger) will occur by chance, which we selected to be 5%.

Additional Information

Just in case my explanation wasn't quite your style, here are some other links

http://en.wikipedia.org/wiki/Null_hypothesis

http://en.wikipedia.org/wiki/P-value

https://statistics.laerd.com/statistical-guides/hypothesis-testing-3.php

http://blog.minitab.com/blog/understanding-statistics/things-statisticians-say-failure-to-reject-the-null-hypothesis

Frequentist Perspective

I just want to mention that this explanation is called Frequentist Statistics. This is what almost every introductory statistics class teaches. The other type of statistics is called Bayesian Statistics. A larger discussion of the two branches of statistics is outside the scope of this article. I have included a few links below

http://stats.stackexchange.com/questions/22/bayesian-and-frequentist-reasoning-in-plain-english

http://www.quora.com/What-is-the-difference-between-Bayesian-and-frequentist-statisticians

http://simplystatistics.org/2014/10/13/as-an-applied-statistician-i-find-the-frequentists-versus-bayesians-debate-completely-inconsequential/

Conclusion

A statistical test tells us how often we would see data like ours given many hypothetical replicated experiments. If it doesn't happen often, we have evidence that the two groups are different.