02 May

A/B Testing - Common Mistakes - Multiple Treatments

A/B Testing: Common Mistakes

Tests with more than one Treatment group

When the A/B test has a control group and one treatment group, you just wait for your statistical test to read 95% significance and then declare a victor and you’re done. If there are multiple treatment groups, you might wait for 95% statistical significance for any treatment against control, then declare a victor and you’re done. But you’re really not done. If you’ve read my previous post about multiple simultaneous A/B tests, you might be able to guess one problem. But there is also a second problem.

How do I handle this situation?

If you have read my previous post, you know the first problem. If we’re testing multiple features, we’re more likely to see a large difference between test groups just by chance. Since we now testing more than one treatment against control, it is more likely that one of these treatment groups are different just by chance. This means we need to correct our test procedure. One easy correction is called a Bonferroni correction. Suppose we have one control group and 4 treatment groups so we have two tests that are being performed against the control group. We change our 100% - 5% = 95% significance level to 100% - 5%/4 = 98.75% significance. This is actually a conservative correction and there are better correction, but the Bonferroni correction is quick and easy to do.

Now, there is still a second problem. Lets say that both of our two treatment groups are higher than control and one is statistically significant. We pick the statistically significant one and we’re done right? Not exactly. Yes, one treatment group is higher than the control group. But we don’t know if that same treatment group is also higher than the other treatment group. We need to perform another statistical test comparing these two treatment groups in order to correctly pick a winner. In general, we should be comparing all pairs of groups with each other. If we have k total groups (1 control group and k-1 treatment groups), we need to perform up to (k choose 2) different pairwise comparisons to find the best group. Also also means making a Bonferroni Correction with (k choose 2) different treatment groups. So back to our example of 1 control and 4 treatment groups, this means we compare (5 choose 2) = 10 different pairwise comparisons looking for a significance level of 100% - 95%/10 = 99.5% significance.

There is a second solution to our problem. Our goal is to see if the best treatment is better than the other treatments.  All we need to do is take the best treatment as one group and place all the remaining treatments into a  second group. Then we perform a single test between to two groups.

However, that isn't perfect. It is possible that the worst performing group is the only group that is statistically different.  There could be any number of possible pairs that we want to consider. We don’t need to make all (k choose 2) pairwise comparisons. We just need to know if the best group is actually the best. So we only need to make (k-1) comparisons from the best to all the others, and only need to make a Bonferroni Correction with (k-1) groups. For our example, this means we pick 1 out of the 5 groups (1 control and 4 treatments) that is currently the highest, and we test for significance against the other 4 groups using a significance level of 100% - 95%/4 = 98.75%.

Lets look at some examples in R. Lets use just a control and two treatments, each from a different distribution. Control has the lowest average. Treatment 1 has the middle average. Treatment 2 has the highest average.

set.seed(2015)
run_experiment_once <- function(x) {
    # Create 3 data sets with 3 different means
    control <- rnorm(100, mean=1, sd=2)
    treatment1 <- rnorm(100, mean=1.4, sd=2)
    treatment2 <- rnorm(100, mean=1.6, sd=2)
    
    # Compute significance for each of the three pairs of tests
    p_value_1 <- t.test(control, treatment1)$p.value
    p_value_2 <- t.test(control, treatment2)$p.value
    p_value_3 <- t.test(treatment1, treatment2)$p.value
    
    # Return all 3 results
    c(p_value_1, p_value_2, p_value_3, mean(control), mean(treatment1), mean(treatment2)) 
}

# Repeat the experiment 10,000 times
result <- sapply(1:10000, run_experiment_once)

Lets look at each treatment compared with control and see how often we would’ve finished our A/B test. Finishing the A/B test in this case would be finding significance in either of the treatments when compared with Control. Don’t forget to apply the Bonferroni Correction

# The first row in result is the significance between control and treatment 1
result1 <- result[1,] < 0.025

# The second row in result is the significance between control and treatment 2
result2 <- result[2,] < 0.025

# Sum will add up the list of boolean expressions where TRUE=1 and FALSE=0
count.either_success <- sum(result1 | result2)

# display the percentage of success
count.either_success / 10000
## [1] 0.4959

It looks like for our experiment, at least one of the two treatments are statistically significant 49.6%% of the time. Now, lets take a look and see how often treatment 2 was significant and not treatment 1.

count.treatment1_failure.treatment2_success <- sum(result[1,] > 0.025 & result[2,] < 0.025)

count.treatment1_failure.treatment2_success / count.either_success
## [1] 0.5922565
count.treatment1_failure.treatment2_success / 10000
## [1] 0.2937

Given that we found statistical significance, we can see that 59% of the time, we found statistical significance with treatment 2 (the higher one) before treatment 1. If the approach was to declare victory at this point, you would be correct. Overall, out of 10,000 experiments, we would’ve correctly choosen treatment 2 29% of the time.

Now, let’s take a look how often treatment 1 was significant and not treatment 2

count.treatment1_success.treatment2_failure <- sum(result[1,] < 0.025 & result[2,] > 0.025)

count.treatment1_success.treatment2_failure / count.either_success
## [1] 0.1103045
count.treatment1_success.treatment2_failure / 10000
## [1] 0.0547

Now, given that we found statistical significance, 11% of the time, we would’ve incorrectly declared treatment 1 to be the winner when the data we generated is actually lower. Overall, this would’ve been 5.5% of the time.

For our experiment, it wouldn’t be useful to check if both treatments are significant or not. This wouldn’t happen if we followed our strategy of picking the first treatment that shows significance.

Lets try the other approach. We can take a look at the highest group and see how often it was significantly higher than the other two groups. Don’t forget the Bonferroni Correction

# the fourth row is the average of control
# the fifth row is the average of treatment 1
# the sixth row is the average of treatment 2

count.control_highest <- sum(result[4,] > result[5,] & result[4,] > result[6,])
count.control_highest / 10000
## [1] 0.0065
count.treatment1_highest <- sum(result[5,] > result[4,] & result[5,] > result[6,])
count.treatment1_highest / 10000
## [1] 0.2415
count.treatment2_highest <- sum(result[6,] > result[4,] & result[6,] > result[5,])
count.treatment2_highest / 10000
## [1] 0.752

We see that control is highest 0.65% of the time. Treatment 1 is highest 24% of the time and Treatment 2 is highest 75% of the time. Now, given the group is highest, let’s see if it is also statistically significant against both other groups

# the first row is p-value for control vs treatment 1
# the second row is p-value for control vs treatment 2
# the third row is p-value for treatment 1 vs treatment 2

count.control_significance <- sum(result[4,] > result[5,] & result[4,] > result[6,] & result[1,] < 0.025 & result[2,] < 0.025) 
# given control is highest, what percent is it significant?
count.control_significance / count.control_highest
## [1] 0
# what percent is control highest and significant?
count.control_significance / 10000
## [1] 0

We can see if control is the highest, we never have statistical significance, which is correct since we generated it with the lowest mean.

count.treatment1_significance <- sum(result[5,] > result[4,] & result[5,] > result[6,] & result[1,] < 0.025 & result[3,] < 0.025) 
# given treatment 1 is highest, what percent is it significant?
count.treatment1_significance / count.treatment1_highest
## [1] 0.004968944
# what percent is treatment 1 highest and significant?
count.treatment1_significance / 10000
## [1] 0.0012

Given treatment 1 is highest, it was statistically significant only 0.5% of the time. Overall, it was highest and statistically significant only 0.12% of the time which is much better than the other approach.

count.treatment2_significance <- sum(result[6,] > result[4,] & result[6,] > result[5,] & result[2,] < 0.025 & result[3,] < 0.025)  
# given treatment 2 is highest, what percent is it significant?
count.treatment2_significance / count.treatment2_highest
## [1] 0.06263298
# what percent is treatment 2 highest and significant?
count.treatment2_significance / 10000
## [1] 0.0471

Given treatment 2 is highest, we find statistical significance 6% of the time. Overall, it was highest and statistically significant 4.7% of the time, much less frequently than the other approach. This happens since we’re testing significance against both control and treatment 1, instead of just control.

There is something better?

Since we’re still looking at significance between all pairs of tests, we’re going to need more time and data. But this is the proper thing to do. If desired, you can remove the test groups that are statistically significant however.

There is an additional approach that works. If we only look at pairwise comparisons between groups, we are ignoring all other groups and not using all of our information. We can instead rank all of our treatments and group them into a high and low group, each time choosing a different point to break up the treatments. This may give more success than looking at only two groups  at a time.

Finally, there are a various tests that test all the groups at once and tell you if all groups have all the same average. If one group has a different average, the test will fail and reject the null hypothesis. ANOVA is one type of test. You can read more about it below.

http://www.physics.csbsju.edu/stats/anova.html

https://explorable.com/anova

http://en.wikipedia.org/wiki/Analysis_of_variance

Conclusion

Now you know how to properly control your level of significance and also find the correct winner when there are more than one treatment group in your A/B test. As you can see, it will take longer to get significance against all groups instead of just control, but this is the correct thing to do.

25 Apr

A/B Testing - Common Mistakes - Simultaneous Tests

A/B Testing: Common Mistakes

Running more than one A/B test at the same time

When you are running an A/B test for a new feature, you wait to reach that magical 95% confidence level so you have statistical significance. You declare a victor and update your site accordingly. When you are running many A/B tests for multiple features, you wait for each of your tests to reach that magical 95% confidence level so you’ve reached statistical significance for each test. You declare the victors and update your site accordingly. Unfortunately, you’re actually not done. If you haven’t already, please read this previous post first which explains how a statistical test works.

What changes if I’m running many A/B tests for different features?

If there is a single A/B test running, 95% significance means that our observed difference or larger will only happen 5% of the time by chance if we assume that the treatment has no effect. However, if we have many A/B tests running, it is more likely that we will observe a large difference just by chance. If we use a 95% significance for each test, we have a larger than 5% chance of any single test having a large difference and accidentally rejecting our assumption.

Lets run some examples in R. In the previous post, we created two tests sets, both from the same distribution and saw how often we would’ve rejected the Null Hypothesis. With a 95% confidence interval, we correctly found that we would fail 5% of the experiments.

Now, we will create more test data sets from the same distribution. But this time, we will create 8 different test sets and run 4 tests at the same time. Lets see how often at least one of the tests fail the t-test.

set.seed(2015)
run_experiment_once <- function(x) {
    # Test 1 and its p-value
    data1 <- rnorm(100, mean=1, sd=2)
    data2 <- rnorm(100, mean=1, sd=2)
    p_value_1 <- t.test(data1, data2)$p.value

    # Test 2 and its p-value
    data1 <- rnorm(100, mean=1, sd=2)
    data2 <- rnorm(100, mean=1, sd=2)
    p_value_2 <- t.test(data1, data2)$p.value

    # Test 3 and its p-value
    data1 <- rnorm(100, mean=1, sd=2)
    data2 <- rnorm(100, mean=1, sd=2)
    p_value_3 <- t.test(data1, data2)$p.value
    
    # Test 4 and its p-value
    data1 <- rnorm(100, mean=1, sd=2)
    data2 <- rnorm(100, mean=1, sd=2)
    p_value_4 <- t.test(data1, data2)$p.value
    
    # Since we only want to find if any of the tests fail, we only
    # need to return the most significant test
    min(p_value_1, p_value_2, p_value_3, p_value_4)
}

# sapply will repeat the experiment 10,000 times
result <- sapply(1:10000, run_experiment_once)

# "< 0.05" will compare the result of each experiment (the p-value) with 0.05 which will
# create a list of "TRUE" and "FALSE" values
reject.null.hypothesis <- result < 0.05

# sum() will add up the "TRUE" and "FALSE" values where TRUE=1 and FALSE=0. So this gives
# the number of "TRUE" values
true.count <- sum(reject.null.hypothesis)

# Finally, divide by 10,000 to get the percentage
true.count / 10000
## [1] 0.185

We can see that, with 4 simultaneous tests, we reject at least one test 18.5% of the time at a 95% significance level and not 5% as expected.

What do I do now?

There are multiple ways to fix this problem. However, the easiest way is to use a Bonferroni Correction. Let say we are running 4 A/B tests. Instead of looking for a 100% - 5% = 95% confidence level, we now look for a 100% - (5% / 4) = 98.75% confidence level. That is, we divide the 5% in the confidence level by the number of tests and compute a new confidence level to test for. This is usually a conservative correction, meaning that we are less likely to reject than necessary, but it is very easy to compute. Depending on the exactly situation, there are other correction that are less conservative, but they are outside the scope of this article.

So lets repeat the previous experiment when we ran 4 simultaneous A/B tests, but this time we apply the Bonferroni correction and look for 98.75% significance level.

set.seed(2015)
run_experiment_once <- function(x) {
    data1 <- rnorm(100, mean=1, sd=2)
    data2 <- rnorm(100, mean=1, sd=2)
    p_value_1 <- t.test(data1, data2)$p.value

    data1 <- rnorm(100, mean=1, sd=2)
    data2 <- rnorm(100, mean=1, sd=2)
    p_value_2 <- t.test(data1, data2)$p.value

    data1 <- rnorm(100, mean=1, sd=2)
    data2 <- rnorm(100, mean=1, sd=2)
    p_value_3 <- t.test(data1, data2)$p.value
    
    data1 <- rnorm(100, mean=1, sd=2)
    data2 <- rnorm(100, mean=1, sd=2)
    p_value_4 <- t.test(data1, data2)$p.value
    
    min(p_value_1, p_value_2, p_value_3, p_value_4)
}

#Bonferroni Correction is applied here
sum(sapply(1:10000, run_experiment_once) < 0.0125) / 10000
## [1] 0.0506

Perfect! By using a Bonferroni correction, we reject any single experiment only 5% of the time, which falls in line with our original 95% confidence level.

But what happens as the number of tests change over time?

Say you have 10 tests running right now and you’re using a 99.5% confidence level. What if 8 of your tests end, leaving you with 2 tests? Then you update your confidence level to 97.5%, and suddenly one of the two remaining tests might have statistical significance! Personally, I would suggest staying conservative and keep using the 99.5% level for these two tests. This would imply using the highest confidence level that each test had during the life of the test.

Additional Information

Here are some wikipedia articles on correcting for multiple tests. I have also included a link about False discovery rates, which Optimizely is using in their new stats engine.

http://en.wikipedia.org/wiki/Familywise_error_rate

http://en.wikipedia.org/wiki/Bonferroni_correction

http://en.wikipedia.org/wiki/False_discovery_rate

http://blog.optimizely.com/2015/01/20/statistics-for-the-internet-age-the-story-behind-optimizelys-new-stats-engine/

Conclusion

If you are running many A/B tests, don’t forget to change your significance level. Otherwise, you’ll declare statistical significance when you don’t actually have it.