30 May

A/B Testing - Common Mistakes - Users / Sessions

A/B Testing: Common Mistakes

Users or sessions?

Do you collect data at the user level or at the session level? Are treatments assigned to to each user or to each session? And is your data aggregated by user or session? The answer to both of these questions should be by user.

Why by User?

When collecting your data, it is better to assign test groups to each user instead of each session. When a user comes to the site and sees a feature, the feature may or may not affect the user during this session, but may take repeated sessions before he or she acts on it.

There is also a statistical reason. You may remember an acronym IID from your statistics class. It stands for independent and identically distributed. This refers to your sample and that it should be IID. For this article, we’re concentrating on independent samples. Independence can be described that by knowing one data point, you don’t know anything additional about any other data point. For our purposes, if your data points are all sessions, then once you know one session from a user, you have a better idea of the other sessions from that same user.

If your data isn’t independent this causes problems in your variance and error calculations. The mean of your data will stay the same, but your standard errors will be different. Having multiple observations from the same person is called clustered sampling. This requires a specific way to compute variance of your sample. Suppose the people in your data set vary quite a bit, but each observation from a specific user is exactly the same. This will cause your observed variance to be lower than the true variance. If you were to compute variance without considering the clustering, it will be underestimated.

Let’s do an example in R. Lets first create 300 data points from 300 users and compute the mean and variance. Then create 3 data data points from each of 100 users then compute the mean and variance. We will run this 1,000 times and look at the 1,000 differences. In this case, the variance between users will be the same as the variance for each user.

set.seed(2015)
library(survey)

samplesize.independent <- 300
samplesize.dependent <- 100

run_once <- function(i) {
    
    # Create 300 people each with a different mean and variance
    population.mean <- rnorm(samplesize.independent, mean=0, sd=1)
    population.var <- abs(rnorm((samplesize.independent), 
                                 mean=0, sd=1))

    # Create one data point for each user with their mean and variance
    points.independent <- mapply (function(m, v) {
                               rnorm(1, mean=m, sd=sqrt(v))
                          }, population.mean, population.var)
    points.independent <- unlist(as.list(points.independent))
    
    # create the design object, where each row is a different user
    df.independent <- data.frame(id=1:samplesize.independent,
                                 point=points.independent)
    design.independent = svydesign(id=~id, data=df.independent, 
                                   weights=~1)
    
    # compute the mean and the mean's standard error
    mean.independent <- coef(svymean(~point, design.independent))
    # mean.independent is just same as below
    # mean(df.independent$point)

    se.independent <- SE(svymean(~point, design.independent))
    # se.independent is the same as this calculation below
    #sd(df.independent$point)/sqrt(nrow(df.independent))

    # Create 100 people, each with a different mean and variance, 
    # but with same parameters as above
    population.mean <- rnorm(samplesize.dependent, mean=0, sd=1)
    population.var <- abs(rnorm(samplesize.dependent, mean=0, sd=1))

    # Create 3 data points for each user with same parameters as above
    pointsperuser<- samplesize.independent/samplesize.dependent
    points.dependent <- mapply (function(m, v) {
        rnorm(pointsperuser, mean=m, sd=sqrt(v))
    }, population.mean, population.var)
    points.dependent <- unlist(as.list(points.dependent))

    # compute the design object, setting the id to define each user
    df.dependent <- data.frame(id=sort(rep(1:samplesize.dependent, 
                          pointsperuser)), point=points.dependent)
    design.dependent = svydesign(id=~id, data=df.dependent, 
                                 weights=~1)
    
    # compute the mean and the mean's standard error
    mean.dependent <- coef(svymean(~point, design.dependent))
    # mean.independent is the same as below
    #mean(df.dependent$point)

    se.dependent <- SE(svymean(~point, design.dependent))
    # se.dependent is no longer the same as below
    se.dependent.wrong <- sd(df.dependent$point) /                
                                 sqrt(nrow(df.dependent))

    c(mean.independent, se.independent, mean.dependent, 
           se.dependent, se.dependent.wrong)
}

result <- sapply(1:1000, run_once)

# Lets look at the percentiles of the difference in means 
quantile(result[3,]-result[1,], c(0.025, 0.25, 0.50, 0.75, 0.975))
##         2.5%          25%          50%          75%        97.5% 
## -0.254349807 -0.091006840  0.007049011  0.100500093  0.278679862
# Lets look at the percentiles of the difference in variance
quantile(result[4,]-result[2,], c(0.025, 0.25, 0.50, 0.75, 0.975))
##       2.5%        25%        50%        75%      97.5% 
## 0.01619895 0.02914339 0.03480557 0.04017795 0.05136426
# Lets look at the percentiles of the difference in incorrectly 
# computed variance
quantile(result[4,]-result[5,], c(0.025, 0.25, 0.50, 0.75, 0.975))
##       2.5%        25%        50%        75%      97.5% 
## 0.02582071 0.03208966 0.03520337 0.03811021 0.04305041

We can see that the difference in mean is around zero. The 95% confidence interval is [-0.25, 0.28]. This is just as expected.

We also see that the 95% confidence interval for the difference in standard error of the mean is [0.016, 0.051], meaning the dependent points have a higher variance than the independent points. We can also see if that we compute the standard error without considering the clustering, this will also lead to a standard error that is too small, with a confidence interval of the difference from [0.26, 0.43]. This will lead to declaring significance when we don’t really have it.

Why is this happening? Here is a plot with just twelve points.

set.seed(2015)
# Create 100 users' mean and variance
mean <- rnorm(12, mean=0, sd=10)
var <- rexp(12, rate=1)

# Lets create three data points for each user, using the mean and variance from above
points <- mapply (function(x, y) {rnorm(6, mean=x, sd=sqrt(y))}, mean, var)

par(mfrow=c(2, 1))
stripchart(points[1,], xlim=c(-20, 10), main="12 independent points")

stripchart(c(points[,1], points[,6]), xlim=c(-20, 10), main="6 points from 2 users")

download

You can see in the bottom plot, the points are clustered and more spread out. This is what gives us higher variance.

That said, there are many assumptions that are required for a perfect statistical test. However, a lot research has been done trying to find out how much we can deviate from these assumptions. It would be impossible for your data points to be completely independent, so some departure from this assumption is expected. But staying as close as possible to the independence assumption would be the safest thing to do and shouldn’t need to be demonstrated. It should be necessary to demonstrate that this assumption can be relaxed.

Additional Information

Here are some links that talk about sampling and/or cluster sampling. The last three links contains formulas and derivations.

http://en.wikipedia.org/wiki/Sampling_(statistics)

http://en.wikipedia.org/wiki/Cluster_sampling

http://stattrek.com/survey-research/cluster-sampling.aspx

http://www.stat.purdue.edu/~jennings/stat522/notes/topic5.pdf

http://ocw.jhsph.edu/courses/statmethodsforsamplesurveys/PDFs/Lecture5.pdf

http://www.ph.ucla.edu/epi/rapidsurveys/RScourse/chap5rapid_2004.pdf

Conclusion

Please keep each user to a single test group and aggregate your data by user. Otherwise, you may declare significance when there is none.

16 May

A/B Testing - Common Mistakes - Adjusting Traffic

A/B Testing: Common Mistakes

Adjusting Traffic proportions

Let’s say you are A/B testing a new feature and you’ve given that feature 1% of the traffic and 99% of traffic to control. After a week, the feature is looking promising, so you give the feature 10% of the traffic. And as you gain more statistical significance that the feature is working well, you keep giving it a larger portion of the traffic. traffic. After a month, your A/B test has statistical significance and you declare victory. Right? Not really.

What the matter?

As the test went on, the proportion of traffic was changed between control and treatment. During the test, the content of overall traffic will change. Then the control average will be weighted more towards the old traffic and the treatment average will be weighted more towards the new traffic.

Lets try an example in R. Suppose because of marketing efforts, or because of other successful tests, our traffic suddenly changed and our metric is higher during the second half of our experiment. Now lets have a treatment that actually has no effect, but we will increase our proportion in this second half from 10% to 50% of our data

set.seed(2015)
# For the first half of the experiment
# create a list of 100 random draws from the specific Normal distribution with mean=1
first_half <- rnorm(100, mean=1, sd=1)

# For the second half of the experiment
# create a second list of 100 random draws from a normal distribution with mean=2
second_half <- rnorm(100, mean=2, sd=1)

# Define control as having 90% of traffic from the first half
# and 50% of traffic from the second half
control <- c(first_half[1:90], second_half[1:50])

# Define treatment as having 10% of traffic from the first half
# and 50% of traffic from the second half
treatment <- c(first_half[91:100], second_half[51:100])

# Perform t-test
t.test(control, treatment)$p.value
## [1] 0.002112838

Here, we pretend treatment has no effect over control. We created a before and after distribution that treatment and control both used. All we did was change was the sampling proportion of before and after. As a result, even though treatment and control should show no significance, we can see that the test is showing a 0.002 p-value which is 99.8% significance.

What can we do about it?

We can adjust our statistical test to handle the situation. The adjusting the estimated mean and variance for different proportions is out of the scope of this article.

However, there is a simple way to avoid the situation entirely. If you want to start with 10% of traffic on the new treatment, only put 10% of traffic control. Then if you want to increase traffic to treatment, just add the same amount of traffic to control.

Lets continue our last example but this time we use a proportion of 10%/10% of control/treatment before the change, and then 50%/50% after the change

# Define control to be 10% from the first half
# and 50% of traffic from the second half
control <- c(first_half[1:10], second_half[1:50])

# Define treatment to be 10% from the first half
# and 50% from the second half
treatment <- c(first_half[91:100], second_half[51:100])

# Perform t-test
t.test(control, treatment)$p.value
## [1] 0.3189567

We can see the p-value is 32% which translates into into 68% significance. This non-significance is what we expected since we used a treatment with no effect.

There is one last note. If we keep both proportion of control and treatment exactly equal, this would mean that we can only ever have up to 50% of traffic on the new treatment. If you eventually want to have 80% of traffic on treatment, then start with 8% of traffic on treatment and 2% on control and increase proportionally to 80%/20% traffic. However, it is worth mentioning that the fastest way to statistical significance is with a 50/50 split of traffic.

Additional Information

If you absolutely must change your traffic proportions, then you need to adjust your mean and variance using over sampling and under sampling techniques. In this case, once the traffic proportions have changed, you have over sampled with respect to time and this should be fixed.

http://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis

http://www.data-mining-blog.com/tips-and-tutorials/overrepresentation-oversampling/

Conclusion

Becareful when shifting more traffic to your treatment group. Make sure the ratio of traffic is always the same between your control and all treatment groups.

02 May

A/B Testing - Common Mistakes - Multiple Treatments

A/B Testing: Common Mistakes

Tests with more than one Treatment group

When the A/B test has a control group and one treatment group, you just wait for your statistical test to read 95% significance and then declare a victor and you’re done. If there are multiple treatment groups, you might wait for 95% statistical significance for any treatment against control, then declare a victor and you’re done. But you’re really not done. If you’ve read my previous post about multiple simultaneous A/B tests, you might be able to guess one problem. But there is also a second problem.

How do I handle this situation?

If you have read my previous post, you know the first problem. If we’re testing multiple features, we’re more likely to see a large difference between test groups just by chance. Since we now testing more than one treatment against control, it is more likely that one of these treatment groups are different just by chance. This means we need to correct our test procedure. One easy correction is called a Bonferroni correction. Suppose we have one control group and 4 treatment groups so we have two tests that are being performed against the control group. We change our 100% - 5% = 95% significance level to 100% - 5%/4 = 98.75% significance. This is actually a conservative correction and there are better correction, but the Bonferroni correction is quick and easy to do.

Now, there is still a second problem. Lets say that both of our two treatment groups are higher than control and one is statistically significant. We pick the statistically significant one and we’re done right? Not exactly. Yes, one treatment group is higher than the control group. But we don’t know if that same treatment group is also higher than the other treatment group. We need to perform another statistical test comparing these two treatment groups in order to correctly pick a winner. In general, we should be comparing all pairs of groups with each other. If we have k total groups (1 control group and k-1 treatment groups), we need to perform up to (k choose 2) different pairwise comparisons to find the best group. Also also means making a Bonferroni Correction with (k choose 2) different treatment groups. So back to our example of 1 control and 4 treatment groups, this means we compare (5 choose 2) = 10 different pairwise comparisons looking for a significance level of 100% - 95%/10 = 99.5% significance.

There is a second solution to our problem. Our goal is to see if the best treatment is better than the other treatments.  All we need to do is take the best treatment as one group and place all the remaining treatments into a  second group. Then we perform a single test between to two groups.

However, that isn't perfect. It is possible that the worst performing group is the only group that is statistically different.  There could be any number of possible pairs that we want to consider. We don’t need to make all (k choose 2) pairwise comparisons. We just need to know if the best group is actually the best. So we only need to make (k-1) comparisons from the best to all the others, and only need to make a Bonferroni Correction with (k-1) groups. For our example, this means we pick 1 out of the 5 groups (1 control and 4 treatments) that is currently the highest, and we test for significance against the other 4 groups using a significance level of 100% - 95%/4 = 98.75%.

Lets look at some examples in R. Lets use just a control and two treatments, each from a different distribution. Control has the lowest average. Treatment 1 has the middle average. Treatment 2 has the highest average.

set.seed(2015)
run_experiment_once <- function(x) {
    # Create 3 data sets with 3 different means
    control <- rnorm(100, mean=1, sd=2)
    treatment1 <- rnorm(100, mean=1.4, sd=2)
    treatment2 <- rnorm(100, mean=1.6, sd=2)
    
    # Compute significance for each of the three pairs of tests
    p_value_1 <- t.test(control, treatment1)$p.value
    p_value_2 <- t.test(control, treatment2)$p.value
    p_value_3 <- t.test(treatment1, treatment2)$p.value
    
    # Return all 3 results
    c(p_value_1, p_value_2, p_value_3, mean(control), mean(treatment1), mean(treatment2)) 
}

# Repeat the experiment 10,000 times
result <- sapply(1:10000, run_experiment_once)

Lets look at each treatment compared with control and see how often we would’ve finished our A/B test. Finishing the A/B test in this case would be finding significance in either of the treatments when compared with Control. Don’t forget to apply the Bonferroni Correction

# The first row in result is the significance between control and treatment 1
result1 <- result[1,] < 0.025

# The second row in result is the significance between control and treatment 2
result2 <- result[2,] < 0.025

# Sum will add up the list of boolean expressions where TRUE=1 and FALSE=0
count.either_success <- sum(result1 | result2)

# display the percentage of success
count.either_success / 10000
## [1] 0.4959

It looks like for our experiment, at least one of the two treatments are statistically significant 49.6%% of the time. Now, lets take a look and see how often treatment 2 was significant and not treatment 1.

count.treatment1_failure.treatment2_success <- sum(result[1,] > 0.025 & result[2,] < 0.025)

count.treatment1_failure.treatment2_success / count.either_success
## [1] 0.5922565
count.treatment1_failure.treatment2_success / 10000
## [1] 0.2937

Given that we found statistical significance, we can see that 59% of the time, we found statistical significance with treatment 2 (the higher one) before treatment 1. If the approach was to declare victory at this point, you would be correct. Overall, out of 10,000 experiments, we would’ve correctly choosen treatment 2 29% of the time.

Now, let’s take a look how often treatment 1 was significant and not treatment 2

count.treatment1_success.treatment2_failure <- sum(result[1,] < 0.025 & result[2,] > 0.025)

count.treatment1_success.treatment2_failure / count.either_success
## [1] 0.1103045
count.treatment1_success.treatment2_failure / 10000
## [1] 0.0547

Now, given that we found statistical significance, 11% of the time, we would’ve incorrectly declared treatment 1 to be the winner when the data we generated is actually lower. Overall, this would’ve been 5.5% of the time.

For our experiment, it wouldn’t be useful to check if both treatments are significant or not. This wouldn’t happen if we followed our strategy of picking the first treatment that shows significance.

Lets try the other approach. We can take a look at the highest group and see how often it was significantly higher than the other two groups. Don’t forget the Bonferroni Correction

# the fourth row is the average of control
# the fifth row is the average of treatment 1
# the sixth row is the average of treatment 2

count.control_highest <- sum(result[4,] > result[5,] & result[4,] > result[6,])
count.control_highest / 10000
## [1] 0.0065
count.treatment1_highest <- sum(result[5,] > result[4,] & result[5,] > result[6,])
count.treatment1_highest / 10000
## [1] 0.2415
count.treatment2_highest <- sum(result[6,] > result[4,] & result[6,] > result[5,])
count.treatment2_highest / 10000
## [1] 0.752

We see that control is highest 0.65% of the time. Treatment 1 is highest 24% of the time and Treatment 2 is highest 75% of the time. Now, given the group is highest, let’s see if it is also statistically significant against both other groups

# the first row is p-value for control vs treatment 1
# the second row is p-value for control vs treatment 2
# the third row is p-value for treatment 1 vs treatment 2

count.control_significance <- sum(result[4,] > result[5,] & result[4,] > result[6,] & result[1,] < 0.025 & result[2,] < 0.025) 
# given control is highest, what percent is it significant?
count.control_significance / count.control_highest
## [1] 0
# what percent is control highest and significant?
count.control_significance / 10000
## [1] 0

We can see if control is the highest, we never have statistical significance, which is correct since we generated it with the lowest mean.

count.treatment1_significance <- sum(result[5,] > result[4,] & result[5,] > result[6,] & result[1,] < 0.025 & result[3,] < 0.025) 
# given treatment 1 is highest, what percent is it significant?
count.treatment1_significance / count.treatment1_highest
## [1] 0.004968944
# what percent is treatment 1 highest and significant?
count.treatment1_significance / 10000
## [1] 0.0012

Given treatment 1 is highest, it was statistically significant only 0.5% of the time. Overall, it was highest and statistically significant only 0.12% of the time which is much better than the other approach.

count.treatment2_significance <- sum(result[6,] > result[4,] & result[6,] > result[5,] & result[2,] < 0.025 & result[3,] < 0.025)  
# given treatment 2 is highest, what percent is it significant?
count.treatment2_significance / count.treatment2_highest
## [1] 0.06263298
# what percent is treatment 2 highest and significant?
count.treatment2_significance / 10000
## [1] 0.0471

Given treatment 2 is highest, we find statistical significance 6% of the time. Overall, it was highest and statistically significant 4.7% of the time, much less frequently than the other approach. This happens since we’re testing significance against both control and treatment 1, instead of just control.

There is something better?

Since we’re still looking at significance between all pairs of tests, we’re going to need more time and data. But this is the proper thing to do. If desired, you can remove the test groups that are statistically significant however.

There is an additional approach that works. If we only look at pairwise comparisons between groups, we are ignoring all other groups and not using all of our information. We can instead rank all of our treatments and group them into a high and low group, each time choosing a different point to break up the treatments. This may give more success than looking at only two groups  at a time.

Finally, there are a various tests that test all the groups at once and tell you if all groups have all the same average. If one group has a different average, the test will fail and reject the null hypothesis. ANOVA is one type of test. You can read more about it below.

http://www.physics.csbsju.edu/stats/anova.html

https://explorable.com/anova

http://en.wikipedia.org/wiki/Analysis_of_variance

Conclusion

Now you know how to properly control your level of significance and also find the correct winner when there are more than one treatment group in your A/B test. As you can see, it will take longer to get significance against all groups instead of just control, but this is the correct thing to do.