27 Jun

A/B Testing – Nonparametric tests

A/B Testing

Nonparametric Statistics

Most A/B testing platforms use Student’s t-test to test for statistical significance. However, this test has assumptions that need to be met. It also has some known short comings. This is where the Mann-Whitney U Test comes in handy. It has fewer assumptions and a different set of short comings. Instead of using the data directly, this test will convert all data points into a rank by combining all test groups into one group and computing the combined rank. It then analysis the groups separately using the combined rank.

Normality of Mean Assumption

This is not the same as a normality of data assumption. This assumption is saying that if we hypothetically repeated this test many times and computed the mean each time, then the distribution of mean is Normal. This is called the Central Limit Theorem. It states that as you get more and more data points, the distribution of mean is more and more Normally distributed. This is true for any set of data, even if the data itself is not Normally distributed. However, if your data is not Normally distributed, then it takes more and more data before the Central Limit Theorem becomes accurate. More details about this are given in links below.

This assumption is not required of the Mann-Whitney U test. Since this test uses rank, it removes almost all details of the specific distribution of the data and this assumption is much easier to meet.

So if the data is not normally distributed, the Mann-Whitney U test is actually more “efficient” than the t-test and is almost as “efficient”" when the data is normally distributed. (Reference)

Lets look at an example. Lets look at the Poisson distribution with shape=0.2 and rate=10

x <- seq(0, 3, length=100)
y <- dgamma(x, shape=0.2, rate=10)
plot(x, y, type="n", main="Poisson Density Function (shape=0.2, rate=10)")
lines(x, y)

Nonparametric Posson

We can see this distribution is not normally distributed. Lets draw a sample of 1000 from this distribution, and also from a slightly different distribution. We will perform both a t-test and Mann-Whitney U test and get the p-values. Lets repeat this 100 times and find the mean of the p-values.

## [1] 0.1681351
## [1] 0.004257654

We can see that the t-test doesn’t have significance with an average p-value of 0.17. The Mann-Whitney U test has significance with an average of 0.004.

Lets try this again, but with something that looks more Normally Distributed. We will use a Poisson distribution but with a shape parameter=10 and rate parameter=10

x <- seq(0, 3, length=100)
y <- dgamma(x, shape=10, rate=10)
plot(x, y, type="n", main="Poisson Density Function (shape=10, rate=10)")
lines(x, y)

Nonparametric Posson 2

Lets run our experiment again.

p.value <- sapply(1:100, function(x) run_once(1000, 10, 10.4, 10))
mean(p.value[1,])
## [1] 0.04503111
mean(p.value[2,])
## [1] 0.05102875

We can see both tests have about the same signifiance. The t-test pvalue is 0.045 and the Mann-whitney U test is about 0.051.

Outliers

The t-test has problems dealing with outliers (Link). Mann-Whitney U test doesn’t suffer from this problem since everything is converted into ranks. Outliers are no different than any slightly large value.

Lets look at the previous example in the other post, but use the Mann-Whitney U test instead. We will create two data sets with different means. Then add a single outlier point to one of them.

set.seed(2015)

# Create a list of 100 random draws from a normal distribution 
# with mean 1 and standard deviation 2
data1 <- rnorm(100, mean=1, sd=2)

# Create a second list of 100 random draws from a normal distribution 
# with mean 2 and standard deviation 2
data2 <- rnorm(100, mean=2, sd=2)

# Perform a t-test on these two data sets
# and get the p-value
t.test(data1, data2)$p.value
## [1] 0.0005304826
# Perform a Mann-Whitney-U test on these two data sets
# and get the p-value
wilcox.test(data1, data2)$p.value
## [1] 0.0002706636
# append 1000 to the first data set only
data1 <- c(data1, 1000)

# Perform a t-test on these two data sets
# and get the p-value
t.test(data1, data2)$p.value
## [1] 0.369525
# Perform a Mann-Whitney-U test on these two data sets
# and get the p-value
wilcox.test(data1, data2)$p.value
## [1] 0.0004766358

We can see the Mann-Whitney U Test finds statistical significance before adding the outlier. After we add the outlier, the p-value increases slightly, but the result is still significant. The t-test has significance before the outlier, but after the outlier, the t-test loses significance.

Ties in the values

The Mann-Whitney U test works best if every value is unique. This is normally not a problem for continuous data. If you have many zeros in your data, or you have count data, this will result in many ties. There are various ways to resolve ties, but the results are no longer exact, but approximate. Approximate isn’t necessarily bad since the t-test is also approximate if the data is not normally distributed.

Lets repeat our first test. This test found statistical significance with the wilcox test. This time, we will round off our values to create ties and see how the test performs.

set.seed(2015)

run_once <- function() {
    # Create a list of 100 random draws from an exponential distribution 
    # with rate=1
    data1 <- rgamma(1000, shape=0.2, rate=10)
    
    # Create a second list of 100 random draws from an exponential
    # distribution with rate=2
    data2 <- rgamma(1000, shape=0.24, rate=10)
    
    # Perform a Mann-Whitney U test on these two data sets
    a <- wilcox.test(data1, data2)$p.value
    
    # Perform a Mann-Whitney U test after rounding off two data sets
    b <- wilcox.test(round(data1*500)/500, round(data2*500)/500)$p.value
    
    # Perform a Mann-Whitney U test after rounding off two data sets
    c <- wilcox.test(round(data1*100)/100, round(data2*100)/100)$p.value
    
    # Perform a Mann-Whitney U test after rounding off two data sets
    d <- wilcox.test(round(data1*25)/25, round(data2*25)/25)$p.value
        
    # Perform a Mann-Whitney U test after rounding off two data sets
    e <- wilcox.test(round(data1*5)/5, round(data2*5)/5)$p.value

    c(a, b, c, d, e)
}

p.value <- sapply(1:100, function(x) run_once())
# mean of continuous data
mean(p.value[1,])
## [1] 0.004257654
# mean of discrete data
mean(p.value[2,])
## [1] 0.01127112
# mean of discrete data
mean(p.value[3,])
## [1] 0.02937118
# mean of discrete data
mean(p.value[4,])
## [1] 0.08112481
# mean of discrete data
mean(p.value[5,])
## [1] 0.3527484

We can see here that the average p-value is 0.004 before we start rounding off the data. As we round off the data more and more, we create more and more ties and we can see that we lose significance.

Additional Information

http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test

http://www.statisticalengineering.com/central_limit_theorem.htm

http://spin.atomicobject.com/2015/02/12/central-limit-theorem-intro/

Conclusion

If your data is not Normally distributed or contains outliers, consider using the Mann-Whitney U test.

13 Jun

A/B Testing - Common Mistakes - Outliers

A/B Testing: Common Mistakes

Exploratory Data Analysis: Outliers

Outliers can easily cause problems with your A/B test. You may have seen strange anomalies with your data metrics, a particular metric being too high or too low compared to the others. You may have seen your statistical test first start significant and then become not significant. These problems may be coming from outliers in your data.

Let look at an example in R. We will create two data sets, the first data set less than the other one.

set.seed(2015)
# Create a list of 100 random draws from a normal distribution 
# with mean 1 and standard deviation 2
data1 <- rnorm(100, mean=1, sd=2)
# Create a second list of 100 random draws from a normal distribution 
# with mean 2 and standard deviation 2
data2 <- rnorm(100, mean=2, sd=2)
# Perform a t-test on these two data sets and get the p-value
t.test(data1, data2)$p.value
## [1] 0.0005304826

We can see this t-test will give a p-value of 0.0005 which is a significance level of 99.95%. Now lets add a single outlier into the first data set.

# append 1000 to the first data set only
data1 <- c(data1, 1000)
# Perform a t-test on these two data sets and get the p-value
t.test(data1, data2)$p.value
## [1] 0.369525

Now, you can see, even though we had 100 points in each data set, a single large outlier caused our data to become non-significant with a 63% significance level only

How do we fix this?

There are multiple ways to fix this problem. Student’s t-test is not robust against outliers and we can run Mann-Whitney U test instead. A deeper discussion of this approach is outside the scope of this article.

We can also detect the outliers and consider removing them. This approach needs to be taken very carefully. We should only remove data that does not come from our target population. If we see one example data point that is an outlier, it may be unlucky to see such a strange data point. However, it may also be unlucky to see only one one such data point. Therefore, outliers need to be removed only after careful examination. For A/B testing, this usually means removing data that is coming from bots and not humans. This can be difficult because not all bots and scripts report their User Agent properly.

Lets look at some diagnostic tools with R. We will create 100 points from the same distribution. Then, we will add 2 outliers to our data set.

set.seed(2015)
# Create a list of 100 random draws from a normal distribution 
# with mean 1 and standard deviation 2
data <- rnorm(100, mean=1, sd=2)

# lets add two outliers
data <- c(data, 20)
data <- c(data, 40)

# Create an image with two plots side by side
par(mfrow=c(1, 2))
hist(data)
boxplot(data, main="Box plot of data")

download (1)

On the left is a histogram. We can see the outlier at 40. It is more questionable if 20 is an outlier. For the boxplot on the right, the box itself contains the 25% to 75% of the data. The thick line in the middle of the plot is the median. The “whisker” at the top and bottom of the plot are the min and max of the data except for “outliers”. A good explanation of outliers in box plots in R can be found at the bottom of this page http://msenux.redwoods.edu/math/R/boxplot.php

So now we have found a few outliers in our data. Remember, it is important to carefully consider each point before removing them, since we easily could have seen more data at that point rather than only one. One technique to try is to perform the test again but with the point removed. If the test gives the same result, then we might as well leave the data point in.

Outliers in more than one dimension

If your data contains two variables, there is another type of outlier to look for. Lets look at this plot. It has 100 points again, but with two correlated variables. Then, we add a single outlier.

set.seed(2015)
# Create a list of 100 random draws from a normal distribution 
# with mean 1 and standard deviation 2
data1 <- rnorm(100, mean=1, sd=2)

# Lets create a second correlated variable.
correlation <- 0.95
data2 <- correlation * data1 + sqrt(1-correlation) * rnorm(100, mean=1, sd=2)

#Lets add our outlier
data1 <- c(data1, -3)
data2 <- c(data2, 4)

par(mfrow=c(1, 1))
plot(data1, data2)

download (2)

Here most of the data lies close to the lower-left to upper-right diagonal. We have a single point on the upper left of the plot. In any single dimension that particular point is right in the range of the data. But combined in two dimensions, it becomes an outlier.

We can find this point by computing something called Leverage. Though this is generally used to find outliers during linear regression, we can use it here to help detect some outliers.

# create a matrix with our two data sets
data_matrix <- matrix(c(data1, data2), nrow=101, ncol=2)
tail(data_matrix)
##              [,1]        [,2]
## [96,]   0.1888523  0.37660990
## [97,]  -2.3504425 -2.08643945
## [98,]   0.9110532  0.06500559
## [99,]  -1.0946785 -1.09385952
## [100,] -2.4602479 -2.40095114
## [101,] -3.0000000  4.00000000
# leverage is also knows as hat values
leverage <- hat(data_matrix)
tail(leverage)
## [1] 0.01121853 0.03700946 0.02901691 0.02292416 0.04241248 0.71791422

Above are the last 6 rows of the matrix and you can see our outlier as the last point. The corresponding leverage values are also given. You can see the very high leverage value for the last point. As a rule of thumb, leverage values that exceed twice the average leverage value should be examined more closely. However, for an A/B test, we have many observations and a wide range of leverage values. In this case, I would start examining the highest leverage points and work your way down.

Alternatives

As mentioned above, Student's t-test is not very robust to outliers. There are other tests that are more robust to outliers and are based on each observation's ranks instead of actual value. You can start looking at the Mann-Whitney U test and enter the world of non-parametric statistics

http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test

http://en.wikipedia.org/wiki/Nonparametric_statistics

http://www.originlab.com/index.aspx?go=Products/Origin/Statistics/NonparametricTests

http://support.minitab.com/en-us/minitab/17/topic-library/basic-statistics-and-graphs/hypothesis-tests/nonparametrics-tests/understanding-nonparametric-tests/

http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Nonparametric/BS704_Nonparametric2.html

Additional Information

Here are some other links about leverage and other types of outlier detection

http://en.wikipedia.org/wiki/Leverage_(statistics)

http://en.wikipedia.org/wiki/Outlier

http://onlinestatbook.com/2/regression/influential.html

http://pages.stern.nyu.edu/~churvich/Undergrad/Handouts2/31-Reg6.pdf

Conclusion

Bots and scripts can cause problems in your A/B tests. It is important to try to detect these users in your data and remove them since they do not represent your target population.

30 May

A/B Testing - Common Mistakes - Users / Sessions

A/B Testing: Common Mistakes

Users or sessions?

Do you collect data at the user level or at the session level? Are treatments assigned to to each user or to each session? And is your data aggregated by user or session? The answer to both of these questions should be by user.

Why by User?

When collecting your data, it is better to assign test groups to each user instead of each session. When a user comes to the site and sees a feature, the feature may or may not affect the user during this session, but may take repeated sessions before he or she acts on it.

There is also a statistical reason. You may remember an acronym IID from your statistics class. It stands for independent and identically distributed. This refers to your sample and that it should be IID. For this article, we’re concentrating on independent samples. Independence can be described that by knowing one data point, you don’t know anything additional about any other data point. For our purposes, if your data points are all sessions, then once you know one session from a user, you have a better idea of the other sessions from that same user.

If your data isn’t independent this causes problems in your variance and error calculations. The mean of your data will stay the same, but your standard errors will be different. Having multiple observations from the same person is called clustered sampling. This requires a specific way to compute variance of your sample. Suppose the people in your data set vary quite a bit, but each observation from a specific user is exactly the same. This will cause your observed variance to be lower than the true variance. If you were to compute variance without considering the clustering, it will be underestimated.

Let’s do an example in R. Lets first create 300 data points from 300 users and compute the mean and variance. Then create 3 data data points from each of 100 users then compute the mean and variance. We will run this 1,000 times and look at the 1,000 differences. In this case, the variance between users will be the same as the variance for each user.

set.seed(2015)
library(survey)

samplesize.independent <- 300
samplesize.dependent <- 100

run_once <- function(i) {
    
    # Create 300 people each with a different mean and variance
    population.mean <- rnorm(samplesize.independent, mean=0, sd=1)
    population.var <- abs(rnorm((samplesize.independent), 
                                 mean=0, sd=1))

    # Create one data point for each user with their mean and variance
    points.independent <- mapply (function(m, v) {
                               rnorm(1, mean=m, sd=sqrt(v))
                          }, population.mean, population.var)
    points.independent <- unlist(as.list(points.independent))
    
    # create the design object, where each row is a different user
    df.independent <- data.frame(id=1:samplesize.independent,
                                 point=points.independent)
    design.independent = svydesign(id=~id, data=df.independent, 
                                   weights=~1)
    
    # compute the mean and the mean's standard error
    mean.independent <- coef(svymean(~point, design.independent))
    # mean.independent is just same as below
    # mean(df.independent$point)

    se.independent <- SE(svymean(~point, design.independent))
    # se.independent is the same as this calculation below
    #sd(df.independent$point)/sqrt(nrow(df.independent))

    # Create 100 people, each with a different mean and variance, 
    # but with same parameters as above
    population.mean <- rnorm(samplesize.dependent, mean=0, sd=1)
    population.var <- abs(rnorm(samplesize.dependent, mean=0, sd=1))

    # Create 3 data points for each user with same parameters as above
    pointsperuser<- samplesize.independent/samplesize.dependent
    points.dependent <- mapply (function(m, v) {
        rnorm(pointsperuser, mean=m, sd=sqrt(v))
    }, population.mean, population.var)
    points.dependent <- unlist(as.list(points.dependent))

    # compute the design object, setting the id to define each user
    df.dependent <- data.frame(id=sort(rep(1:samplesize.dependent, 
                          pointsperuser)), point=points.dependent)
    design.dependent = svydesign(id=~id, data=df.dependent, 
                                 weights=~1)
    
    # compute the mean and the mean's standard error
    mean.dependent <- coef(svymean(~point, design.dependent))
    # mean.independent is the same as below
    #mean(df.dependent$point)

    se.dependent <- SE(svymean(~point, design.dependent))
    # se.dependent is no longer the same as below
    se.dependent.wrong <- sd(df.dependent$point) /                
                                 sqrt(nrow(df.dependent))

    c(mean.independent, se.independent, mean.dependent, 
           se.dependent, se.dependent.wrong)
}

result <- sapply(1:1000, run_once)

# Lets look at the percentiles of the difference in means 
quantile(result[3,]-result[1,], c(0.025, 0.25, 0.50, 0.75, 0.975))
##         2.5%          25%          50%          75%        97.5% 
## -0.254349807 -0.091006840  0.007049011  0.100500093  0.278679862
# Lets look at the percentiles of the difference in variance
quantile(result[4,]-result[2,], c(0.025, 0.25, 0.50, 0.75, 0.975))
##       2.5%        25%        50%        75%      97.5% 
## 0.01619895 0.02914339 0.03480557 0.04017795 0.05136426
# Lets look at the percentiles of the difference in incorrectly 
# computed variance
quantile(result[4,]-result[5,], c(0.025, 0.25, 0.50, 0.75, 0.975))
##       2.5%        25%        50%        75%      97.5% 
## 0.02582071 0.03208966 0.03520337 0.03811021 0.04305041

We can see that the difference in mean is around zero. The 95% confidence interval is [-0.25, 0.28]. This is just as expected.

We also see that the 95% confidence interval for the difference in standard error of the mean is [0.016, 0.051], meaning the dependent points have a higher variance than the independent points. We can also see if that we compute the standard error without considering the clustering, this will also lead to a standard error that is too small, with a confidence interval of the difference from [0.26, 0.43]. This will lead to declaring significance when we don’t really have it.

Why is this happening? Here is a plot with just twelve points.

set.seed(2015)
# Create 100 users' mean and variance
mean <- rnorm(12, mean=0, sd=10)
var <- rexp(12, rate=1)

# Lets create three data points for each user, using the mean and variance from above
points <- mapply (function(x, y) {rnorm(6, mean=x, sd=sqrt(y))}, mean, var)

par(mfrow=c(2, 1))
stripchart(points[1,], xlim=c(-20, 10), main="12 independent points")

stripchart(c(points[,1], points[,6]), xlim=c(-20, 10), main="6 points from 2 users")

download

You can see in the bottom plot, the points are clustered and more spread out. This is what gives us higher variance.

That said, there are many assumptions that are required for a perfect statistical test. However, a lot research has been done trying to find out how much we can deviate from these assumptions. It would be impossible for your data points to be completely independent, so some departure from this assumption is expected. But staying as close as possible to the independence assumption would be the safest thing to do and shouldn’t need to be demonstrated. It should be necessary to demonstrate that this assumption can be relaxed.

Additional Information

Here are some links that talk about sampling and/or cluster sampling. The last three links contains formulas and derivations.

http://en.wikipedia.org/wiki/Sampling_(statistics)

http://en.wikipedia.org/wiki/Cluster_sampling

http://stattrek.com/survey-research/cluster-sampling.aspx

http://www.stat.purdue.edu/~jennings/stat522/notes/topic5.pdf

http://ocw.jhsph.edu/courses/statmethodsforsamplesurveys/PDFs/Lecture5.pdf

http://www.ph.ucla.edu/epi/rapidsurveys/RScourse/chap5rapid_2004.pdf

Conclusion

Please keep each user to a single test group and aggregate your data by user. Otherwise, you may declare significance when there is none.