A/B Testing - Common Mistakes - Users / Sessions

A/B Testing: Common Mistakes

Users or sessions?

Do you collect data at the user level or at the session level? Are treatments assigned to to each user or to each session? And is your data aggregated by user or session? The answer to both of these questions should be by user.

Why by User?

When collecting your data, it is better to assign test groups to each user instead of each session. When a user comes to the site and sees a feature, the feature may or may not affect the user during this session, but may take repeated sessions before he or she acts on it.

There is also a statistical reason. You may remember an acronym IID from your statistics class. It stands for independent and identically distributed. This refers to your sample and that it should be IID. For this article, we’re concentrating on independent samples. Independence can be described that by knowing one data point, you don’t know anything additional about any other data point. For our purposes, if your data points are all sessions, then once you know one session from a user, you have a better idea of the other sessions from that same user.

If your data isn’t independent this causes problems in your variance and error calculations. The mean of your data will stay the same, but your standard errors will be different. Having multiple observations from the same person is called clustered sampling. This requires a specific way to compute variance of your sample. Suppose the people in your data set vary quite a bit, but each observation from a specific user is exactly the same. This will cause your observed variance to be lower than the true variance. If you were to compute variance without considering the clustering, it will be underestimated.

Let’s do an example in R. Lets first create 300 data points from 300 users and compute the mean and variance. Then create 3 data data points from each of 100 users then compute the mean and variance. We will run this 1,000 times and look at the 1,000 differences. In this case, the variance between users will be the same as the variance for each user.

set.seed(2015)
library(survey)

samplesize.independent <- 300
samplesize.dependent <- 100

run_once <- function(i) {
    
    # Create 300 people each with a different mean and variance
    population.mean <- rnorm(samplesize.independent, mean=0, sd=1)
    population.var <- abs(rnorm((samplesize.independent), 
                                 mean=0, sd=1))

    # Create one data point for each user with their mean and variance
    points.independent <- mapply (function(m, v) {
                               rnorm(1, mean=m, sd=sqrt(v))
                          }, population.mean, population.var)
    points.independent <- unlist(as.list(points.independent))
    
    # create the design object, where each row is a different user
    df.independent <- data.frame(id=1:samplesize.independent,
                                 point=points.independent)
    design.independent = svydesign(id=~id, data=df.independent, 
                                   weights=~1)
    
    # compute the mean and the mean's standard error
    mean.independent <- coef(svymean(~point, design.independent))
    # mean.independent is just same as below
    # mean(df.independent$point)

    se.independent <- SE(svymean(~point, design.independent))
    # se.independent is the same as this calculation below
    #sd(df.independent$point)/sqrt(nrow(df.independent))

    # Create 100 people, each with a different mean and variance, 
    # but with same parameters as above
    population.mean <- rnorm(samplesize.dependent, mean=0, sd=1)
    population.var <- abs(rnorm(samplesize.dependent, mean=0, sd=1))

    # Create 3 data points for each user with same parameters as above
    pointsperuser<- samplesize.independent/samplesize.dependent
    points.dependent <- mapply (function(m, v) {
        rnorm(pointsperuser, mean=m, sd=sqrt(v))
    }, population.mean, population.var)
    points.dependent <- unlist(as.list(points.dependent))

    # compute the design object, setting the id to define each user
    df.dependent <- data.frame(id=sort(rep(1:samplesize.dependent, 
                          pointsperuser)), point=points.dependent)
    design.dependent = svydesign(id=~id, data=df.dependent, 
                                 weights=~1)
    
    # compute the mean and the mean's standard error
    mean.dependent <- coef(svymean(~point, design.dependent))
    # mean.independent is the same as below
    #mean(df.dependent$point)

    se.dependent <- SE(svymean(~point, design.dependent))
    # se.dependent is no longer the same as below
    se.dependent.wrong <- sd(df.dependent$point) /                
                                 sqrt(nrow(df.dependent))

    c(mean.independent, se.independent, mean.dependent, 
           se.dependent, se.dependent.wrong)
}

result <- sapply(1:1000, run_once)

# Lets look at the percentiles of the difference in means 
quantile(result[3,]-result[1,], c(0.025, 0.25, 0.50, 0.75, 0.975))

##         2.5%          25%          50%          75%        97.5% 
## -0.254349807 -0.091006840  0.007049011  0.100500093  0.278679862

# Lets look at the percentiles of the difference in variance
quantile(result[4,]-result[2,], c(0.025, 0.25, 0.50, 0.75, 0.975))

##       2.5%        25%        50%        75%      97.5% 
## 0.01619895 0.02914339 0.03480557 0.04017795 0.05136426

# Lets look at the percentiles of the difference in incorrectly 
# computed variance
quantile(result[4,]-result[5,], c(0.025, 0.25, 0.50, 0.75, 0.975))

##       2.5%        25%        50%        75%      97.5% 
## 0.02582071 0.03208966 0.03520337 0.03811021 0.04305041

We can see that the difference in mean is around zero. The 95% confidence interval is [-0.25, 0.28]. This is just as expected.

We also see that the 95% confidence interval for the difference in standard error of the mean is [0.016, 0.051], meaning the dependent points have a higher variance than the independent points. We can also see if that we compute the standard error without considering the clustering, this will also lead to a standard error that is too small, with a confidence interval of the difference from [0.26, 0.43]. This will lead to declaring significance when we don’t really have it.

Why is this happening? Here is a plot with just twelve points.

set.seed(2015)
# Create 100 users' mean and variance
mean <- rnorm(12, mean=0, sd=10)
var <- rexp(12, rate=1)

# Lets create three data points for each user, using the mean and variance from above
points <- mapply (function(x, y) {rnorm(6, mean=x, sd=sqrt(y))}, mean, var)

par(mfrow=c(2, 1))
stripchart(points[1,], xlim=c(-20, 10), main="12 independent points")

stripchart(c(points[,1], points[,6]), xlim=c(-20, 10), main="6 points from 2 users")

You can see in the bottom plot, the points are clustered and more spread out. This is what gives us higher variance.

That said, there are many assumptions that are required for a perfect statistical test. However, a lot research has been done trying to find out how much we can deviate from these assumptions. It would be impossible for your data points to be completely independent, so some departure from this assumption is expected. But staying as close as possible to the independence assumption would be the safest thing to do and shouldn’t need to be demonstrated. It should be necessary to demonstrate that this assumption can be relaxed.

Additional Information

Here are some links that talk about sampling and/or cluster sampling. The last three links contains formulas and derivations.

http://en.wikipedia.org/wiki/Sampling_(statistics)

http://en.wikipedia.org/wiki/Cluster_sampling

http://stattrek.com/survey-research/cluster-sampling.aspx

http://www.stat.purdue.edu/~jennings/stat522/notes/topic5.pdf

http://ocw.jhsph.edu/courses/statmethodsforsamplesurveys/PDFs/Lecture5.pdf

http://www.ph.ucla.edu/epi/rapidsurveys/RScourse/chap5rapid_2004.pdf

Conclusion

Please keep each user to a single test group and aggregate your data by user. Otherwise, you may declare significance when there is none.