Summary
I explain blocking, optimal design and covariate adjustment as methods to improve power in design of experiments. I try to motivate this as something data scientists working with online experiments ought to be doing since it can drastically improve the power of an experiment and make design of experiments tractable where otherwise it would not be. I also implement a D-optimal design fitting algorithm from first principles in Python to give the reader a deeper sense of what optimal design does, and I provide a slightly hand-wavy example to sketch how all these methods could be used together in the real world.
Introduction
I feel a certain responsibility to advocate for an interpretation of data science which frames it as a rigorous profession. I’ll spare the polemic, however I think that whatever the definition of “data science”, it should include the design and analysis of (controlled) experiments (DoE).
In the online business world, basic applications of DoE such as factorial testing (e.g. A/B tests) are known to most because they are enabled by tools such as Optimizely, VWO and so on. However, these tools – to my knowledge – typically do not make use of blocking, covariate adjustment or optimal designs (I’ll cover these later) to improve the power of experiments and to minimise sample sizes needed for significance. I suspect this is the case because these methods often assume an underlying model, and model specification is not a very “point and click” affair. As a glimpse into how the high tech companies do DoE Facebook – some time ago – open-sourced a tool called PlanOut which is a neat DSL able to facilitate all sorts of sampling designs, but as a result the analysis is left up to the user.
Even with fancier tooling, the online setting often assumes that the marginal cost of trials is low and that we can sample ad nauseam. This is often the constraint because the effect sizes we are trying to determine are typically small, the audience over which they are being determined is heterogeneous, and the timing of treatments is confounded. I have direct experience here with major online retailers, and the number of samples to settle a naive A/B test with a small effect size can easily run into the millions and take weeks to collect the data. However, there are many situations for which we cannot sample ad nauseam either because we do not have the traffic or because the cost of the experiment is high: I’d go so far as to say that this is probably true in most cases. For example, you have relatively low throughput software with a relatively small number of customers (most SaaS and on-site enterprise software), or you are marketing to small segments (e.g. major shipping companies, British high-street retailers, Danish high-tech SMBs, etc), or you are making changes (even to large websites) which may result in revenue loss (e.g. handing out discounts), and so on.
There are however more frugal ways. We can use experiment design to sample more specifically in aid of the inferences we are interested in. Once the form of the design is settled we can then use blocking, optimal design and covariate adjustment together or separately to significantly improve the power of an experiment. I cover each method in sections to follow. I won’t cover things like factorial designs, randomisation, importance of replication and power calculations which are obviously important but assumed background knowledge. However, I will start with a few words about linear models because they feature prominently throughout.
Analysing DoE data with linear models
Linear models are perhaps the most featureful way to analyse design of experiments data. All sorts of designs can be expressed in terms of the dummy variables, and they also facilitate optimal design and error reduction using covariate adjustment as we will see later. We can write down a \(m\)-dimensional linear model in matrix form as follows:
\[ y = X\beta + \epsilon\] where \(y\) is the response variable, \(X\) is the design matrix of the explanatory variables, \(\beta\) are the explanatory variable weights and \(\epsilon\) is the Gaussian error term. If we are fitting the linear model from experimental data then \(y\) ends is the \(m\) responses collected, whilst the \(m \times n\) matrix \(X\) is the settings used at each trial as row vectors. Data collection for a linear model must produce enough trials to identify the model. It must also produce statistically significant estimates for parameters of interest in order to be useful.
Vanilla linear models such as those we will discuss herein assume that variance is homoscedastic, which roughly means that variance is the same for any given subset of the data. This is a simplifying but unnecessary assumption which can be dealt with using mixed-effects models. Analysis of linear models usually makes the assumption that the error is Gaussian. I won’t deal with that here, but there are plenty of powerful non-parametric methods such as bootstrap resampling and permutation tests which do not require any such assumptions often without loss of power.
I will now describe methods which can be used in conjuction with linear models to increase their power.
Blocking
Say we are measuring some variable for a population which is naturally partitioned into two groups: A and B. Say the variable is high on average for Group A but low for Group B. If we conduct our experiment ignorant of these groups, then the variable we are measuring will inherit the variability associated with group membership. “Blocking” on these groups means repeating the experiment within each group, and therefore removing the inter-group variance by accounting for group membership. That is, blocking can improve the power of an experiment by reducing the within-block variability, which increases the chances of detecting a true treatment effect if one exists.
This process is particularly straightforward when using homoscedastic linear models. In the initial example, the linear model without blocking factors might be \(y = ax + b + \epsilon\). We can account for the Group A/B block by adding a dummy variable for the groups, where the variable is 1 when the measurement is in the group and zero otherwise. For example, \(y = ax + b + c\gamma_A + \epsilon\), where \(\gamma_A\) is the dummy variable for group A and \(c\) is the coefficient that accounts for the difference in means associated with group A. In this way, the coefficients \(a\) and \(b\) are no longer confounded by the difference in group means. Note that Group B does not need to be explicitly represented since both the blocking factors are binary: it will be represented in the intercept where \(\gamma_A=0\). Generally, if there are \(k\) factor levels across all blocking factors, then we will need \(k-1\) dummy variables to account for them.
When deciding how to block, ideally blocks should be as homogenous as possible within each block and as heterogeneous as possible between blocks: this will maximise the benefit of blocking. There are fancy techniques for using covariates and other to information create synthetic blocks with the aim of maximising this goal (see propensity score matching). It is also important to keep in mind what the blocking factors interact with. To gain benefit, they must interact with experimental error, but if they also interact with the controlled variables then it is necessary to add interaction terms to the model which represents this, otherwise the estimates for the affected variables will be biased by the blocks they fall into.
Optimal design
Once we’ve decided on an experiment design and its blocking factors we have effectively set the design matrix \(X\). The covariance of the resulting model is given by:
\[ Cov(X) = (X^T X)^{-1}\sigma^2\] where \(\sigma^2\) is the variance of the \(\epsilon\) term. Note that the covariance matrix is independent of \(y\) and \(\beta\), which means that it can be optimised to minimise covariance prior to carrying out any experiments. Here, “optimised to minimise covariance” might mean any number of things, and as a result there are a whole host of eligible optimality criteria. The choice of optimality criterion can depend on the specific characteristics of the experiment. For instance, A-optimality – which aims to minimise the average variance of the estimated regression coefficients – might be more suitable if the primary goal is to obtain precise estimates for all parameters, whilst E-optimality – which aims to minimise the maximum eigenvalue of the covariance matrix – might be more suitable in situations where the worst-case variance needs to be controlled. I will focus on “D-optimality” which is both very popular and simple to explain.
Recall that the eigenvectors of a covariance matrix point in the variance maximising directions – ala principal component analysis – and that they are scaled by their corresponding eigenvalues. Meanwhile, the determinant of a matrix is equal to the product of its eigenvalues. Hence, minimising the determinant (D) also minimises the spread of the data by reducing the scale of the eigenvectors. The task is to pick \(X\) such that \(argmin_X{|(X^TX)^{-1}|}\) which will a priori minimise the variance irrespective of the experimental result. Note that since \(|A^{-1}| =1/|A|\), we will maximise the determinant of \(X^TX\) instead in the implementation to follow.
If the factors in experiments were typically on continuous scales, it may be straightforward to write a differentiable programme to minimise the determinant via gradient descent and be done with it. However, experiment designs – with some exceptions – are typically discretised such that factors are divided into levels representing significant changes to the factor. Further, factors are often systematically related and cannot be changed independently; they are often subject to thresholds, or there must be a minimal proportion preserved, or some parts of their range is irrelevant, and so on. The usual treatment in the context of D-optimal design is therefore to pick a set of \(m\) candidates from the space of possible (discretised) settings such that it minimises the determinant of the covariance matrix, making it a combinatorial problem. Fortunately, there are some simple candidate exchange based algorithms which do a good job of finding D-optimal designs in practice.
In 1972, Fedorov and Malyutov published a paper titled “Optimal designs in regression problems” in which they described a simple exchange algorithm – later known as Fedorov exchange – for finding determinant minimising designs. The central insight was that the marginal effect on the determinant of changing a row in a matrix can be calculated efficiently:
\[\mid X_{t}^TX_{t}\mid=\mid X_{t-1}^TX_{t-1}\mid\times (1+\Delta(x_i,x_j))\] where \(\Delta(x_i,x_j)=d(x_j,x_j)-\big[d(x_i,x_i)d(x_j,x_j)-d(x_i,x_j)^2\big]-d(x_i,x_i)\) and \(d(x_i,x_j) = x_i^T(X_{t-1}^TX_{t-1})^{-1}x_j\).
Given some initial design \(X_{t-1}\) (which could be a random selection of candidates), we find \(argmax_{x_i,x_j}\Delta(x_i,x_j)\) where \(x_i\) is a sample from the current solution and \(x_j\) is a candidate, and we exchange them if \(\Delta(x_i,x_j)\) is bigger than some tolerance. It is notable that on each iteration the algorithm needs to calculate the change in determinant \(mM\) times where \(m\) is the number of samples and \(M\) is the number of candidates. Cook and Nachtsheim (1980) proposed a simple modification to reduce the number of steps per iteration from \(mM\) to \(M\). For each sample in \(x_i\), we calculate \(\Delta(x_i,x_j)\), where \(x_j\) are the candidates, and exchange \(x_j\) such that \(argmax_{x_j}{d(x_i,x_j)}\) and \(d(x_i,x_j)>0\). We do this for each row continuously until the change in determinant is below some tolerance. Both versions of the algorithm are local optimisations, they may need to be run from many starting solutions to make finding an optimal solution more likely. Here is the modified algorithm made explicit in Python:
import numpy as np
def argmax_det(cands, can, sol, s):
m = cands[list(sol)]
i = np.linalg.inv(m.transpose() @ m)
d = lambda x,y: x @ i @ y
best, val = None, 0
for c in can:
rs = d(cands[s],cands[s])
rc = d(cands[c],cands[c])
res = rc - (rs*rc - d(cands[c],cands[s])**2) - rs
if res > val:
best, val = c, res
return best, val
def fedorov(cand_list, k, itr=1000, tol=1e-10):
assert isinstance(cand_list, np.ndarray)
assert len(cand_list) > k
can = set(np.arange(len(cand_list)))
sol = set(np.random.choice(list(can), k, replace=False))
can = can - sol
mods = 1
while itr > 0 and mods != 0:
itr, mods = itr-1, 0
for s in sol.copy():
c, v = argmax_det(cand_list, can, sol, s)
if v > tol:
sol = (sol-{s}).union({c})
can = (can-{c}).union({s})
mods += 1
return cand_list[list(sol)]
The fedorov function takes a numpy array of candidate rows and returns a optimised design matrix of length \(k\). A caveat emptor with D-optimal design but optimal designs in general is that they are model dependent. Traditional designs can be significantly more flexible with regards to how collected data will be analysed whilst optimal designs are very model dependent. That is, changing the model ex post may invalidate the design and introduce bias.
Covariate adjustment
In many situations the response variable of interest is correlated with other variables which can be measured before or at the same time as the response. For example, how often a customer revisits a web page and whether they eventually buy the item may be strongly correlated. If the covariates of interest are not strongly related to the factors we are trying to control, then we can gain power by removing them from the response and working with the residual instead. That is, given some model \(y = ax + b + \epsilon\) and some covariate \(z\), we can first regress \(y = cz + d\) to obtain \(\hat{y} = y - cz - d\) and then use the model \(\hat{y} = ax + b \epsilon\) instead. Generally, adding covariates should not compromise the estimation of main effects even if the “covariates” are entirely unrelated to the response because in that event their coefficients in the \(\hat{y}\) equation would end up set to zero. They do however consume degrees of freedom which may increase variance in experiments with very small sample sizes or very large numbers of covariates. The main caveat emptor is that if the covariates are strongly related to both the outcome and the controlled variables, it can potentially cause issues with multicollinearity or introduce bias if not properly accounted for.
A hand-wavy example
Here is a shallow example to help bring the concepts together in a real-world setting. Say that online store that is looking to maximise revenues by offering preferential discounts to its customers. They plan to run an experiment to find the optimal discount rate and communication timing; they do not want to hand out more discounts than necessary to collect the data for the study.
The response variable of interest is “revenue” (say measured over 3 months after the discount is offered). The main controlled variables are “discount rate”, which could vary between 0% and 40% in increments of 5%, and “day of week”, which could be Monday to Sunday. To control for variability that could be introduced by different customer groups, the store decides to block on “new customers” and “returning customers”, as their buying behavior and responsiveness to discounts is likely to differ. This can help to reduce the within-block variability and increase the chances of detecting a true effect of discount on revenue. Further,the store will make use of several covariates to reduce experimental error. Namely:
- average past spending of each customer.
- frequency of past visits to the website.
- number of items usually purchased per visit.
All three covariates are known to be related to total revenue but not to the allocation of discounts to customers which are randomised within blocks. It should be noted that the covariates are not used in the design matrix: they are used during analysis to reduce experimental error by covariate adjustment.
The discount rate and days of week together with the blocking factor will constitute the row candidates for a design matrix. There are \(8\times 7 \times 2 = 112\) combinations. The store thinks it has budget for about 350 experiments which would allow them to run the full factorial design for 3 replications whilst remaining within budget. The store thinks this is too few replications per combination level and decides instead to increase the number of replications to 5. The store uses D-optimal design to select the combinations of discount rates, days of the week, and customer types that minimises the determinant of the covariance matrix in order to allocate the \(350 / 5 = 70\) experiments per replication in order to improve the efficiency of the estimation without increase the experimental budget.
Prior to running the experiment, the store’s statistician writes down the planned linear model, settles the design matrix using a D-optimal design search algorithm, and tests the power of the related design using simulation under various common sense assumptions. The statistician also considers variables such as seasonal effects, current consumer spending, etc, and whether they will effect the execution of the experiment. Once the statistician and the execution team are happy with the experiment \(ex ante\), it enters an execution phase for several months. The statistician is aware that due to optimal design consideration, the results from the experiment will need to be analysed according to the statistical model that was specified prior to the experiment, and if that turns out to be inappropriate, further experimental runs may be required to “save” or de-bias the analysis.
Conclusions
Design of experiments is commonly used online but typically in high volume traffic settings and even then in elementary forms such as the A/B test. Yet I suspect most possible uses for DoE doesn’t fit those constraints. Hopefully I’ve given a reader with some knowledge of DoE a greater sense of how experiments can be put together more frugally using the methods of blocking, optimal design and covariate adjustment using linear models and simple algorithms.