CUPED · Tae Hyun Kim (Lowell)

Definition

CUPED (Controlled-experiment Using Pre-Experiment Data) is a technique that leverages pre-experiment data to reduce the variance of A/B tests.

$\hat{Y}_{cuped} = Y - \theta(X - E[X])$

where:

$Y$ : the outcome observed during the experiment
$X$ : pre-experiment data (e.g., behavior during the 2 weeks before the experiment)
$\theta = \frac{Cov(Y, X)}{Var(X)}$ : the adjustment coefficient

It was proposed by Deng et al. (2013) at Microsoft Research.

Intuitive Understanding

Each customer has a different baseline propensity to purchase. Some customers naturally buy a lot, while others buy little.

CUPED uses “how much this customer tends to purchase in the first place” to reduce the noise in the experiment results. By removing the variation that is predictable from pre-experiment behavior, the treatment effect can be estimated more precisely.

Key Properties

Variance Reduction

$Var(\hat{Y}_{cuped}) = Var(Y)(1 - \rho^2_{XY})$

$\rho_{XY}$ : the correlation coefficient between $X$ and $Y$
The higher the correlation, the greater the variance reduction

Unbiasedness Preserved

Under random assignment: $E[\hat{Y}_{cuped}|T=1] - E[\hat{Y}_{cuped}|T=0] = E[Y|T=1] - E[Y|T=0]$

Even after the CUPED adjustment, it remains an unbiased estimator of the treatment effect.

Efficiency Gain

When variance is reduced, the same Statistical Power is achieved with fewer samples:

$\text{Effective sample increase} = \frac{1}{1 - \rho^2_{XY}}$

Example: if $\rho = 0.5$ , the effective sample size increases by 33%.

Example

Python Implementation

import numpy as np
from scipy import stats

class CUPEDEstimator:
    def __init__(self, pre_period_days=14):
        self.pre_period_days = pre_period_days

    def fit(self, Y, X, treatment):
        """
        Y: outcome during the experiment
        X: pre-experiment data (covariate)
        treatment: treatment indicator (0/1)
        """
        # Compute theta (full data)
        self.theta = np.cov(Y, X)[0, 1] / np.var(X)
        self.X_mean = np.mean(X)

        # CUPED-adjusted outcome
        Y_cuped = Y - self.theta * (X - self.X_mean)

        # Estimate treatment effect
        Y_cuped_treatment = Y_cuped[treatment == 1]
        Y_cuped_control = Y_cuped[treatment == 0]

        self.effect = np.mean(Y_cuped_treatment) - np.mean(Y_cuped_control)
        self.effect_se = np.sqrt(
            np.var(Y_cuped_treatment) / len(Y_cuped_treatment) +
            np.var(Y_cuped_control) / len(Y_cuped_control)
        )

        # Comparison: before adjustment
        self.effect_raw = np.mean(Y[treatment == 1]) - np.mean(Y[treatment == 0])
        self.effect_raw_se = np.sqrt(
            np.var(Y[treatment == 1]) / sum(treatment == 1) +
            np.var(Y[treatment == 0]) / sum(treatment == 0)
        )

        # Variance reduction rate
        self.variance_reduction = 1 - (self.effect_se / self.effect_raw_se)**2

        return self

    def summary(self):
        return {
            'effect_cuped': self.effect,
            'se_cuped': self.effect_se,
            'effect_raw': self.effect_raw,
            'se_raw': self.effect_raw_se,
            'variance_reduction': self.variance_reduction,
            'theta': self.theta
        }

Usage Example

# Simulated data
np.random.seed(42)
n = 10000

# Per-individual baseline propensity (unobserved)
baseline = np.random.randn(n) * 10 + 50

# Pre-experiment data (revenue during the 2 weeks before the experiment)
X_pre = baseline + np.random.randn(n) * 5

# Treatment assignment
treatment = np.random.binomial(1, 0.5, n)

# Outcome during the experiment (treatment effect = 2)
true_effect = 2
Y = baseline + true_effect * treatment + np.random.randn(n) * 5

# Apply CUPED
cuped = CUPEDEstimator()
cuped.fit(Y, X_pre, treatment)
results = cuped.summary()

print(f"True effect: {true_effect}")
print(f"Raw estimate: {results['effect_raw']:.3f} ± {results['se_raw']:.3f}")
print(f"CUPED estimate: {results['effect_cuped']:.3f} ± {results['se_cuped']:.3f}")
print(f"Variance reduction: {results['variance_reduction']:.1%}")

Extension to Multiple Covariates

from sklearn.linear_model import LinearRegression

def cuped_multiple_covariates(Y, X_covariates, treatment):
    """
    CUPED using multiple pre-experiment variables
    """
    # Estimate theta with a linear model
    model = LinearRegression()
    model.fit(X_covariates, Y)

    # Subtract the predicted values
    Y_pred = model.predict(X_covariates)
    Y_cuped = Y - Y_pred + np.mean(Y_pred)

    # Treatment effect
    effect = np.mean(Y_cuped[treatment == 1]) - np.mean(Y_cuped[treatment == 0])

    return effect, Y_cuped

Statistical Power - what CUPED improves
A-B Testing - the context in which CUPED is applied
Design Effect - another effective-sample adjustment

References

Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). “Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data.”
Comprehensive Personalized Pricing Guide, Part V, §14.3

Local graph