Tae Hyun Kim (Lowell)

IPW (Inverse Propensity Weighting)

4 min read #causal-inference#reweighting#ipw

Definition

Estimating treatment effects by using the inverse of the propensity score as weights

ATE^IPW=1ni=1n[WiYie(Xi)(1Wi)Yi1e(Xi)]\hat{\text{ATE}}_{IPW} = \frac{1}{n}\sum_{i=1}^{n} \left[\frac{W_i Y_i}{e(X_i)} - \frac{(1-W_i) Y_i}{1-e(X_i)}\right]

Here, e(X)=P(W=1X)e(X) = P(W=1 \mid X) is the Propensity Score.


Intuitive Understanding

Why inverse weighting?

“More weight to underrepresented samples”

SituationTreatment probabilityWeightMeaning
Treatment raree(X)=0.1e(X) = 0.11/0.1=101/0.1 = 10This person represents 10 individuals
Treatment commone(X)=0.9e(X) = 0.91/0.91.11/0.9 ≈ 1.1Nearly 1:1 representation

Resampling Perspective

IPW is equivalent to:

  • “Replicating” each sample according to its weight
  • Creating a hypothetical pseudo-RCT (pseudo-randomized experiment)

Mathematical Derivation

ATE Identification

Under Strong Ignorability:

E[Y(w)]=E[1(W=w)YP(W=wX)]E[Y(w)] = E\left[\frac{\mathbb{1}(W=w) \cdot Y}{P(W=w \mid X)}\right]

Therefore:

ATE=E[WYe(X)]E[(1W)Y1e(X)]\text{ATE} = E\left[\frac{WY}{e(X)}\right] - E\left[\frac{(1-W)Y}{1-e(X)}\right]

Sample Estimator

ATE^IPW=1ni[WiYie(Xi)(1Wi)Yi1e(Xi)]\hat{\text{ATE}}_{IPW} = \frac{1}{n}\sum_i \left[\frac{W_i Y_i}{e(X_i)} - \frac{(1-W_i) Y_i}{1-e(X_i)}\right]

Normalized Version

More stable when the PS is estimated:

ATE^IPWnorm=iWiYie(Xi)iWie(Xi)i(1Wi)Yi1e(Xi)i(1Wi)1e(Xi)\hat{\text{ATE}}_{IPW}^{norm} = \frac{\sum_i \frac{W_i Y_i}{e(X_i)}}{\sum_i \frac{W_i}{e(X_i)}} - \frac{\sum_i \frac{(1-W_i) Y_i}{1-e(X_i)}}{\sum_i \frac{(1-W_i)}{1-e(X_i)}}

Advantages: reduced bias, reduced variance


IPW for ATT

ATT^IPW=iWiYiiWii(1Wi)e(Xi)1e(Xi)Yii(1Wi)e(Xi)1e(Xi)\hat{\text{ATT}}_{IPW} = \frac{\sum_i W_i Y_i}{\sum_i W_i} - \frac{\sum_i (1-W_i) \frac{e(X_i)}{1-e(X_i)} Y_i}{\sum_i (1-W_i) \frac{e(X_i)}{1-e(X_i)}}

See ATT


Pros and Cons

Advantages

AdvantageDescription
SimpleIntuitive and easy to implement
NonparametricNo outcome-model assumptions required
Theoretical justificationGuarantees conditional consistency
FlexibilityApplicable to a variety of estimands

Disadvantages

DisadvantageDescription
Dependence on PS estimationBiased when the PS is misspecified
Sensitivity to extreme PSUnstable when e(X)0e(X) \approx 0 or 11
High varianceEspecially when overlap is weak
Difficult in high dimensionsPS estimation is difficult

Extreme PS Problem

Problem

When e(X)0e(X) \to 0 or e(X)1e(X) \to 1:

  • Weights explode: 1/e(X)1/e(X) \to \infty
  • Estimator becomes unstable

Solutions

  1. Trimming: remove samples with extreme PS
  2. Overlap Weighting: use stable weights
  3. Weight clipping: set an upper bound on weights

Implementation

Python (EconML)

from econml.dr import LinearDRLearner

# IPW without an outcome model
model = LinearDRLearner(model_propensity=LogisticRegression())
model.fit(Y, T, X)
ate = model.effect(X).mean()

R

library(WeightIt)

# Propensity score weights
weights <- weightit(treat ~ x1 + x2, data = df, method = "ps")

# Weighted outcome regression
lm(y ~ treat, data = df, weights = weights$weights)

  • Re-weighting Methods Overview - consolidated overview of reweighting methods
  • Propensity Score - the core tool
  • Doubly Robust Estimator - IPW + outcome regression
  • CBPS - directly optimizing balance
  • Trimming - handling extreme PS
  • Overlap Weighting - stable weighting

Application: Correcting RTB Win Selection Bias

In RTB, training only on won impressions introduces win selection bias. Correct it with IPW:

wi=1pwin(xi,bi),pwin=P(winX,bid)w_i = \frac{1}{p_{\text{win}}(x_i, b_i)}, \quad p_{\text{win}} = P(\text{win} \mid X, \text{bid})

The win propensity is estimated via Survival Analysis (Kaplan-Meier) or gradient boosting. Weight stabilization (clipping, normalization) is essential. For details, see Multi-Task Learning (IPW-ESCM²).


References

  • yaoSurveyCausalInference2021 - Section 3.1.3
  • Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score
  • Horvitz, D. G., & Thompson, D. J. (1952). A generalization of sampling without replacement
  • Zhang et al. (2016). Bid-aware Gradient Descent (KDD)

Local graph