Tae Hyun Kim (Lowell)

Meta-learners

Definition

Meta-learners are a general term for algorithms that estimate the CATE by leveraging existing supervised learning methods (base learners).

Core idea:

Decompose the CATE estimation problem into sub-regression problems that a base learner can solve

τ(x)=E[Y(1)Y(0)X=x]\tau(x) = E[Y(1) - Y(0) | X = x]

Major Meta-learners:

  • S-Learner: Single model with treatment as feature
  • T-Learner: Two separate models
  • X-Learner: Two-stage imputation approach
  • R-Learner: Residualized regression
  • DR-Learner: Doubly robust pseudo-outcome regression

Intuitive Understanding

Why “Meta”?

  • Base learners (RF, BART, NN, etc.) are designed to estimate E[YX]E[Y|X]
  • The CATE τ(x)=E[Y(1)Y(0)X]\tau(x) = E[Y(1) - Y(0)|X] cannot be estimated directly
  • Meta-learners repurpose base learners to estimate the CATE
Base Learner:    Designed for E[Y|X] (standard regression)

Meta-Learner:    Transforms CATE problem → sub-regression problems

                 Uses base learners to solve sub-problems

                 Combines results → τ̂(x)

Comparison of Meta-learners

MethodApproachProsConsBest When
S-Learnerμ^(x,1)μ^(x,0)\hat{\mu}(x,1) - \hat{\mu}(x,0)Simple, shares dataMay ignore small effectsCATE ≈ 0
T-Learnerμ^1(x)μ^0(x)\hat{\mu}_1(x) - \hat{\mu}_0(x)Captures different response functionsNo data sharingμ0μ1\mu_0 \neq \mu_1 structures
X-LearnerTwo-stage imputation + weightingExploits CATE structure, handles imbalanceMore complexUnbalanced groups, smooth CATE
R-LearnerResidualized regressionOrthogonalityRequires product rateHeterogeneous effects
DR-LearnerDR pseudo-outcome regressionDouble robustnessStability conditionRobustness desired

Framework and Setup

Potential Outcomes Framework

XΛ,WBern(e(X))X \sim \Lambda, \quad W \sim \text{Bern}(e(X)) Y(0)=μ0(X)+ϵ(0),Y(1)=μ1(X)+ϵ(1)Y(0) = \mu_0(X) + \epsilon(0), \quad Y(1) = \mu_1(X) + \epsilon(1)

where:

  • XRdX \in \mathbb{R}^d: Covariates
  • W{0,1}W \in \{0, 1\}: Treatment indicator
  • e(x)=P(W=1X=x)e(x) = P(W=1|X=x): Propensity Score
  • μa(x)=E[Y(a)X=x]\mu_a(x) = E[Y(a)|X=x]: Response functions

Identification Assumptions

  1. Unconfoundedness: (Y(0),Y(1))WX(Y(0), Y(1)) \perp W | X
  2. Positivity: 0<emin<e(x)<emax<10 < e_{min} < e(x) < e_{max} < 1

Estimation Target

EMSE(P,τ^)=E[(τ(X)τ^(X))2]\text{EMSE}(P, \hat{\tau}) = E[(\tau(X) - \hat{\tau}(X))^2]

Convergence Rates

Notation: S(a)S(a) = function class with minimax rate NaN^{-a}

T-Learner Rate

Rate=O(maμ+naμ)\text{Rate} = O(m^{-a_\mu} + n^{-a_\mu})

  • mm: control group size, nn: treatment group size
  • aμa_\mu: smoothness of response functions

X-Learner Rate (under conditions)

  • τ^0\hat{\tau}_0: O(maτ+naμ)O(m^{-a_\tau} + n^{-a_\mu})
  • τ^1\hat{\tau}_1: O(maμ+naτ)O(m^{-a_\mu} + n^{-a_\tau})

Linear CATE + Lipschitz response → Parametric rate is achievable

Choosing a Meta-learner

Decision Guide

Meta-learners

Mermaid source (click to expand)
> flowchart TD
>     A[Start] --> B{CATE mostly zero?}
>     B -->|Yes| C[S-Learner]
>     B -->|No| D{Groups balanced?}
>     D -->|No| E[X-Learner]
>     D -->|Yes| F{Response functions similar?}
>     F -->|Yes| G[X-Learner or R-Learner]
>     F -->|No| H[T-Learner]
>

Practical Recommendations (Künzel et al.)

  1. Default choice: X-Learner (unless strong prior that CATE ≈ 0)
  2. Small datasets: BART as base learner
  3. Large datasets: Random Forest as base learner

Base Learners

Meta-learners are compatible with a variety of base learners:

Base LearnerStrengthsTypical Use
Random ForestScalable, handles high-dimLarge datasets
BARTUncertainty quantification, regularizationSmall datasets
Neural NetworksFlexible, complex patternsVery large datasets
Lasso/RidgeSparse/regularized linearHigh-dim, interpretable
BoostingAdaptive, accurateGeneral purpose
  • CATE - the estimation target
  • S-Learner - Single model approach
  • T-Learner - Two model approach
  • X-Learner - Imputation-based approach
  • R-Learner - Residualized approach
  • DR-Learner - Doubly robust approach
  • Propensity Score - Treatment probability

Implementation

Python (econml):

from econml.metalearners import SLearner, TLearner, XLearner
from sklearn.ensemble import RandomForestRegressor

# S-Learner
s_learner = SLearner(overall_model=RandomForestRegressor())
s_learner.fit(Y, T, X=X)

# T-Learner
t_learner = TLearner(models=RandomForestRegressor())
t_learner.fit(Y, T, X=X)

# X-Learner
x_learner = XLearner(models=RandomForestRegressor())
x_learner.fit(Y, T, X=X)

cate = x_learner.effect(X_test)

R (causalToolbox):

library(causalToolbox)

# S-Learner with RF
s_rf <- S_RF(feat = X, tr = W, yobs = Y)
cate_s <- EstimateCate(s_rf, X_test)

# X-Learner with RF
x_rf <- X_RF(feat = X, tr = W, yobs = Y)
cate_x <- EstimateCate(x_rf, X_test)

References

  • kunzelMetalearnersEstimatingHeterogeneous2019 - S, T, X-learner
  • nieQuasiOracleEstimationHeterogeneous2020 - R-learner
  • kennedyOptimalDoublyRobust2023 - DR-learner

Local graph