Meta-learners · Tae Hyun Kim (Lowell)

Definition

Meta-learners are a general term for algorithms that estimate the CATE by leveraging existing supervised learning methods (base learners).

Core idea:

Decompose the CATE estimation problem into sub-regression problems that a base learner can solve

$\tau(x) = E[Y(1) - Y(0) | X = x]$

Major Meta-learners:

S-Learner: Single model with treatment as feature
T-Learner: Two separate models
X-Learner: Two-stage imputation approach
R-Learner: Residualized regression
DR-Learner: Doubly robust pseudo-outcome regression

Intuitive Understanding

Why “Meta”?

Base learners (RF, BART, NN, etc.) are designed to estimate $E[Y|X]$
The CATE $\tau(x) = E[Y(1) - Y(0)|X]$ cannot be estimated directly
Meta-learners repurpose base learners to estimate the CATE

Base Learner:    Designed for E[Y|X] (standard regression)
                       ↓
Meta-Learner:    Transforms CATE problem → sub-regression problems
                       ↓
                 Uses base learners to solve sub-problems
                       ↓
                 Combines results → τ̂(x)

Comparison of Meta-learners

Method	Approach	Pros	Cons	Best When
S-Learner	$\hat{\mu}(x,1) - \hat{\mu}(x,0)$	Simple, shares data	May ignore small effects	CATE ≈ 0
T-Learner	$\hat{\mu}_1(x) - \hat{\mu}_0(x)$	Captures different response functions	No data sharing	$\mu_0 \neq \mu_1$ structures
X-Learner	Two-stage imputation + weighting	Exploits CATE structure, handles imbalance	More complex	Unbalanced groups, smooth CATE
R-Learner	Residualized regression	Orthogonality	Requires product rate	Heterogeneous effects
DR-Learner	DR pseudo-outcome regression	Double robustness	Stability condition	Robustness desired

Framework and Setup

Potential Outcomes Framework

$X \sim \Lambda, \quad W \sim \text{Bern}(e(X))$ $Y(0) = \mu_0(X) + \epsilon(0), \quad Y(1) = \mu_1(X) + \epsilon(1)$

where:

$X \in \mathbb{R}^d$ : Covariates
$W \in \{0, 1\}$ : Treatment indicator
$e(x) = P(W=1|X=x)$ : Propensity Score
$\mu_a(x) = E[Y(a)|X=x]$ : Response functions

Identification Assumptions

Unconfoundedness: $(Y(0), Y(1)) \perp W | X$
Positivity: $0 < e_{min} < e(x) < e_{max} < 1$

Estimation Target

$\text{EMSE}(P, \hat{\tau}) = E[(\tau(X) - \hat{\tau}(X))^2]$

Convergence Rates

Notation: $S(a)$ = function class with minimax rate $N^{-a}$

T-Learner Rate

$\text{Rate} = O(m^{-a_\mu} + n^{-a_\mu})$

$m$ : control group size, $n$ : treatment group size
$a_\mu$ : smoothness of response functions

X-Learner Rate (under conditions)

$\hat{\tau}_0$ : $O(m^{-a_\tau} + n^{-a_\mu})$
$\hat{\tau}_1$ : $O(m^{-a_\mu} + n^{-a_\tau})$

Linear CATE + Lipschitz response → Parametric rate is achievable

Choosing a Meta-learner

Decision Guide

Meta-learners

Mermaid source (click to expand)

> flowchart TD
>     A[Start] --> B{CATE mostly zero?}
>     B -->|Yes| C[S-Learner]
>     B -->|No| D{Groups balanced?}
>     D -->|No| E[X-Learner]
>     D -->|Yes| F{Response functions similar?}
>     F -->|Yes| G[X-Learner or R-Learner]
>     F -->|No| H[T-Learner]
>

Practical Recommendations (Künzel et al.)

Default choice: X-Learner (unless strong prior that CATE ≈ 0)
Small datasets: BART as base learner
Large datasets: Random Forest as base learner

Base Learners

Meta-learners are compatible with a variety of base learners:

Base Learner	Strengths	Typical Use
Random Forest	Scalable, handles high-dim	Large datasets
BART	Uncertainty quantification, regularization	Small datasets
Neural Networks	Flexible, complex patterns	Very large datasets
Lasso/Ridge	Sparse/regularized linear	High-dim, interpretable
Boosting	Adaptive, accurate	General purpose

CATE - the estimation target
S-Learner - Single model approach
T-Learner - Two model approach
X-Learner - Imputation-based approach
R-Learner - Residualized approach
DR-Learner - Doubly robust approach
Propensity Score - Treatment probability

Implementation

Python (econml):

from econml.metalearners import SLearner, TLearner, XLearner
from sklearn.ensemble import RandomForestRegressor

# S-Learner
s_learner = SLearner(overall_model=RandomForestRegressor())
s_learner.fit(Y, T, X=X)

# T-Learner
t_learner = TLearner(models=RandomForestRegressor())
t_learner.fit(Y, T, X=X)

# X-Learner
x_learner = XLearner(models=RandomForestRegressor())
x_learner.fit(Y, T, X=X)

cate = x_learner.effect(X_test)

R (causalToolbox):

library(causalToolbox)

# S-Learner with RF
s_rf <- S_RF(feat = X, tr = W, yobs = Y)
cate_s <- EstimateCate(s_rf, X_test)

# X-Learner with RF
x_rf <- X_RF(feat = X, tr = W, yobs = Y)
cate_x <- EstimateCate(x_rf, X_test)

References

kunzelMetalearnersEstimatingHeterogeneous2019 - S, T, X-learner
nieQuasiOracleEstimationHeterogeneous2020 - R-learner
kennedyOptimalDoublyRobust2023 - DR-learner

Local graph