R-Learner

Definition

R-Learner (Residualized Learner) is a Meta-learners that estimates the CATE by using residualized outcomes and residualized treatments, based on the Robinson Transformation.

Algorithm:

Step 1: Estimate nuisance functions (with Cross-fitting) $\hat{m}(x) = \hat{E}[Y|X=x], \quad \hat{e}(x) = \hat{P}(W=1|X=x)$

Step 2: Minimize the R-Loss $\hat{\tau} = \arg\min_\tau \hat{L}_n(\tau) + \Lambda_n(\tau)$

where: $\hat{L}_n(\tau) = \frac{1}{n} \sum_{i=1}^n \left[ \{Y_i - \hat{m}^{(-q(i))}(X_i)\} - \{W_i - \hat{e}^{(-q(i))}(X_i)\} \tau(X_i) \right]^2$

$\hat{m}^{(-q(i))}$ : $\hat{m}$ estimated on the fold excluding the $i$ -th observation
$\Lambda_n(\tau)$ : Regularization term

Intuitive Understanding

Core idea:

After removing the influence of the covariates from both the outcome and the treatment, learn only the pure treatment effect.

Step 1: Estimate nuisance functions
        m̂(x) = E[Y|X]    (outcome model)
        ê(x) = P(W=1|X)  (propensity model)
              ↓
Step 2: Compute residuals (via cross-fitting)
        Ỹᵢ = Yᵢ - m̂(Xᵢ)        (outcome residual)
        W̃ᵢ = Wᵢ - ê(Xᵢ)        (treatment residual)
              ↓
Step 3: Minimize R-loss
        τ̂ = argmin Σ[Ỹᵢ - W̃ᵢ·τ(Xᵢ)]² + regularization

Why “R”?

Uses Residuals
Based on the Robinson transformation
Or the initial of the author’s name

Key theorem (Theorem 1): If the nuisance components are estimated at a rate of $o(n^{-1/4})$ , the R-learner achieves the same convergence rate as the oracle that knows the true nuisance functions.

$\|\hat{\tau} - \tau^*\|^2 = O_P\left(\frac{\log n}{n}\right) + o_P(1) \cdot \text{nuisance error}$

Meaning:

The error of the nuisance estimation has no first-order effect
Slow nuisance estimation is also fine (only the $n^{-1/4}$ rate must be met)

Orthogonality

The orthogonality condition of the Robinson Transformation: $E[(Y - m^*(X)) \cdot (W - e^*(X)) | X] = 0$

This secures robustness against nuisance error.

Separation of Concerns

Confounding control: Estimate $\hat{m}, \hat{e}$ in Step 1
Treatment effect estimation: Focus purely on the CATE in Step 2

A different ML method can be used at each step.

Algorithm Detail

def r_learner(X, W, Y, base_learner, n_folds=5):
    from sklearn.model_selection import KFold

    n = len(Y)
    m_hat = np.zeros(n)  # outcome residuals
    e_hat = np.zeros(n)  # treatment residuals

    # Step 1: Cross-fitted nuisance estimation
    kf = KFold(n_splits=n_folds, shuffle=True)

    for train_idx, val_idx in kf.split(X):
        # Fit outcome model
        outcome_model = base_learner.fit(X[train_idx], Y[train_idx])
        m_hat[val_idx] = outcome_model.predict(X[val_idx])

        # Fit propensity model
        propensity_model = base_learner.fit(X[train_idx], W[train_idx])
        e_hat[val_idx] = propensity_model.predict(X[val_idx])

    # Compute residuals
    Y_tilde = Y - m_hat  # outcome residual
    W_tilde = W - e_hat  # treatment residual

    # Step 2: Minimize R-loss
    # τ(x) = argmin Σ(Ỹᵢ - W̃ᵢ·τ(Xᵢ))²
    # This is equivalent to weighted least squares:
    # Ỹᵢ/W̃ᵢ ≈ τ(Xᵢ) with weight W̃ᵢ²

    # Pseudo-outcome for regression
    pseudo_outcome = Y_tilde / np.clip(W_tilde, 1e-6, None)
    weights = W_tilde ** 2

    # Fit CATE model (weighted regression)
    tau_model = base_learner.fit(X, pseudo_outcome, sample_weight=weights)

    return tau_model.predict

Comparison with Other Meta-Learners

Aspect	S-Learner	T-Learner	X-Learner	R-Learner
Models	1	2	4	3 (m, e, τ)
Data usage	All together	Split	Cross-group	All + cross-fitting
Targets	Response	Response	Imputed effects	Residualized outcome
Key feature	Simple	Separate	Imbalance handling	Orthogonality
Best when	CATE ≈ 0	Different μ₀, μ₁	Unbalanced groups	Simple CATE, complex nuisance

When to Use

Good Scenarios

When the CATE is simpler than the nuisance: Orthogonality minimizes the impact of nuisance complexity
When confounding is complex but the treatment effect is simple
When cross-validation is important: Hyperparameter tuning is possible at each step

Bad Scenarios

When the propensity score is extreme: Unstable if $e(x) \approx 0$ or $1$
When the sample size is very small: Data loss due to cross-fitting
When nuisance estimation is difficult: No guarantee if the $n^{-1/4}$ rate is not achieved

Penalized kernel regression: $O(n^{-\alpha/(2\alpha+d)})$ where $\alpha$ is smoothness
Linear CATE: Parametric rate $O(n^{-1/2})$ possible

Simulation Results (from Paper)

Setup A (Complex nuisance, simple CATE):

R-learner has the strongest performance
Orthogonality removes the confounding effect

Setup B (RCT, constant propensity):

R-learner ≈ T-learner
No particular advantage

Setup C (Easy propensity, complex baseline):

R-learner is competitive
Performance similar to X-learner

Setup D (Unrelated arms):

T-learner is advantageous
R-learner does not benefit from data sharing

Meta-learners - The overall framework
Robinson Transformation - Theoretical foundation
R-Loss - Optimization objective function
Quasi-Oracle Property - Key theoretical guarantee
Cross-fitting - Removes overfitting bias
S-Learner, T-Learner, X-Learner - Alternative methods
DR-Learner - Doubly robust approach
CATE - Estimation target
Propensity Score - Nuisance component

Implementation

Python (econml):

from econml.dml import NonParamDML
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

# R-learner is closely related to DML
r_learner = NonParamDML(
    model_y=RandomForestRegressor(),
    model_t=RandomForestClassifier(),
    model_final=RandomForestRegressor(),
    cv=5  # cross-fitting folds
)
r_learner.fit(Y, T, X=X)
cate = r_learner.effect(X_test)

R (rlearner package):

library(rlearner)

# Using Random Forest as base learner
r_rf <- rlasso(X, W, Y)  # or rboost, etc.
cate <- predict(r_rf, X_test)

References

nieQuasiOracleEstimationHeterogeneous2020 - Original R-learner paper
chernozhukovDoubleDebiasedMachine2018 - Related DML theory

Definition

Intuitive Understanding

Key Properties

Quasi-Oracle Property