S-Learner · Tae Hyun Kim (Lowell)

Definition

The S-Learner (Single Learner) is a Meta-learner that estimates the response function with a single model that includes the treatment indicator as a feature, then computes the CATE.

Algorithm:

Estimate the combined response function with a single model: $\hat{\mu}(x, w) = \hat{E}[Y | X = x, W = w]$
Estimate the CATE: $\hat{\tau}_S(x) = \hat{\mu}(x, 1) - \hat{\mu}(x, 0)$

Intuitive Understanding

Core idea:

Treat the treatment $W$ simply as one more feature, and train a single model on the entire dataset.

Data:  (X, W, Y) for all observations
           ↓
Model: μ̂(x, w) = f(x, w)  (single model)
           ↓
CATE:  τ̂(x) = μ̂(x, 1) - μ̂(x, 0)

Advantages:

The simplest approach
Uses all data jointly (data sharing)
Exploits common patterns shared between treatment/control

Disadvantages:

May be ignored when the treatment effect is small (regularization drops $W$ )
Unsuitable when the structures of $\mu_0$ and $\mu_1$ are very different

Key Properties

Trains a single model using all $(n + m)$ observations
Can learn patterns common to control/treatment

Regularization Bias

$\hat{\mu}(x, w) \approx \hat{\mu}(x) \quad \text{if treatment effect is small}$

The stronger the regularization, the more it tends to ignore the influence of $W$
Suitable when CATE ≈ 0; bias arises otherwise

Convergence Rate

Depends on the smoothness $a_\mu$ of the response function: $\text{Rate} = O((n+m)^{-a_\mu})$

Algorithm Detail

def s_learner(X, W, Y, base_learner):
    # Step 1: Combine treatment as feature
    X_combined = np.column_stack([X, W])

    # Step 2: Fit single model
    model = base_learner.fit(X_combined, Y)

    # Step 3: Predict CATE
    def predict_cate(X_new):
        X_treat = np.column_stack([X_new, np.ones(len(X_new))])
        X_ctrl = np.column_stack([X_new, np.zeros(len(X_new))])
        return model.predict(X_treat) - model.predict(X_ctrl)

    return predict_cate

When to Use

Good Scenarios

When the CATE is mostly close to 0: Regularization works correctly
When the response functions are similar: $\mu_0(x) \approx \mu_1(x) + c$
When data is limited: Benefits from data sharing

Bad Scenarios

When the treatment effect is clear: The effect may be ignored
When the response functions are very different: Hard to capture structural differences
When heterogeneous effects matter: Subtle differences are missed

Comparison with T-Learner

Aspect	S-Learner	T-Learner
Models	1	2
Data usage	All together	Split by treatment
Sharing	Yes	No
Best when	CATE ≈ 0	Different response structures
Risk	Ignore small effects	No data sharing

Example

Simulation setup:

$\mu_0(x) = x$
$\mu_1(x) = x$ (i.e., $\tau(x) = 0$ )

S-Learner result:

The regularized model ignores $W$ → $\hat{\tau}(x) \approx 0$ ✓
Correct estimation

Opposite scenario:

$\mu_0(x) = x$ , $\mu_1(x) = x + 2$ (i.e., $\tau(x) = 2$ )
The S-Learner may shrink the influence of $W$ via regularization → bias

Meta-learners - The overall framework
T-Learner - Alternative: two separate models
X-Learner - Improvement over S/T-learner
CATE - The estimation target

Implementation

Python (econml):

from econml.metalearners import SLearner
from sklearn.ensemble import RandomForestRegressor

s_learner = SLearner(overall_model=RandomForestRegressor())
s_learner.fit(Y, T, X=X)
cate = s_learner.effect(X_test)

library(causalToolbox)
s_rf <- S_RF(feat = X, tr = W, yobs = Y)
cate <- EstimateCate(s_rf, X_test)

References

kunzelMetalearnersEstimatingHeterogeneous2019 - S-learner analysis

Local graph