Tae Hyun Kim (Lowell)

S-Learner

3 min read #causal-inference#hte#meta-learner

Definition

The S-Learner (Single Learner) is a Meta-learner that estimates the response function with a single model that includes the treatment indicator as a feature, then computes the CATE.

Algorithm:

  1. Estimate the combined response function with a single model: μ^(x,w)=E^[YX=x,W=w]\hat{\mu}(x, w) = \hat{E}[Y | X = x, W = w]

  2. Estimate the CATE: τ^S(x)=μ^(x,1)μ^(x,0)\hat{\tau}_S(x) = \hat{\mu}(x, 1) - \hat{\mu}(x, 0)

Intuitive Understanding

Core idea:

Treat the treatment WW simply as one more feature, and train a single model on the entire dataset.

Data:  (X, W, Y) for all observations

Model: μ̂(x, w) = f(x, w)  (single model)

CATE:  τ̂(x) = μ̂(x, 1) - μ̂(x, 0)

Advantages:

  • The simplest approach
  • Uses all data jointly (data sharing)
  • Exploits common patterns shared between treatment/control

Disadvantages:

  • May be ignored when the treatment effect is small (regularization drops WW)
  • Unsuitable when the structures of μ0\mu_0 and μ1\mu_1 are very different

Key Properties

Data Sharing

  • Trains a single model using all (n+m)(n + m) observations
  • Can learn patterns common to control/treatment

Regularization Bias

μ^(x,w)μ^(x)if treatment effect is small\hat{\mu}(x, w) \approx \hat{\mu}(x) \quad \text{if treatment effect is small}

  • The stronger the regularization, the more it tends to ignore the influence of WW
  • Suitable when CATE ≈ 0; bias arises otherwise

Convergence Rate

Depends on the smoothness aμa_\mu of the response function: Rate=O((n+m)aμ)\text{Rate} = O((n+m)^{-a_\mu})

Algorithm Detail

def s_learner(X, W, Y, base_learner):
    # Step 1: Combine treatment as feature
    X_combined = np.column_stack([X, W])

    # Step 2: Fit single model
    model = base_learner.fit(X_combined, Y)

    # Step 3: Predict CATE
    def predict_cate(X_new):
        X_treat = np.column_stack([X_new, np.ones(len(X_new))])
        X_ctrl = np.column_stack([X_new, np.zeros(len(X_new))])
        return model.predict(X_treat) - model.predict(X_ctrl)

    return predict_cate

When to Use

Good Scenarios

  • When the CATE is mostly close to 0: Regularization works correctly
  • When the response functions are similar: μ0(x)μ1(x)+c\mu_0(x) \approx \mu_1(x) + c
  • When data is limited: Benefits from data sharing

Bad Scenarios

  • When the treatment effect is clear: The effect may be ignored
  • When the response functions are very different: Hard to capture structural differences
  • When heterogeneous effects matter: Subtle differences are missed

Comparison with T-Learner

AspectS-LearnerT-Learner
Models12
Data usageAll togetherSplit by treatment
SharingYesNo
Best whenCATE ≈ 0Different response structures
RiskIgnore small effectsNo data sharing

Example

Simulation setup:

  • μ0(x)=x\mu_0(x) = x
  • μ1(x)=x\mu_1(x) = x (i.e., τ(x)=0\tau(x) = 0)

S-Learner result:

  • The regularized model ignores WWτ^(x)0\hat{\tau}(x) \approx 0
  • Correct estimation

Opposite scenario:

  • μ0(x)=x\mu_0(x) = x, μ1(x)=x+2\mu_1(x) = x + 2 (i.e., τ(x)=2\tau(x) = 2)
  • The S-Learner may shrink the influence of WW via regularization → bias

Implementation

Python (econml):

from econml.metalearners import SLearner
from sklearn.ensemble import RandomForestRegressor

s_learner = SLearner(overall_model=RandomForestRegressor())
s_learner.fit(Y, T, X=X)
cate = s_learner.effect(X_test)

R:

library(causalToolbox)
s_rf <- S_RF(feat = X, tr = W, yobs = Y)
cate <- EstimateCate(s_rf, X_test)

References

  • kunzelMetalearnersEstimatingHeterogeneous2019 - S-learner analysis

Local graph