Tae Hyun Kim (Lowell)

CATE (Conditional Average Treatment Effect)

4 min read #causal-inference#cate#hte

Definition

The Conditional Average Treatment Effect (CATE) is the average treatment effect given covariates X=xX=x:

τ(x)=E[Y(1)Y(0)X=x]\tau(x) = E[Y(1) - Y(0) | X = x]

where:

  • Y(1)Y(1): potential outcome under treatment
  • Y(0)Y(0): potential outcome under no treatment
  • XX: pre-treatment covariates (feature variables)

Related terms:

  • HTE (Heterogeneous Treatment Effect): used synonymously with CATE
  • ITE (Individual Treatment Effect): τi=Yi(1)Yi(0)\tau_i = Y_i(1) - Y_i(0) (unobservable)

Intuitive Understanding

Key question:

“How effective is the treatment for a person with specific characteristics?”

ATE vs CATE:

QuantityDefinitionQuestion
ATEE[Y(1)Y(0)]E[Y(1) - Y(0)]”Is there an effect on average?”
CATEE[Y(1)Y(0)X=x]E[Y(1) - Y(0) \| X=x]”Is there an effect for a person with these characteristics?”

Example:

  • The average effect of a new drug is positive (ATE > 0)
  • But for patients aged 65 and over it has no or even a negative effect (τ(xage65)0\tau(x_{age \geq 65}) \leq 0)
ATE = E[τ(X)] = ∫ τ(x) dP(x)  (average of CATE)

Key Properties

Fundamental Problem of Causal Inference

At the individual level, Y(1)Y(1) and Y(0)Y(0) cannot be observed simultaneously:

  • Actual observation: Y=AY(1)+(1A)Y(0)Y = AY(1) + (1-A)Y(0)
  • The counterfactual is always missing

Identification Assumptions

The standard assumptions for identifying CATE:

  1. SUTVA (Stable Unit Treatment Value Assumption)

    • No interference: another unit’s treatment does not affect my outcome
    • Consistency: Y=Y(A)Y = Y(A)
  2. Unconfoundedness (Ignorability) Y(0),Y(1)AXY(0), Y(1) \perp A | X

    • Given XX, treatment assignment is independent of the potential outcomes
  3. Positivity (Overlap) 0<P(A=1X=x)<1,xX0 < P(A=1|X=x) < 1, \quad \forall x \in \mathcal{X}

    • At every covariate value the probability of receiving treatment lies strictly between 0 and 1

Structure of CATE

CATE can be decomposed as: τ(x)=μ1(x)μ0(x)\tau(x) = \mu_1(x) - \mu_0(x)

where μa(x)=E[YX=x,A=a]\mu_a(x) = E[Y|X=x, A=a]

Estimation Methods

Meta-Learners

MethodDescriptionBest When
S-LearnerSingle model: μ^(x,a)\hat{\mu}(x,a), then τ^(x)=μ^(x,1)μ^(x,0)\hat{\tau}(x) = \hat{\mu}(x,1) - \hat{\mu}(x,0)Homogeneous effects
T-LearnerTwo models: μ^1(x)\hat{\mu}_1(x), μ^0(x)\hat{\mu}_0(x) separatelyDifferent response functions
X-LearnerTwo-stage imputation with propensity weightingUnbalanced treatment groups
R-LearnerResidualize then regress: minimize (Yiμ^(Xi)(Aiπ^(Xi))τ(Xi))2\sum(Y_i - \hat{\mu}(X_i) - (A_i - \hat{\pi}(X_i))\tau(X_i))^2Heterogeneous effects
DR-LearnerRegress doubly robust pseudo-outcome on XXDouble robustness desired

Tree-Based Methods

  • Causal Forest (Wager & Athey): Random forest adapted for CATE
  • BART (Bayesian Additive Regression Trees)
  • Causal MARS

Deep Learning

  • CEVAE (Causal Effect VAE)
  • TARNet (Treatment-Agnostic Representation Network)
  • DragonNet

Example

Medical scenario:

  • YY: reduction in blood pressure
  • AA: whether the new drug is administered (0/1)
  • XX: (age, sex, baseline blood pressure, BMI, …)

τ(x)=E[BP reductionnew drug]E[BP reductionplacebo]given X=x\tau(x) = E[\text{BP reduction}|\text{new drug}] - E[\text{BP reduction}|\text{placebo}] \quad \text{given } X=x

Interpretation:

  • τ(x)>0\tau(x) > 0: the new drug is effective for a patient with these characteristics
  • τ(x)<0\tau(x) < 0: the new drug is harmful for a patient with these characteristics
  • τ(x)0\tau(x) \approx 0: no effect for a patient with these characteristics

Applications

Treatment Targeting (Policy Learning)

Learning the optimal treatment rule: d(x)=1[τ(x)>0]d^*(x) = \mathbf{1}[\tau(x) > 0]

  • treat if τ(x)>0\tau(x) > 0
  • don’t treat if τ(x)<0\tau(x) < 0

Personalized Medicine

  • Tailored treatment based on patient characteristics
  • Minimize side effects & maximize efficacy

Precision Marketing

  • Estimating per-customer marketing effects
  • Personalized promotion targeting

Policy Evaluation

  • Analyzing policy effects by subgroup
  • Exploring heterogeneity

Evaluation Metrics

Evaluating CATE estimates is difficult (the true CATE is unobservable)

When an RCT is Available

  • PEHE (Precision in Estimation of HTE): E[(τ^(x)τ(x))2]\sqrt{E[(\hat{\tau}(x) - \tau(x))^2]}
  • ATE Error: τ^ATEτATE|\hat{\tau}_{ATE} - \tau_{ATE}|

Observational Data

  • AUUC (Area Under Uplift Curve): treatment targeting performance
  • Qini Coefficient: uplift modeling evaluation
  • ATE - Average Treatment Effect (the average of CATE)
  • ATT - Average Treatment on Treated
  • Propensity Score - Treatment assignment probability
  • DR-Learner - A doubly robust method for CATE estimation
  • Double-Debiased ML - High-dimensional CATE estimation
  • Causal Forest - Tree-based CATE estimation

Key Papers

  • kunzelMetalearnersEstimatingHeterogeneous2019 - Meta-learners (S, T, X-learner)
  • nieQuasiOracleEstimationHeterogeneous2020 - R-learner
  • kennedyOptimalDoublyRobust2023 - DR-learner, optimal rates
  • Wager & Athey (2018) - Causal Forests
  • chernozhukovDoubleDebiasedMachine2018 - DML for treatment effects

Implementation

Python (econml):

from econml.dml import CausalForestDML
from econml.dr import DRLearner

# Causal Forest
cf = CausalForestDML()
cf.fit(Y, T, X=X, W=W)
cate = cf.effect(X_test)

# DR-Learner
dr = DRLearner()
dr.fit(Y, T, X=X, W=W)
cate = dr.effect(X_test)

R (grf):

library(grf)
cf <- causal_forest(X, Y, W)
tau_hat <- predict(cf)$predictions

References

  • kunzelMetalearnersEstimatingHeterogeneous2019
  • nieQuasiOracleEstimationHeterogeneous2020
  • kennedyOptimalDoublyRobust2023
  • chernozhukovDoubleDebiasedMachine2018
  • Wager & Athey (2018) - “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests”

Local graph