ESCM² (Entire Space Counterfactual Multi-Task Model)

Definition

A model that integrates a counterfactual risk regularizer based on the Inverse Propensity Score (IPS) and the Doubly Robust Estimator into ESMM, in order to address ESMM’s two theoretical limitations — Inherent Estimation Bias (IEB) and Potential Independence Priority (PIP).

Final training objective:

\mathcal{L}_{\text{ESCM}^2} = \underbrace{\mathcal{L}_{\text{CTR}}}_{\text{Empirical Risk}} + \lambda_c \underbrace{\mathcal{L}_{\text{CVR}}}_{\text{Counterfactual Risk}} + \lambda_g \underbrace{\mathcal{L}_{\text{CTCVR}}}_{\text{Global Risk}}

Redefining the CVR estimand via the do-calculus:

P(r=1 \mid do(o=1))

→ removing the $X \to O$ dependency resolves selection bias and PIP simultaneously.

Intuitive Understanding

ESMM’s Two Problems

IEB (Inherent Estimation Bias): ESMM estimates CVR as $\hat{R} = \hat{C}/\hat{O}$ , but by Jensen’s inequality $E[\hat{C}/\hat{O}] \geq E[\hat{C}]/E[\hat{O}]$ → it always overestimates.

PIP (Potential Independence Priority): In ESMM’s causal graph, the click→conversion causal relationship ( $O \to R$ ) is missing → the CVR tower risks learning $P(r=1)$ , which is independent of whether a click occurred.

Causal Graph Comparison

(a) ESMM              (b) Naive            (c) ESCM²
 X                     X                    X
 ↓                    ╱ ╲                     ╲
 O    R               O → R              do(O) → R
  ╲  ╱                     ╲                     ╲
   C                        C                     C

O→R missing!         X→O: selection bias    do removes X→O +
                                            O→R retained

ESCM² achieves the structure of Figure 3(c) through the do-calculus: for samples with click=0, it answers the counterfactual question “what would the conversion probability have been if the user had clicked.”

Two Variants

ESCM²-IPS

Uses the CTR tower output as the propensity score to inverse-probability-weight the CVR loss:

\mathcal{R}_{\text{IPS}} = \frac{1}{|\mathcal{D}|} \sum_{(u,i) \in \mathcal{D}} \frac{o_{u,i} \cdot \delta(r_{u,i}, \hat{r}_{u,i})}{\hat{o}_{u,i}}

If CTR is accurate ( $\hat{o} = o$ ), unbiased CVR estimation is guaranteed (Theorem 2)
Drawback: high variance (weights blow up at low CTR)

ESCM²-DR

Adds an imputation tower to IPS to reduce variance:

\mathcal{R}_{\text{DR}} = \frac{1}{|\mathcal{D}|} \sum_{(u,i) \in \mathcal{D}} \left[ \hat{\delta}_{u,i} + \frac{o_{u,i} \cdot (\delta_{u,i} - \hat{\delta}_{u,i})}{\hat{o}_{u,i}} \right]

$\hat{\delta}_{u,i}$ : the CVR error predicted by the imputation tower
Double robustness: unbiased as long as either the imputation or the propensity is accurate
Additional imputation loss: the imputation tower is trained via $\mathcal{R}_{\text{DR}}^{\text{imp}}$

Architecture

       Shared Embedding Lookup Table
       ┌──────┬──────────┬──────────┐
       ↓      ↓          ↓          ↓
   CTR Tower  Imp Tower  CVR Tower
       ↓      ↓          ↓
      pCTR   δ̂ (imp)   pCVR ──── × ──── pCTR
       │      │          │                  ↓
       │      │          │               pCTCVR
       ↓      ↓          ↓                  ↓
   L_CTR   L_CVR(IPS/DR)              L_CTCVR
   ───────────── + ─────────── + ──────────
              L_ESCM² (final)

Implementation Tips

Propensity clipping: prevent weight explosion when $\hat{o}_{u,i}$ is very small → clip at threshold 0.1
Gradient truncation: block the gradient from $\mathcal{L}_{\text{CVR}}$ into the CTR tower (propensity) → protect CTR training
Setting $\lambda_c$ : typically in the 0.1–1.0 range. Too large hinders CTR training → degrades CTCVR performance
Setting $\lambda_g$ : 1.0 or above is recommended. Beneficial for both CVR and CTCVR
DR stabilization: train the imputation tower and the CVR tower alternately via alternative training

Theoretical Guarantees

Theorem	Statement	Meaning
Theorem 1	$\text{Bias}^{\text{ESMM}} > 0$	ESMM structurally overestimates CVR
Theorem 2	$\mathcal{R}_{\text{IPS}} = \mathcal{P}$ (when CTR is accurate)	IPS regularizer → unbiased CVR (resolves IEB)
Theorem 3	$\hat{r}^{\text{IPS}} \to P(r \mid do(o=1))$	IPS → convergence to counterfactual CVR (resolves PIP)

ESMM — the base model that ESCM² improves upon
Multi-Task Learning — the learning paradigm
Propensity Score — uses CTR as the propensity
Doubly Robust Estimator — the theoretical basis of the DR regularizer
Counterfactual Reasoning — redefines the CVR estimand via the do-calculus
Selection Bias — the core problem ESCM² addresses
IPW — the underlying methodology of the IPS regularizer

Key Papers

wangESCM $^2$ EntireSpace2022 — the original ESCM² paper (SIGIR 2022)
maEntireSpaceMultiTask2018 — the original ESMM paper (SIGIR 2018)

Local graph