ESCM² (Entire Space Counterfactual Multi-Task Model)
Definition
A model that integrates a counterfactual risk regularizer based on the Inverse Propensity Score (IPS) and the Doubly Robust Estimator into ESMM, in order to address ESMM’s two theoretical limitations — Inherent Estimation Bias (IEB) and Potential Independence Priority (PIP).
Final training objective:
Redefining the CVR estimand via the do-calculus:
→ removing the dependency resolves selection bias and PIP simultaneously.
Intuitive Understanding
ESMM’s Two Problems
IEB (Inherent Estimation Bias): ESMM estimates CVR as , but by Jensen’s inequality → it always overestimates.
PIP (Potential Independence Priority): In ESMM’s causal graph, the click→conversion causal relationship () is missing → the CVR tower risks learning , which is independent of whether a click occurred.
Causal Graph Comparison
(a) ESMM (b) Naive (c) ESCM²
X X X
↓ ╱ ╲ ╲
O R O → R do(O) → R
╲ ╱ ╲ ╲
C C C
O→R missing! X→O: selection bias do removes X→O +
O→R retained
ESCM² achieves the structure of Figure 3(c) through the do-calculus: for samples with click=0, it answers the counterfactual question “what would the conversion probability have been if the user had clicked.”
Two Variants
ESCM²-IPS
Uses the CTR tower output as the propensity score to inverse-probability-weight the CVR loss:
- If CTR is accurate (), unbiased CVR estimation is guaranteed (Theorem 2)
- Drawback: high variance (weights blow up at low CTR)
ESCM²-DR
Adds an imputation tower to IPS to reduce variance:
- : the CVR error predicted by the imputation tower
- Double robustness: unbiased as long as either the imputation or the propensity is accurate
- Additional imputation loss: the imputation tower is trained via
Architecture
Shared Embedding Lookup Table
┌──────┬──────────┬──────────┐
↓ ↓ ↓ ↓
CTR Tower Imp Tower CVR Tower
↓ ↓ ↓
pCTR δ̂ (imp) pCVR ──── × ──── pCTR
│ │ │ ↓
│ │ │ pCTCVR
↓ ↓ ↓ ↓
L_CTR L_CVR(IPS/DR) L_CTCVR
───────────── + ─────────── + ──────────
L_ESCM² (final)
Implementation Tips
- Propensity clipping: prevent weight explosion when is very small → clip at threshold 0.1
- Gradient truncation: block the gradient from into the CTR tower (propensity) → protect CTR training
- Setting : typically in the 0.1–1.0 range. Too large hinders CTR training → degrades CTCVR performance
- Setting : 1.0 or above is recommended. Beneficial for both CVR and CTCVR
- DR stabilization: train the imputation tower and the CVR tower alternately via alternative training
Theoretical Guarantees
| Theorem | Statement | Meaning |
|---|---|---|
| Theorem 1 | ESMM structurally overestimates CVR | |
| Theorem 2 | (when CTR is accurate) | IPS regularizer → unbiased CVR (resolves IEB) |
| Theorem 3 | IPS → convergence to counterfactual CVR (resolves PIP) |
Related Concepts
- ESMM — the base model that ESCM² improves upon
- Multi-Task Learning — the learning paradigm
- Propensity Score — uses CTR as the propensity
- Doubly Robust Estimator — the theoretical basis of the DR regularizer
- Counterfactual Reasoning — redefines the CVR estimand via the do-calculus
- Selection Bias — the core problem ESCM² addresses
- IPW — the underlying methodology of the IPS regularizer
Key Papers
- wangESCMEntireSpace2022 — the original ESCM² paper (SIGIR 2022)
- maEntireSpaceMultiTask2018 — the original ESMM paper (SIGIR 2018)