ESMM (Entire Space Multi-Task Model)
Definition
A multi-task model that addresses CVR’s Sample Selection Bias and Data Sparsity problems simultaneously by exploiting the sequential user behavior to learn CVR indirectly over the entire impression space.
Core decomposition:
Training objective:
- — learns CTR over the entire impression space
- — learns CTCVR over the entire impression space
- The CVR tower has no direct loss; it is learned indirectly through the CTCVR multiplication
Intuitive Understanding
Why learn over the entire space?
The problem with the naive CVR model:
Entire impression (D) Clicked items (O) Conversion (R)
┌──────────────┐ ┌──────────┐ ┌────┐
│ ■ ■ □ □ □ │ │ ■ ■ □ │ CVR │ ■ │
│ □ □ □ □ □ │ → │ ■ □ □ │ train →│ ■ │
│ □ □ □ □ □ │ │ │ ↑ │ │
└──────────────┘ └──────────┘ │ └────┘
Inference space Training Selection
space Bias!
- Training: CVR is learned only from click=1 samples
- Inference: CVR is predicted over all impressions
- → distribution mismatch (MNAR: Missing Not At Random)
ESMM’s solution: instead of training the CVR tower directly, it exploits the relationship CTR × CVR = CTCVR to learn CVR indirectly over the entire impression space via the CTCVR loss.
Architecture
Raw Features (User, Item)
↓
┌─── Shared Embedding Lookup Table ───┐
│ │
↓ ↓
CTR Tower CVR Tower
↓ ↓
pCTR ──────────── × ──────────── pCVR
↓
pCTCVR
↓
L_CTR + L_CTCVR (training)
- Shared Embedding: transfers CTR’s abundant data (click labels) to the CVR tower → mitigates data sparsity
- Multiplicative structure: the CVR tower is learned over the entire space → bypasses selection bias
Problems Addressed
| Problem | Mechanism |
|---|---|
| Sample Selection Bias | CTCVR is learned over the entire impression space instead of the click space |
| Data Sparsity | Shared embedding transfers CTR → CVR information |
Limitations
1. Inherent Estimation Bias (IEB)
ESMM’s CVR estimate is structurally always higher than the ground truth:
Cause: in , by Jensen’s inequality — and the equality condition () is unrealistic.
2. Potential Independence Priority (PIP)
In ESMM’s causal graph the edge is missing → the CVR tower risks learning , ignoring the causal effect of the click.
Both limitations are resolved in ESCM2 via a counterfactual risk regularizer.
Related Concepts
- Multi-Task Learning — ESMM’s learning paradigm
- Selection Bias — the core problem ESMM addresses
- ESCM2 — the successor model that resolves ESMM’s IEB/PIP limitations
- Propensity Score — used as the propensity (CTR) in ESCM²
Key Papers
- maEntireSpaceMultiTask2018 — the original ESMM paper (SIGIR 2018)
- wangESCMEntireSpace2022 — proof of the IEB/PIP limitations and the ESCM² proposal