ESMM (Entire Space Multi-Task Model) · Tae Hyun Kim (Lowell)

Definition

A multi-task model that addresses CVR’s Sample Selection Bias and Data Sparsity problems simultaneously by exploiting the sequential user behavior $\text{impression} \to \text{click} \to \text{conversion}$ to learn CVR indirectly over the entire impression space.

Core decomposition:

\underbrace{P(o=1, r=1)}_{\text{CTCVR}} = \underbrace{P(o=1)}_{\text{CTR}} \times \underbrace{P(r=1 \mid o=1)}_{\text{CVR}}

Training objective:

\mathcal{L}_{\text{ESMM}} = \mathcal{L}_{\text{CTR}} + \mathcal{L}_{\text{CTCVR}}

$\mathcal{L}_{\text{CTR}} = E_{(u,i) \in \mathcal{D}}[\delta(o_{u,i}, \hat{o}_{u,i})]$ — learns CTR over the entire impression space
$\mathcal{L}_{\text{CTCVR}} = E_{(u,i) \in \mathcal{D}}[\delta(o_{u,i} \cdot r_{u,i},\ \hat{o}_{u,i} \cdot \hat{r}_{u,i})]$ — learns CTCVR over the entire impression space
The CVR tower has no direct loss; it is learned indirectly through the CTCVR multiplication

Intuitive Understanding

Why learn over the entire space?

The problem with the naive CVR model:

Entire impression (D)     Clicked items (O)       Conversion (R)
 ┌──────────────┐     ┌──────────┐        ┌────┐
 │  ■ ■ □ □ □   │     │  ■ ■ □   │  CVR   │ ■  │
 │  □ □ □ □ □   │ →   │  ■ □ □   │ train →│ ■  │
 │  □ □ □ □ □   │     │          │   ↑     │    │
 └──────────────┘     └──────────┘   │     └────┘
   Inference space      Training     Selection
                        space        Bias!

Training: CVR is learned only from click=1 samples
Inference: CVR is predicted over all impressions
→ distribution mismatch (MNAR: Missing Not At Random)

ESMM’s solution: instead of training the CVR tower directly, it exploits the relationship CTR × CVR = CTCVR to learn CVR indirectly over the entire impression space via the CTCVR loss.

Architecture

Raw Features (User, Item)
         ↓
┌─── Shared Embedding Lookup Table ───┐
│                                      │
↓                                      ↓
CTR Tower                         CVR Tower
   ↓                                  ↓
  pCTR ──────────── × ──────────── pCVR
                     ↓
                  pCTCVR
                     ↓
            L_CTR + L_CTCVR (training)

Shared Embedding: transfers CTR’s abundant data (click labels) to the CVR tower → mitigates data sparsity
Multiplicative structure: the CVR tower is learned over the entire space → bypasses selection bias

Problems Addressed

Problem	Mechanism
Sample Selection Bias	CTCVR is learned over the entire impression space instead of the click space
Data Sparsity	Shared embedding transfers CTR → CVR information

Limitations

1. Inherent Estimation Bias (IEB)

ESMM’s CVR estimate is structurally always higher than the ground truth:

\text{Bias}^{\text{ESMM}} := E_\mathcal{D}[\hat{R}] - E_\mathcal{D}[R] > 0

Cause: in $\hat{R} = \hat{C}/\hat{O}$ , by Jensen’s inequality $E[\hat{C}/\hat{O}] \geq E[\hat{C}]/E[\hat{O}]$ — and the equality condition ( $\text{Var}(\hat{O})=0$ ) is unrealistic.

2. Potential Independence Priority (PIP)

In ESMM’s causal graph the $O \to R$ edge is missing → the CVR tower risks learning $P(r=1)$ , ignoring the causal effect of the click.

Both limitations are resolved in ESCM2 via a counterfactual risk regularizer.

Multi-Task Learning — ESMM’s learning paradigm
Selection Bias — the core problem ESMM addresses
ESCM2 — the successor model that resolves ESMM’s IEB/PIP limitations
Propensity Score — used as the propensity (CTR) in ESCM²

Key Papers

maEntireSpaceMultiTask2018 — the original ESMM paper (SIGIR 2018)
wangESCM $^2$ EntireSpace2022 — proof of the IEB/PIP limitations and the ESCM² proposal

Local graph