Tae Hyun Kim (Lowell)

ESMM (Entire Space Multi-Task Model)

3 min read #recsys#representation-learning

Definition

A multi-task model that addresses CVR’s Sample Selection Bias and Data Sparsity problems simultaneously by exploiting the sequential user behavior impressionclickconversion\text{impression} \to \text{click} \to \text{conversion} to learn CVR indirectly over the entire impression space.

Core decomposition:

P(o=1,r=1)CTCVR=P(o=1)CTR×P(r=1o=1)CVR\underbrace{P(o=1, r=1)}_{\text{CTCVR}} = \underbrace{P(o=1)}_{\text{CTR}} \times \underbrace{P(r=1 \mid o=1)}_{\text{CVR}}

Training objective:

LESMM=LCTR+LCTCVR\mathcal{L}_{\text{ESMM}} = \mathcal{L}_{\text{CTR}} + \mathcal{L}_{\text{CTCVR}}
  • LCTR=E(u,i)D[δ(ou,i,o^u,i)]\mathcal{L}_{\text{CTR}} = E_{(u,i) \in \mathcal{D}}[\delta(o_{u,i}, \hat{o}_{u,i})] — learns CTR over the entire impression space
  • LCTCVR=E(u,i)D[δ(ou,iru,i, o^u,ir^u,i)]\mathcal{L}_{\text{CTCVR}} = E_{(u,i) \in \mathcal{D}}[\delta(o_{u,i} \cdot r_{u,i},\ \hat{o}_{u,i} \cdot \hat{r}_{u,i})] — learns CTCVR over the entire impression space
  • The CVR tower has no direct loss; it is learned indirectly through the CTCVR multiplication

Intuitive Understanding

Why learn over the entire space?

The problem with the naive CVR model:

Entire impression (D)     Clicked items (O)       Conversion (R)
 ┌──────────────┐     ┌──────────┐        ┌────┐
 │  ■ ■ □ □ □   │     │  ■ ■ □   │  CVR   │ ■  │
 │  □ □ □ □ □   │ →   │  ■ □ □   │ train →│ ■  │
 │  □ □ □ □ □   │     │          │   ↑     │    │
 └──────────────┘     └──────────┘   │     └────┘
   Inference space      Training     Selection
                        space        Bias!
  • Training: CVR is learned only from click=1 samples
  • Inference: CVR is predicted over all impressions
  • → distribution mismatch (MNAR: Missing Not At Random)

ESMM’s solution: instead of training the CVR tower directly, it exploits the relationship CTR × CVR = CTCVR to learn CVR indirectly over the entire impression space via the CTCVR loss.

Architecture

Raw Features (User, Item)

┌─── Shared Embedding Lookup Table ───┐
│                                      │
↓                                      ↓
CTR Tower                         CVR Tower
   ↓                                  ↓
  pCTR ──────────── × ──────────── pCVR

                  pCTCVR

            L_CTR + L_CTCVR (training)
  • Shared Embedding: transfers CTR’s abundant data (click labels) to the CVR tower → mitigates data sparsity
  • Multiplicative structure: the CVR tower is learned over the entire space → bypasses selection bias

Problems Addressed

ProblemMechanism
Sample Selection BiasCTCVR is learned over the entire impression space instead of the click space
Data SparsityShared embedding transfers CTR → CVR information

Limitations

1. Inherent Estimation Bias (IEB)

ESMM’s CVR estimate is structurally always higher than the ground truth:

BiasESMM:=ED[R^]ED[R]>0\text{Bias}^{\text{ESMM}} := E_\mathcal{D}[\hat{R}] - E_\mathcal{D}[R] > 0

Cause: in R^=C^/O^\hat{R} = \hat{C}/\hat{O}, by Jensen’s inequality E[C^/O^]E[C^]/E[O^]E[\hat{C}/\hat{O}] \geq E[\hat{C}]/E[\hat{O}] — and the equality condition (Var(O^)=0\text{Var}(\hat{O})=0) is unrealistic.

2. Potential Independence Priority (PIP)

In ESMM’s causal graph the ORO \to R edge is missing → the CVR tower risks learning P(r=1)P(r=1), ignoring the causal effect of the click.

Both limitations are resolved in ESCM2 via a counterfactual risk regularizer.

  • Multi-Task Learning — ESMM’s learning paradigm
  • Selection Bias — the core problem ESMM addresses
  • ESCM2 — the successor model that resolves ESMM’s IEB/PIP limitations
  • Propensity Score — used as the propensity (CTR) in ESCM²

Key Papers

  • maEntireSpaceMultiTask2018 — the original ESMM paper (SIGIR 2018)
  • wangESCM2^2EntireSpace2022 — proof of the IEB/PIP limitations and the ESCM² proposal

Local graph