Tae Hyun Kim (Lowell)

ESCM² (Entire Space Counterfactual Multi-Task Model)

Definition

A model that integrates a counterfactual risk regularizer based on the Inverse Propensity Score (IPS) and the Doubly Robust Estimator into ESMM, in order to address ESMM’s two theoretical limitations — Inherent Estimation Bias (IEB) and Potential Independence Priority (PIP).

Final training objective:

LESCM2=LCTREmpirical Risk+λcLCVRCounterfactual Risk+λgLCTCVRGlobal Risk\mathcal{L}_{\text{ESCM}^2} = \underbrace{\mathcal{L}_{\text{CTR}}}_{\text{Empirical Risk}} + \lambda_c \underbrace{\mathcal{L}_{\text{CVR}}}_{\text{Counterfactual Risk}} + \lambda_g \underbrace{\mathcal{L}_{\text{CTCVR}}}_{\text{Global Risk}}

Redefining the CVR estimand via the do-calculus:

P(r=1do(o=1))P(r=1 \mid do(o=1))

→ removing the XOX \to O dependency resolves selection bias and PIP simultaneously.

Intuitive Understanding

ESMM’s Two Problems

IEB (Inherent Estimation Bias): ESMM estimates CVR as R^=C^/O^\hat{R} = \hat{C}/\hat{O}, but by Jensen’s inequality E[C^/O^]E[C^]/E[O^]E[\hat{C}/\hat{O}] \geq E[\hat{C}]/E[\hat{O}] → it always overestimates.

PIP (Potential Independence Priority): In ESMM’s causal graph, the click→conversion causal relationship (ORO \to R) is missing → the CVR tower risks learning P(r=1)P(r=1), which is independent of whether a click occurred.

Causal Graph Comparison

(a) ESMM              (b) Naive            (c) ESCM²
 X                     X                    X
 ↓                    ╱ ╲                     ╲
 O    R               O → R              do(O) → R
  ╲  ╱                     ╲                     ╲
   C                        C                     C

O→R missing!         X→O: selection bias    do removes X→O +
                                            O→R retained

ESCM² achieves the structure of Figure 3(c) through the do-calculus: for samples with click=0, it answers the counterfactual question “what would the conversion probability have been if the user had clicked.”

Two Variants

ESCM²-IPS

Uses the CTR tower output as the propensity score to inverse-probability-weight the CVR loss:

RIPS=1D(u,i)Dou,iδ(ru,i,r^u,i)o^u,i\mathcal{R}_{\text{IPS}} = \frac{1}{|\mathcal{D}|} \sum_{(u,i) \in \mathcal{D}} \frac{o_{u,i} \cdot \delta(r_{u,i}, \hat{r}_{u,i})}{\hat{o}_{u,i}}
  • If CTR is accurate (o^=o\hat{o} = o), unbiased CVR estimation is guaranteed (Theorem 2)
  • Drawback: high variance (weights blow up at low CTR)

ESCM²-DR

Adds an imputation tower to IPS to reduce variance:

RDR=1D(u,i)D[δ^u,i+ou,i(δu,iδ^u,i)o^u,i]\mathcal{R}_{\text{DR}} = \frac{1}{|\mathcal{D}|} \sum_{(u,i) \in \mathcal{D}} \left[ \hat{\delta}_{u,i} + \frac{o_{u,i} \cdot (\delta_{u,i} - \hat{\delta}_{u,i})}{\hat{o}_{u,i}} \right]
  • δ^u,i\hat{\delta}_{u,i}: the CVR error predicted by the imputation tower
  • Double robustness: unbiased as long as either the imputation or the propensity is accurate
  • Additional imputation loss: the imputation tower is trained via RDRimp\mathcal{R}_{\text{DR}}^{\text{imp}}

Architecture

       Shared Embedding Lookup Table
       ┌──────┬──────────┬──────────┐
       ↓      ↓          ↓          ↓
   CTR Tower  Imp Tower  CVR Tower
       ↓      ↓          ↓
      pCTR   δ̂ (imp)   pCVR ──── × ──── pCTR
       │      │          │                  ↓
       │      │          │               pCTCVR
       ↓      ↓          ↓                  ↓
   L_CTR   L_CVR(IPS/DR)              L_CTCVR
   ───────────── + ─────────── + ──────────
              L_ESCM² (final)

Implementation Tips

  1. Propensity clipping: prevent weight explosion when o^u,i\hat{o}_{u,i} is very small → clip at threshold 0.1
  2. Gradient truncation: block the gradient from LCVR\mathcal{L}_{\text{CVR}} into the CTR tower (propensity) → protect CTR training
  3. Setting λc\lambda_c: typically in the 0.1–1.0 range. Too large hinders CTR training → degrades CTCVR performance
  4. Setting λg\lambda_g: 1.0 or above is recommended. Beneficial for both CVR and CTCVR
  5. DR stabilization: train the imputation tower and the CVR tower alternately via alternative training

Theoretical Guarantees

TheoremStatementMeaning
Theorem 1BiasESMM>0\text{Bias}^{\text{ESMM}} > 0ESMM structurally overestimates CVR
Theorem 2RIPS=P\mathcal{R}_{\text{IPS}} = \mathcal{P} (when CTR is accurate)IPS regularizer → unbiased CVR (resolves IEB)
Theorem 3r^IPSP(rdo(o=1))\hat{r}^{\text{IPS}} \to P(r \mid do(o=1))IPS → convergence to counterfactual CVR (resolves PIP)
  • ESMM — the base model that ESCM² improves upon
  • Multi-Task Learning — the learning paradigm
  • Propensity Score — uses CTR as the propensity
  • Doubly Robust Estimator — the theoretical basis of the DR regularizer
  • Counterfactual Reasoning — redefines the CVR estimand via the do-calculus
  • Selection Bias — the core problem ESCM² addresses
  • IPW — the underlying methodology of the IPS regularizer

Key Papers

  • wangESCM2^2EntireSpace2022 — the original ESCM² paper (SIGIR 2022)
  • maEntireSpaceMultiTask2018 — the original ESMM paper (SIGIR 2018)

Local graph