Tae Hyun Kim (Lowell)

Multi-Task Learning

2 min read #recsys#multi-task-learning

Definition

A learning paradigm that jointly trains several related tasks, improving generalization through a shared representation.

Ltotal=k=1KλkLk\mathcal{L}_{total} = \sum_{k=1}^{K} \lambda_k \mathcal{L}_k

Here Lk\mathcal{L}_k is the loss of task kk, and λk\lambda_k is the task weight.


Intuitive Understanding

Why learn jointly?

  • Shared representation: when related tasks learn common patterns, the performance of each individual task improves.
  • Regularization effect: multiple tasks help prevent overfitting.
  • Data efficiency: sparse tasks can transfer information from data-rich tasks.

Core application in advertising/recommendation

Predicting CTR and CVR simultaneously:

User/Item Features

  Shared Embedding

   ┌───┴───┐
   ↓       ↓
CTR Tower  CVR Tower
   ↓       ↓
  pCTR    pCVR

Main Architectures

1. Shared-Bottom

All tasks share the same lower network. Simple, but tasks may conflict.

2. MMoE (Multi-gate Mixture-of-Experts)

A per-task gate selectively combines expert networks. Advantageous when task relationships are weak.

3. PLE (Progressive Layered Extraction)

Progressively separates shared and task-specific experts. Minimizes task conflict.


Main Models in Ad Prediction

ESMM (Entire Space Multi-Task Model)

  • Problem solved: CVR Selection Bias (training only on click=1) + Data Sparsity
  • Key idea: learn CTCVR=pCTR×pCVR\text{CTCVR} = \text{pCTR} \times \text{pCVR} over the entire impression space.
  • Limitations: Inherent Estimation Bias (IEB) + Potential Independence Priority (PIP)

ESCM² (Entire Space Counterfactual Multi-Task Model)

  • Improvement: ESMM + IPS/DR counterfactual risk regularizer
  • Method: redefines CVR as P(r=1do(o=1))P(r=1 \mid do(o=1)) and trains it unbiasedly with IPS/DR.
  • Limitations: depends on CTR accuracy, operates only in the impression space.

IPW-ESCM² (proposed framework)

  • Combines ESCM² with IPW weights.
  • Corrects win selection bias and click selection bias simultaneously.
  • See research_design_selection_bias for details.

  • Selection Bias - the core problem MTL aims to address
  • IPW - a weighting method for bias correction
  • Calibration - the accuracy of predicted probabilities
  • Survival Analysis - used for win propensity estimation

Key Papers

  • maEntireSpaceMultiTask2018 — Entire Space Multi-Task Model (ESMM), SIGIR 2018
  • wangESCM2^2EntireSpace2022 — ESCM²: Entire Space Counterfactual Multi-Task Model, SIGIR 2022
  • Ma et al. (2018). Modeling Task Relationships in Multi-Task Learning with Multi-Gate Mixture-of-Experts (MMoE), KDD

Local graph