Multi-Task Learning
Definition
A learning paradigm that jointly trains several related tasks, improving generalization through a shared representation.
Here is the loss of task , and is the task weight.
Intuitive Understanding
Why learn jointly?
- Shared representation: when related tasks learn common patterns, the performance of each individual task improves.
- Regularization effect: multiple tasks help prevent overfitting.
- Data efficiency: sparse tasks can transfer information from data-rich tasks.
Core application in advertising/recommendation
Predicting CTR and CVR simultaneously:
User/Item Features
↓
Shared Embedding
↓
┌───┴───┐
↓ ↓
CTR Tower CVR Tower
↓ ↓
pCTR pCVR
Main Architectures
1. Shared-Bottom
All tasks share the same lower network. Simple, but tasks may conflict.
2. MMoE (Multi-gate Mixture-of-Experts)
A per-task gate selectively combines expert networks. Advantageous when task relationships are weak.
3. PLE (Progressive Layered Extraction)
Progressively separates shared and task-specific experts. Minimizes task conflict.
Main Models in Ad Prediction
ESMM (Entire Space Multi-Task Model)
- Problem solved: CVR Selection Bias (training only on click=1) + Data Sparsity
- Key idea: learn over the entire impression space.
- Limitations: Inherent Estimation Bias (IEB) + Potential Independence Priority (PIP)
ESCM² (Entire Space Counterfactual Multi-Task Model)
- Improvement: ESMM + IPS/DR counterfactual risk regularizer
- Method: redefines CVR as and trains it unbiasedly with IPS/DR.
- Limitations: depends on CTR accuracy, operates only in the impression space.
IPW-ESCM² (proposed framework)
- Combines ESCM² with IPW weights.
- Corrects win selection bias and click selection bias simultaneously.
- See research_design_selection_bias for details.
Related Concepts
- Selection Bias - the core problem MTL aims to address
- IPW - a weighting method for bias correction
- Calibration - the accuracy of predicted probabilities
- Survival Analysis - used for win propensity estimation
Key Papers
- maEntireSpaceMultiTask2018 — Entire Space Multi-Task Model (ESMM), SIGIR 2018
- wangESCMEntireSpace2022 — ESCM²: Entire Space Counterfactual Multi-Task Model, SIGIR 2022
- Ma et al. (2018). Modeling Task Relationships in Multi-Task Learning with Multi-Gate Mixture-of-Experts (MMoE), KDD