Multi-Task Learning · Tae Hyun Kim (Lowell)

Definition

A learning paradigm that jointly trains several related tasks, improving generalization through a shared representation.

\mathcal{L}_{total} = \sum_{k=1}^{K} \lambda_k \mathcal{L}_k

Here $\mathcal{L}_k$ is the loss of task $k$ , and $\lambda_k$ is the task weight.

Shared representation: when related tasks learn common patterns, the performance of each individual task improves.
Regularization effect: multiple tasks help prevent overfitting.
Data efficiency: sparse tasks can transfer information from data-rich tasks.

Predicting CTR and CVR simultaneously:

User/Item Features
       ↓
  Shared Embedding
       ↓
   ┌───┴───┐
   ↓       ↓
CTR Tower  CVR Tower
   ↓       ↓
  pCTR    pCVR

All tasks share the same lower network. Simple, but tasks may conflict.

A per-task gate selectively combines expert networks. Advantageous when task relationships are weak.

Progressively separates shared and task-specific experts. Minimizes task conflict.

Problem solved: CVR Selection Bias (training only on click=1) + Data Sparsity
Key idea: learn $\text{CTCVR} = \text{pCTR} \times \text{pCVR}$ over the entire impression space.
Limitations: Inherent Estimation Bias (IEB) + Potential Independence Priority (PIP)

Improvement: ESMM + IPS/DR counterfactual risk regularizer
Method: redefines CVR as $P(r=1 \mid do(o=1))$ and trains it unbiasedly with IPS/DR.
Limitations: depends on CTR accuracy, operates only in the impression space.

maEntireSpaceMultiTask2018 — Entire Space Multi-Task Model (ESMM), SIGIR 2018
wangESCM $^2$ EntireSpace2022 — ESCM²: Entire Space Counterfactual Multi-Task Model, SIGIR 2022
Ma et al. (2018). Modeling Task Relationships in Multi-Task Learning with Multi-Gate Mixture-of-Experts (MMoE), KDD