Tae Hyun Kim (Lowell)

Off-Policy Evaluation (OPE)

1 min read #decision-making#ope#doubly-robust

Definition

Estimate the value V(πe)=Eπe[r]V(\pi_e)=E_{\pi_e}[\sum r] of a target policy πe\pi_e from logs collected under a different behavior policy πb\pi_b.

  • Direct Method (DM): model-based plug-in using Q^\hat Q (low bias, low variance, but vulnerable to model misspecification)
  • IPS/IS: V^=1nπe(ax)πb(ax)r\hat V=\frac1n\sum \frac{\pi_e(a|x)}{\pi_b(a|x)} r (unbiased, high variance)
  • Doubly Robust (DR): DM + IPS correction → consistent if either one is correct (Dudík et al.)
  • Variance control: SWITCH/clipping; RL extensions: per-decision IS, MAGIC, DRL.

Intuitive Understanding

“From old-policy logs alone, what value would the new policy have earned?” — policy overlap (positivity) governs the bias-variance trade-off.

Key Papers

  • Dudík, Erhan, Langford & Li, “Doubly Robust Policy Evaluation and Optimization”, Statistical Science 29(4):485–511, 2014
  • Uehara, Shi & Kallus, “A Review of Off-Policy Evaluation in RL”, Statistical Science (in press), 2025

Local graph