Off-Policy Evaluation (OPE)
Definition
Estimate the value of a target policy from logs collected under a different behavior policy .
- Direct Method (DM): model-based plug-in using (low bias, low variance, but vulnerable to model misspecification)
- IPS/IS: (unbiased, high variance)
- Doubly Robust (DR): DM + IPS correction → consistent if either one is correct (Dudík et al.)
- Variance control: SWITCH/clipping; RL extensions: per-decision IS, MAGIC, DRL.
Intuitive Understanding
“From old-policy logs alone, what value would the new policy have earned?” — policy overlap (positivity) governs the bias-variance trade-off.
Related Concepts
- Policy Learning · Doubly Robust Estimator · IPW · Multi-Armed Bandits · Contextual Bandits
- e-process — anytime-valid OPE
Key Papers
- Dudík, Erhan, Langford & Li, “Doubly Robust Policy Evaluation and Optimization”, Statistical Science 29(4):485–511, 2014
- Uehara, Shi & Kallus, “A Review of Off-Policy Evaluation in RL”, Statistical Science (in press), 2025