Tae Hyun Kim (Lowell)

Decision-Making under Uncertainty

Off-Policy Evaluation (OPE)

2026-06-11 1 min read #decision-making #ope #doubly-robust

Definition

Estimate the value $V(\pi_e)=E_{\pi_e}[\sum r]$ of a target policy $\pi_e$ from logs collected under a different behavior policy $\pi_b$ .

Direct Method (DM): model-based plug-in using $\hat Q$ (low bias, low variance, but vulnerable to model misspecification)
IPS/IS: $\hat V=\frac1n\sum \frac{\pi_e(a|x)}{\pi_b(a|x)} r$ (unbiased, high variance)
Doubly Robust (DR): DM + IPS correction → consistent if either one is correct (Dudík et al.)
Variance control: SWITCH/clipping; RL extensions: per-decision IS, MAGIC, DRL.

Intuitive Understanding

“From old-policy logs alone, what value would the new policy have earned?” — policy overlap (positivity) governs the bias-variance trade-off.

Policy Learning · Doubly Robust Estimator · IPW · Multi-Armed Bandits · Contextual Bandits
e-process — anytime-valid OPE

Key Papers

Dudík, Erhan, Langford & Li, “Doubly Robust Policy Evaluation and Optimization”, Statistical Science 29(4):485–511, 2014
Uehara, Shi & Kallus, “A Review of Off-Policy Evaluation in RL”, Statistical Science (in press), 2025

Local graph