Tae Hyun Kim (Lowell)

← 모든 노트

Decision-Making under Uncertainty

Off-Policy Evaluation (OPE)

2026-06-11 1분 읽기 #decision-making #ope #doubly-robust

정의

다른 behavior policy $\pi_b$ 로 수집한 로그로 target policy $\pi_e$ 의 가치 $V(\pi_e)=E_{\pi_e}[\sum r]$ 를 추정.

Direct Method (DM): 모델 $\hat Q$ 기반 plug-in (편향↓·분산↓, 모델 오설정에 취약)
IPS/IS: $\hat V=\frac1n\sum \frac{\pi_e(a|x)}{\pi_b(a|x)} r$ (무편향·고분산)
Doubly Robust (DR): DM + IPS 보정 → 둘 중 하나만 맞아도 일치 (Dudík et al.)
분산 제어: SWITCH/clipping; RL 확장: per-decision IS, MAGIC, DRL.

직관적 이해

“옛 정책 로그만으로 새 정책이 벌었을 값은?” — policy overlap(positivity)이 bias-variance를 좌우.

관련 개념

Policy Learning · Doubly Robust Estimator · IPW · Multi-Armed Bandits · Contextual Bandits
e-process — anytime-valid OPE

참고 논문

Dudík, Erhan, Langford & Li, “Doubly Robust Policy Evaluation and Optimization”, Statistical Science 29(4):485–511, 2014
Uehara, Shi & Kallus, “A Review of Off-Policy Evaluation in RL”, Statistical Science (in press), 2025

연결 그래프