#decision-making

노트 13개

Anytime-Valid OPE 임의의 정지 시점에서도 유효한(time-uniform) off-policy value 신뢰열을 제공하는 anytime-valid off-policy evaluation; e-process/confidence sequence 기반.
Contextual Bandits Contextual Bandits는 맥락(context)에 따라 최적의 행동(arm)이 달라지는 다중 슬롯 머신 문제입니다.
Decision-Making Overview 불확실성하 (순차) 의사결정의 방법론 — bandit regret부터 RL, off-policy evaluation, dynamic/optimal treatment regimes까지. 임상(DTR/OTR)과 산업(targeting·bidding) personalization을 모두 받친다.
Dynamic Treatment Regimes (DTR / OTR) DTR은 누적 이력 $Ht$(공변량·이전 처치·중간결과)를 처치로 사상하는 결정규칙 열 $\{dt(Ht)\}{t=1}^T$. optimal treatment regime(OTR) 은 기대 장기결과 $E[Y^{d}]$를 최대화. 추정:
From Estimation to Action — How HTE Drives Personalized Policy Across Domains One methodological spine — estimate heterogeneous treatment effects and turn them into individual-level policies — powers both clinical sequential treatment decisions and industrial targeting, pricing, and recommendation.
Marketing Attribution at Scale — From Simulation to Causal Inference A case study comparing 10+ multi-touch attribution methods against a known-ground-truth simulator, then scaling them on the public Criteo dataset, closing the loop with budget off-policy evaluation for channel allocation.
MDP (Markov Decision Process) 마르코프 결정 과정(Markov Decision Process, MDP)은 순차적 의사결정 문제의 수학적 프레임워크입니다.
Multi-Armed Bandits $K$개 arm, 매 라운드 $t$에 $At$를 당겨 보상 관측. cumulative regret 최소화:
Off-Policy Evaluation (OPE) 다른 behavior policy $\pib$로 수집한 로그로 target policy $\pie$의 가치 $V(\pie)=E{\pie}[\sum r]$를 추정.
Policy Trees 정책 트리(Policy Trees)는 Athey & Wager (2021)가 제안한 해석 가능한 정책 학습 방법입니다.
RTB Bidding Strategy via Causal ML — From Prediction to Optimization A five-stage case study on the public iPinYou RTB dataset that moves from pCTR/pCVR prediction through causal effect estimation (CATE, SCM) to budget-constrained optimal bidding and off-policy policy evaluation.
Sequential and Adaptive Decision-Making — From Bandits to Dynamic Treatment Regimes A synthesis essay tracing one methodological spine through sequential decision-making under uncertainty — exploration–exploitation in bandits, off-policy evaluation, and optimal/dynamic treatment regimes — that powers clinical adaptive trials and real-time bidding alike.
Thompson Sampling Thompson Sampling은 탐색과 활용의 균형을 맞추는 베이지안 접근법입니다.