Decision-Making under Uncertainty
bandits · RL · OPE · DTR/OTR · policy learning
Turning estimated effects into decisions — optimal policy learning, bandits and reinforcement learning, off-policy evaluation, and dynamic / optimal treatment regimes.
20 notes
-
From Estimation to Action — How HTE Drives Personalized Policy Across Domains
One methodological spine — estimate heterogeneous treatment effects and turn them into individual-level policies — powers both clinical sequential treatment decisions and industrial targeting, pricing, and recommendation.
-
Marketing Attribution at Scale — From Simulation to Causal Inference
A case study comparing 10+ multi-touch attribution methods against a known-ground-truth simulator, then scaling them on the public Criteo dataset, closing the loop with budget off-policy evaluation for channel allocation.
-
RTB Bidding Strategy via Causal ML — From Prediction to Optimization
A five-stage case study on the public iPinYou RTB dataset that moves from pCTR/pCVR prediction through causal effect estimation (CATE, SCM) to budget-constrained optimal bidding and off-policy policy evaluation.
-
Sequential and Adaptive Decision-Making — From Bandits to Dynamic Treatment Regimes
A synthesis essay tracing one methodological spine through sequential decision-making under uncertainty — exploration–exploitation in bandits, off-policy evaluation, and optimal/dynamic treatment regimes — that powers clinical adaptive trials and real-time bidding alike.
-
Anytime-Valid Inference Overview
Game-theoretic statistics that resolves the "peeking" problem of fixed-sample hypothesis testing. The mathematical foundation for real-time monitoring of identification-validity drift.
-
Anytime-Valid OPE
Anytime-valid off-policy evaluation that provides time-uniform off-policy value confidence sequences valid at any stopping time; based on e-processes/confidence sequences.
-
Confidence Sequence
A confidence sequence (CS) $(C_t){t\ge1}$ is a sequence of confidence intervals with time-uniform coverage:
-
Decision-Making Overview
Methods for (sequential) decision-making under uncertainty — from bandit regret to RL, off-policy evaluation, and dynamic/optimal treatment regimes. Underpins both clinical (DTR/OTR) and industrial (targeting, bidding) personalization.
-
Dynamic Treatment Regimes (DTR / OTR)
A DTR is a sequence of decision rules $\{d_t(H_t)\}_{t=1}^T$ mapping the accumulated history $H_t$ (covariates, prior treatments, intermediate outcomes) to a treatment. The optimal treatment regime (OTR) maximizes the expected long-term outcome $E[Y^{d}]$. Estimation:
-
e-process (e-value)
An e-value $E$ is a nonnegative random variable with $EP[E]\le 1$ ($\forall P\in H0$) under the null $H0$. An e-process $(Et)$ is a nonnegative process such that $E\tau$ is an e-value at any stopping time $\tau$ ($E[E\tau]\le1$) — typically a nonnegative supermartingale under the null.…
-
Multi-Armed Bandits
$K$ arms; at each round $t$ pull $A_t$ and observe a reward. Minimize cumulative regret:
-
Off-Policy Evaluation (OPE)
Estimate the value $V(\pie)=E{\pie}[\sum r]$ of a target policy $\pie$ from logs collected under a different behavior policy $\pib$.
-
A/B Testing
A/B testing is the online application of the randomized controlled trial (RCT), estimating causal effects by randomly exposing two or more variants to users.
-
Contextual Bandits
Contextual Bandits are a multi-armed bandit problem in which the optimal action (arm) varies depending on the context.
-
CUPED
CUPED (Controlled-experiment Using Pre-Experiment Data) is a technique that leverages pre-experiment data to reduce the variance of A/B tests.
-
Design Effect
The Design Effect (DEFF) measures the impact of a complex sampling design on variance relative to simple random sampling.
-
MDP (Markov Decision Process)
A Markov Decision Process (MDP) is a mathematical framework for sequential decision-making problems.
-
Policy Trees
Policy Trees, proposed by Athey & Wager (2021), are an interpretable policy-learning method.
-
Statistical Power
Statistical power is the probability of detecting an effect when it truly exists.
-
Thompson Sampling
Thompson Sampling is a Bayesian approach that balances exploration and exploitation.