Sequential and Adaptive Decision-Making — From Bandits to Dynamic Treatment Regimes

Most real-world decisions do not end after a single shot. A patient’s first-line therapy changes their state next month, and that state in turn redefines the options for second-line therapy. A single ad bid shaves the budget, and the remaining budget decides the headroom of every later bid. Decisions cascade along the time axis, and — more subtly — each choice changes what we will get to learn next. In such a world, a static policy that is learned once and frozen breaks down quickly. This essay follows a single spine — from bandits, through off-policy evaluation, to dynamic treatment regimes — and shows why clinical adaptive trials and real-time bidding share the same mathematics.

The Unifying Idea: When Learning and Acting Are Entangled

Static supervised learning carries an implicit premise: “the data are given exogenously, and the model passively fits them.” Sequential decision-making breaks that premise. The action we choose generates the next data. Two essential difficulties follow.

The exploration–exploitation tension. Repeating only the action that currently looks best (exploit) may mean never discovering a better alternative; trying unfamiliar actions to gather information (explore) sacrifices short-term reward.
Distribution shift and the counterfactual. We observe outcomes only for the actions we actually tried; the outcomes of untried actions are missing forever. This makes off-policy evaluation and causal inference two faces of the same problem — “estimating the value of a different policy from observed logs” is fundamentally a counterfactual question.

A one-line unifying principle runs through both: quantify uncertainty, explore in proportion to that uncertainty, and explicitly model the structure by which actions reshape the future. This principle threads the entire method arc below.

The Method Arc

1. Bandits — Exploration–Exploitation in Its Purest Form

The multi-armed bandit (MAB) is the leanest distillation of sequential decision-making. Each round, you pull one of $K$ arms, observe a reward, and minimize cumulative regret

$R_T = T\mu^\* - \mathbb{E}\Big[\sum_{t=1}^T \mu_{A_t}\Big],\qquad \mu^\* = \max_k \mu_k.$

There is no state transition and no context here — only the explore–exploit balance. Two canonical strategies resolve it.

UCB (Upper Confidence Bound). “Optimism in the face of uncertainty.” Add a confidence bonus to each arm’s estimated mean and pull $A_t = \arg\max_k(\hat\mu_k + \sqrt{2\log t / N_k})$ . This deterministic rule attains $O(\log T)$ regret, meeting the Lai–Robbins lower bound.
Thompson Sampling. The Bayesian view — draw a sample from the posterior over each arm’s reward parameter and pull the arm with the largest draw. This is probability matching: “select in proportion to the posterior probability that an arm is optimal,” so uncertain arms occasionally draw large samples and get explored naturally. It is simple to implement, empirically robust, and widely used in industry.

Add context and you get the contextual bandit. Each round you first observe a context $x_t$ (patient features, user profile) and learn the best action conditional on that context. Regret becomes $\text{Regret}_T = \sum_t [r^*(x_t) - r(x_t, a_t)]$ , modeling reward as a function of context (LinUCB — linear assumption plus UCB — is the canonical example). Already at this step the decision is personalization: it asks “what is optimal for this context.”

2. When the State Reacts to the Action — MDP

The decisive limitation of bandits is that each decision is independent. In reality, today’s action changes tomorrow’s state — inventory drops, a patient’s disease progresses, a budget is depleted. The framework that makes this dynamics explicit is the MDP (Markov Decision Process). A 5-tuple $(S, A, P, R, \gamma)$ specifies state, action, transition probability, reward, and discount, and the Bellman optimality equation

$V^*(s) = \max_a \Big[ R(s,a) + \gamma \sum_{s'} P(s'|s,a)\, V^*(s') \Big]$

defines long-run value. A contextual bandit is essentially a special case of an MDP with no transitions ( $\gamma=0$ , or a single step), and conversely an MDP generalizes the bandit by adding temporal coupling. This is where the territory of reinforcement learning (RL) begins.

3. Re-evaluating a Path Already Taken — Off-Policy Evaluation

Testing a sequential policy in the live environment is expensive, and in clinical settings it can be unethical. So the central question shifts: using only logs already collected (under a behavior policy $\pi_b$ ), can we estimate the value of a new policy $\pi_e$ that has not yet been deployed? This is off-policy evaluation (OPE), and three canonical estimators carve out the bias–variance spectrum.

Direct Method (DM). Fit a reward model $\hat Q$ and plug in. Low variance, but vulnerable to model misspecification.
Inverse Propensity Scoring (IPS). Importance-weight, $\hat V = \frac1n\sum \frac{\pi_e(a|x)}{\pi_b(a|x)} r$ . Unbiased, but high variance when weights explode.
Doubly Robust (DR). Combine the DM plug-in with an IPS correction — consistent if either the reward model or the propensity model is correct. This is the policy-evaluation counterpart of the doubly-robust estimator and AIPW.

Here causal inference and decision-making meet formally. OPE’s key assumption, policy overlap, is the same condition as positivity in causal identification, and estimating the missing outcomes of untried actions is exactly counterfactual reasoning. Pushing policy from evaluation to learning leads to policy learning: take estimated individual-level effects as input and learn an interpretable policy rule in decision-tree form, $\pi^*(x)=\arg\max_a E[Y(a)\mid X=x]$ , maximizing policy value $V(\pi)=E[\tau(X)\,\mathbf 1\{\pi(X)=1\}]$ (Athey–Wager). Unlike a black-box effect estimate, this can explain “why this action for this context” to a regulatory or clinical reviewer.

4. Personalization Along the Time Axis — Dynamic Treatment Regimes

The final step integrates the pieces above into a clinical time series. A dynamic treatment regime (DTR) is a sequence of decision rules $\{d_t(H_t)\}_{t=1}^T$ mapping accumulated history $H_t$ (covariates, prior treatments, intermediate outcomes) to a treatment, and the optimal treatment regime (OTR) is the rule sequence maximizing the expected long-term outcome $E[Y^d]$ . It decides “what to do next given everything so far” so as to optimize the final outcome — essentially personalization extended along time. The estimation tools sit at the confluence of RL and causal inference.

Q-learning — fit the Q-function by backward-induction regression.
A-learning — model only the contrast/advantage, robust to baseline misspecification.
G-estimation — Robins’ structural nested mean model.
OWL (outcome-weighted learning) — recast policy learning as weighted classification.

The key identification assumption is sequential ignorability (at each time, conditional on observed history, treatment is independent of future potential outcomes), generalizing single-timepoint no-unmeasured-confounding along the time axis. In the end, the DTR holds the bandit’s explore–exploit, the MDP’s temporal coupling, and OPE/policy learning’s counterfactual evaluation all inside a single sequential clinical decision.

Why It Matters Across Domains

The practical payoff is that this spine is singular. The surface domains differ, but the mathematical core is the same.

Clinical adaptive trials. Patients enroll sequentially, and interim outcomes update the next assignment probabilities. Response-adaptive randomization is effectively a bandit, multi-stage treatment optimization is DTR/OTR, and evaluating a new protocol from existing trial logs before exposing patients is OPE.
Real-time bidding (RTB). Ad auctions flow tens of thousands per second, and a bid is set conditional on context (user, page, time) — a contextual/linear bandit. The budget is a state variable depleted over time, so the MDP’s temporal coupling kicks in, and the value of a new bidding policy before deployment is estimated by budget OPE.

The same four stages — exploration–exploitation, state dynamics, off-policy evaluation, and personalization along time — underpin both domains. A regret bound or a doubly-robust estimator obtained on one side transfers directly to the other. This is the methodological basis for viewing “personalized decision-making under uncertainty” as a duality of the clinical and the industrial.

One honest caveat: all of these guarantees (regret bounds, consistency, coverage) depend on assumptions — overlap/positivity, sequential ignorability, model fit. When the assumptions break, the estimates are silently biased. That is why distribution-free guarantees (anytime-valid OPE, conformal-style methods) and sensitivity analysis are not add-ons but a necessary safety layer when deploying sequential decision-making in practice.

Decision-Making Overview — the hub for the full landscape this essay compresses
Multi-Armed Bandits · Contextual Bandits · Thompson Sampling — exploration–exploitation
MDP — sequential decisions where the state reacts to the action
Off-Policy Evaluation · Policy Trees — counterfactual policy evaluation and learning
Dynamic Treatment Regimes — personalized treatment rules along the time axis

Local graph