Multi-Armed Bandits · Tae Hyun Kim (Lowell)

Definition

$K$ arms; at each round $t$ pull $A_t$ and observe a reward. Minimize cumulative regret: $R_T=T\mu^\*-\mathbb{E}\Big[\sum_{t=1}^T \mu_{A_t}\Big],\quad \mu^\*=\max_k\mu_k.$ Balance explore–exploit. UCB (optimism): $A_t=\arg\max_k\big(\hat\mu_k+\sqrt{2\log t/N_k}\big)$ → $O\!\big(\sum_{k:\Delta_k>0}\log T/\Delta_k\big)$ regret, attaining the Lai–Robbins lower bound. Thompson Sampling: sample from the posterior. Extensions: contextual/linear bandits.

Intuitive Understanding

The optimal balance between “try an uncertain arm more (explore) vs. pull the best one so far (exploit).” Concentration and minimax arguments are the core tools of regret theory.

Contextual Bandits · Thompson Sampling · MDP (→ bridge to RL) · Off-Policy Evaluation

Key Papers

Lattimore & Szepesvári, Bandit Algorithms, Cambridge UP 2020 (free PDF) — the full bandit curriculum
Russo, Van Roy, Kazerouni, Osband & Wen, “A Tutorial on Thompson Sampling”, FnT in ML 11(1), 2018

Definition

Intuitive Understanding

Related Concepts

Key Papers

Local graph