Multi-Armed Bandits
Definition
arms; at each round pull and observe a reward. Minimize cumulative regret: Balance explore–exploit. UCB (optimism): → regret, attaining the Lai–Robbins lower bound. Thompson Sampling: sample from the posterior. Extensions: contextual/linear bandits.
Intuitive Understanding
The optimal balance between “try an uncertain arm more (explore) vs. pull the best one so far (exploit).” Concentration and minimax arguments are the core tools of regret theory.
Related Concepts
- Contextual Bandits · Thompson Sampling · MDP (→ bridge to RL) · Off-Policy Evaluation
Key Papers
- Lattimore & Szepesvári, Bandit Algorithms, Cambridge UP 2020 (free PDF) — the full bandit curriculum
- Russo, Van Roy, Kazerouni, Osband & Wen, “A Tutorial on Thompson Sampling”, FnT in ML 11(1), 2018