Tae Hyun Kim (Lowell)

Multi-Armed Bandits

1 min read #decision-making#bandits

Definition

KK arms; at each round tt pull AtA_t and observe a reward. Minimize cumulative regret: RT=Tμ\*E[t=1TμAt],μ\*=maxkμk.R_T=T\mu^\*-\mathbb{E}\Big[\sum_{t=1}^T \mu_{A_t}\Big],\quad \mu^\*=\max_k\mu_k. Balance explore–exploit. UCB (optimism): At=argmaxk(μ^k+2logt/Nk)A_t=\arg\max_k\big(\hat\mu_k+\sqrt{2\log t/N_k}\big)O ⁣(k:Δk>0logT/Δk)O\!\big(\sum_{k:\Delta_k>0}\log T/\Delta_k\big) regret, attaining the Lai–Robbins lower bound. Thompson Sampling: sample from the posterior. Extensions: contextual/linear bandits.

Intuitive Understanding

The optimal balance between “try an uncertain arm more (explore) vs. pull the best one so far (exploit).” Concentration and minimax arguments are the core tools of regret theory.

Key Papers

  • Lattimore & Szepesvári, Bandit Algorithms, Cambridge UP 2020 (free PDF) — the full bandit curriculum
  • Russo, Van Roy, Kazerouni, Osband & Wen, “A Tutorial on Thompson Sampling”, FnT in ML 11(1), 2018

Local graph