Tae Hyun Kim (Lowell)

Thompson Sampling

4 min read #decision-making#bandits

Definition

Thompson Sampling is a Bayesian approach that balances exploration and exploitation.

At each round:

  1. Sample from the posterior distribution of each action’s reward
  2. Select the action with the highest sampled value
  3. Update the posterior distribution with the observed outcome

at=argmaxaθ~a,where θ~aP(θahistory)a_t = \arg\max_a \tilde{\theta}_a, \quad \text{where } \tilde{\theta}_a \sim P(\theta_a | \text{history})

Intuitive Understanding

Thompson Sampling performs “probability matching.”

It selects action aa in proportion to the posterior probability that aa is optimal. Uncertain actions are explored because they occasionally draw high samples, while actions known to be good are exploited because they frequently draw high samples.

Key Properties

Binomial Rewards with the Beta Distribution

For conversion rates (0/1 rewards):

  • Prior: θaBeta(α,β)\theta_a \sim \text{Beta}(\alpha, \beta)
  • Posterior after observation: θadataBeta(α+successes,β+failures)\theta_a | \text{data} \sim \text{Beta}(\alpha + \text{successes}, \beta + \text{failures})

Comparison with UCB

AlgorithmApproachCharacteristics
UCBUses upper confidence boundsDeterministic, optimistic exploration
Thompson SamplingSamples from the posterior distributionStochastic, natural exploration

Empirically, Thompson Sampling often performs better.

Advantages

  1. Simplicity: Only maintains a posterior distribution and performs sampling
  2. Efficient exploration: Exploration proportional to uncertainty
  3. Flexibility: Applicable to a variety of posterior distributions
  4. Strong empirical performance: Excellent results across many applications

Example

Price Optimization Implementation

import numpy as np
from scipy import stats

class ThompsonSamplingPricing:
    def __init__(self, price_levels, marginal_cost):
        self.price_levels = price_levels
        self.MC = marginal_cost
        self.n_prices = len(price_levels)

        # 각 가격에서의 전환율에 대한 베타 사전 분포
        self.alpha = np.ones(self.n_prices)  # 성공 (구매)
        self.beta = np.ones(self.n_prices)   # 실패 (미구매)

    def select_price(self):
        """베타 분포에서 샘플링하여 가격 선택"""
        sampled_rates = np.array([
            stats.beta.rvs(self.alpha[i], self.beta[i])
            for i in range(self.n_prices)
        ])

        # 기대 이익 = (가격 - MC) × 전환율
        expected_profits = (self.price_levels - self.MC) * sampled_rates
        best_idx = np.argmax(expected_profits)

        return best_idx, self.price_levels[best_idx]

    def update(self, price_idx, converted):
        """관측 결과로 사후 분포 업데이트"""
        if converted:
            self.alpha[price_idx] += 1
        else:
            self.beta[price_idx] += 1

    def get_optimal_price(self):
        """현재까지의 최적 가격 (활용만)"""
        mean_rates = self.alpha / (self.alpha + self.beta)
        expected_profits = (self.price_levels - self.MC) * mean_rates
        return self.price_levels[np.argmax(expected_profits)]

    def get_posterior_stats(self):
        """사후 분포 통계"""
        return {
            'mean': self.alpha / (self.alpha + self.beta),
            'std': np.sqrt(self.alpha * self.beta /
                          ((self.alpha + self.beta)**2 * (self.alpha + self.beta + 1)))
        }

Simulation

# 실제 전환율 (알 수 없음)
true_rates = np.array([0.08, 0.06, 0.04, 0.03, 0.02])
price_levels = np.array([10, 15, 20, 25, 30])
MC = 5

# Thompson Sampling 에이전트
agent = ThompsonSamplingPricing(price_levels, MC)

# 시뮬레이션
cumulative_regret = []
total_regret = 0
optimal_profit = max((price_levels - MC) * true_rates)

for t in range(5000):
    # 가격 선택
    price_idx, price = agent.select_price()

    # 구매 여부 (베르누이)
    converted = np.random.random() < true_rates[price_idx]
    profit = (price - MC) * converted

    # 학습
    agent.update(price_idx, converted)

    # 후회 계산
    total_regret += optimal_profit - (price - MC) * true_rates[price_idx]
    cumulative_regret.append(total_regret)

print(f"최종 후회: {total_regret:.2f}")
print(f"학습된 최적 가격: ${agent.get_optimal_price()}")

# 사후 분포 확인
stats = agent.get_posterior_stats()
for i, p in enumerate(price_levels):
    print(f"가격 ${p}: 추정 전환율 {stats['mean'][i]:.3f} ± {stats['std'][i]:.3f}")

Visualization

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# 누적 후회
axes[0].plot(cumulative_regret)
axes[0].set_xlabel('Round')
axes[0].set_ylabel('Cumulative Regret')
axes[0].set_title('Regret over Time')

# 사후 분포
x = np.linspace(0, 0.2, 1000)
for i in range(agent.n_prices):
    y = stats.beta.pdf(x, agent.alpha[i], agent.beta[i])
    axes[1].plot(x, y, label=f'${price_levels[i]}')
axes[1].axvline(true_rates[0], color='r', linestyle='--', alpha=0.5)
axes[1].set_xlabel('Conversion Rate')
axes[1].set_ylabel('Density')
axes[1].set_title('Posterior Distributions')
axes[1].legend()

plt.tight_layout()

References

  • Thompson, W. R. (1933). “On the Likelihood that One Unknown Probability Exceeds Another.”
  • Russo, D., et al. (2018). “A Tutorial on Thompson Sampling.”
  • Comprehensive Personalized Pricing Guide, Part IX, §26.2

Local graph