Thompson Sampling
Definition
Thompson Sampling is a Bayesian approach that balances exploration and exploitation.
At each round:
- Sample from the posterior distribution of each action’s reward
- Select the action with the highest sampled value
- Update the posterior distribution with the observed outcome
Intuitive Understanding
Thompson Sampling performs “probability matching.”
It selects action in proportion to the posterior probability that is optimal. Uncertain actions are explored because they occasionally draw high samples, while actions known to be good are exploited because they frequently draw high samples.
Key Properties
Binomial Rewards with the Beta Distribution
For conversion rates (0/1 rewards):
- Prior:
- Posterior after observation:
Comparison with UCB
| Algorithm | Approach | Characteristics |
|---|---|---|
| UCB | Uses upper confidence bounds | Deterministic, optimistic exploration |
| Thompson Sampling | Samples from the posterior distribution | Stochastic, natural exploration |
Empirically, Thompson Sampling often performs better.
Advantages
- Simplicity: Only maintains a posterior distribution and performs sampling
- Efficient exploration: Exploration proportional to uncertainty
- Flexibility: Applicable to a variety of posterior distributions
- Strong empirical performance: Excellent results across many applications
Example
Price Optimization Implementation
import numpy as np
from scipy import stats
class ThompsonSamplingPricing:
def __init__(self, price_levels, marginal_cost):
self.price_levels = price_levels
self.MC = marginal_cost
self.n_prices = len(price_levels)
# 각 가격에서의 전환율에 대한 베타 사전 분포
self.alpha = np.ones(self.n_prices) # 성공 (구매)
self.beta = np.ones(self.n_prices) # 실패 (미구매)
def select_price(self):
"""베타 분포에서 샘플링하여 가격 선택"""
sampled_rates = np.array([
stats.beta.rvs(self.alpha[i], self.beta[i])
for i in range(self.n_prices)
])
# 기대 이익 = (가격 - MC) × 전환율
expected_profits = (self.price_levels - self.MC) * sampled_rates
best_idx = np.argmax(expected_profits)
return best_idx, self.price_levels[best_idx]
def update(self, price_idx, converted):
"""관측 결과로 사후 분포 업데이트"""
if converted:
self.alpha[price_idx] += 1
else:
self.beta[price_idx] += 1
def get_optimal_price(self):
"""현재까지의 최적 가격 (활용만)"""
mean_rates = self.alpha / (self.alpha + self.beta)
expected_profits = (self.price_levels - self.MC) * mean_rates
return self.price_levels[np.argmax(expected_profits)]
def get_posterior_stats(self):
"""사후 분포 통계"""
return {
'mean': self.alpha / (self.alpha + self.beta),
'std': np.sqrt(self.alpha * self.beta /
((self.alpha + self.beta)**2 * (self.alpha + self.beta + 1)))
}
Simulation
# 실제 전환율 (알 수 없음)
true_rates = np.array([0.08, 0.06, 0.04, 0.03, 0.02])
price_levels = np.array([10, 15, 20, 25, 30])
MC = 5
# Thompson Sampling 에이전트
agent = ThompsonSamplingPricing(price_levels, MC)
# 시뮬레이션
cumulative_regret = []
total_regret = 0
optimal_profit = max((price_levels - MC) * true_rates)
for t in range(5000):
# 가격 선택
price_idx, price = agent.select_price()
# 구매 여부 (베르누이)
converted = np.random.random() < true_rates[price_idx]
profit = (price - MC) * converted
# 학습
agent.update(price_idx, converted)
# 후회 계산
total_regret += optimal_profit - (price - MC) * true_rates[price_idx]
cumulative_regret.append(total_regret)
print(f"최종 후회: {total_regret:.2f}")
print(f"학습된 최적 가격: ${agent.get_optimal_price()}")
# 사후 분포 확인
stats = agent.get_posterior_stats()
for i, p in enumerate(price_levels):
print(f"가격 ${p}: 추정 전환율 {stats['mean'][i]:.3f} ± {stats['std'][i]:.3f}")
Visualization
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
# 누적 후회
axes[0].plot(cumulative_regret)
axes[0].set_xlabel('Round')
axes[0].set_ylabel('Cumulative Regret')
axes[0].set_title('Regret over Time')
# 사후 분포
x = np.linspace(0, 0.2, 1000)
for i in range(agent.n_prices):
y = stats.beta.pdf(x, agent.alpha[i], agent.beta[i])
axes[1].plot(x, y, label=f'${price_levels[i]}')
axes[1].axvline(true_rates[0], color='r', linestyle='--', alpha=0.5)
axes[1].set_xlabel('Conversion Rate')
axes[1].set_ylabel('Density')
axes[1].set_title('Posterior Distributions')
axes[1].legend()
plt.tight_layout()
Related Concepts
- Contextual Bandits - Context-based extension
- A-B Testing - Pure exploration strategy
- MDP - Extension with state transitions
- Policy Trees - Interpreting the learned policy
References
- Thompson, W. R. (1933). “On the Likelihood that One Unknown Probability Exceeds Another.”
- Russo, D., et al. (2018). “A Tutorial on Thompson Sampling.”
- Comprehensive Personalized Pricing Guide, Part IX, §26.2