Thompson Sampling · Tae Hyun Kim (Lowell)

Definition

Thompson Sampling is a Bayesian approach that balances exploration and exploitation.

At each round:

Sample from the posterior distribution of each action’s reward
Select the action with the highest sampled value
Update the posterior distribution with the observed outcome

$a_t = \arg\max_a \tilde{\theta}_a, \quad \text{where } \tilde{\theta}_a \sim P(\theta_a | \text{history})$

Intuitive Understanding

Thompson Sampling performs “probability matching.”

It selects action $a$ in proportion to the posterior probability that $a$ is optimal. Uncertain actions are explored because they occasionally draw high samples, while actions known to be good are exploited because they frequently draw high samples.

Key Properties

Binomial Rewards with the Beta Distribution

For conversion rates (0/1 rewards):

Prior: $\theta_a \sim \text{Beta}(\alpha, \beta)$
Posterior after observation: $\theta_a | \text{data} \sim \text{Beta}(\alpha + \text{successes}, \beta + \text{failures})$

Comparison with UCB

Algorithm	Approach	Characteristics
UCB	Uses upper confidence bounds	Deterministic, optimistic exploration
Thompson Sampling	Samples from the posterior distribution	Stochastic, natural exploration

Empirically, Thompson Sampling often performs better.

Advantages

Simplicity: Only maintains a posterior distribution and performs sampling
Efficient exploration: Exploration proportional to uncertainty
Flexibility: Applicable to a variety of posterior distributions
Strong empirical performance: Excellent results across many applications

Example

Price Optimization Implementation

import numpy as np
from scipy import stats

class ThompsonSamplingPricing:
    def __init__(self, price_levels, marginal_cost):
        self.price_levels = price_levels
        self.MC = marginal_cost
        self.n_prices = len(price_levels)

        # 각 가격에서의 전환율에 대한 베타 사전 분포
        self.alpha = np.ones(self.n_prices)  # 성공 (구매)
        self.beta = np.ones(self.n_prices)   # 실패 (미구매)

    def select_price(self):
        """베타 분포에서 샘플링하여 가격 선택"""
        sampled_rates = np.array([
            stats.beta.rvs(self.alpha[i], self.beta[i])
            for i in range(self.n_prices)
        ])

        # 기대 이익 = (가격 - MC) × 전환율
        expected_profits = (self.price_levels - self.MC) * sampled_rates
        best_idx = np.argmax(expected_profits)

        return best_idx, self.price_levels[best_idx]

    def update(self, price_idx, converted):
        """관측 결과로 사후 분포 업데이트"""
        if converted:
            self.alpha[price_idx] += 1
        else:
            self.beta[price_idx] += 1

    def get_optimal_price(self):
        """현재까지의 최적 가격 (활용만)"""
        mean_rates = self.alpha / (self.alpha + self.beta)
        expected_profits = (self.price_levels - self.MC) * mean_rates
        return self.price_levels[np.argmax(expected_profits)]

    def get_posterior_stats(self):
        """사후 분포 통계"""
        return {
            'mean': self.alpha / (self.alpha + self.beta),
            'std': np.sqrt(self.alpha * self.beta /
                          ((self.alpha + self.beta)**2 * (self.alpha + self.beta + 1)))
        }

Simulation

# 실제 전환율 (알 수 없음)
true_rates = np.array([0.08, 0.06, 0.04, 0.03, 0.02])
price_levels = np.array([10, 15, 20, 25, 30])
MC = 5

# Thompson Sampling 에이전트
agent = ThompsonSamplingPricing(price_levels, MC)

# 시뮬레이션
cumulative_regret = []
total_regret = 0
optimal_profit = max((price_levels - MC) * true_rates)

for t in range(5000):
    # 가격 선택
    price_idx, price = agent.select_price()

    # 구매 여부 (베르누이)
    converted = np.random.random() < true_rates[price_idx]
    profit = (price - MC) * converted

    # 학습
    agent.update(price_idx, converted)

    # 후회 계산
    total_regret += optimal_profit - (price - MC) * true_rates[price_idx]
    cumulative_regret.append(total_regret)

print(f"최종 후회: {total_regret:.2f}")
print(f"학습된 최적 가격: ${agent.get_optimal_price()}")

# 사후 분포 확인
stats = agent.get_posterior_stats()
for i, p in enumerate(price_levels):
    print(f"가격 ${p}: 추정 전환율 {stats['mean'][i]:.3f} ± {stats['std'][i]:.3f}")

Visualization

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# 누적 후회
axes[0].plot(cumulative_regret)
axes[0].set_xlabel('Round')
axes[0].set_ylabel('Cumulative Regret')
axes[0].set_title('Regret over Time')

# 사후 분포
x = np.linspace(0, 0.2, 1000)
for i in range(agent.n_prices):
    y = stats.beta.pdf(x, agent.alpha[i], agent.beta[i])
    axes[1].plot(x, y, label=f'${price_levels[i]}')
axes[1].axvline(true_rates[0], color='r', linestyle='--', alpha=0.5)
axes[1].set_xlabel('Conversion Rate')
axes[1].set_ylabel('Density')
axes[1].set_title('Posterior Distributions')
axes[1].legend()

plt.tight_layout()

Contextual Bandits - Context-based extension
A-B Testing - Pure exploration strategy
MDP - Extension with state transitions
Policy Trees - Interpreting the learned policy

References

Thompson, W. R. (1933). “On the Likelihood that One Unknown Probability Exceeds Another.”
Russo, D., et al. (2018). “A Tutorial on Thompson Sampling.”
Comprehensive Personalized Pricing Guide, Part IX, §26.2

Local graph