Thompson Sampling · Tae Hyun Kim (Lowell)

Definition

Thompson Sampling은 탐색과 활용의 균형을 맞추는 베이지안 접근법입니다.

각 라운드에서:

각 행동의 보상에 대한 사후 분포에서 샘플링
샘플된 값이 가장 높은 행동 선택
관측 결과로 사후 분포 업데이트

$a_t = \arg\max_a \tilde{\theta}_a, \quad \text{where } \tilde{\theta}_a \sim P(\theta_a | \text{history})$

Intuitive Understanding

Thompson Sampling은 “확률 매칭(probability matching)“을 수행합니다.

행동 $a$ 가 최적일 사후 확률과 비례하여 $a$ 를 선택합니다. 불확실한 행동은 때때로 높은 샘플을 뽑아 탐색되고, 좋다고 알려진 행동은 자주 높은 샘플을 뽑아 활용됩니다.

Key Properties

베타 분포를 사용한 이항 보상

전환율(0/1 보상)에 대해:

사전: $\theta_a \sim \text{Beta}(\alpha, \beta)$
관측 후 사후: $\theta_a | \text{data} \sim \text{Beta}(\alpha + \text{successes}, \beta + \text{failures})$

UCB와의 비교

알고리즘	접근법	특징
UCB	상한 신뢰 구간 사용	결정론적, 낙관적 탐색
Thompson Sampling	사후 분포에서 샘플링	확률적, 자연스러운 탐색

실증적으로 Thompson Sampling이 종종 더 나은 성과를 보입니다.

장점

단순함: 사후 분포만 유지, 샘플링만 수행
효율적 탐색: 불확실성에 비례한 탐색
유연성: 다양한 사후 분포에 적용 가능
좋은 실증 성과: 많은 응용에서 우수한 결과

Example

가격 최적화 구현

import numpy as np
from scipy import stats

class ThompsonSamplingPricing:
    def __init__(self, price_levels, marginal_cost):
        self.price_levels = price_levels
        self.MC = marginal_cost
        self.n_prices = len(price_levels)

        # 각 가격에서의 전환율에 대한 베타 사전 분포
        self.alpha = np.ones(self.n_prices)  # 성공 (구매)
        self.beta = np.ones(self.n_prices)   # 실패 (미구매)

    def select_price(self):
        """베타 분포에서 샘플링하여 가격 선택"""
        sampled_rates = np.array([
            stats.beta.rvs(self.alpha[i], self.beta[i])
            for i in range(self.n_prices)
        ])

        # 기대 이익 = (가격 - MC) × 전환율
        expected_profits = (self.price_levels - self.MC) * sampled_rates
        best_idx = np.argmax(expected_profits)

        return best_idx, self.price_levels[best_idx]

    def update(self, price_idx, converted):
        """관측 결과로 사후 분포 업데이트"""
        if converted:
            self.alpha[price_idx] += 1
        else:
            self.beta[price_idx] += 1

    def get_optimal_price(self):
        """현재까지의 최적 가격 (활용만)"""
        mean_rates = self.alpha / (self.alpha + self.beta)
        expected_profits = (self.price_levels - self.MC) * mean_rates
        return self.price_levels[np.argmax(expected_profits)]

    def get_posterior_stats(self):
        """사후 분포 통계"""
        return {
            'mean': self.alpha / (self.alpha + self.beta),
            'std': np.sqrt(self.alpha * self.beta /
                          ((self.alpha + self.beta)**2 * (self.alpha + self.beta + 1)))
        }

시뮬레이션

# 실제 전환율 (알 수 없음)
true_rates = np.array([0.08, 0.06, 0.04, 0.03, 0.02])
price_levels = np.array([10, 15, 20, 25, 30])
MC = 5

# Thompson Sampling 에이전트
agent = ThompsonSamplingPricing(price_levels, MC)

# 시뮬레이션
cumulative_regret = []
total_regret = 0
optimal_profit = max((price_levels - MC) * true_rates)

for t in range(5000):
    # 가격 선택
    price_idx, price = agent.select_price()

    # 구매 여부 (베르누이)
    converted = np.random.random() < true_rates[price_idx]
    profit = (price - MC) * converted

    # 학습
    agent.update(price_idx, converted)

    # 후회 계산
    total_regret += optimal_profit - (price - MC) * true_rates[price_idx]
    cumulative_regret.append(total_regret)

print(f"최종 후회: {total_regret:.2f}")
print(f"학습된 최적 가격: ${agent.get_optimal_price()}")

# 사후 분포 확인
stats = agent.get_posterior_stats()
for i, p in enumerate(price_levels):
    print(f"가격 ${p}: 추정 전환율 {stats['mean'][i]:.3f} ± {stats['std'][i]:.3f}")

시각화

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# 누적 후회
axes[0].plot(cumulative_regret)
axes[0].set_xlabel('Round')
axes[0].set_ylabel('Cumulative Regret')
axes[0].set_title('Regret over Time')

# 사후 분포
x = np.linspace(0, 0.2, 1000)
for i in range(agent.n_prices):
    y = stats.beta.pdf(x, agent.alpha[i], agent.beta[i])
    axes[1].plot(x, y, label=f'${price_levels[i]}')
axes[1].axvline(true_rates[0], color='r', linestyle='--', alpha=0.5)
axes[1].set_xlabel('Conversion Rate')
axes[1].set_ylabel('Density')
axes[1].set_title('Posterior Distributions')
axes[1].legend()

plt.tight_layout()

Contextual Bandits - 맥락 기반 확장
A-B Testing - 순수 탐색 전략
MDP - 상태 전이가 있는 확장
Policy Trees - 학습된 정책의 해석

References

Thompson, W. R. (1933). “On the Likelihood that One Unknown Probability Exceeds Another.”
Russo, D., et al. (2018). “A Tutorial on Thompson Sampling.”
Comprehensive Personalized Pricing Guide, Part IX, §26.2

연결 그래프