Contextual Bandits · Tae Hyun Kim (Lowell)

Definition

Contextual Bandits는 **맥락(context)**에 따라 최적의 행동(arm)이 달라지는 다중 슬롯 머신 문제입니다.

각 라운드 $t$ 에서:

맥락(context) $x_t$ 를 관측 (예: 고객 특성)
행동 $a_t$ 를 선택 (예: 가격 수준)
보상 $r_t$ 를 관측 (예: 구매 여부 × 이익)

목표: 누적 보상 최대화 또는 후회(regret) 최소화

$\text{Regret}_T = \sum_{t=1}^{T} \left[ r^*(x_t) - r(x_t, a_t) \right]$

Intuitive Understanding

프라이싱에서 Contextual Bandits는 “이 고객에게 어떤 가격을 제시해야 하는가?”를 실시간으로 학습합니다.

고소득 고객에게는 높은 가격이, 가격 민감 고객에게는 낮은 가격이 최적일 수 있습니다. Contextual Bandits는 **탐색(exploration)**과 **활용(exploitation)**의 균형을 맞추면서 이 관계를 학습합니다.

Key Properties

기본 Bandits와의 차이

유형	설명	프라이싱 예시
K-armed Bandits	맥락 없이 최적 행동 학습	단일 최적 가격 찾기
Contextual Bandits	맥락에 따른 최적 행동 학습	고객별 최적 가격 찾기

탐색-활용 트레이드오프

활용(Exploitation): 현재 최선으로 알려진 행동 선택
탐색(Exploration): 불확실한 행동을 시도하여 정보 수집

너무 많은 활용 → 차선의 정책에 고착 너무 많은 탐색 → 단기 보상 희생

주요 알고리즘

알고리즘	접근법	특징
ε-greedy	확률 ε로 무작위 탐색	단순, 하지만 비효율적 탐색
UCB	불확실성 보너스 추가	효율적 탐색, 결정론적
Thompson Sampling	사후 분포에서 샘플링	베이지안, 좋은 실증 성과
LinUCB	선형 보상 가정 + UCB	대표적 contextual 알고리즘

Example

LinUCB 구현

class LinUCBPricing:
    def __init__(self, n_prices, context_dim, alpha=1.0):
        self.n_prices = n_prices
        self.d = context_dim
        self.alpha = alpha  # 탐색 파라미터

        # 각 행동(가격)에 대한 파라미터
        self.A = [np.eye(context_dim) for _ in range(n_prices)]
        self.b = [np.zeros(context_dim) for _ in range(n_prices)]

    def select_price(self, context, price_levels, marginal_cost):
        """UCB로 가격 선택"""
        ucb_values = []

        for i in range(self.n_prices):
            A_inv = np.linalg.inv(self.A[i])
            theta = A_inv @ self.b[i]

            # 예측 전환율 + 불확실성 보너스
            pred = context @ theta
            uncertainty = self.alpha * np.sqrt(context @ A_inv @ context)
            ucb = pred + uncertainty

            # 기대 이익 = (가격 - MC) × 예측 전환율
            expected_profit = (price_levels[i] - marginal_cost) * ucb
            ucb_values.append(expected_profit)

        best_action = np.argmax(ucb_values)
        return best_action, price_levels[best_action]

    def update(self, price_idx, context, reward):
        """관측으로 파라미터 업데이트"""
        self.A[price_idx] += np.outer(context, context)
        self.b[price_idx] += reward * context

프라이싱 시뮬레이션

# 가격 수준
price_levels = np.array([10, 15, 20, 25, 30])
marginal_cost = 8

# LinUCB 에이전트
agent = LinUCBPricing(n_prices=5, context_dim=4, alpha=0.5)

# 시뮬레이션
total_reward = 0
for t in range(10000):
    # 고객 도착
    context = generate_customer()  # [income, age, loyalty, recency]

    # 가격 선택
    action, price = agent.select_price(context, price_levels, marginal_cost)

    # 구매 결정 (실제 환경에서는 관측)
    conversion = simulate_purchase(context, price)
    reward = (price - marginal_cost) * conversion

    # 학습
    agent.update(action, context, conversion)
    total_reward += reward

print(f"총 보상: {total_reward:.2f}")

Thompson Sampling - 베이지안 탐색 전략
MDP - 상태 전이가 있는 확장
Policy Trees - 해석 가능한 정책 학습
A-B Testing - 순수 탐색 (실험)

References

Li, L., et al. (2010). “A Contextual-Bandit Approach to Personalized News Article Recommendation.”
Agrawal, S., & Goyal, N. (2013). “Thompson Sampling for Contextual Bandits with Linear Payoffs.”
Comprehensive Personalized Pricing Guide, Part IX, §26.3

연결 그래프