Tae Hyun Kim (Lowell)

Contextual Bandits

Definition

Contextual Bandits are a multi-armed bandit problem in which the optimal action (arm) varies depending on the context.

At each round tt:

  1. Observe the context xtx_t (e.g., customer features)
  2. Choose an action ata_t (e.g., a price level)
  3. Observe the reward rtr_t (e.g., purchase indicator × profit)

Objective: maximize cumulative reward, or minimize regret.

RegretT=t=1T[r(xt)r(xt,at)]\text{Regret}_T = \sum_{t=1}^{T} \left[ r^*(x_t) - r(x_t, a_t) \right]

Intuitive Understanding

In pricing, Contextual Bandits learn in real time the answer to “what price should I offer this customer?”

A high price may be optimal for high-income customers, while a low price may be optimal for price-sensitive ones. Contextual Bandits learn this relationship while balancing exploration and exploitation.

Key Properties

Difference from basic Bandits

TypeDescriptionPricing example
K-armed BanditsLearn the optimal action without contextFind a single optimal price
Contextual BanditsLearn the optimal action given the contextFind the optimal price per customer

Exploration-Exploitation Trade-off

  • Exploitation: Choose the action currently believed to be best
  • Exploration: Try uncertain actions to gather information

Too much exploitation → getting stuck in a suboptimal policy Too much exploration → sacrificing short-term reward

Main Algorithms

AlgorithmApproachCharacteristics
ε-greedyRandom exploration with probability εSimple, but inefficient exploration
UCBAdd an uncertainty bonusEfficient exploration, deterministic
Thompson SamplingSample from the posterior distributionBayesian, strong empirical performance
LinUCBLinear reward assumption + UCBA representative contextual algorithm

Example

LinUCB Implementation

class LinUCBPricing:
    def __init__(self, n_prices, context_dim, alpha=1.0):
        self.n_prices = n_prices
        self.d = context_dim
        self.alpha = alpha  # 탐색 파라미터

        # 각 행동(가격)에 대한 파라미터
        self.A = [np.eye(context_dim) for _ in range(n_prices)]
        self.b = [np.zeros(context_dim) for _ in range(n_prices)]

    def select_price(self, context, price_levels, marginal_cost):
        """UCB로 가격 선택"""
        ucb_values = []

        for i in range(self.n_prices):
            A_inv = np.linalg.inv(self.A[i])
            theta = A_inv @ self.b[i]

            # 예측 전환율 + 불확실성 보너스
            pred = context @ theta
            uncertainty = self.alpha * np.sqrt(context @ A_inv @ context)
            ucb = pred + uncertainty

            # 기대 이익 = (가격 - MC) × 예측 전환율
            expected_profit = (price_levels[i] - marginal_cost) * ucb
            ucb_values.append(expected_profit)

        best_action = np.argmax(ucb_values)
        return best_action, price_levels[best_action]

    def update(self, price_idx, context, reward):
        """관측으로 파라미터 업데이트"""
        self.A[price_idx] += np.outer(context, context)
        self.b[price_idx] += reward * context

Pricing Simulation

# 가격 수준
price_levels = np.array([10, 15, 20, 25, 30])
marginal_cost = 8

# LinUCB 에이전트
agent = LinUCBPricing(n_prices=5, context_dim=4, alpha=0.5)

# 시뮬레이션
total_reward = 0
for t in range(10000):
    # 고객 도착
    context = generate_customer()  # [income, age, loyalty, recency]

    # 가격 선택
    action, price = agent.select_price(context, price_levels, marginal_cost)

    # 구매 결정 (실제 환경에서는 관측)
    conversion = simulate_purchase(context, price)
    reward = (price - marginal_cost) * conversion

    # 학습
    agent.update(action, context, conversion)
    total_reward += reward

print(f"총 보상: {total_reward:.2f}")

References

  • Li, L., et al. (2010). “A Contextual-Bandit Approach to Personalized News Article Recommendation.”
  • Agrawal, S., & Goyal, N. (2013). “Thompson Sampling for Contextual Bandits with Linear Payoffs.”
  • Comprehensive Personalized Pricing Guide, Part IX, §26.3

Local graph