Contextual Bandits · Tae Hyun Kim (Lowell)

Definition

Contextual Bandits are a multi-armed bandit problem in which the optimal action (arm) varies depending on the context.

At each round $t$ :

Observe the context $x_t$ (e.g., customer features)
Choose an action $a_t$ (e.g., a price level)
Observe the reward $r_t$ (e.g., purchase indicator × profit)

Objective: maximize cumulative reward, or minimize regret.

$\text{Regret}_T = \sum_{t=1}^{T} \left[ r^*(x_t) - r(x_t, a_t) \right]$

Intuitive Understanding

In pricing, Contextual Bandits learn in real time the answer to “what price should I offer this customer?”

A high price may be optimal for high-income customers, while a low price may be optimal for price-sensitive ones. Contextual Bandits learn this relationship while balancing exploration and exploitation.

Key Properties

Difference from basic Bandits

Type	Description	Pricing example
K-armed Bandits	Learn the optimal action without context	Find a single optimal price
Contextual Bandits	Learn the optimal action given the context	Find the optimal price per customer

Exploration-Exploitation Trade-off

Exploitation: Choose the action currently believed to be best
Exploration: Try uncertain actions to gather information

Too much exploitation → getting stuck in a suboptimal policy Too much exploration → sacrificing short-term reward

Main Algorithms

Algorithm	Approach	Characteristics
ε-greedy	Random exploration with probability ε	Simple, but inefficient exploration
UCB	Add an uncertainty bonus	Efficient exploration, deterministic
Thompson Sampling	Sample from the posterior distribution	Bayesian, strong empirical performance
LinUCB	Linear reward assumption + UCB	A representative contextual algorithm

Example

LinUCB Implementation

class LinUCBPricing:
    def __init__(self, n_prices, context_dim, alpha=1.0):
        self.n_prices = n_prices
        self.d = context_dim
        self.alpha = alpha  # 탐색 파라미터

        # 각 행동(가격)에 대한 파라미터
        self.A = [np.eye(context_dim) for _ in range(n_prices)]
        self.b = [np.zeros(context_dim) for _ in range(n_prices)]

    def select_price(self, context, price_levels, marginal_cost):
        """UCB로 가격 선택"""
        ucb_values = []

        for i in range(self.n_prices):
            A_inv = np.linalg.inv(self.A[i])
            theta = A_inv @ self.b[i]

            # 예측 전환율 + 불확실성 보너스
            pred = context @ theta
            uncertainty = self.alpha * np.sqrt(context @ A_inv @ context)
            ucb = pred + uncertainty

            # 기대 이익 = (가격 - MC) × 예측 전환율
            expected_profit = (price_levels[i] - marginal_cost) * ucb
            ucb_values.append(expected_profit)

        best_action = np.argmax(ucb_values)
        return best_action, price_levels[best_action]

    def update(self, price_idx, context, reward):
        """관측으로 파라미터 업데이트"""
        self.A[price_idx] += np.outer(context, context)
        self.b[price_idx] += reward * context

Pricing Simulation

# 가격 수준
price_levels = np.array([10, 15, 20, 25, 30])
marginal_cost = 8

# LinUCB 에이전트
agent = LinUCBPricing(n_prices=5, context_dim=4, alpha=0.5)

# 시뮬레이션
total_reward = 0
for t in range(10000):
    # 고객 도착
    context = generate_customer()  # [income, age, loyalty, recency]

    # 가격 선택
    action, price = agent.select_price(context, price_levels, marginal_cost)

    # 구매 결정 (실제 환경에서는 관측)
    conversion = simulate_purchase(context, price)
    reward = (price - marginal_cost) * conversion

    # 학습
    agent.update(action, context, conversion)
    total_reward += reward

print(f"총 보상: {total_reward:.2f}")

Thompson Sampling - Bayesian exploration strategy
MDP - Extension with state transitions
Policy Trees - Learning interpretable policies
A-B Testing - Pure exploration (experimentation)

References

Li, L., et al. (2010). “A Contextual-Bandit Approach to Personalized News Article Recommendation.”
Agrawal, S., & Goyal, N. (2013). “Thompson Sampling for Contextual Bandits with Linear Payoffs.”
Comprehensive Personalized Pricing Guide, Part IX, §26.3

Local graph