Contextual Bandits
Definition
Contextual Bandits are a multi-armed bandit problem in which the optimal action (arm) varies depending on the context.
At each round :
- Observe the context (e.g., customer features)
- Choose an action (e.g., a price level)
- Observe the reward (e.g., purchase indicator × profit)
Objective: maximize cumulative reward, or minimize regret.
Intuitive Understanding
In pricing, Contextual Bandits learn in real time the answer to “what price should I offer this customer?”
A high price may be optimal for high-income customers, while a low price may be optimal for price-sensitive ones. Contextual Bandits learn this relationship while balancing exploration and exploitation.
Key Properties
Difference from basic Bandits
| Type | Description | Pricing example |
|---|---|---|
| K-armed Bandits | Learn the optimal action without context | Find a single optimal price |
| Contextual Bandits | Learn the optimal action given the context | Find the optimal price per customer |
Exploration-Exploitation Trade-off
- Exploitation: Choose the action currently believed to be best
- Exploration: Try uncertain actions to gather information
Too much exploitation → getting stuck in a suboptimal policy Too much exploration → sacrificing short-term reward
Main Algorithms
| Algorithm | Approach | Characteristics |
|---|---|---|
| ε-greedy | Random exploration with probability ε | Simple, but inefficient exploration |
| UCB | Add an uncertainty bonus | Efficient exploration, deterministic |
| Thompson Sampling | Sample from the posterior distribution | Bayesian, strong empirical performance |
| LinUCB | Linear reward assumption + UCB | A representative contextual algorithm |
Example
LinUCB Implementation
class LinUCBPricing:
def __init__(self, n_prices, context_dim, alpha=1.0):
self.n_prices = n_prices
self.d = context_dim
self.alpha = alpha # 탐색 파라미터
# 각 행동(가격)에 대한 파라미터
self.A = [np.eye(context_dim) for _ in range(n_prices)]
self.b = [np.zeros(context_dim) for _ in range(n_prices)]
def select_price(self, context, price_levels, marginal_cost):
"""UCB로 가격 선택"""
ucb_values = []
for i in range(self.n_prices):
A_inv = np.linalg.inv(self.A[i])
theta = A_inv @ self.b[i]
# 예측 전환율 + 불확실성 보너스
pred = context @ theta
uncertainty = self.alpha * np.sqrt(context @ A_inv @ context)
ucb = pred + uncertainty
# 기대 이익 = (가격 - MC) × 예측 전환율
expected_profit = (price_levels[i] - marginal_cost) * ucb
ucb_values.append(expected_profit)
best_action = np.argmax(ucb_values)
return best_action, price_levels[best_action]
def update(self, price_idx, context, reward):
"""관측으로 파라미터 업데이트"""
self.A[price_idx] += np.outer(context, context)
self.b[price_idx] += reward * context
Pricing Simulation
# 가격 수준
price_levels = np.array([10, 15, 20, 25, 30])
marginal_cost = 8
# LinUCB 에이전트
agent = LinUCBPricing(n_prices=5, context_dim=4, alpha=0.5)
# 시뮬레이션
total_reward = 0
for t in range(10000):
# 고객 도착
context = generate_customer() # [income, age, loyalty, recency]
# 가격 선택
action, price = agent.select_price(context, price_levels, marginal_cost)
# 구매 결정 (실제 환경에서는 관측)
conversion = simulate_purchase(context, price)
reward = (price - marginal_cost) * conversion
# 학습
agent.update(action, context, conversion)
total_reward += reward
print(f"총 보상: {total_reward:.2f}")
Related Concepts
- Thompson Sampling - Bayesian exploration strategy
- MDP - Extension with state transitions
- Policy Trees - Learning interpretable policies
- A-B Testing - Pure exploration (experimentation)
References
- Li, L., et al. (2010). “A Contextual-Bandit Approach to Personalized News Article Recommendation.”
- Agrawal, S., & Goyal, N. (2013). “Thompson Sampling for Contextual Bandits with Linear Payoffs.”
- Comprehensive Personalized Pricing Guide, Part IX, §26.3