Tae Hyun Kim (Lowell)

Endogeneity

Definition

Endogeneity is the problem that arises when an explanatory variable is correlated with the error term.

Y=β0+β1X+uwhereCov(X,u)0Y = \beta_0 + \beta_1 X + u \quad \text{where} \quad Cov(X, u) \neq 0

In this case the OLS estimator β^1\hat{\beta}_1 is biased and inconsistent.

Intuitive Understanding

In estimating price elasticity, endogeneity is the most fundamental challenge.

Firms do not set prices at random. They raise prices when they expect demand to be high and lower them when they expect it to be low. This optimizing behavior creates a spurious correlation between price and unobserved demand factors.

Key Properties

Sources of Endogeneity

SourceDescriptionExample
SimultaneityPrice and quantity are determined simultaneouslyMarket equilibrium
Omitted variablesOmission of a variable that affects both price and demandQuality, brand
Reverse causalityDemand affects priceDemand-forecast-based pricing
Measurement errorMeasurement error in the explanatory variableMissing records of price discounts

Direction of the Bias

Price endogeneity almost always leads to underestimation of elasticity (a positive bias).

β^OLS=β+Cov(X,u)Var(X)\hat{\beta}_{OLS} = \beta + \frac{Cov(X, u)}{Var(X)}

Since higher quality → higher price and higher demand:

  • Cov(price,quality)>0Cov(\text{price}, \text{quality}) > 0
  • Cov(quality,demand)>0Cov(\text{quality}, \text{demand}) > 0

Result: although the true elasticity is negative, the estimate can approach 0 or even become positive.

Example

Simulation

import numpy as np
from sklearn.linear_model import LinearRegression

np.random.seed(42)
n = 5000

# Unobserved quality (confounder)
quality = np.random.randn(n)

# Price: depends on quality (endogenous!)
true_price_effect = -2.0
price = 20 + 3 * quality + np.random.randn(n) * 2

# Demand: depends on both price and quality
demand = 100 + true_price_effect * price + 10 * quality + np.random.randn(n) * 5

# Naive OLS (quality uncontrolled) - biased!
naive_model = LinearRegression()
naive_model.fit(price.reshape(-1, 1), demand)
print(f"True price effect: {true_price_effect}")
print(f"Naive OLS estimate: {naive_model.coef_[0]:.3f}")  # biased toward ~ -0.5

# After controlling for quality - consistent
X_controlled = np.column_stack([price, quality])
controlled_model = LinearRegression()
controlled_model.fit(X_controlled, demand)
print(f"After controlling for quality: {controlled_model.coef_[0]:.3f}")  # ~ -2.0

Solution Approaches

ApproachCore ideaAssumption
ExperimentsRandomly assign pricesEthical/cost constraints
Instrumental variablesExploit exogenous price variationExclusion restriction
Control strategyControl for sufficient covariatesUnconfoundedness
Structural modelsExplicitly model the economic structureFunctional-form assumptions
  • Instrumental Variables - the primary remedy for endogeneity
  • Confounder - a variable that induces endogeneity
  • A-B Testing - causal estimation free of endogeneity
  • Price Elasticity - the target for which endogeneity is a problem

References

  • Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data.
  • Comprehensive Personalized Pricing Guide, Part I, §3

Local graph