RTB Bidding Strategy via Causal ML — From Prediction to Optimization

Real-Time Bidding (RTB) forces a decision under a hard clock: every time an ad impression goes to auction, you have under 100ms to answer “how much should I bid?”. The common approach predicts click/conversion and bids proportional to that expected value — but that is prediction, not decision-making. What we actually want to know is “if I raise my bid by 1%, how much does win rate and conversion causally change, and how does that effect differ across segments?”. This note walks the five-stage pipeline built on the public iPinYou dataset — prediction → traditional analysis → CATE → SCM → optimization — and shows how causal inference builds the bridge from prediction to decision.

Every number below is illustrative, derived at the scale of the public iPinYou dataset. No proprietary baselines or client identifiers are used; the figures convey the shape of the methodology, not absolute performance.

Problem: why prediction alone is not enough

The expected value of an impression is usually computed as:

$V(x) = \text{pCTR}(x) \times \text{pCVR}(x) \times \text{value}$

Two problems. First, knowing $V(x)$ does not tell you what to bid. In a first-price auction, bidding $\text{bid} = V$ yields zero margin, so bid shading is mandatory — and the optimal shading depends on the responsiveness of win rate to the bid (elasticity). Second, that elasticity must be a causal effect, not a correlation. In observed logs, high bids already concentrate on high-value impressions (high pCTR), so the naive regression slope of “bid↑ → win↑” is contaminated by confounding.

The core causal structure:

Campaign / Context / User → pCTR/pCVR → Bid → Win → Impression → Click → Conversion
                                          ↑
                                  Competition, Floor

Here pCTR and Campaign are confounders that influence both Bid and Win, while Competition and Floor are the other inputs that determine Win. Identifying the causal effect of Bid requires controlling for them.

Data: public iPinYou scale

iPinYou is a widely used public RTB benchmark, roughly at this scale (rounded / approximate):

Stage	Scale (approx.)	Meaning
Bid requests	~300M	All bid opportunities sent to auction
Wins (impressions)	~60M	Won and served
Clicks	~800K	Click events
Conversions	~16K	Conversion events

This funnel structure is itself the central analytical difficulty. We observe click/conversion only among the ~60M wins, and conversion only when click=1. That is, a two-stage selection bias is baked into the data.

Stage 1 (Win selection): only won impressions are observed → prediction over the full bid space is biased; overfit to samples won with high bids.
Stage 2 (Click selection): CVR is trained only on click=1 samples → the conversion intent of non-clicked impressions is missing.

Method / Pipeline

Stage 1 — prediction + selection-bias debiasing

The base prediction targets are $\text{pCTR} = P(\text{Click}=1\mid X)$ , $\text{pCVR} = P(\text{Conv}=1\mid \text{Click}=1, X)$ , and the entire-space target $\text{CTCVR} = \text{pCTR}\times\text{pCVR}$ . To correct both bias stages at once, Stage 2 is handled with an entire-space multi-task structure (learning CTCVR over the impression space), and Stage 1 is handled with IPW.

The Stage 1 win propensity is estimated over the full bid log (including win=0):

$p_{\text{win}}(x, b) = P(\text{win}=1 \mid X=x,\ \text{bid}=b), \qquad w_i = \frac{1}{p_{\text{win}}(x_i, b_i)}$

To prevent gradient explosion from extreme weights, we use stabilized weights $w_i^{\text{stab}} = P(\text{win})/p_{\text{win}}(x_i,b_i)$ , clipping, and self-normalized (Hájek) normalization. The justification is a simple reweighting identity:

$E_{\text{bid}}[f(x)] = E_{\text{won}}\!\left[\frac{f(x)}{p(\text{win}\mid x)}\ \middle|\ \text{win}=1\right]$

Illustrative result: the pCTR/pCVR models showed discrimination in the AUC ~0.82–0.88 range, and the debiased (IPW-weighted) variants markedly improved calibration (ECE) precisely in low-win-rate segments. The value of debiasing shows up more clearly in the IPW-weighted AUC over the full bid space and in calibration than in average AUC — exactly because it fixes predictions in regions that were previously underrepresented.

Stage 2 — traditional win-rate / ROI analysis

First, at the correlational level, fit the bid–win-rate curve with logistic regression:

$P(\text{Win}=1 \mid \text{Bid}, X) = \Lambda\!\big(\beta_0 + \beta_1 \log(\text{Bid}) + \gamma' X\big)$

The bid elasticity then comes out in closed form:

$\varepsilon = \frac{\partial \log P(\text{Win})}{\partial \log \text{Bid}} = \beta_1\,(1 - P(\text{Win}))$

ROI is $\text{ROI} = (n_{\text{conv}}\cdot\text{value} - \sum \text{payprice}_i)/\sum \text{payprice}_i$ . The key observation at this stage is that elasticity varies across segments (low-competition hours → higher elasticity). But this curve is descriptive statistics mixed with confounding, not a causal quantity — which is why we move on to Stage 3.

Stage 3 — CATE (DML / Causal Forest)

Treating the treatment as a continuous variable $T = \log(\text{bid})$ , we estimate the conditional average treatment effect with Double-Debiased ML / Causal Forest (CausalForestDML):

$\tau(x) = \frac{\partial}{\partial T}\, E[\text{Win}\mid do(T), X=x]$

The variable split:

Y (Outcome):       win, click, conversion
T (Treatment):     log_bid  (continuous treatment)
X (Heterogeneity): campaign, hour, exchange, region
W (Confounders):   floor, competition_proxy, pCTR

DML learns the nuisances ( $E[Y\mid X,W]$ , $E[T\mid X,W]$ ) with ML and orthogonalizes via Robinson residualization, so by the Neyman-orthogonal score it is robust to first-stage estimation error and attains $\sqrt{n}$ -consistency. Causal Forest’s honest splitting guarantees valid confidence intervals.

Illustrative result: the ATE was estimated to be positive (95% CI excluding 0 — a unit increase in log_bid raises win rate by a few percentage points), and the CATE distribution spread widely around 0, revealing substantial heterogeneity. The pattern matches intuition:

Heterogeneity dimension	Pattern (illustrative)
Campaign (vertical)	Clear differences in $\tau$ by vertical
Hour	$\tau$ ↑ in off-peak hours
Competition	competition ↑ → $\tau$ ↓ (moderation effect)

In other words, the same 1% bid increase has a very different marginal effect depending on where you spend it — and that is the starting point for policy design.

Stage 4 — SCM / counterfactual

If CATE answers “how much on average”, the SCM answers “what if this bid had been different?” — individual and policy counterfactuals. The structural equations:

\begin{aligned} \text{Bid} &:= f_{\text{bid}}(\text{pCTR}, \text{pCVR}, \text{Campaign}) + U_{\text{bid}} \\ \text{Win} &:= \mathbf{1}\{\text{Bid} \geq \max(\text{Competition}, \text{Floor})\} \\ \text{Click} &:= f_{\text{click}}(\text{ad}, \text{user}) + U_{\text{click}} \ \mid\ \text{Win}=1 \\ \text{Conv} &:= f_{\text{conv}}(\text{intent}, \text{landing}) + U_{\text{conv}} \ \mid\ \text{Click}=1 \end{aligned}

Using the do-operator, the interventional distribution is identified by backdoor adjustment:

$P(\text{Win}\mid do(\text{Bid}=b)) = \sum_z P(\text{Win}\mid b, z)\,P(z), \quad Z = \{\text{pCTR}, \text{campaign}, \text{hour}, \text{exchange}, \text{floor}, \text{competition\_proxy}\}$

Individual counterfactuals are solved by abduction–action–prediction. For example, “I lost at bid=80; what if it had been bid=130?” → (abduction) $P(\text{Competition}\mid \text{Win}=0, \text{Bid}=80)\Rightarrow \text{Competition}\ge 80$ → (action) $do(\text{Bid}=130)$ → (prediction) $[F(130)-F(80)]/[1-F(80)]$ . In mediation analysis, since Bid affects Conversion only through Win, we get NIE/TE ≈ 1 (Win as an almost complete mediator).

SUTVA caveat: because of its competitive structure, RTB can weakly violate SUTVA — one bid can shift another bidder’s win probability. We assume that in a sufficiently large market the impact of any individual bid is negligible, but flag this as a limitation (a point to revisit through a contextual-bandit / interference lens).

Stage 5 — optimal bidding + policy simulation (OPE)

Plug the CATE/elasticity directly into the optimal bid formula. Without a budget constraint, it is isomorphic to the monopoly pricing formula:

$b^*(x) = V(x)\left(1 - \frac{1}{\varepsilon(x)}\right)$

Solving the budget- $B$ -constrained problem with a Lagrangian introduces a shadow price $\lambda$ :

$\max_{\{b_i\}}\ \sum_i (V_i - b_i)\,P(\text{Win}_i\mid b_i)\quad \text{s.t.}\quad \sum_i b_i\,P(\text{Win}_i\mid b_i) \le B \ \Rightarrow\ b_i^* = \frac{V_i}{1+\lambda}\left(1 - \frac{1}{\varepsilon_i}\right)$

Then comes this project’s key validation step: to evaluate a policy learned from logs without putting it online, we use Off-Policy Evaluation. We evaluate the candidate policies (fixed bid / pCTR-proportional / value-based / CATE-optimal) counterfactually with IPS and doubly-robust estimators, and use the SCM simulator to compare “what would ROI have been under this policy?”.

Illustrative result: the CATE-optimal policy produced per-segment bid discount factors — bidding more aggressively (less shading) in low-competition, high-elasticity segments, and more conservatively in segments that are expensive relative to their value. By OPE, this policy improved budget efficiency (cost per conversion) over the value-based and fixed baselines. That said, OPE estimates are trustworthy only where propensity overlap is good, so for extreme bid regions with weak positivity we reported wide confidence intervals.

Serving constraint: <100ms SLA

Finally, all of this must run within 100ms at online bid time. The Ad Exchange timeout is 100ms, and feature lookup, model inference, bid computation, and budget pacing all have to fit inside that budget. The measured targets were on the order of P50 < 20ms and P99 < 80ms. A causally-derived policy is useless if it blows the latency budget, so the CATE/optimal-bid values are precomputed and served as per-segment lookup tables and lightweight models.

Key findings (illustrative summary)

Debiasing pays off in the tails, not the average. IPW-based correction helped calibration in low-win-rate segments far more than it moved average AUC — because it fixes bid space that was previously invisible.
Bid effects are strongly heterogeneous. The ATE is positive, but the CATE distribution spreads widely around 0, with a clear moderation effect where competition suppresses the effect.
Correlational elasticity ≠ causal elasticity. The naive regression slope can be inflated by confounding; using the DML/Causal-Forest-corrected $\tau(x)$ in the optimal bid is what keeps the policy from breaking.
OPE does not replace online experiments, but it de-risks. OPE estimates over weak-positivity regions carry large uncertainty — report confidence intervals honestly and bid conservatively there.

Lessons

The bridge from prediction to decision is a causal quantity. $V(x)$ (prediction) alone does not yield a bid. What you need to solve for the bid is the causal derivative — elasticity / CATE — and that is the decisive difference from a prediction-only pipeline.
Model selection bias as part of the data-generating process. The RTB funnel is structural, not accidental. Naming Win/Click selection explicitly as Stage 1/2 and treating them separately with reweighting + entire-space learning makes it interpretable where the correction helps (the tail segments).
An average policy that ignores heterogeneity leaks money. Bidding off a single ATE simultaneously misses the opportunity in high-elasticity segments and the waste in expensive ones. CATE-based per-segment discount factors send budget to the right place.
Don’t deploy a policy without OPE — but don’t over-trust OPE either. Off-policy estimation lets you filter policies without live risk, but in poor-overlap regions it is itself biased. Read it alongside a positivity diagnostic.
Causality that loses to the latency budget is useless. Precomputing and lightweighting the policy so it runs inside the 100ms SLA is part of the decision system, not an afterthought.

Off-Policy Evaluation — counterfactual policy-value estimation from logs (IPS / doubly-robust)
Contextual Bandits — the natural next step that extends bidding to online explore-exploit
Double-Debiased ML — CATE estimation with confounding removed via nuisance orthogonalization
Causal Forest — nonparametric CATE with honest confidence intervals
CATE — the per-segment bid effect $\tau(x)$ , the policy’s core input
do-operator — identification of the interventional distribution $P(\text{Win}\mid do(\text{Bid}))$
Optimal Targeting Policy — turning CATE into a budget-constrained action rule
Decision-Making Overview — the big picture from prediction → evaluation → policy
IPW, Positivity, SCM, Confounder — foundational concepts for identification and correction

Local graph