Tae Hyun Kim (Lowell)

Marketing Attribution at Scale — From Simulation to Causal Inference

The Problem — “Which channel drove the conversion?” Is a Causal Question

Marketing attribution looks like a simple accounting problem on the surface: “how much did each marketing channel contribute to a conversion?” In practice, though, the question is fundamentally counterfactual. What actually matters is not “a Paid Search click happened right before the conversion” but “would this user have converted without the Paid Search ad?” Rule-based heuristics like last-click answer the former, yet the answer you need when reallocating budget is the latter.

This gap is what makes attribution hard. Correlational and causal methods can assign credit to different channels on the same data, and deciding which is right requires ground truth — which real marketing data never has, because we cannot observe the counterfactual world.

This case study resolves the dilemma in two stages: (1) design a simulator whose ground truth is known to quantify how accurately 10+ methods recover the true contribution, then (2) scale those methods onto the large public Criteo dataset to test real-world robustness, and finally (3) close the loop by connecting estimated channel effects to budget decisions via Off-Policy Evaluation.

Every DGP parameter and number in this note is either researcher-designed ground truth or a public dataset specification — no client or proprietary metric is involved. Figures are illustrative, meant to convey method behavior.

Data — Simulator (Interpretable) + Criteo (Large, Public)

The core diagnosis is this: a large, public MTA dataset with interpretable features essentially does not exist. The Criteo Attribution Dataset is large but every feature is hashed/anonymized, so you cannot tell whether a channel is Display or Search. Conversely, interpretable public sources either cannot reliably reconstruct per-session channel sequences or are limited to a single binary treatment. Hence a hybrid approach.

SourceScaleRoleKey property
Simulator (self-designed)~100K users, 2–3% conversion (~2–3K converters)ground-truth accuracy evaluation7 interpretable channels, known DGP, counterfactuals computable directly
Criteo Attribution (2018, public)~16.5M events, ~2.6M journeys, 7 channelslarge real-data scale/robustness checkview-through vs click-through distinction, cost included, features hashed

The simulator’s DGP integrates three validated academic frameworks — a backbone of sequential dependence and user heterogeneity, channel-specific temporal-decay functions, and cross-channel influence. Conversions are generated as the realization of an inhomogeneous Poisson process whose intensity follows a log-linear model:

logλi(t)=α0+jkβkxjkfchannel(ttj)+i,jδijcross(ci,cj)+diη\log \lambda_i(t) = \alpha_0 + \sum_j \sum_k \beta_k\, x_{jk}\, f_{\text{channel}}(t - t_j) + \sum_{i,j} \delta_{ij}\, \text{cross}(c_i, c_j) + d_i \cdot \eta

Here fchannelf_{\text{channel}} is the per-channel decay (e.g., Display decays slowly as exp(Δt/14d)\exp(-\Delta t / 14\text{d}) for long-lived awareness, while Paid Search dies off immediately as exp(Δt/1d)\exp(-\Delta t / 1\text{d})), δij\delta_{ij} is cross-channel synergy (e.g., Display → Paid Search), and diηd_i \cdot \eta is user heterogeneity. The Shapley value derived from these β,δ\beta, \delta parameters is the ground truth, and it is the grading key for every attribution model.

Pipeline — From Rule-Based to Causal DL

We apply 10+ methods to the same simulated data and measure accuracy against ground truth. The organizing axis is whether the question a method answers is correlational or causal.

CategoryMethodQuestion answeredCausality level
Rule-basedLast / First / Linear / Time-Decay / Position”By what rule do we split credit?”None
StatisticalMarkov Chain (1st/2nd/higher-order, Removal Effect)“Which channel matters in the transition structure?”Weak
Game-theoreticShapley Value (conversion-based / model-based, exact over 27=1282^7=128 coalitions)“What is a fair allocation?”Weak
Predictive DLLSTM+Attention / Transformer”Which touchpoint matters for predicting conversion?”Weak
Incremental causalIncremental Shapley”What is the incremental lift from ads?”Medium
Time-to-event causalSurvival / Poisson attribution”How much did this ad accelerate the conversion?”Medium
Debiased causalIPW / Doubly Robust / DML”What is the channel effect after correcting selection bias?”High
Causal DLCausal Attention (CAMTA variant)“Is the deep-learning attribution causally valid?”Medium–High

Four causal methods sit at the center of the analysis. (1) IPW — user segments make channel-exposure probabilities differ, inducing confounding in the underlying SCM, so we reweight by the inverse propensity. (2) DML — treat each channel as the treatment, the remaining channels and user features as confounders, remove the nuisance via cross-fitting, and estimate per-channel CATE/ATE (EconML LinearDML). (3) Incremental Shapley — allocate only the incremental conversions, refusing credit to the base conversions that would have happened without any ad. (4) Survival/Poisson — compute the counterfactual intensity λj(t)\lambda_{-j}(t) with ad jj removed, extending causal interpretation onto the time axis.

For the deep-learning models we compare three contribution-extraction techniques (Attention weight / SHAP DeepExplainer / Leave-One-Out). The key hypothesis is “attention ≠ attribution” — that attention weights may not reflect the true counterfactual contribution — which we quantify with Kendall’s Tau rank agreement.

Scaling to Criteo, the same baselines (rule-based, Markov, Shapley) and deep sequence models run on ~2.6M journeys, but because features are hashed we deliberately make no domain interpretation of the form “increase budget for channel X.” Criteo’s unique value is (a) whether the methods work technically at scale, (b) whether view-through (impression-only) touchpoints contribute to conversion, and (c) a cost-per-attributed-conversion analysis using the cost field.

Key Findings (Illustrative)

All figures below come from ground-truth comparisons on the researcher-designed DGP; they illustrate method behavior.

  • Correlation and causation diverge, and more so as confounding strengthens. As the spread of channel-exposure probability across user segments (= confounding strength) grows, the credit allocations of correlational methods (Shapley, LSTM-Attention) and causal methods (IPW, DML) pull apart. Under strong CRM targeting (Email/Direct concentrated on certain segments), correlational methods systematically over-credited those channels. → “When is causal correction needed” is decided by the data’s selection-bias strength.
  • Positivity breaks IPW. When propensity scores hit the extremes (near-perfect prediction), IPW weights explode and the estimate destabilizes. Trimming restores stability, and DML — absorbing the nuisance with ML — was more robust: a practical payoff of the doubly-robust structure.
  • The higher the base conversion, the more traditional attribution misleads. Raising the DGP’s base conversion rate (natural conversion without ads) from 0% to 20% widens the gap between Incremental Shapley and traditional Shapley. In businesses with heavy organic conversion, traditional MTA overstates what ads actually did.
  • Attention is not attribution. Rank agreement between deep-learning attention weights and SHAP/LOO-based contributions was imperfect, and high-disagreement journeys were consistently identifiable — a caution against reading attention as explanation.
  • Complexity is not always justified. On short sequences (2–3 touchpoints), the Transformer gave no meaningful gain over the LSTM — and “Transformers are unnecessary on short journeys” is itself a practically useful finding.

Connecting to Budget — Channel Allocation as Policy Evaluation

Comparing methods is not the end goal. The final deliverable is a decision: how to move the budget. Viewing the estimated per-channel response curve (Adstock + Saturation at the aggregate level, causal ATE at the user level) as a policy, a new budget split is a new policy, and “how much revenue will this split produce” becomes an Off-Policy Evaluation problem — estimating that policy’s value from logged data.

Under a fixed total-budget constraint, we find the revenue-maximizing allocation via numerical optimization and present, relative to the current split, “where to cut and where to move.” The crux is feeding the response curve with causally estimated effects, not correlational ones — reallocating on correlational credit mistakes organic conversions for ad lift and pours money into the wrong channel. Uncertainty is reported as a credible interval rather than a point estimate (e.g., if a channel’s ATE is $24 with a 95% CI that straddles zero, the spend-increase decision is held).

Lessons

  • Without ground truth you cannot grade methods. The simulator’s real value is not data volume but that counterfactuals can be computed directly. Real data (Criteo) only checks robustness; the simulator delivers the accuracy verdict — the division of labor must be explicit.
  • Correlational attribution is an assumption, not a default. Shapley/Markov suffice only when “selection bias is weak” holds. Using them without verifying that assumption in the data (by measuring exposure-probability differences across segments) is risky.
  • Do not separate estimation from decision. Attribution numbers only acquire meaning once they translate into budget policy. So we treat CATE estimation → Off-Policy Evaluation → budget allocation as a single pipeline.
  • Carry uncertainty all the way through. Weight explosion under positivity violations, estimation variance on short data — reporting point estimates alone breeds false confidence. We expose the CI/credible interval directly to the decision.
  • IPW — the starting point for selection-bias correction, fragile under positivity violations
  • DML — robust ATE/CATE estimation under high-dimensional confounding via cross-fitting
  • Off-Policy Evaluation — estimating the value of a budget reallocation from logs
  • CATE — heterogeneous treatment effects per channel/segment
  • SCM — the language for the confounding structure user segments create
  • Attribution — the credit-allocation problem in general

Local graph