Customer Segmentation & Causal Targeting — An Applied Case Study

In marketing, “who should we send the campaign to?” is not a prediction problem — it is a causal decision problem. The goal is not to pick customers with a high response probability, but customers for whom the campaign makes the biggest difference between treatment and no treatment (uplift). This note is an applied field note: I built the entire pipeline — from segmentation through a causal targeting policy and off-policy validation — on a single public dataset, and wrote down what I learned. Every number below is an illustrative finding on a public dataset, not an audited result for any specific real retailer.

Problem

Targeting on propensity (response probability) routinely wastes coupons on people who would have bought anyway. The quantity we actually want is the individual-level CATE, $\tau(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x]$ — the difference in outcome caused by sending (vs. not sending) the campaign. This case study asks two questions:

(Segmentation) Which latent dimensions summarize customer behavior, and how do we group customers into actionable segments?
(Causal Targeting) How does campaign effect vary across customers, and what targeting rule maximizes return on spend?

Data (public)

I used the Dunnhumby “The Complete Journey” public retail dataset: transaction and campaign logs for roughly 2,500 households over 102 weeks, with millions of transactions and dozens of marketing-campaign exposures. Because anyone can download it, the results below should be read strictly as an illustrative demonstration of methodology.

Method / Pipeline

Track 1 — Latent Factor Segmentation

Feature engineering: collapsed behavioral features by removing multicollinearity ( $r \ge 0.7$ ).
Latent factor extraction: NMF (Non-negative Matrix Factorization) yielded 5 latent behavioral factors. I chose NMF over PCA because the non-negativity constraint gives a parts-based interpretation (“customer = 0.3×loyal + 0.5×fresh-food”) and matches the natural non-negativity of spend data. The 5 factors explained most (~92%) of the behavioral variance.
Clustering: K-Means over the factor scores produced 7 customer segments.
Stability check: bootstrap resampling (100 iterations, 80% subsamples) plus Adjusted Rand Index (ARI) measured segment reproducibility → bootstrap ARI ≈ 0.77 (strong stability). Where internal quality metrics (e.g. silhouette) ask “is the current assignment good?”, bootstrap ARI asks “is this assignment reproducible?” — the two are complementary. Silhouette was only moderate (typical for behavioral data), but the high ARI backed up the substantive stability.

These 7 segments are both a marketing insight in their own right and, in Track 2, a moderator (heterogeneity axis) for the HTE.

Track 2 — Causal Targeting

I first diagnosed the positivity (overlap) assumption. To check whether the treatment and control groups were effectively different populations, I trained a propensity-score model — and it achieved PS AUC ≈ 0.99, i.e. it could almost perfectly predict who got treated. That is a serious Positivity violation signal. The causally identifiable overlap region (propensity within $[0.1, 0.9]$ ) was only about 17% of the data; the remaining ~83% sat in a strong-extrapolation region.

On top of that diagnosis I estimated the HTE.

CATE estimation: I compared meta-learners (S/T/X-learner) against a Causal Forest (CausalForestDML). Model selection was not driven by statistical significance alone but by targeting-ranking quality (uplift curve / AUUC) together with estimation stability (variance). The Causal Forest won on highest AUUC and lowest variance and was adopted (the BLP heterogeneity test was borderline — heterogeneity exists, but it is not strong).
Optimal targeting policy: I built a threshold policy that treats only customers whose $\hat\tau(x)$ exceeds the breakeven threshold (campaign cost ÷ margin), $\pi^\star(x) = \mathbf{1}\{\hat\tau(x) > \text{breakeven}\}$ . This is the industrial form of an Optimal Targeting Policy. I also compared learned rules such as a PolicyTree; the continuous CATE threshold dominated, because binarizing a continuous CATE discards information.
Off-policy validation: rather than deploy the new policy, I estimated its value with Off-Policy Evaluation (a doubly-robust estimator). Where positivity is weak, CATE uncertainty is large, so I tuned conservatism with a risk-adjusted variant $\text{CE-CATE}(\lambda) = (1-\lambda)\,\hat\tau + \lambda\cdot\text{LB}$ that blends in a lower bound.

Key Findings (illustrative)

All numbers below are illustrative results on the public Dunnhumby dataset and should be read as hypothesis-generating only.

The targeting-rate curve is non-monotone. Targeting only about 31% of customers put policy value near its optimum at an illustrative ROI ≈ 125% (e.g. roughly $2,426 in profit). Narrowing further raises ROI but lowers absolute profit, and targeting everyone actually produced a net loss (about -$4,657).
“Target everyone” is the worst policy because of negative-CATE segments. Counter-intuitive negative-CATE segments like VIP Heavy (about -$38) and Bulk Shoppers (about -$40) — the uplift literature’s so-called “sleeping dogs” — cancel out the positive effects. Nudging a customer who already buys plenty can yield zero or negative effect (a ceiling effect / coupon-mismatch hypothesis).
Conversely, under-targeted positive-uplift segments surfaced (e.g. Light Grocery) — a clear signal of where to move budget.
Positivity violation constrains everything. PS AUC ≈ 0.99 / ~17% overlap made ATE estimates swing by an order of magnitude across methods (naive vs. IPW vs. AIPW vs. DML), so trustworthy causal claims were effectively limited to the overlap region. Every conclusion is therefore left as a hypothesis pending an A/B test.

Lessons

Predictive ≠ causal targeting. Ranking by response probability is not the same as ranking by uplift. The existence of negative-CATE segments is the sharpest illustration of why propensity-based targeting burns money.
Diagnose assumptions first, and honestly. Not assuming positivity but measuring it via PS AUC was the single most important step here. Once you know overlap is only 17%, you read every point estimate differently (more humbly).
The model-selection criterion should be the decision criterion. Choosing the CATE model by targeting-ranking quality (AUUC) rather than statistical significance reflects the fact that we ship a policy, not an estimate.
Uncertainty is a policy knob, not a bug. The risk-adjusted $\lambda$ and OPE let you set “how aggressively to target” differently before vs. after validation — conservative pre-experiment, aggressive once verified.
The honest line for public-data case studies. Absolute dollar totals are illustrative figures that demonstrate the method, not audited business outcomes. Not mixing hypothesis generation with validated causal claims is the discipline this note tries to keep.

Customer Segmentation — NMF + K-Means behavioral segmentation
Uplift Modeling — targeting incremental, not predictive, effect
CATE — individual-level conditional average treatment effect
Meta-learners — S/T/X-learner family for CATE
Causal Forest — nonparametric CATE / adopted model
Optimal Targeting Policy — breakeven threshold policy
Off-Policy Evaluation — estimating value without deployment
Targeting Overview — map of industrial targeting methods
Positivity — the overlap assumption and diagnosing its violation

Local graph