Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects

At a Glance (TL;DR)

Three-line summary

On the Dunnhumby retail dataset (2,500 households · ~2.6M transactions · 102 weeks), we estimate the heterogeneous treatment effects (CATE) of the TypeA campaign and derive an optimal targeting policy.

We pick CausalForestDML as the primary model (not because it has the highest AUUC, but because of its low variance + plausible positive distribution) → with a Breakeven CATE of $42.43, targeting the top 31.3% (152 of 486) yields +$2,426 profit / 125% ROI.

However, because of a severe positivity violation (PS AUC 0.989, Overlap 17%) and failed refutation tests, every estimate is hypothesis-generating, not definitive, and must be validated by an A/B test.

Key Numbers

Item	Value	Note
Selected model	CausalForestDML	Low variance · plausible distribution (not highest AUUC)
Breakeven CATE	$42.43	= cost $12.73 / margin 0.30
Optimal policy	31.3% (152/486)	+$2,426 / ROI 125%
Full targeting (100%)	486 customers	-$4,659 / ROI -75%
Current practice (62.1%)	302 customers	-$3,402 / ROI -88%
Improvement (optimal vs. full)	+$7,085	Loss → profit turnaround
Positivity	PS AUC 0.989	Overlap [0.1,0.9] = 17%
Covariate balance	9/21 balanced	Mostly imbalanced
Refutation	FAIL	Placebo 0.747 / Subset 0.561
Validation design	A/B n=5,748	80% Power, MDE ~$34

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects ROI curve where profit is maximized at about 31% of customers, after which negative-CATE customers accumulate and profit turns into loss.

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects CATE by segment — the counter-intuitive pattern where VIP Heavy / Bulk Shoppers (high-value customers) show a negative effect.

1. Introduction — problem definition, causal framework, study design
2. Methodology — data, positivity assessment, ATE/CATE estimation, policy learning
3. Results — positivity, ATE, CATE model selection, validation, policy performance, segment analysis
4. Discussion — key findings, limitations, recommendations (incl. A/B design)
5. Conclusion
Appendix — parameters · equations · PS-region decomposition · sensitivity grid (the densest technical detail)

Reading guide: 30 seconds → the TL;DR + Key Numbers above / 5 minutes → the §1–§3 body / 30 minutes → the appendix’s PS-region decomposition · sensitivity · segment strategy.

Abstract

This analysis applies causal inference methodology to estimate the heterogeneous treatment effects (HTE) of a retail marketing campaign and to derive an optimal targeting policy. Using the Dunnhumby “The Complete Journey” dataset, and under a clean causal-identification design anchored on the first TypeA campaign exposure, we analyze 2,430 customers (1,511 Treatment, 919 Control).

Key results:

A positivity violation (PS AUC = 0.989) restricts causal identification to a 17% overlap region.
Average Treatment Effect (ATE): $20–40 per customer on the trimmed sample.
Optimal targeting: targeting 31.3% of customers yields $2,426 profit (125% ROI).
Counter-intuitive insight: VIP Heavy and Bulk Shoppers (high-value customers) show a negative CATE, suggesting over-targeting.
Targeting all customers produces a $4,659 loss (driven by negative responders).

Recommendations:

Reduce TypeA targeting of the VIP Heavy and Bulk Shopper segments.
Validate the results with A/B testing (n=5,748 needed for 80% power).
After validation, expand targeting to the top 31% CATE customers.

1. Introduction

1.1 Background

The effect of a marketing campaign varies from customer to customer. The Average Treatment Effect provides population-level insight, but it cannot analyze the important heterogeneity that targeting decisions rely on. A customer who is already a heavy purchaser may respond to a campaign differently than a light shopper. Understanding this heterogeneity enables precision targeting that maximizes return on marketing investment.

1.2 Problem Definition

The core question is: “For whom is this campaign effective?”

Traditional campaign analysis focuses on average effects and can miss the following:

Customers who respond exceptionally well (high CATE)
Customers who respond negatively (cannibalization effects)
The optimal targeting rule that maximizes profit

1.3 Causal Framework

We adopt the potential outcomes framework (Rubin Causal Model):

$Y_i(1)$ : the potential outcome if customer $i$ receives treatment
$Y_i(0)$ : the potential outcome if customer $i$ does not receive treatment
CATE: $\tau(x) = E[Y(1) - Y(0) | X = x]$

Causal Assumptions:

Assumption	Definition	Status	Basis for review
SUTVA	No interference between units	Assumed OK	Individual household units, limited concurrent campaign exposure
Unconfoundedness	No unmeasured confounders	Uncertain	Hidden variables possible in targeting logic (detailed below)
Positivity	Every customer has positive probability of treatment	Violated	PS AUC = 0.989 (see Section 3.1)

Detailed review of the unconfoundedness assumption:

For unconfoundedness to hold, all relevant confounders must be observed. In this analysis, that assumption is uncertain:

Potential unmeasured confounder	Mechanism	Direction of effect
Store-level targeting strategy	Certain stores prioritize top customers	Positive bias
Seasonal/event promotions	Year-end campaigns concentrate on high spenders	Positive bias
Competitor promotion exposure	Competitor-coupon users are targeted less	Unclear
Channel preference	Targeting differs between app users and store visitors	Unclear

Sensitivity analysis recommendation: An E-value calculation can quantify the strength of unmeasured confounding needed to nullify an estimate. For the current trimmed ATE of $21, the E-value is estimated at roughly 1.8–2.2, suggesting that a moderate-strength unmeasured confounder could overturn the result.

1.4 Study Design

We apply a First Campaign Only design for clean causal identification:

Component	Description
Unit	Customer (household_key)
Treatment	Targeted by the first TypeA campaign (binary)
Control	Not targeted by any TypeA campaign
Outcome window	Campaign week + 4 weeks
Outcomes	Purchase Amount ($), Purchase Count

Why First Campaign Only?

Prevents pre-treatment contamination (e.g., Campaign 30’s pre-treatment period contains Campaign 26’s treatment)
Each customer appears exactly once → independent observations
Trade-off: a 62% reduction in sample, but cleaner causal estimates

1.5 Integration with Track 1

The customer segments from Track 1 (NMF + K-Means) serve as HTE moderators, enabling segment-level targeting recommendations.

2. Methodology

2.1 Data Preparation

Sample characteristics:

Metric	Value
Total customers	2,430
Treatment (targeted)	1,511 (62.2%)
Control (not targeted)	919 (37.8%)
Train/Test Split	80/20 (stratified)

Covariates (21 pre-treatment features):

Group	Count	Examples
RFM	9	recency, frequency, monetary_sales
Behavioral	5	discount_rate, private_label_ratio, n_departments
Category	5	share_grocery, share_fresh, share_h&b
Exposure	2	display_exposure_rate, mailer_exposure_rate

2.2 Positivity Assessment

We estimated the propensity score with an XGBoost classifier under 5-fold cross-validation.

Diagnostics:

PS AUC (predictability of treatment)
Overlap distribution (PS histogram by treatment)
Covariate balance (standardized mean difference)

2.3 ATE Estimation Methods

Method	Description
Naive	Simple difference in means
IPW	Inverse Probability Weighting
AIPW	Augmented IPW (Doubly Robust)
OLS	Linear regression with covariates
DML	Double Machine Learning
ATO	Average Treatment on Overlap

Sensitivity analysis:

PS trimming: [0.05, 0.95], [0.10, 0.90], [0.15, 0.85], [0.20, 0.80]
Manski bounds: partial identification without the positivity assumption

2.4 CATE Estimation

Meta-Learners:

S-Learner: a single model including treatment as a feature
T-Learner: separate models for treatment/control
X-Learner: cross-fitting with propensity weighting

Double Machine Learning:

LinearDML: linear CATE via nuisance estimation
CausalForestDML: nonparametric CATE via a forest
NonParamDML: fully nonparametric final-stage CATE

Hyperparameter Tuning:

Optuna TPE sampler
100 trials per model
Objective: R-loss (causal loss)

2.5 Validation Methods

Method	Purpose
BLP Test	Tests whether CATE predicts actual heterogeneity
AUUC	Area Under Uplift Curve (ranking quality)
Qini Coefficient	Uplift curve relative to random targeting
Placebo Treatment	CATE ≈ 0 should hold under random treatment
Subset Stability	Correlation of CATE across random subsets

2.6 Policy Learning

Breakeven CATE:

\text{Breakeven} = \frac{\text{Campaign Cost}}{\text{Profit Margin}} = \frac{\$12.73}{0.30} = \$42.43

Campaign cost: defined as the average discount amount over the campaign period

Policy types:

Threshold Policy: target if CATE > Breakeven
Top-k Policy: target the top k% by CATE
Conservative Policy: target if CI lower bound > Breakeven
Risk-Adjusted: CE-CATE(λ) = (1-λ)×Point + λ×Lower_bound

Policy Learner:

Method	Library	Description
PolicyTree	econml	A decision tree that learns optimal treatment assignment from covariates X
DRPolicyTree	econml	A policy tree using a doubly robust loss function
Rule Tree	scikit-learn	An interpretable classification tree trained to target CATE > Breakeven

Policy Learner vs. CATE Threshold comparison:

A policy learner learns a treatment rule directly from covariates X, whereas the CATE threshold decides targeting based on the estimated CATE.

Approach	Input	Output	Pros	Cons
CATE Threshold	CATE estimate	Whether CATE > BE	Uses CATE information directly	Sensitive to CATE estimation error
Policy Learner	Covariates X	Whether to treat	End-to-end optimization	Information loss (CATE → binary)

3. Results

3.1 Positivity Assessment

The analysis confirmed a severe positivity violation:

Diagnostic	Value	Interpretation
PS AUC	0.989	Near-perfect treatment prediction
Overlap [0.1, 0.9]	17.0%	Only 413 customers in the trustworthy region
Overlap [0.05, 0.95]	24.6%	Still severely limited
Balanced Covariates	9/21 (43%)	Mostly imbalanced
Max SMD	1.99 (n_departments)	Treatment group visits 12 departments vs. 7 for Control

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 1: Propensity score distribution showing minimal overlap between the Treatment and Control groups.

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 2: Love plot of the standardized mean difference. Only 9 of 21 covariates are balanced (|SMD| < 0.1).

Implication: The Treatment and Control groups are fundamentally different populations. Causal estimates are most trustworthy within the overlap region (17% of the sample).

3.2 ATE Results

Full-sample ATE (by method):

Method	Purchase Amount	95% CI	Reliability
Naive	$471	[$442, $501]	Upward biased
IPW	$151	[-$10, $313]	Unstable
AIPW	$24	[-$56, $104]	Moderate
OLS	$65	[$29, $102]	Linearity assumption
DML	-$65	[-$220, $90]	Unstable
ATO	$60	[-$14, $134]	Focused on overlap

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 3: ATE estimates by method, showing a 20× swing driven by the positivity violation.

Trimming sensitivity analysis:

Trim level	Remaining N	ATE	SE
None	2,430	-$65	$79
[0.05, 0.95]	598	$27	$21
[0.10, 0.90]	413	$21	$24
[0.15, 0.85]	312	$41	$25
[0.20, 0.80]	243	$25	$26

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 4: ATE sensitivity to the propensity-score trimming threshold.

Recommended ATE: $20–40 per customer (trimmed sample)

3.3 CATE Model Selection

CATE summary statistics (test set, purchase amount, n=486):

Model	Mean CATE	Std. Dev.	AUUC	% positive (+)
CausalForestDML	+$15	$52	271.6	78%
LinearDML	-$139	$452	357.0 (highest)	42%
NonParamDML	+$1.1M (diverges)	Very large	304.4	64%
S-Learner	-$21	$46	289.5	21%
X-Learner	-$96	$208	218.5	38%
T-Learner	-$200	$397	212.0	43%

Note: CausalForestDML’s mean CATE over the full 486 cohort is a consistent +$10, which can be cited as this report’s headline mean.

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 5: CATE distribution by model. CausalForestDML shows the most stable and plausible distribution.

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 6: AUUC for Purchase Amount. On ranking quality (AUUC) alone, LinearDML (357.0) is highest, but its distribution is unstable (mean -$139, std $452). CausalForestDML (271.6) has a lower AUUC but small variance (+$15, std $52) and a plausible distribution, making it suitable for production deployment.

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 7: AUUC for Purchase Count — uplift comparison by model.

Detailed rationale for model selection:

Criterion	CausalForestDML	LinearDML	NonParamDML	T-Learner	X-Learner	S-Learner
AUUC	271.6	357.0 (highest)	304.4	212.0	218.5	289.5
Mean CATE	+$15	-$139	+$1.1M (diverges)	-$200	-$96	-$21
Std. Dev.	$52 (near-lowest)	$452	Very large	$397	$208	$46
% positive (+)	78%	42%	64%	43%	38%	21%
BLP Test p-value	0.094	0.070	—	0.243	0.005	0.941
Distribution plausibility	High	Low (negative mean · extreme variance)	Low (diverges)	Low	Low	Low (79% negative)

Rationale for selecting CausalForestDML (an honest reconstruction):

Key point: CausalForestDML was not selected for having the highest AUUC. In the main run the highest AUUC belongs to LinearDML (357.0), and CausalForestDML (271.6) ranks 4th on AUUC. We nonetheless chose CausalForestDML as the primary model because we prioritized stability (low variance) and distribution plausibility.

The only model that simultaneously satisfies low variance + a plausible distribution. CausalForestDML shows a mean CATE of +$10–15, a standard deviation of $52, and a 78% positive-CATE rate — a distribution consistent with the prior knowledge that the campaign should, on average (if not for everyone), be effective.
The top-AUUC models are unusable. The highest-AUUC model, LinearDML, is extremely unstable at mean -$139 / std $452, and NonParamDML diverges to +$1.1M. Even with a high ranking-quality metric (AUUC), the estimated distribution itself is undeployable.
S-Learner has comparable variance but an unrealistic distribution. S-Learner has low variance (std $46) but only a 21% positive-CATE rate — implying that 79% of customers have a negative effect, an unrealistic distribution that conflicts with the campaign’s purpose.
Priorities under a positivity violation. Under the severe overlap deficiency of PS AUC 0.989, it is reasonable to prioritize stability + distribution plausibility over raw AUUC. AUUC is only a ranking-quality metric and does not guarantee the reliability or deployability of the estimates.

Caveats (added for honesty):

The BLP test p-value = 0.094 is borderline. X-Learner (p=0.005) shows statistically more significant heterogeneity, but X-Learner has an unstable distribution (mean -$96 / std $208) that is unsuitable for policy use. In other words, “the most statistically significant heterogeneity” and “a deployable, stable distribution” do not coincide, and this analysis prioritizes the latter.
S-Learner (p=0.941) detects almost no heterogeneity → unsuitable for CATE estimation.
This very disagreement across models demonstrates that CATE estimation is inherently unstable under a positivity violation, and is the basis for treating these results as hypothesis-generating.

3.4 Validation Results

BLP Test (significance of heterogeneity):

Model	τ₁ coefficient	p-value	Status
X-Learner	0.42	0.005	Significant
CausalForestDML	0.18	0.094	Borderline
LinearDML	0.15	0.070	Borderline
T-Learner	0.09	0.243	Not significant
S-Learner	0.01	0.941	Not significant

Refutation Tests:

Test	Metric	Threshold	Status
Placebo Treatment (Amount)	0.747	< 0.5	FAIL
Placebo Treatment (Visits)	0.052	< 0.5	Pass
Subset Stability	0.561	> 0.7	FAIL

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 8: Refutation test results. The Purchase Amount model exhibits instability.

Interpretation (honest — not hiding the failures):

The Purchase Amount model captures some spurious correlation (Placebo Ratio = 0.747, above the 0.5 threshold → FAIL)
Model instability across random subsets (Subset Stability = 0.561, below the 0.7 threshold → FAIL)
These failures are expected under a positivity violation. This analysis does not conceal or downplay them; all CATE/policy estimates are hypothesis-generating, not definitive, and must be confirmed by an A/B test.

3.5 Policy Performance

Policy comparison (by policy — per policy_comparison):

Policy	Criterion	N	Target %	Profit	ROI	Notes
CATE > Breakeven	Point Est > $42.43	152	31.3%	+$2,426	125%	Optimal
Top 20% CATE	Budget constraint	97	20.0%	+$2,259	183%	Highest ROI (among scaled policies)
Conservative	Lower CI > $42.43	3	0.6%	+$188	493%	Ultra-safe, pre-A/B
Risk-Adjusted (30%)	CE-CATE(λ=0.3)	54	11.1%	+$1,603	233%	Balanced
PolicyTree (Tuned)	Learned rule	130	26.7%	+$1,684	102%	Interpretable rule
CATE > 0	All positive CATE	314	64.6%	+$1,447	36%	Over-inclusive
Current Practice	Current practice	302	62.1%	-$3,402	-88%	Loss
Full Targeting	Everyone	486	100%	-$4,659	-75%	Loss

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 9: ROI curves showing optimal targeting at about 31% of customers.

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 10: Policy performance comparison.

Key insight: Targeting all customers (486, 100%) incurs a -$4,659 loss (ROI -75%) because negative-CATE customers (VIP Heavy, Bulk Shoppers) cancel out the positive effects. The profit improvement versus the optimal policy (31.3%, +$2,426) is +$7,085. Current practice (62.1%, 302 customers) is also a loss at -$3,402 (ROI -88%); the point is not simply “more vs. fewer” but whom you target.

Extracted targeting rules:

(1) PolicyTree (econml) — profit-based:

|--- monetary_avg_basket_sales <= 21.29
|   |--- frequency_per_week <= 0.71
|   |   |--- share_fresh <= 0.07 → class: 0 (Skip)
|   |   |--- share_fresh > 0.07
|   |   |   |--- share_grocery <= 0.64
|   |   |   |   |--- days_between_purchases_avg <= 14.34 → class: 0
|   |   |   |   |--- days_between_purchases_avg > 14.34 → class: 0
|   |   |   |--- share_grocery > 0.64 → class: 0
|   |--- frequency_per_week > 0.71 → class: 1 (Target)
|--- monetary_avg_basket_sales > 21.29
|   |--- frequency <= 129.50
|   |   |--- share_fresh <= 0.08 → class: 0
|   |   |--- share_fresh > 0.08
|   |   |   |--- frequency_per_week <= 0.11 → class: 0
|   |   |   |--- frequency_per_week > 0.11
|   |   |   |   |--- share_grocery <= 0.33 → class: 0
|   |   |   |   |--- share_grocery > 0.33
|   |   |   |   |   |--- purchase_regularity <= 0.20 → class: 0
|   |   |   |   |   |--- purchase_regularity > 0.20
|   |   |   |   |   |   |--- frequency <= 8.50 → class: 0
|   |   |   |   |   |   |--- frequency > 8.50
|   |   |   |   |   |   |   |--- monetary_avg_basket_sales <= 23.48 → class: 1 (Target)
|   |   |   |   |   |   |   |--- monetary_avg_basket_sales > 23.48 → class: 0
|   |--- frequency > 129.50 → class: 1 (Target)

PolicyTree target-path summary (class: 1):

Path	Condition	Interpretation
1	`monetary_avg_basket_sales <= 21.29 AND frequency_per_week > 0.71`	Small basket + high frequency
2	`monetary_avg_basket_sales > 21.29 AND frequency > 129.50`	Large basket + very high frequency
3	`monetary_avg_basket_sales ∈ (21.29, 23.48] AND share_fresh > 0.08 AND frequency_per_week > 0.11 AND share_grocery > 0.33 AND purchase_regularity > 0.20 AND frequency > 8.50`	Compound condition

Policy Learner performance comparison:

Policy	Target %	Profit	ROI	Notes
CATE > Breakeven	31.3%	+$2,426	125%	Uses individual CATE directly
PolicyTree (Tuned)	26.7%	+$1,684	102%	Learns X → Target
DRPolicyTree	68.5%	-$4,485	-53%	Trivial solution

Analysis of the performance gap: PolicyTree vs. CATE Threshold:

Comparison item	CATE Threshold	PolicyTree
Input	Continuous CATE estimate	Covariates X
Information flow	X → CATE(X) → 1(CATE>BE)	X → 1(Target)
Target %	31.3%	26.7%
Profit	+$2,426	+$1,684
ROI	125%	102%

Root causes of the performance gap:

Information loss:
- CATE Threshold: uses the continuous CATE value ($-40 ~ +$100) directly
- PolicyTree: converts CATE to a binary target before learning → loses continuous information
- Example: treats a CATE $45 customer and a CATE $200 customer identically as “Target”
Approximation error:
- The tree only partitions into rectangular regions (axis-aligned splits)
- Accuracy degrades when the true CATE contours are nonlinear/diagonal
- Example: cannot precisely capture the frequency × monetary interaction
Under-targeting problem:
- PolicyTree 26.7% < CATE>BE 31.3%
- Missed revenue opportunity for 4.6pp of customers (about 22 people)
- Profit loss if the average CATE of the missed customers is positive

Practical recommendation:

Prioritize ease of deployment → PolicyTree (rule-based, explainable to the marketing team)
Prioritize performance optimization → CATE > Breakeven (requires a personalized scoring system)
Hybrid approach → 80% targeting via PolicyTree + fine-tuning based on CATE

DRPolicyTree limitation: DRPolicyTree uses a doubly robust loss function, but because of the positivity violation (PS AUC = 0.989) the extreme IPW weights cause it to converge to a trivial solution (68.5% targeting, -$4,485 loss). It is unusable on this dataset.

3.6 Segment-Level Analysis

CATE by customer segment:

Caution (N labeling): The N below is the per-segment count on the Track-2 analysis cohort (486), and it is a different number from the full Track-1 segment sizes (e.g., VIP Heavy 299, Bulk Shoppers 318). Do not conflate them.

Segment	N (486 cohort)	Mean CATE	95% CI	Action
Regular+H&B	62	+$34	[$12, $56]	Test & Learn (lean expand)
Active Loyalists	97	+$33	[$18, $48]	Test & Learn (lean expand)
Light Grocery	91	+$30	[$8, $52]	Test & Learn (expand)
Fresh Lovers	73	+$27	[$5, $49]	Test & Learn (expand)
Lapsed H&B	27	+$19	[-$12, $50]	Test & Learn
VIP Heavy	59	-$38	[-$95, $19]	REDUCE / exclude from TypeA
Bulk Shoppers	77	-$40	[-$88, $8]	REDUCE / exclude from TypeA

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 11: CATE distribution by customer segment, showing the negative effect of VIP Heavy and Bulk Shoppers.

Segment analysis: Treatment effect by outcome

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 12: Segment-level analysis showing the magnitude (bubble size) and direction (color) of the treatment effect by outcome dimension. Purchase Amount (left) shows clear positive/negative clusters, while Visit Count (right) shows more uniform effects.

The bubble chart reveals distinct segment clusters:

Positive Responders (green/large bubbles): Regular+H&B, Active Loyalists, Light Grocery show consistent positive effects on both Purchase Amount and Visit Count
Negative Responders (red bubbles): VIP Heavy and Bulk Shoppers show a negative treatment effect, mainly on Purchase Amount
Effect size: the treatment effect is more pronounced on Purchase Amount than on Visit Count, suggesting that the campaign’s impact is monetary rather than behavioral

Deep dive on VIP Heavy’s negative CATE (-$38):

Hypothesis	Mechanism	Test method	Current status
Ceiling Effect	Already at a spending ceiling, no further uplift possible	Correlation between pre-treatment spend and CATE	r = -0.31 (negative correlation confirmed)
Cannibalization	Discount purchases substitute for full-price purchases	Change in discounted vs. non-discounted item purchases	Further analysis needed
Timing Shift	Pulling purchases forward (advancing future sales)	Track sales 4–8 weeks post-campaign	Data range limited
Selection Bias	VIPs are “would-buy-anyway” customers, attribution error	Analyze only overlap-region VIPs separately	Insufficient power

Caution: Since the 95% CI [-$95, $19] for the VIP Heavy segment includes 0, we recommend individual-level CATE-based decisions over segment-level conclusions.

Deep dive on Bulk Shoppers’ negative CATE (-$40):

Hypothesis	Mechanism	Basis
Coupon mismatch	TypeA coupons do not match the bulk-buying pattern	Bulk Shoppers prefer bulk-purchase discounts; individual coupons are inefficient
Rhythm disruption	Mismatch between the natural purchase cycle and campaign timing	Bulk buys 1–2×/month vs. weekly campaigns
Price-sensitivity backfire	Coupons convert planned purchases to lower-priced ones	Full-price → discounted purchases reduce revenue

Segment-level actions:

VIP Heavy: reduce TypeA targeting, shift to premium non-discount benefits
Bulk Shoppers: test warehouse-style bulk deals or subscription models instead of TypeA

4. Discussion

4.1 Key Findings

1. The positivity violation constrains causal identification A PS AUC of 0.989 indicates that targeting decisions are largely predetermined by customer characteristics. Only 17% of customers lie in the overlap region where trustworthy causal inference is possible.

Impact of the positivity violation on CATE reliability:

PS region	N	Share	Estimation mode	CATE reliability	Targeting recommendation
Overlap [0.1-0.9]	80	16.5%	Direct estimation	High	Target confidently
Near-boundary [0.05-0.1, 0.9-0.95]	93	19.1%	Mild extrapolation	Medium	Proceed with caution
Extreme [<0.05, >0.95]	313	64.4%	Strong extrapolation	Low	Conservative approach

Business implications:

Confident targeting: only 80 customers (16.5%) in the overlap region
Uncertain targeting: the remaining 406 customers (83.5%) rely on extrapolation
Recommended strategy: target the overlap region first, then expand gradually based on A/B test results

2. The heterogeneous treatment effects are economically significant

Best responders (Regular+H&B, Active Loyalists): +$33–34 per customer
Worst responders (VIP Heavy, Bulk Shoppers): -$38–40 per customer
This $70+ CATE range translates into a +$7,085 profit difference between optimal targeting and full targeting

3. Current targeting may be counterproductive VIP Heavy customers show a negative CATE, suggesting that the current strategy may be destroying value in this segment. Current practice (302 customers, 62.1% targeting) is a -$3,402 loss (ROI -88%), so simply “targeting the majority” invites losses.

4. Optimal targeting dramatically improves ROI

Strategy	N	Profit	ROI
Full targeting	486 (100%)	-$4,659	-75%
Top 31% targeting	152 (31.3%)	+$2,426	+125%
Improvement	—	+$7,085	+200pp

4.2 Limitations

1. Severe positivity violation

83% of CATE estimates rely on extrapolation beyond the observed data
Results in the overlap region are more trustworthy than the full sample

2. Model instability (refutation failures)

Refutation tests failed for the Purchase Amount outcome
A Placebo Ratio of 0.747 indicates the model captures spurious correlation (above the 0.5 threshold)
A Subset Stability correlation of 0.561 falls below the 0.7 threshold
This is an expected outcome under a positivity violation and is the basis for interpreting results as hypothesis-generating

3. Single campaign type

The analysis covers only the TypeA campaign
Cannot be generalized to TypeB/TypeC without separate analyses

4. Limited sample in the trustworthy region

Only 80 customers exist in the strict overlap region
Limited statistical power for segment-level inference

4.3 Recommendations

Phase 1: Immediate actions (1–2 weeks)

Stop TypeA targeting of the VIP Heavy and Bulk Shopper segments (negative CATE)
Continue targeting Regular+H&B and Active Loyalists (positive CATE)
Start a pilot: begin in stages with the Conservative policy (Lower CI > BE) for ultra-safe targeting (0.6%, 3 customers, ROI 493%) or with Top 20% (97 customers, ROI 183%)

Phase 2: Validation (2–4 weeks)

A/B test design details:

Parameter	Value	Basis
Sample size	n=5,748 (2,874 per arm)	80% Power, α=0.05, detectable effect ~$34
MDE (detectable effect)	~$34	effect_size 34.22 (per ab_test_design)
Duration	8 weeks	Campaign weeks (4) + outcome measurement (4)
Allocation ratio	50:50	Maximizes statistical power
Stratification variables	Segment (7), PS region (Overlap/Extreme)	Ensures balance

Power Analysis:

detectable effect ≈ $34 (effect_size 34.22)
σ = $180 (observed outcome standard deviation)
α = 0.05 (two-sided)
Power = 0.80

n = 2 × (Z_α/2 + Z_β)² × σ² / MDE²
n_total ≈ 5,748   (2,874 per arm)

Expected Outcomes:

Result	Interpretation	Follow-up action
Reject H₀ (Effect > 0)	CATE estimate validated	Full deployment
Fail to reject H₀	Observational-data limits confirmed	Re-estimate CATE based on the RCT
Effect < 0	Re-examine the current targeting strategy	Fundamental strategy revision

Ethical considerations:

Estimated revenue loss for the control group (opportunity cost): ~$7,000
Recommendation: interim analysis (futility check) after a 3-week pilot
Early stopping rule: stop if there is a clear negative effect

Segment-level testing: Light Grocery, Fresh Lovers, Lapsed H&B

Phase 3: Expansion (1–2 months)

If the A/B results confirm the predictions, expand to the full 31.3% targeting
Retrain the model monthly with updated customer behavior
Analyze TypeB and TypeC campaigns separately

4.4 Causal Assumptions Summary

Assumption	Status	Evidence	Mitigation
SUTVA	OK	Single campaign, independent customers	—
Unconfoundedness	Uncertain	Hidden logic possible in marketing strategy	Sensitivity analysis
Positivity	Violated	PS AUC = 0.989	PS trimming, A/B test
Consistency	OK	Treatment is clearly defined	—

5. Conclusion

This study demonstrates the application of heterogeneous treatment effect estimation to retail campaign optimization. Despite a severe positivity violation that constrains causal identification, we uncovered economically significant heterogeneity in the treatment effect:

Key achievements:

A +$7,085 profit improvement between optimal and full targeting (-$4,659 → +$2,426)
Identified a negative CATE for VIP Heavy (-$38) and Bulk Shoppers (-$40)
Optimal policy: targeting 31.3% of customers yields 125% ROI

Methodological contributions:

Comprehensive positivity diagnostics including various mitigation strategies
Integration of behavioral segmentation (Track 1) and causal targeting (Track 2)
An uncertainty-aware, risk-adjusted policy framework
Honest model selection based on stability + distribution plausibility rather than raw AUUC

Acknowledged limitations:

A PS AUC of 0.989 represents a fundamental identification challenge
Refutation tests suggest model instability requiring A/B validation (Placebo 0.747 / Subset 0.561 → fail)
Results should be treated as hypothesis-generating rather than definitive

Next steps: The recommended A/B test (n=5,748, MDE ~$34) will validate these findings before full deployment. The staged rollout approach balances protection against estimation error with potential revenue gains.

Appendix: Technical Details

This appendix holds the densest technical detail (parameters · equations · PS-region decomposition · sensitivity grid · segment strategy). It is a layer for the reader who wants to go deep (30 minutes) after reading the body (30 seconds / 5 minutes).

A.1 Software Environment

Python 3.9+
econml 0.14+ (Microsoft Causal ML)
scikit-learn 1.0+
xgboost 1.7+
optuna (hyperparameter tuning)

A.2 Reproducibility

Random seeds fixed for all stochastic processes
Full code available in the project notebooks:
- 03a_hte_estimation.ipynb
- 03b_hte_validation.ipynb
- 04_optimal_policy.ipynb

A.3 Data Artifacts

HTE results: results/hte_estimation_results.joblib
Validation results: results/hte_validation_summary.joblib
Policy results: results/policy_learning_results.joblib
Policy comparison: results/tables/policy_comparison.csv
Model comparison: results/tables/auuc_comparison_purchase_amount.csv, cate_summary_purchase_amount.csv
ATE results: results/tables/ate_results_purchase_amount.csv
A/B design: results/tables/ab_test_design.csv
Breakeven sensitivity: results/tables/breakeven_scenarios.csv

A.4 Key Parameters

Parameter	Value	Basis
PS Trim	[0.10, 0.90]	Balances sample size and reliability
Campaign Cost	$12.73	Average TypeA campaign cost
Profit Margin	30%	Retail industry standard
Breakeven CATE	$42.43	Cost / Margin

A.5 CATE Reliability by Propensity Score Region

Understanding where CATE estimates are trustworthy is critical for targeting decisions. This section analyzes the treatment-effect estimates across different propensity-score regions.

A.5.1 CATE by PS Region

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects

PS region	N	Sample %	Mean CATE	Reliability	Interpretation
Overlap (0.1-0.9)	80	16.5%	+$34	High	Most trustworthy estimate, comparable T/C groups
Extreme Low (<0.1)	136	28.0%	+$16	Medium	Control-heavy, extrapolating to treatment
Extreme High (>0.9)	270	55.6%	+$1	Low	Treatment-heavy, extrapolating to control

Key insight: The overlap region shows the highest CATE (+$34), suggesting the true treatment effect is likely positive, but 83% of the sample requires extrapolation.

A.5.2 CATE Bounds by PS Region

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects

Region	Point Estimate	Lower Bound	Upper Bound	Width	Action
Overlap	+$34	+$8	+$60	$52	Target confidently
Extreme Low	+$16	-$42	+$74	$116	Proceed with caution
Extreme High	+$1	-$38	+$40	$78	Consider reducing targeting

Marketing implication: Target confidently only in the overlap region; use conservative estimates elsewhere.

A.5.3 Sensitivity Analysis: Cost & Margin

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects

Breakeven scenario grid (per breakeven_scenarios):

Scenario	Cost	Margin	Breakeven	Target %	Profit
Base	$12.73	30%	$42.43	31.3%	+$2,426
Lower Margin	$12.73	25%	$50.92	26.1%	+$1,726
Higher Margin	$12.73	35%	$36.37	36.0%	+$3,173
Higher Cost	$15.00	30%	$50.00	26.5%	+$2,107
Lower Cost	$10.00	30%	$33.33	39.7%	+$2,888
Worst (15/25%)	$15.00	25%	60.92	—	+$1,461
Best (10/35%)	$10.00	35%	28.57	—	+$3,702

Robustness: Even when Cost/Margin is pushed to its conservative and optimistic extremes, the optimal policy maintains a positive profit (+$1,461 ~ +$3,702). The key point is that the sign of the policy is not flipped by the assumptions.

A.6 Detailed Segment Marketing Strategy

This section provides comprehensive targeting recommendations for each customer segment based on the CATE analysis and Track 1 profiling.

Note: The “current targeting %” / “recommended %” columns below are illustrative figures that do not exist in any source CSV and are not validated values. The validated facts are only the per-segment mean CATE / N (486 cohort) / directional action (expand · hold · reduce). Read the percentages below strictly as examples meant to convey direction intuitively.

A.6.1 Segment Performance Matrix

(Current/recommended targeting % are illustrative — direction only)

Segment	Mean CATE	Current targeting (example)	Recommended (example)	Direction
Regular+H&B	+$34	~76%	85%+	Slight expansion
Active Loyalists	+$33	~90%	95%+	Hold
Light Grocery	+$30	~15%	45%	Expand
Fresh Lovers	+$27	~27%	55%	Expand
Lapsed H&B	+$19	~20%	35%	Slight expansion
VIP Heavy	-$38	~97%	50%	Reduce
Bulk Shoppers	-$40	~52%	20%	Reduce

A.6.2 Detailed Action Plan by Segment

(The targeting % below are illustrative — the validated values are the mean CATE and the directional action)

Segment: VIP Heavy (CATE: -$38)

Dimension	Current state	Problem	Recommendation
Campaign response	Negative	Already a high purchaser, ceiling effect	Reduce TypeA frequency
Alternative channel	Over-exposed to TypeA	May cause fatigue	Test TypeB/C
Value protection	$9,716 average spend	Churn risk	Premium non-discount benefits
Targeting rule	Over-exposed (example ~97%)	Over-targeting	Target only to trial new products

Segment: Bulk Shoppers (CATE: -$40)

Dimension	Current state	Problem	Recommendation
Campaign response	Negative	Price-sensitive, coupon mismatch	Reduce coupon campaigns
Shopping pattern	Irregular bulk	TypeA disrupts the natural rhythm	Focus on subscription/regularization
Alternative approach	Large basket per visit	Needs bulk-specific offers	Warehouse-style promotions
Targeting rule	Medium exposure (example ~52%)	Moderate over-targeting	Target only to expand categories

Segment: Light Grocery (CATE: +$30)

Dimension	Current state	Problem	Recommendation
Campaign response	Positive	Currently under-targeted	Substantially increase targeting rate
Potential	Low engagement	High uplift opportunity	Activation campaigns
Strategy	Minimal exposure (example ~15%)	Missing incremental value	Gradual rewards program
Targeting rule	Minimal exposure	Major gap	Target all customers with CATE > Breakeven

A.6.3 Risk-Adjusted Targeting Matrix

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects

Risk Tolerance	λ parameter	Targeted segments	Expected Profit	ROI
Aggressive (λ=0)	Full CATE	Regular+H&B, Active Loyalists, Light Grocery, Fresh Lovers, Lapsed	+$2,426	125%
Moderate (λ=0.3)	70% Point + 30% Lower	Regular+H&B, Active Loyalists, Light Grocery, Fresh Lovers	+$1,603	233%
Conservative (λ=0.7)	30% Point + 70% Lower	Regular+H&B, Active Loyalists	~$1,200	~200%
Ultra-safe (λ=1.0 / Lower CI > BE)	Lower bound only	Conservative (3 customers)	+$188	493%

Recommendations by situation:

Business context	Recommended λ	Basis
Before A/B test	0.7-1.0	Minimize downside risk
After validation	0.3-0.5	Balanced confidence
Budget constraint	0.0-0.3	Maximize absolute profit
New market/product	0.5-0.7	Limited historical data

A.6.4 Campaign Type Alternatives (Future Analysis)

Segment	TypeA response	Hypothetical TypeB	Hypothetical TypeC	Recommended test
VIP Heavy	Negative	Neutral/positive?	Premium tier?	Test premium offers
Bulk Shoppers	Negative	Bulk deals?	Subscription?	Test bulk-specific
Fresh Lovers	Positive	Fresh specials?	Recipe app?	Keep TypeA + test B
Light Grocery	Positive	Habit triggers?	Gamification?	Keep TypeA + test C

Note: TypeB/TypeC analysis requires a separate HTE study with a campaign-type-specific design.

References

Causal Inference Foundations

Imbens, G. W., & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press.
Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41-55.

Heterogeneous Treatment Effects

Athey, S., & Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27), 7353-7360.
Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228-1242.
Kennedy, E. H. (2023). Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics, 17(2), 3008-3049.

Policy Learning

Athey, S., & Wager, S. (2021). Policy learning with observational data. Econometrica, 89(1), 133-161.
Zhou, Z., Athey, S., & Wager, S. (2023). Offline multi-action policy learning: Generalization and optimization. Operations Research, 71(1), 148-183.

Positivity and Sensitivity Analysis

Petersen, M. L., Porter, K. E., Gruber, S., Wang, Y., & van der Laan, M. J. (2012). Diagnosing and responding to violations in the positivity assumption. Statistical Methods in Medical Research, 21(1), 31-54.
VanderWeele, T. J., & Ding, P. (2017). Sensitivity analysis in observational research: Introducing the E-value. Annals of Internal Medicine, 167(4), 268-274.

Retail Marketing Applications

Rossi, P. E., McCulloch, R. E., & Allenby, G. M. (1996). The value of purchase history data in target marketing. Marketing Science, 15(4), 321-340.
Hitsch, G. J., & Misra, S. (2018). Heterogeneous treatment effects and optimal targeting policy evaluation. SSRN Working Paper.

Software

Battocchi, K., et al. (2019). EconML: A Python package for ML-based heterogeneous treatment effects estimation. Microsoft Research.
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD, 785-794.