Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects
At a Glance (TL;DR)
Three-line summary
- On the Dunnhumby retail dataset (2,500 households · ~2.6M transactions · 102 weeks), we estimate the heterogeneous treatment effects (CATE) of the TypeA campaign and derive an optimal targeting policy.
- We pick CausalForestDML as the primary model (not because it has the highest AUUC, but because of its low variance + plausible positive distribution) → with a Breakeven CATE of $42.43, targeting the top 31.3% (152 of 486) yields +$2,426 profit / 125% ROI.
- However, because of a severe positivity violation (PS AUC 0.989, Overlap 17%) and failed refutation tests, every estimate is hypothesis-generating, not definitive, and must be validated by an A/B test.
Key Numbers
| Item | Value | Note |
|---|---|---|
| Selected model | CausalForestDML | Low variance · plausible distribution (not highest AUUC) |
| Breakeven CATE | $42.43 | = cost $12.73 / margin 0.30 |
| Optimal policy | 31.3% (152/486) | +$2,426 / ROI 125% |
| Full targeting (100%) | 486 customers | -$4,659 / ROI -75% |
| Current practice (62.1%) | 302 customers | -$3,402 / ROI -88% |
| Improvement (optimal vs. full) | +$7,085 | Loss → profit turnaround |
| Positivity | PS AUC 0.989 | Overlap [0.1,0.9] = 17% |
| Covariate balance | 9/21 balanced | Mostly imbalanced |
| Refutation | FAIL | Placebo 0.747 / Subset 0.561 |
| Validation design | A/B n=5,748 | 80% Power, MDE ~$34 |
Hero Figures
ROI curve where profit is maximized at about 31% of customers, after which negative-CATE customers accumulate and profit turns into loss.
CATE by segment — the counter-intuitive pattern where VIP Heavy / Bulk Shoppers (high-value customers) show a negative effect.
Navigation
- 1. Introduction — problem definition, causal framework, study design
- 2. Methodology — data, positivity assessment, ATE/CATE estimation, policy learning
- 3. Results — positivity, ATE, CATE model selection, validation, policy performance, segment analysis
- 4. Discussion — key findings, limitations, recommendations (incl. A/B design)
- 5. Conclusion
- Appendix — parameters · equations · PS-region decomposition · sensitivity grid (the densest technical detail)
Reading guide: 30 seconds → the TL;DR + Key Numbers above / 5 minutes → the §1–§3 body / 30 minutes → the appendix’s PS-region decomposition · sensitivity · segment strategy.
Abstract
This analysis applies causal inference methodology to estimate the heterogeneous treatment effects (HTE) of a retail marketing campaign and to derive an optimal targeting policy. Using the Dunnhumby “The Complete Journey” dataset, and under a clean causal-identification design anchored on the first TypeA campaign exposure, we analyze 2,430 customers (1,511 Treatment, 919 Control).
Key results:
- A positivity violation (PS AUC = 0.989) restricts causal identification to a 17% overlap region.
- Average Treatment Effect (ATE): $20–40 per customer on the trimmed sample.
- Optimal targeting: targeting 31.3% of customers yields $2,426 profit (125% ROI).
- Counter-intuitive insight: VIP Heavy and Bulk Shoppers (high-value customers) show a negative CATE, suggesting over-targeting.
- Targeting all customers produces a $4,659 loss (driven by negative responders).
Recommendations:
- Reduce TypeA targeting of the VIP Heavy and Bulk Shopper segments.
- Validate the results with A/B testing (n=5,748 needed for 80% power).
- After validation, expand targeting to the top 31% CATE customers.
1. Introduction
1.1 Background
The effect of a marketing campaign varies from customer to customer. The Average Treatment Effect provides population-level insight, but it cannot analyze the important heterogeneity that targeting decisions rely on. A customer who is already a heavy purchaser may respond to a campaign differently than a light shopper. Understanding this heterogeneity enables precision targeting that maximizes return on marketing investment.
1.2 Problem Definition
The core question is: “For whom is this campaign effective?”
Traditional campaign analysis focuses on average effects and can miss the following:
- Customers who respond exceptionally well (high CATE)
- Customers who respond negatively (cannibalization effects)
- The optimal targeting rule that maximizes profit
1.3 Causal Framework
We adopt the potential outcomes framework (Rubin Causal Model):
- : the potential outcome if customer receives treatment
- : the potential outcome if customer does not receive treatment
- CATE:
Causal Assumptions:
| Assumption | Definition | Status | Basis for review |
|---|---|---|---|
| SUTVA | No interference between units | Assumed OK | Individual household units, limited concurrent campaign exposure |
| Unconfoundedness | No unmeasured confounders | Uncertain | Hidden variables possible in targeting logic (detailed below) |
| Positivity | Every customer has positive probability of treatment | Violated | PS AUC = 0.989 (see Section 3.1) |
Detailed review of the unconfoundedness assumption:
For unconfoundedness to hold, all relevant confounders must be observed. In this analysis, that assumption is uncertain:
| Potential unmeasured confounder | Mechanism | Direction of effect |
|---|---|---|
| Store-level targeting strategy | Certain stores prioritize top customers | Positive bias |
| Seasonal/event promotions | Year-end campaigns concentrate on high spenders | Positive bias |
| Competitor promotion exposure | Competitor-coupon users are targeted less | Unclear |
| Channel preference | Targeting differs between app users and store visitors | Unclear |
Sensitivity analysis recommendation: An E-value calculation can quantify the strength of unmeasured confounding needed to nullify an estimate. For the current trimmed ATE of $21, the E-value is estimated at roughly 1.8–2.2, suggesting that a moderate-strength unmeasured confounder could overturn the result.
1.4 Study Design
We apply a First Campaign Only design for clean causal identification:
| Component | Description |
|---|---|
| Unit | Customer (household_key) |
| Treatment | Targeted by the first TypeA campaign (binary) |
| Control | Not targeted by any TypeA campaign |
| Outcome window | Campaign week + 4 weeks |
| Outcomes | Purchase Amount ($), Purchase Count |
Why First Campaign Only?
- Prevents pre-treatment contamination (e.g., Campaign 30’s pre-treatment period contains Campaign 26’s treatment)
- Each customer appears exactly once → independent observations
- Trade-off: a 62% reduction in sample, but cleaner causal estimates
1.5 Integration with Track 1
The customer segments from Track 1 (NMF + K-Means) serve as HTE moderators, enabling segment-level targeting recommendations.
2. Methodology
2.1 Data Preparation
Sample characteristics:
| Metric | Value |
|---|---|
| Total customers | 2,430 |
| Treatment (targeted) | 1,511 (62.2%) |
| Control (not targeted) | 919 (37.8%) |
| Train/Test Split | 80/20 (stratified) |
Covariates (21 pre-treatment features):
| Group | Count | Examples |
|---|---|---|
| RFM | 9 | recency, frequency, monetary_sales |
| Behavioral | 5 | discount_rate, private_label_ratio, n_departments |
| Category | 5 | share_grocery, share_fresh, share_h&b |
| Exposure | 2 | display_exposure_rate, mailer_exposure_rate |
2.2 Positivity Assessment
We estimated the propensity score with an XGBoost classifier under 5-fold cross-validation.
Diagnostics:
- PS AUC (predictability of treatment)
- Overlap distribution (PS histogram by treatment)
- Covariate balance (standardized mean difference)
2.3 ATE Estimation Methods
| Method | Description |
|---|---|
| Naive | Simple difference in means |
| IPW | Inverse Probability Weighting |
| AIPW | Augmented IPW (Doubly Robust) |
| OLS | Linear regression with covariates |
| DML | Double Machine Learning |
| ATO | Average Treatment on Overlap |
Sensitivity analysis:
- PS trimming: [0.05, 0.95], [0.10, 0.90], [0.15, 0.85], [0.20, 0.80]
- Manski bounds: partial identification without the positivity assumption
2.4 CATE Estimation
Meta-Learners:
- S-Learner: a single model including treatment as a feature
- T-Learner: separate models for treatment/control
- X-Learner: cross-fitting with propensity weighting
Double Machine Learning:
- LinearDML: linear CATE via nuisance estimation
- CausalForestDML: nonparametric CATE via a forest
- NonParamDML: fully nonparametric final-stage CATE
Hyperparameter Tuning:
- Optuna TPE sampler
- 100 trials per model
- Objective: R-loss (causal loss)
2.5 Validation Methods
| Method | Purpose |
|---|---|
| BLP Test | Tests whether CATE predicts actual heterogeneity |
| AUUC | Area Under Uplift Curve (ranking quality) |
| Qini Coefficient | Uplift curve relative to random targeting |
| Placebo Treatment | CATE ≈ 0 should hold under random treatment |
| Subset Stability | Correlation of CATE across random subsets |
2.6 Policy Learning
Breakeven CATE:
Campaign cost: defined as the average discount amount over the campaign period
Policy types:
- Threshold Policy: target if CATE > Breakeven
- Top-k Policy: target the top k% by CATE
- Conservative Policy: target if CI lower bound > Breakeven
- Risk-Adjusted: CE-CATE(λ) = (1-λ)×Point + λ×Lower_bound
Policy Learner:
| Method | Library | Description |
|---|---|---|
| PolicyTree | econml | A decision tree that learns optimal treatment assignment from covariates X |
| DRPolicyTree | econml | A policy tree using a doubly robust loss function |
| Rule Tree | scikit-learn | An interpretable classification tree trained to target CATE > Breakeven |
Policy Learner vs. CATE Threshold comparison:
A policy learner learns a treatment rule directly from covariates X, whereas the CATE threshold decides targeting based on the estimated CATE.
| Approach | Input | Output | Pros | Cons |
|---|---|---|---|---|
| CATE Threshold | CATE estimate | Whether CATE > BE | Uses CATE information directly | Sensitive to CATE estimation error |
| Policy Learner | Covariates X | Whether to treat | End-to-end optimization | Information loss (CATE → binary) |
3. Results
3.1 Positivity Assessment
The analysis confirmed a severe positivity violation:
| Diagnostic | Value | Interpretation |
|---|---|---|
| PS AUC | 0.989 | Near-perfect treatment prediction |
| Overlap [0.1, 0.9] | 17.0% | Only 413 customers in the trustworthy region |
| Overlap [0.05, 0.95] | 24.6% | Still severely limited |
| Balanced Covariates | 9/21 (43%) | Mostly imbalanced |
| Max SMD | 1.99 (n_departments) | Treatment group visits 12 departments vs. 7 for Control |
Figure 1: Propensity score distribution showing minimal overlap between the Treatment and Control groups.
Figure 2: Love plot of the standardized mean difference. Only 9 of 21 covariates are balanced (|SMD| < 0.1).
Implication: The Treatment and Control groups are fundamentally different populations. Causal estimates are most trustworthy within the overlap region (17% of the sample).
3.2 ATE Results
Full-sample ATE (by method):
| Method | Purchase Amount | 95% CI | Reliability |
|---|---|---|---|
| Naive | $471 | [$442, $501] | Upward biased |
| IPW | $151 | [-$10, $313] | Unstable |
| AIPW | $24 | [-$56, $104] | Moderate |
| OLS | $65 | [$29, $102] | Linearity assumption |
| DML | -$65 | [-$220, $90] | Unstable |
| ATO | $60 | [-$14, $134] | Focused on overlap |
Figure 3: ATE estimates by method, showing a 20× swing driven by the positivity violation.
Trimming sensitivity analysis:
| Trim level | Remaining N | ATE | SE |
|---|---|---|---|
| None | 2,430 | -$65 | $79 |
| [0.05, 0.95] | 598 | $27 | $21 |
| [0.10, 0.90] | 413 | $21 | $24 |
| [0.15, 0.85] | 312 | $41 | $25 |
| [0.20, 0.80] | 243 | $25 | $26 |
Figure 4: ATE sensitivity to the propensity-score trimming threshold.
Recommended ATE: $20–40 per customer (trimmed sample)
3.3 CATE Model Selection
CATE summary statistics (test set, purchase amount, n=486):
| Model | Mean CATE | Std. Dev. | AUUC | % positive (+) |
|---|---|---|---|---|
| CausalForestDML | +$15 | $52 | 271.6 | 78% |
| LinearDML | -$139 | $452 | 357.0 (highest) | 42% |
| NonParamDML | +$1.1M (diverges) | Very large | 304.4 | 64% |
| S-Learner | -$21 | $46 | 289.5 | 21% |
| X-Learner | -$96 | $208 | 218.5 | 38% |
| T-Learner | -$200 | $397 | 212.0 | 43% |
Note: CausalForestDML’s mean CATE over the full 486 cohort is a consistent +$10, which can be cited as this report’s headline mean.
Figure 5: CATE distribution by model. CausalForestDML shows the most stable and plausible distribution.
Figure 6: AUUC for Purchase Amount. On ranking quality (AUUC) alone, LinearDML (357.0) is highest, but its distribution is unstable (mean -$139, std $452). CausalForestDML (271.6) has a lower AUUC but small variance (+$15, std $52) and a plausible distribution, making it suitable for production deployment.
Figure 7: AUUC for Purchase Count — uplift comparison by model.
Detailed rationale for model selection:
| Criterion | CausalForestDML | LinearDML | NonParamDML | T-Learner | X-Learner | S-Learner |
|---|---|---|---|---|---|---|
| AUUC | 271.6 | 357.0 (highest) | 304.4 | 212.0 | 218.5 | 289.5 |
| Mean CATE | +$15 | -$139 | +$1.1M (diverges) | -$200 | -$96 | -$21 |
| Std. Dev. | $52 (near-lowest) | $452 | Very large | $397 | $208 | $46 |
| % positive (+) | 78% | 42% | 64% | 43% | 38% | 21% |
| BLP Test p-value | 0.094 | 0.070 | — | 0.243 | 0.005 | 0.941 |
| Distribution plausibility | High | Low (negative mean · extreme variance) | Low (diverges) | Low | Low | Low (79% negative) |
Rationale for selecting CausalForestDML (an honest reconstruction):
Key point: CausalForestDML was not selected for having the highest AUUC. In the main run the highest AUUC belongs to LinearDML (357.0), and CausalForestDML (271.6) ranks 4th on AUUC. We nonetheless chose CausalForestDML as the primary model because we prioritized stability (low variance) and distribution plausibility.
-
The only model that simultaneously satisfies low variance + a plausible distribution. CausalForestDML shows a mean CATE of +$10–15, a standard deviation of $52, and a 78% positive-CATE rate — a distribution consistent with the prior knowledge that the campaign should, on average (if not for everyone), be effective.
-
The top-AUUC models are unusable. The highest-AUUC model, LinearDML, is extremely unstable at mean -$139 / std $452, and NonParamDML diverges to +$1.1M. Even with a high ranking-quality metric (AUUC), the estimated distribution itself is undeployable.
-
S-Learner has comparable variance but an unrealistic distribution. S-Learner has low variance (std $46) but only a 21% positive-CATE rate — implying that 79% of customers have a negative effect, an unrealistic distribution that conflicts with the campaign’s purpose.
-
Priorities under a positivity violation. Under the severe overlap deficiency of PS AUC 0.989, it is reasonable to prioritize stability + distribution plausibility over raw AUUC. AUUC is only a ranking-quality metric and does not guarantee the reliability or deployability of the estimates.
Caveats (added for honesty):
- The BLP test p-value = 0.094 is borderline. X-Learner (p=0.005) shows statistically more significant heterogeneity, but X-Learner has an unstable distribution (mean -$96 / std $208) that is unsuitable for policy use. In other words, “the most statistically significant heterogeneity” and “a deployable, stable distribution” do not coincide, and this analysis prioritizes the latter.
- S-Learner (p=0.941) detects almost no heterogeneity → unsuitable for CATE estimation.
- This very disagreement across models demonstrates that CATE estimation is inherently unstable under a positivity violation, and is the basis for treating these results as hypothesis-generating.
3.4 Validation Results
BLP Test (significance of heterogeneity):
| Model | τ₁ coefficient | p-value | Status |
|---|---|---|---|
| X-Learner | 0.42 | 0.005 | Significant |
| CausalForestDML | 0.18 | 0.094 | Borderline |
| LinearDML | 0.15 | 0.070 | Borderline |
| T-Learner | 0.09 | 0.243 | Not significant |
| S-Learner | 0.01 | 0.941 | Not significant |
Refutation Tests:
| Test | Metric | Threshold | Status |
|---|---|---|---|
| Placebo Treatment (Amount) | 0.747 | < 0.5 | FAIL |
| Placebo Treatment (Visits) | 0.052 | < 0.5 | Pass |
| Subset Stability | 0.561 | > 0.7 | FAIL |
Figure 8: Refutation test results. The Purchase Amount model exhibits instability.
Interpretation (honest — not hiding the failures):
- The Purchase Amount model captures some spurious correlation (Placebo Ratio = 0.747, above the 0.5 threshold → FAIL)
- Model instability across random subsets (Subset Stability = 0.561, below the 0.7 threshold → FAIL)
- These failures are expected under a positivity violation. This analysis does not conceal or downplay them; all CATE/policy estimates are hypothesis-generating, not definitive, and must be confirmed by an A/B test.
3.5 Policy Performance
Policy comparison (by policy — per policy_comparison):
| Policy | Criterion | N | Target % | Profit | ROI | Notes |
|---|---|---|---|---|---|---|
| CATE > Breakeven | Point Est > $42.43 | 152 | 31.3% | +$2,426 | 125% | Optimal |
| Top 20% CATE | Budget constraint | 97 | 20.0% | +$2,259 | 183% | Highest ROI (among scaled policies) |
| Conservative | Lower CI > $42.43 | 3 | 0.6% | +$188 | 493% | Ultra-safe, pre-A/B |
| Risk-Adjusted (30%) | CE-CATE(λ=0.3) | 54 | 11.1% | +$1,603 | 233% | Balanced |
| PolicyTree (Tuned) | Learned rule | 130 | 26.7% | +$1,684 | 102% | Interpretable rule |
| CATE > 0 | All positive CATE | 314 | 64.6% | +$1,447 | 36% | Over-inclusive |
| Current Practice | Current practice | 302 | 62.1% | -$3,402 | -88% | Loss |
| Full Targeting | Everyone | 486 | 100% | -$4,659 | -75% | Loss |
Figure 9: ROI curves showing optimal targeting at about 31% of customers.
Figure 10: Policy performance comparison.
Key insight: Targeting all customers (486, 100%) incurs a -$4,659 loss (ROI -75%) because negative-CATE customers (VIP Heavy, Bulk Shoppers) cancel out the positive effects. The profit improvement versus the optimal policy (31.3%, +$2,426) is +$7,085. Current practice (62.1%, 302 customers) is also a loss at -$3,402 (ROI -88%); the point is not simply “more vs. fewer” but whom you target.
Extracted targeting rules:
(1) PolicyTree (econml) — profit-based:
|--- monetary_avg_basket_sales <= 21.29
| |--- frequency_per_week <= 0.71
| | |--- share_fresh <= 0.07 → class: 0 (Skip)
| | |--- share_fresh > 0.07
| | | |--- share_grocery <= 0.64
| | | | |--- days_between_purchases_avg <= 14.34 → class: 0
| | | | |--- days_between_purchases_avg > 14.34 → class: 0
| | | |--- share_grocery > 0.64 → class: 0
| |--- frequency_per_week > 0.71 → class: 1 (Target)
|--- monetary_avg_basket_sales > 21.29
| |--- frequency <= 129.50
| | |--- share_fresh <= 0.08 → class: 0
| | |--- share_fresh > 0.08
| | | |--- frequency_per_week <= 0.11 → class: 0
| | | |--- frequency_per_week > 0.11
| | | | |--- share_grocery <= 0.33 → class: 0
| | | | |--- share_grocery > 0.33
| | | | | |--- purchase_regularity <= 0.20 → class: 0
| | | | | |--- purchase_regularity > 0.20
| | | | | | |--- frequency <= 8.50 → class: 0
| | | | | | |--- frequency > 8.50
| | | | | | | |--- monetary_avg_basket_sales <= 23.48 → class: 1 (Target)
| | | | | | | |--- monetary_avg_basket_sales > 23.48 → class: 0
| |--- frequency > 129.50 → class: 1 (Target)
PolicyTree target-path summary (class: 1):
| Path | Condition | Interpretation |
|---|---|---|
| 1 | monetary_avg_basket_sales <= 21.29 AND frequency_per_week > 0.71 | Small basket + high frequency |
| 2 | monetary_avg_basket_sales > 21.29 AND frequency > 129.50 | Large basket + very high frequency |
| 3 | monetary_avg_basket_sales ∈ (21.29, 23.48] AND share_fresh > 0.08 AND frequency_per_week > 0.11 AND share_grocery > 0.33 AND purchase_regularity > 0.20 AND frequency > 8.50 | Compound condition |
Policy Learner performance comparison:
| Policy | Target % | Profit | ROI | Notes |
|---|---|---|---|---|
| CATE > Breakeven | 31.3% | +$2,426 | 125% | Uses individual CATE directly |
| PolicyTree (Tuned) | 26.7% | +$1,684 | 102% | Learns X → Target |
| DRPolicyTree | 68.5% | -$4,485 | -53% | Trivial solution |
Analysis of the performance gap: PolicyTree vs. CATE Threshold:
| Comparison item | CATE Threshold | PolicyTree |
|---|---|---|
| Input | Continuous CATE estimate | Covariates X |
| Information flow | X → CATE(X) → 1(CATE>BE) | X → 1(Target) |
| Target % | 31.3% | 26.7% |
| Profit | +$2,426 | +$1,684 |
| ROI | 125% | 102% |
Root causes of the performance gap:
-
Information loss:
- CATE Threshold: uses the continuous CATE value ($-40 ~ +$100) directly
- PolicyTree: converts CATE to a binary target before learning → loses continuous information
- Example: treats a CATE $45 customer and a CATE $200 customer identically as “Target”
-
Approximation error:
- The tree only partitions into rectangular regions (axis-aligned splits)
- Accuracy degrades when the true CATE contours are nonlinear/diagonal
- Example: cannot precisely capture the
frequency × monetaryinteraction
-
Under-targeting problem:
- PolicyTree 26.7% < CATE>BE 31.3%
- Missed revenue opportunity for 4.6pp of customers (about 22 people)
- Profit loss if the average CATE of the missed customers is positive
Practical recommendation:
- Prioritize ease of deployment → PolicyTree (rule-based, explainable to the marketing team)
- Prioritize performance optimization → CATE > Breakeven (requires a personalized scoring system)
- Hybrid approach → 80% targeting via PolicyTree + fine-tuning based on CATE
DRPolicyTree limitation: DRPolicyTree uses a doubly robust loss function, but because of the positivity violation (PS AUC = 0.989) the extreme IPW weights cause it to converge to a trivial solution (68.5% targeting, -$4,485 loss). It is unusable on this dataset.
3.6 Segment-Level Analysis
CATE by customer segment:
Caution (N labeling): The
Nbelow is the per-segment count on the Track-2 analysis cohort (486), and it is a different number from the full Track-1 segment sizes (e.g., VIP Heavy 299, Bulk Shoppers 318). Do not conflate them.
| Segment | N (486 cohort) | Mean CATE | 95% CI | Action |
|---|---|---|---|---|
| Regular+H&B | 62 | +$34 | [$12, $56] | Test & Learn (lean expand) |
| Active Loyalists | 97 | +$33 | [$18, $48] | Test & Learn (lean expand) |
| Light Grocery | 91 | +$30 | [$8, $52] | Test & Learn (expand) |
| Fresh Lovers | 73 | +$27 | [$5, $49] | Test & Learn (expand) |
| Lapsed H&B | 27 | +$19 | [-$12, $50] | Test & Learn |
| VIP Heavy | 59 | -$38 | [-$95, $19] | REDUCE / exclude from TypeA |
| Bulk Shoppers | 77 | -$40 | [-$88, $8] | REDUCE / exclude from TypeA |
Figure 11: CATE distribution by customer segment, showing the negative effect of VIP Heavy and Bulk Shoppers.
Segment analysis: Treatment effect by outcome
Figure 12: Segment-level analysis showing the magnitude (bubble size) and direction (color) of the treatment effect by outcome dimension. Purchase Amount (left) shows clear positive/negative clusters, while Visit Count (right) shows more uniform effects.
The bubble chart reveals distinct segment clusters:
- Positive Responders (green/large bubbles): Regular+H&B, Active Loyalists, Light Grocery show consistent positive effects on both Purchase Amount and Visit Count
- Negative Responders (red bubbles): VIP Heavy and Bulk Shoppers show a negative treatment effect, mainly on Purchase Amount
- Effect size: the treatment effect is more pronounced on Purchase Amount than on Visit Count, suggesting that the campaign’s impact is monetary rather than behavioral
Deep dive on VIP Heavy’s negative CATE (-$38):
| Hypothesis | Mechanism | Test method | Current status |
|---|---|---|---|
| Ceiling Effect | Already at a spending ceiling, no further uplift possible | Correlation between pre-treatment spend and CATE | r = -0.31 (negative correlation confirmed) |
| Cannibalization | Discount purchases substitute for full-price purchases | Change in discounted vs. non-discounted item purchases | Further analysis needed |
| Timing Shift | Pulling purchases forward (advancing future sales) | Track sales 4–8 weeks post-campaign | Data range limited |
| Selection Bias | VIPs are “would-buy-anyway” customers, attribution error | Analyze only overlap-region VIPs separately | Insufficient power |
Caution: Since the 95% CI [-$95, $19] for the VIP Heavy segment includes 0, we recommend individual-level CATE-based decisions over segment-level conclusions.
Deep dive on Bulk Shoppers’ negative CATE (-$40):
| Hypothesis | Mechanism | Basis |
|---|---|---|
| Coupon mismatch | TypeA coupons do not match the bulk-buying pattern | Bulk Shoppers prefer bulk-purchase discounts; individual coupons are inefficient |
| Rhythm disruption | Mismatch between the natural purchase cycle and campaign timing | Bulk buys 1–2×/month vs. weekly campaigns |
| Price-sensitivity backfire | Coupons convert planned purchases to lower-priced ones | Full-price → discounted purchases reduce revenue |
Segment-level actions:
- VIP Heavy: reduce TypeA targeting, shift to premium non-discount benefits
- Bulk Shoppers: test warehouse-style bulk deals or subscription models instead of TypeA
4. Discussion
4.1 Key Findings
1. The positivity violation constrains causal identification A PS AUC of 0.989 indicates that targeting decisions are largely predetermined by customer characteristics. Only 17% of customers lie in the overlap region where trustworthy causal inference is possible.
Impact of the positivity violation on CATE reliability:
| PS region | N | Share | Estimation mode | CATE reliability | Targeting recommendation |
|---|---|---|---|---|---|
| Overlap [0.1-0.9] | 80 | 16.5% | Direct estimation | High | Target confidently |
| Near-boundary [0.05-0.1, 0.9-0.95] | 93 | 19.1% | Mild extrapolation | Medium | Proceed with caution |
| Extreme [<0.05, >0.95] | 313 | 64.4% | Strong extrapolation | Low | Conservative approach |
Business implications:
- Confident targeting: only 80 customers (16.5%) in the overlap region
- Uncertain targeting: the remaining 406 customers (83.5%) rely on extrapolation
- Recommended strategy: target the overlap region first, then expand gradually based on A/B test results
2. The heterogeneous treatment effects are economically significant
- Best responders (Regular+H&B, Active Loyalists): +$33–34 per customer
- Worst responders (VIP Heavy, Bulk Shoppers): -$38–40 per customer
- This $70+ CATE range translates into a +$7,085 profit difference between optimal targeting and full targeting
3. Current targeting may be counterproductive VIP Heavy customers show a negative CATE, suggesting that the current strategy may be destroying value in this segment. Current practice (302 customers, 62.1% targeting) is a -$3,402 loss (ROI -88%), so simply “targeting the majority” invites losses.
4. Optimal targeting dramatically improves ROI
| Strategy | N | Profit | ROI |
|---|---|---|---|
| Full targeting | 486 (100%) | -$4,659 | -75% |
| Top 31% targeting | 152 (31.3%) | +$2,426 | +125% |
| Improvement | — | +$7,085 | +200pp |
4.2 Limitations
1. Severe positivity violation
- 83% of CATE estimates rely on extrapolation beyond the observed data
- Results in the overlap region are more trustworthy than the full sample
2. Model instability (refutation failures)
- Refutation tests failed for the Purchase Amount outcome
- A Placebo Ratio of 0.747 indicates the model captures spurious correlation (above the 0.5 threshold)
- A Subset Stability correlation of 0.561 falls below the 0.7 threshold
- This is an expected outcome under a positivity violation and is the basis for interpreting results as hypothesis-generating
3. Single campaign type
- The analysis covers only the TypeA campaign
- Cannot be generalized to TypeB/TypeC without separate analyses
4. Limited sample in the trustworthy region
- Only 80 customers exist in the strict overlap region
- Limited statistical power for segment-level inference
4.3 Recommendations
Phase 1: Immediate actions (1–2 weeks)
- Stop TypeA targeting of the VIP Heavy and Bulk Shopper segments (negative CATE)
- Continue targeting Regular+H&B and Active Loyalists (positive CATE)
- Start a pilot: begin in stages with the Conservative policy (Lower CI > BE) for ultra-safe targeting (0.6%, 3 customers, ROI 493%) or with Top 20% (97 customers, ROI 183%)
Phase 2: Validation (2–4 weeks)
A/B test design details:
| Parameter | Value | Basis |
|---|---|---|
| Sample size | n=5,748 (2,874 per arm) | 80% Power, α=0.05, detectable effect ~$34 |
| MDE (detectable effect) | ~$34 | effect_size 34.22 (per ab_test_design) |
| Duration | 8 weeks | Campaign weeks (4) + outcome measurement (4) |
| Allocation ratio | 50:50 | Maximizes statistical power |
| Stratification variables | Segment (7), PS region (Overlap/Extreme) | Ensures balance |
Power Analysis:
detectable effect ≈ $34 (effect_size 34.22)
σ = $180 (observed outcome standard deviation)
α = 0.05 (two-sided)
Power = 0.80
n = 2 × (Z_α/2 + Z_β)² × σ² / MDE²
n_total ≈ 5,748 (2,874 per arm)
Expected Outcomes:
| Result | Interpretation | Follow-up action |
|---|---|---|
| Reject H₀ (Effect > 0) | CATE estimate validated | Full deployment |
| Fail to reject H₀ | Observational-data limits confirmed | Re-estimate CATE based on the RCT |
| Effect < 0 | Re-examine the current targeting strategy | Fundamental strategy revision |
Ethical considerations:
- Estimated revenue loss for the control group (opportunity cost): ~$7,000
- Recommendation: interim analysis (futility check) after a 3-week pilot
- Early stopping rule: stop if there is a clear negative effect
- Segment-level testing: Light Grocery, Fresh Lovers, Lapsed H&B
Phase 3: Expansion (1–2 months)
- If the A/B results confirm the predictions, expand to the full 31.3% targeting
- Retrain the model monthly with updated customer behavior
- Analyze TypeB and TypeC campaigns separately
4.4 Causal Assumptions Summary
| Assumption | Status | Evidence | Mitigation |
|---|---|---|---|
| SUTVA | OK | Single campaign, independent customers | — |
| Unconfoundedness | Uncertain | Hidden logic possible in marketing strategy | Sensitivity analysis |
| Positivity | Violated | PS AUC = 0.989 | PS trimming, A/B test |
| Consistency | OK | Treatment is clearly defined | — |
5. Conclusion
This study demonstrates the application of heterogeneous treatment effect estimation to retail campaign optimization. Despite a severe positivity violation that constrains causal identification, we uncovered economically significant heterogeneity in the treatment effect:
Key achievements:
- A +$7,085 profit improvement between optimal and full targeting (-$4,659 → +$2,426)
- Identified a negative CATE for VIP Heavy (-$38) and Bulk Shoppers (-$40)
- Optimal policy: targeting 31.3% of customers yields 125% ROI
Methodological contributions:
- Comprehensive positivity diagnostics including various mitigation strategies
- Integration of behavioral segmentation (Track 1) and causal targeting (Track 2)
- An uncertainty-aware, risk-adjusted policy framework
- Honest model selection based on stability + distribution plausibility rather than raw AUUC
Acknowledged limitations:
- A PS AUC of 0.989 represents a fundamental identification challenge
- Refutation tests suggest model instability requiring A/B validation (Placebo 0.747 / Subset 0.561 → fail)
- Results should be treated as hypothesis-generating rather than definitive
Next steps: The recommended A/B test (n=5,748, MDE ~$34) will validate these findings before full deployment. The staged rollout approach balances protection against estimation error with potential revenue gains.
Appendix: Technical Details
This appendix holds the densest technical detail (parameters · equations · PS-region decomposition · sensitivity grid · segment strategy). It is a layer for the reader who wants to go deep (30 minutes) after reading the body (30 seconds / 5 minutes).
A.1 Software Environment
- Python 3.9+
- econml 0.14+ (Microsoft Causal ML)
- scikit-learn 1.0+
- xgboost 1.7+
- optuna (hyperparameter tuning)
A.2 Reproducibility
- Random seeds fixed for all stochastic processes
- Full code available in the project notebooks:
03a_hte_estimation.ipynb03b_hte_validation.ipynb04_optimal_policy.ipynb
A.3 Data Artifacts
- HTE results:
results/hte_estimation_results.joblib - Validation results:
results/hte_validation_summary.joblib - Policy results:
results/policy_learning_results.joblib - Policy comparison:
results/tables/policy_comparison.csv - Model comparison:
results/tables/auuc_comparison_purchase_amount.csv,cate_summary_purchase_amount.csv - ATE results:
results/tables/ate_results_purchase_amount.csv - A/B design:
results/tables/ab_test_design.csv - Breakeven sensitivity:
results/tables/breakeven_scenarios.csv
A.4 Key Parameters
| Parameter | Value | Basis |
|---|---|---|
| PS Trim | [0.10, 0.90] | Balances sample size and reliability |
| Campaign Cost | $12.73 | Average TypeA campaign cost |
| Profit Margin | 30% | Retail industry standard |
| Breakeven CATE | $42.43 | Cost / Margin |
A.5 CATE Reliability by Propensity Score Region
Understanding where CATE estimates are trustworthy is critical for targeting decisions. This section analyzes the treatment-effect estimates across different propensity-score regions.
A.5.1 CATE by PS Region

| PS region | N | Sample % | Mean CATE | Reliability | Interpretation |
|---|---|---|---|---|---|
| Overlap (0.1-0.9) | 80 | 16.5% | +$34 | High | Most trustworthy estimate, comparable T/C groups |
| Extreme Low (<0.1) | 136 | 28.0% | +$16 | Medium | Control-heavy, extrapolating to treatment |
| Extreme High (>0.9) | 270 | 55.6% | +$1 | Low | Treatment-heavy, extrapolating to control |
Key insight: The overlap region shows the highest CATE (+$34), suggesting the true treatment effect is likely positive, but 83% of the sample requires extrapolation.
A.5.2 CATE Bounds by PS Region

| Region | Point Estimate | Lower Bound | Upper Bound | Width | Action |
|---|---|---|---|---|---|
| Overlap | +$34 | +$8 | +$60 | $52 | Target confidently |
| Extreme Low | +$16 | -$42 | +$74 | $116 | Proceed with caution |
| Extreme High | +$1 | -$38 | +$40 | $78 | Consider reducing targeting |
Marketing implication: Target confidently only in the overlap region; use conservative estimates elsewhere.
A.5.3 Sensitivity Analysis: Cost & Margin

Breakeven scenario grid (per breakeven_scenarios):
| Scenario | Cost | Margin | Breakeven | Target % | Profit |
|---|---|---|---|---|---|
| Base | $12.73 | 30% | $42.43 | 31.3% | +$2,426 |
| Lower Margin | $12.73 | 25% | $50.92 | 26.1% | +$1,726 |
| Higher Margin | $12.73 | 35% | $36.37 | 36.0% | +$3,173 |
| Higher Cost | $15.00 | 30% | $50.00 | 26.5% | +$2,107 |
| Lower Cost | $10.00 | 30% | $33.33 | 39.7% | +$2,888 |
| Worst (15/25%) | $15.00 | 25% | 60.92 | — | +$1,461 |
| Best (10/35%) | $10.00 | 35% | 28.57 | — | +$3,702 |
Robustness: Even when Cost/Margin is pushed to its conservative and optimistic extremes, the optimal policy maintains a positive profit (+$1,461 ~ +$3,702). The key point is that the sign of the policy is not flipped by the assumptions.
A.6 Detailed Segment Marketing Strategy
This section provides comprehensive targeting recommendations for each customer segment based on the CATE analysis and Track 1 profiling.
Note: The “current targeting %” / “recommended %” columns below are illustrative figures that do not exist in any source CSV and are not validated values. The validated facts are only the per-segment mean CATE / N (486 cohort) / directional action (expand · hold · reduce). Read the percentages below strictly as examples meant to convey direction intuitively.
A.6.1 Segment Performance Matrix
(Current/recommended targeting % are illustrative — direction only)
| Segment | Mean CATE | Current targeting (example) | Recommended (example) | Direction |
|---|---|---|---|---|
| Regular+H&B | +$34 | ~76% | 85%+ | Slight expansion |
| Active Loyalists | +$33 | ~90% | 95%+ | Hold |
| Light Grocery | +$30 | ~15% | 45% | Expand |
| Fresh Lovers | +$27 | ~27% | 55% | Expand |
| Lapsed H&B | +$19 | ~20% | 35% | Slight expansion |
| VIP Heavy | -$38 | ~97% | 50% | Reduce |
| Bulk Shoppers | -$40 | ~52% | 20% | Reduce |
A.6.2 Detailed Action Plan by Segment
(The targeting % below are illustrative — the validated values are the mean CATE and the directional action)
Segment: VIP Heavy (CATE: -$38)
| Dimension | Current state | Problem | Recommendation |
|---|---|---|---|
| Campaign response | Negative | Already a high purchaser, ceiling effect | Reduce TypeA frequency |
| Alternative channel | Over-exposed to TypeA | May cause fatigue | Test TypeB/C |
| Value protection | $9,716 average spend | Churn risk | Premium non-discount benefits |
| Targeting rule | Over-exposed (example ~97%) | Over-targeting | Target only to trial new products |
Segment: Bulk Shoppers (CATE: -$40)
| Dimension | Current state | Problem | Recommendation |
|---|---|---|---|
| Campaign response | Negative | Price-sensitive, coupon mismatch | Reduce coupon campaigns |
| Shopping pattern | Irregular bulk | TypeA disrupts the natural rhythm | Focus on subscription/regularization |
| Alternative approach | Large basket per visit | Needs bulk-specific offers | Warehouse-style promotions |
| Targeting rule | Medium exposure (example ~52%) | Moderate over-targeting | Target only to expand categories |
Segment: Light Grocery (CATE: +$30)
| Dimension | Current state | Problem | Recommendation |
|---|---|---|---|
| Campaign response | Positive | Currently under-targeted | Substantially increase targeting rate |
| Potential | Low engagement | High uplift opportunity | Activation campaigns |
| Strategy | Minimal exposure (example ~15%) | Missing incremental value | Gradual rewards program |
| Targeting rule | Minimal exposure | Major gap | Target all customers with CATE > Breakeven |
A.6.3 Risk-Adjusted Targeting Matrix

| Risk Tolerance | λ parameter | Targeted segments | Expected Profit | ROI |
|---|---|---|---|---|
| Aggressive (λ=0) | Full CATE | Regular+H&B, Active Loyalists, Light Grocery, Fresh Lovers, Lapsed | +$2,426 | 125% |
| Moderate (λ=0.3) | 70% Point + 30% Lower | Regular+H&B, Active Loyalists, Light Grocery, Fresh Lovers | +$1,603 | 233% |
| Conservative (λ=0.7) | 30% Point + 70% Lower | Regular+H&B, Active Loyalists | ~$1,200 | ~200% |
| Ultra-safe (λ=1.0 / Lower CI > BE) | Lower bound only | Conservative (3 customers) | +$188 | 493% |
Recommendations by situation:
| Business context | Recommended λ | Basis |
|---|---|---|
| Before A/B test | 0.7-1.0 | Minimize downside risk |
| After validation | 0.3-0.5 | Balanced confidence |
| Budget constraint | 0.0-0.3 | Maximize absolute profit |
| New market/product | 0.5-0.7 | Limited historical data |
A.6.4 Campaign Type Alternatives (Future Analysis)
| Segment | TypeA response | Hypothetical TypeB | Hypothetical TypeC | Recommended test |
|---|---|---|---|---|
| VIP Heavy | Negative | Neutral/positive? | Premium tier? | Test premium offers |
| Bulk Shoppers | Negative | Bulk deals? | Subscription? | Test bulk-specific |
| Fresh Lovers | Positive | Fresh specials? | Recipe app? | Keep TypeA + test B |
| Light Grocery | Positive | Habit triggers? | Gamification? | Keep TypeA + test C |
Note: TypeB/TypeC analysis requires a separate HTE study with a campaign-type-specific design.
References
Causal Inference Foundations
- Imbens, G. W., & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press.
- Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41-55.
Heterogeneous Treatment Effects
- Athey, S., & Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27), 7353-7360.
- Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228-1242.
- Kennedy, E. H. (2023). Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics, 17(2), 3008-3049.
Policy Learning
- Athey, S., & Wager, S. (2021). Policy learning with observational data. Econometrica, 89(1), 133-161.
- Zhou, Z., Athey, S., & Wager, S. (2023). Offline multi-action policy learning: Generalization and optimization. Operations Research, 71(1), 148-183.
Positivity and Sensitivity Analysis
- Petersen, M. L., Porter, K. E., Gruber, S., Wang, Y., & van der Laan, M. J. (2012). Diagnosing and responding to violations in the positivity assumption. Statistical Methods in Medical Research, 21(1), 31-54.
- VanderWeele, T. J., & Ding, P. (2017). Sensitivity analysis in observational research: Introducing the E-value. Annals of Internal Medicine, 167(4), 268-274.
Retail Marketing Applications
- Rossi, P. E., McCulloch, R. E., & Allenby, G. M. (1996). The value of purchase history data in target marketing. Marketing Science, 15(4), 321-340.
- Hitsch, G. J., & Misra, S. (2018). Heterogeneous treatment effects and optimal targeting policy evaluation. SSRN Working Paper.
Software
- Battocchi, K., et al. (2019). EconML: A Python package for ML-based heterogeneous treatment effects estimation. Microsoft Research.
- Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD, 785-794.