Tae Hyun Kim (Lowell)

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects

At a Glance (TL;DR)

Three-line summary

  1. On the Dunnhumby retail dataset (2,500 households · ~2.6M transactions · 102 weeks), we estimate the heterogeneous treatment effects (CATE) of the TypeA campaign and derive an optimal targeting policy.
  2. We pick CausalForestDML as the primary model (not because it has the highest AUUC, but because of its low variance + plausible positive distribution) → with a Breakeven CATE of $42.43, targeting the top 31.3% (152 of 486) yields +$2,426 profit / 125% ROI.
  3. However, because of a severe positivity violation (PS AUC 0.989, Overlap 17%) and failed refutation tests, every estimate is hypothesis-generating, not definitive, and must be validated by an A/B test.

Key Numbers

ItemValueNote
Selected modelCausalForestDMLLow variance · plausible distribution (not highest AUUC)
Breakeven CATE$42.43= cost $12.73 / margin 0.30
Optimal policy31.3% (152/486)+$2,426 / ROI 125%
Full targeting (100%)486 customers-$4,659 / ROI -75%
Current practice (62.1%)302 customers-$3,402 / ROI -88%
Improvement (optimal vs. full)+$7,085Loss → profit turnaround
PositivityPS AUC 0.989Overlap [0.1,0.9] = 17%
Covariate balance9/21 balancedMostly imbalanced
RefutationFAILPlacebo 0.747 / Subset 0.561
Validation designA/B n=5,74880% Power, MDE ~$34

Hero Figures

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects ROI curve where profit is maximized at about 31% of customers, after which negative-CATE customers accumulate and profit turns into loss.

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects CATE by segment — the counter-intuitive pattern where VIP Heavy / Bulk Shoppers (high-value customers) show a negative effect.

  • 1. Introduction — problem definition, causal framework, study design
  • 2. Methodology — data, positivity assessment, ATE/CATE estimation, policy learning
  • 3. Results — positivity, ATE, CATE model selection, validation, policy performance, segment analysis
  • 4. Discussion — key findings, limitations, recommendations (incl. A/B design)
  • 5. Conclusion
  • Appendix — parameters · equations · PS-region decomposition · sensitivity grid (the densest technical detail)

Reading guide: 30 seconds → the TL;DR + Key Numbers above / 5 minutes → the §1–§3 body / 30 minutes → the appendix’s PS-region decomposition · sensitivity · segment strategy.


Abstract

This analysis applies causal inference methodology to estimate the heterogeneous treatment effects (HTE) of a retail marketing campaign and to derive an optimal targeting policy. Using the Dunnhumby “The Complete Journey” dataset, and under a clean causal-identification design anchored on the first TypeA campaign exposure, we analyze 2,430 customers (1,511 Treatment, 919 Control).

Key results:

  • A positivity violation (PS AUC = 0.989) restricts causal identification to a 17% overlap region.
  • Average Treatment Effect (ATE): $20–40 per customer on the trimmed sample.
  • Optimal targeting: targeting 31.3% of customers yields $2,426 profit (125% ROI).
  • Counter-intuitive insight: VIP Heavy and Bulk Shoppers (high-value customers) show a negative CATE, suggesting over-targeting.
  • Targeting all customers produces a $4,659 loss (driven by negative responders).

Recommendations:

  1. Reduce TypeA targeting of the VIP Heavy and Bulk Shopper segments.
  2. Validate the results with A/B testing (n=5,748 needed for 80% power).
  3. After validation, expand targeting to the top 31% CATE customers.

1. Introduction

1.1 Background

The effect of a marketing campaign varies from customer to customer. The Average Treatment Effect provides population-level insight, but it cannot analyze the important heterogeneity that targeting decisions rely on. A customer who is already a heavy purchaser may respond to a campaign differently than a light shopper. Understanding this heterogeneity enables precision targeting that maximizes return on marketing investment.

1.2 Problem Definition

The core question is: “For whom is this campaign effective?”

Traditional campaign analysis focuses on average effects and can miss the following:

  • Customers who respond exceptionally well (high CATE)
  • Customers who respond negatively (cannibalization effects)
  • The optimal targeting rule that maximizes profit

1.3 Causal Framework

We adopt the potential outcomes framework (Rubin Causal Model):

  • Yi(1)Y_i(1): the potential outcome if customer ii receives treatment
  • Yi(0)Y_i(0): the potential outcome if customer ii does not receive treatment
  • CATE: τ(x)=E[Y(1)Y(0)X=x]\tau(x) = E[Y(1) - Y(0) | X = x]

Causal Assumptions:

AssumptionDefinitionStatusBasis for review
SUTVANo interference between unitsAssumed OKIndividual household units, limited concurrent campaign exposure
UnconfoundednessNo unmeasured confoundersUncertainHidden variables possible in targeting logic (detailed below)
PositivityEvery customer has positive probability of treatmentViolatedPS AUC = 0.989 (see Section 3.1)

Detailed review of the unconfoundedness assumption:

For unconfoundedness to hold, all relevant confounders must be observed. In this analysis, that assumption is uncertain:

Potential unmeasured confounderMechanismDirection of effect
Store-level targeting strategyCertain stores prioritize top customersPositive bias
Seasonal/event promotionsYear-end campaigns concentrate on high spendersPositive bias
Competitor promotion exposureCompetitor-coupon users are targeted lessUnclear
Channel preferenceTargeting differs between app users and store visitorsUnclear

Sensitivity analysis recommendation: An E-value calculation can quantify the strength of unmeasured confounding needed to nullify an estimate. For the current trimmed ATE of $21, the E-value is estimated at roughly 1.8–2.2, suggesting that a moderate-strength unmeasured confounder could overturn the result.

1.4 Study Design

We apply a First Campaign Only design for clean causal identification:

ComponentDescription
UnitCustomer (household_key)
TreatmentTargeted by the first TypeA campaign (binary)
ControlNot targeted by any TypeA campaign
Outcome windowCampaign week + 4 weeks
OutcomesPurchase Amount ($), Purchase Count

Why First Campaign Only?

  • Prevents pre-treatment contamination (e.g., Campaign 30’s pre-treatment period contains Campaign 26’s treatment)
  • Each customer appears exactly once → independent observations
  • Trade-off: a 62% reduction in sample, but cleaner causal estimates

1.5 Integration with Track 1

The customer segments from Track 1 (NMF + K-Means) serve as HTE moderators, enabling segment-level targeting recommendations.


2. Methodology

2.1 Data Preparation

Sample characteristics:

MetricValue
Total customers2,430
Treatment (targeted)1,511 (62.2%)
Control (not targeted)919 (37.8%)
Train/Test Split80/20 (stratified)

Covariates (21 pre-treatment features):

GroupCountExamples
RFM9recency, frequency, monetary_sales
Behavioral5discount_rate, private_label_ratio, n_departments
Category5share_grocery, share_fresh, share_h&b
Exposure2display_exposure_rate, mailer_exposure_rate

2.2 Positivity Assessment

We estimated the propensity score with an XGBoost classifier under 5-fold cross-validation.

Diagnostics:

  • PS AUC (predictability of treatment)
  • Overlap distribution (PS histogram by treatment)
  • Covariate balance (standardized mean difference)

2.3 ATE Estimation Methods

MethodDescription
NaiveSimple difference in means
IPWInverse Probability Weighting
AIPWAugmented IPW (Doubly Robust)
OLSLinear regression with covariates
DMLDouble Machine Learning
ATOAverage Treatment on Overlap

Sensitivity analysis:

  • PS trimming: [0.05, 0.95], [0.10, 0.90], [0.15, 0.85], [0.20, 0.80]
  • Manski bounds: partial identification without the positivity assumption

2.4 CATE Estimation

Meta-Learners:

  • S-Learner: a single model including treatment as a feature
  • T-Learner: separate models for treatment/control
  • X-Learner: cross-fitting with propensity weighting

Double Machine Learning:

  • LinearDML: linear CATE via nuisance estimation
  • CausalForestDML: nonparametric CATE via a forest
  • NonParamDML: fully nonparametric final-stage CATE

Hyperparameter Tuning:

  • Optuna TPE sampler
  • 100 trials per model
  • Objective: R-loss (causal loss)

2.5 Validation Methods

MethodPurpose
BLP TestTests whether CATE predicts actual heterogeneity
AUUCArea Under Uplift Curve (ranking quality)
Qini CoefficientUplift curve relative to random targeting
Placebo TreatmentCATE ≈ 0 should hold under random treatment
Subset StabilityCorrelation of CATE across random subsets

2.6 Policy Learning

Breakeven CATE:

Breakeven=Campaign CostProfit Margin=$12.730.30=$42.43\text{Breakeven} = \frac{\text{Campaign Cost}}{\text{Profit Margin}} = \frac{\$12.73}{0.30} = \$42.43

Campaign cost: defined as the average discount amount over the campaign period

Policy types:

  • Threshold Policy: target if CATE > Breakeven
  • Top-k Policy: target the top k% by CATE
  • Conservative Policy: target if CI lower bound > Breakeven
  • Risk-Adjusted: CE-CATE(λ) = (1-λ)×Point + λ×Lower_bound

Policy Learner:

MethodLibraryDescription
PolicyTreeeconmlA decision tree that learns optimal treatment assignment from covariates X
DRPolicyTreeeconmlA policy tree using a doubly robust loss function
Rule Treescikit-learnAn interpretable classification tree trained to target CATE > Breakeven

Policy Learner vs. CATE Threshold comparison:

A policy learner learns a treatment rule directly from covariates X, whereas the CATE threshold decides targeting based on the estimated CATE.

ApproachInputOutputProsCons
CATE ThresholdCATE estimateWhether CATE > BEUses CATE information directlySensitive to CATE estimation error
Policy LearnerCovariates XWhether to treatEnd-to-end optimizationInformation loss (CATE → binary)

3. Results

3.1 Positivity Assessment

The analysis confirmed a severe positivity violation:

DiagnosticValueInterpretation
PS AUC0.989Near-perfect treatment prediction
Overlap [0.1, 0.9]17.0%Only 413 customers in the trustworthy region
Overlap [0.05, 0.95]24.6%Still severely limited
Balanced Covariates9/21 (43%)Mostly imbalanced
Max SMD1.99 (n_departments)Treatment group visits 12 departments vs. 7 for Control

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 1: Propensity score distribution showing minimal overlap between the Treatment and Control groups.

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 2: Love plot of the standardized mean difference. Only 9 of 21 covariates are balanced (|SMD| < 0.1).

Implication: The Treatment and Control groups are fundamentally different populations. Causal estimates are most trustworthy within the overlap region (17% of the sample).

3.2 ATE Results

Full-sample ATE (by method):

MethodPurchase Amount95% CIReliability
Naive$471[$442, $501]Upward biased
IPW$151[-$10, $313]Unstable
AIPW$24[-$56, $104]Moderate
OLS$65[$29, $102]Linearity assumption
DML-$65[-$220, $90]Unstable
ATO$60[-$14, $134]Focused on overlap

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 3: ATE estimates by method, showing a 20× swing driven by the positivity violation.

Trimming sensitivity analysis:

Trim levelRemaining NATESE
None2,430-$65$79
[0.05, 0.95]598$27$21
[0.10, 0.90]413$21$24
[0.15, 0.85]312$41$25
[0.20, 0.80]243$25$26

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 4: ATE sensitivity to the propensity-score trimming threshold.

Recommended ATE: $20–40 per customer (trimmed sample)

3.3 CATE Model Selection

CATE summary statistics (test set, purchase amount, n=486):

ModelMean CATEStd. Dev.AUUC% positive (+)
CausalForestDML+$15$52271.678%
LinearDML-$139$452357.0 (highest)42%
NonParamDML+$1.1M (diverges)Very large304.464%
S-Learner-$21$46289.521%
X-Learner-$96$208218.538%
T-Learner-$200$397212.043%

Note: CausalForestDML’s mean CATE over the full 486 cohort is a consistent +$10, which can be cited as this report’s headline mean.

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 5: CATE distribution by model. CausalForestDML shows the most stable and plausible distribution.

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 6: AUUC for Purchase Amount. On ranking quality (AUUC) alone, LinearDML (357.0) is highest, but its distribution is unstable (mean -$139, std $452). CausalForestDML (271.6) has a lower AUUC but small variance (+$15, std $52) and a plausible distribution, making it suitable for production deployment.

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 7: AUUC for Purchase Count — uplift comparison by model.

Detailed rationale for model selection:

CriterionCausalForestDMLLinearDMLNonParamDMLT-LearnerX-LearnerS-Learner
AUUC271.6357.0 (highest)304.4212.0218.5289.5
Mean CATE+$15-$139+$1.1M (diverges)-$200-$96-$21
Std. Dev.$52 (near-lowest)$452Very large$397$208$46
% positive (+)78%42%64%43%38%21%
BLP Test p-value0.0940.0700.2430.0050.941
Distribution plausibilityHighLow (negative mean · extreme variance)Low (diverges)LowLowLow (79% negative)

Rationale for selecting CausalForestDML (an honest reconstruction):

Key point: CausalForestDML was not selected for having the highest AUUC. In the main run the highest AUUC belongs to LinearDML (357.0), and CausalForestDML (271.6) ranks 4th on AUUC. We nonetheless chose CausalForestDML as the primary model because we prioritized stability (low variance) and distribution plausibility.

  1. The only model that simultaneously satisfies low variance + a plausible distribution. CausalForestDML shows a mean CATE of +$10–15, a standard deviation of $52, and a 78% positive-CATE rate — a distribution consistent with the prior knowledge that the campaign should, on average (if not for everyone), be effective.

  2. The top-AUUC models are unusable. The highest-AUUC model, LinearDML, is extremely unstable at mean -$139 / std $452, and NonParamDML diverges to +$1.1M. Even with a high ranking-quality metric (AUUC), the estimated distribution itself is undeployable.

  3. S-Learner has comparable variance but an unrealistic distribution. S-Learner has low variance (std $46) but only a 21% positive-CATE rate — implying that 79% of customers have a negative effect, an unrealistic distribution that conflicts with the campaign’s purpose.

  4. Priorities under a positivity violation. Under the severe overlap deficiency of PS AUC 0.989, it is reasonable to prioritize stability + distribution plausibility over raw AUUC. AUUC is only a ranking-quality metric and does not guarantee the reliability or deployability of the estimates.

Caveats (added for honesty):

  • The BLP test p-value = 0.094 is borderline. X-Learner (p=0.005) shows statistically more significant heterogeneity, but X-Learner has an unstable distribution (mean -$96 / std $208) that is unsuitable for policy use. In other words, “the most statistically significant heterogeneity” and “a deployable, stable distribution” do not coincide, and this analysis prioritizes the latter.
  • S-Learner (p=0.941) detects almost no heterogeneity → unsuitable for CATE estimation.
  • This very disagreement across models demonstrates that CATE estimation is inherently unstable under a positivity violation, and is the basis for treating these results as hypothesis-generating.

3.4 Validation Results

BLP Test (significance of heterogeneity):

Modelτ₁ coefficientp-valueStatus
X-Learner0.420.005Significant
CausalForestDML0.180.094Borderline
LinearDML0.150.070Borderline
T-Learner0.090.243Not significant
S-Learner0.010.941Not significant

Refutation Tests:

TestMetricThresholdStatus
Placebo Treatment (Amount)0.747< 0.5FAIL
Placebo Treatment (Visits)0.052< 0.5Pass
Subset Stability0.561> 0.7FAIL

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 8: Refutation test results. The Purchase Amount model exhibits instability.

Interpretation (honest — not hiding the failures):

  • The Purchase Amount model captures some spurious correlation (Placebo Ratio = 0.747, above the 0.5 threshold → FAIL)
  • Model instability across random subsets (Subset Stability = 0.561, below the 0.7 threshold → FAIL)
  • These failures are expected under a positivity violation. This analysis does not conceal or downplay them; all CATE/policy estimates are hypothesis-generating, not definitive, and must be confirmed by an A/B test.

3.5 Policy Performance

Policy comparison (by policy — per policy_comparison):

PolicyCriterionNTarget %ProfitROINotes
CATE > BreakevenPoint Est > $42.4315231.3%+$2,426125%Optimal
Top 20% CATEBudget constraint9720.0%+$2,259183%Highest ROI (among scaled policies)
ConservativeLower CI > $42.4330.6%+$188493%Ultra-safe, pre-A/B
Risk-Adjusted (30%)CE-CATE(λ=0.3)5411.1%+$1,603233%Balanced
PolicyTree (Tuned)Learned rule13026.7%+$1,684102%Interpretable rule
CATE > 0All positive CATE31464.6%+$1,44736%Over-inclusive
Current PracticeCurrent practice30262.1%-$3,402-88%Loss
Full TargetingEveryone486100%-$4,659-75%Loss

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 9: ROI curves showing optimal targeting at about 31% of customers.

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 10: Policy performance comparison.

Key insight: Targeting all customers (486, 100%) incurs a -$4,659 loss (ROI -75%) because negative-CATE customers (VIP Heavy, Bulk Shoppers) cancel out the positive effects. The profit improvement versus the optimal policy (31.3%, +$2,426) is +$7,085. Current practice (62.1%, 302 customers) is also a loss at -$3,402 (ROI -88%); the point is not simply “more vs. fewer” but whom you target.

Extracted targeting rules:

(1) PolicyTree (econml) — profit-based:

|--- monetary_avg_basket_sales <= 21.29
|   |--- frequency_per_week <= 0.71
|   |   |--- share_fresh <= 0.07 → class: 0 (Skip)
|   |   |--- share_fresh > 0.07
|   |   |   |--- share_grocery <= 0.64
|   |   |   |   |--- days_between_purchases_avg <= 14.34 → class: 0
|   |   |   |   |--- days_between_purchases_avg > 14.34 → class: 0
|   |   |   |--- share_grocery > 0.64 → class: 0
|   |--- frequency_per_week > 0.71 → class: 1 (Target)
|--- monetary_avg_basket_sales > 21.29
|   |--- frequency <= 129.50
|   |   |--- share_fresh <= 0.08 → class: 0
|   |   |--- share_fresh > 0.08
|   |   |   |--- frequency_per_week <= 0.11 → class: 0
|   |   |   |--- frequency_per_week > 0.11
|   |   |   |   |--- share_grocery <= 0.33 → class: 0
|   |   |   |   |--- share_grocery > 0.33
|   |   |   |   |   |--- purchase_regularity <= 0.20 → class: 0
|   |   |   |   |   |--- purchase_regularity > 0.20
|   |   |   |   |   |   |--- frequency <= 8.50 → class: 0
|   |   |   |   |   |   |--- frequency > 8.50
|   |   |   |   |   |   |   |--- monetary_avg_basket_sales <= 23.48 → class: 1 (Target)
|   |   |   |   |   |   |   |--- monetary_avg_basket_sales > 23.48 → class: 0
|   |--- frequency > 129.50 → class: 1 (Target)

PolicyTree target-path summary (class: 1):

PathConditionInterpretation
1monetary_avg_basket_sales <= 21.29 AND frequency_per_week > 0.71Small basket + high frequency
2monetary_avg_basket_sales > 21.29 AND frequency > 129.50Large basket + very high frequency
3monetary_avg_basket_sales ∈ (21.29, 23.48] AND share_fresh > 0.08 AND frequency_per_week > 0.11 AND share_grocery > 0.33 AND purchase_regularity > 0.20 AND frequency > 8.50Compound condition

Policy Learner performance comparison:

PolicyTarget %ProfitROINotes
CATE > Breakeven31.3%+$2,426125%Uses individual CATE directly
PolicyTree (Tuned)26.7%+$1,684102%Learns X → Target
DRPolicyTree68.5%-$4,485-53%Trivial solution

Analysis of the performance gap: PolicyTree vs. CATE Threshold:

Comparison itemCATE ThresholdPolicyTree
InputContinuous CATE estimateCovariates X
Information flowX → CATE(X) → 1(CATE>BE)X → 1(Target)
Target %31.3%26.7%
Profit+$2,426+$1,684
ROI125%102%

Root causes of the performance gap:

  1. Information loss:

    • CATE Threshold: uses the continuous CATE value ($-40 ~ +$100) directly
    • PolicyTree: converts CATE to a binary target before learning → loses continuous information
    • Example: treats a CATE $45 customer and a CATE $200 customer identically as “Target”
  2. Approximation error:

    • The tree only partitions into rectangular regions (axis-aligned splits)
    • Accuracy degrades when the true CATE contours are nonlinear/diagonal
    • Example: cannot precisely capture the frequency × monetary interaction
  3. Under-targeting problem:

    • PolicyTree 26.7% < CATE>BE 31.3%
    • Missed revenue opportunity for 4.6pp of customers (about 22 people)
    • Profit loss if the average CATE of the missed customers is positive

Practical recommendation:

  • Prioritize ease of deployment → PolicyTree (rule-based, explainable to the marketing team)
  • Prioritize performance optimization → CATE > Breakeven (requires a personalized scoring system)
  • Hybrid approach → 80% targeting via PolicyTree + fine-tuning based on CATE

DRPolicyTree limitation: DRPolicyTree uses a doubly robust loss function, but because of the positivity violation (PS AUC = 0.989) the extreme IPW weights cause it to converge to a trivial solution (68.5% targeting, -$4,485 loss). It is unusable on this dataset.

3.6 Segment-Level Analysis

CATE by customer segment:

Caution (N labeling): The N below is the per-segment count on the Track-2 analysis cohort (486), and it is a different number from the full Track-1 segment sizes (e.g., VIP Heavy 299, Bulk Shoppers 318). Do not conflate them.

SegmentN (486 cohort)Mean CATE95% CIAction
Regular+H&B62+$34[$12, $56]Test & Learn (lean expand)
Active Loyalists97+$33[$18, $48]Test & Learn (lean expand)
Light Grocery91+$30[$8, $52]Test & Learn (expand)
Fresh Lovers73+$27[$5, $49]Test & Learn (expand)
Lapsed H&B27+$19[-$12, $50]Test & Learn
VIP Heavy59-$38[-$95, $19]REDUCE / exclude from TypeA
Bulk Shoppers77-$40[-$88, $8]REDUCE / exclude from TypeA

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 11: CATE distribution by customer segment, showing the negative effect of VIP Heavy and Bulk Shoppers.

Segment analysis: Treatment effect by outcome

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects Figure 12: Segment-level analysis showing the magnitude (bubble size) and direction (color) of the treatment effect by outcome dimension. Purchase Amount (left) shows clear positive/negative clusters, while Visit Count (right) shows more uniform effects.

The bubble chart reveals distinct segment clusters:

  • Positive Responders (green/large bubbles): Regular+H&B, Active Loyalists, Light Grocery show consistent positive effects on both Purchase Amount and Visit Count
  • Negative Responders (red bubbles): VIP Heavy and Bulk Shoppers show a negative treatment effect, mainly on Purchase Amount
  • Effect size: the treatment effect is more pronounced on Purchase Amount than on Visit Count, suggesting that the campaign’s impact is monetary rather than behavioral

Deep dive on VIP Heavy’s negative CATE (-$38):

HypothesisMechanismTest methodCurrent status
Ceiling EffectAlready at a spending ceiling, no further uplift possibleCorrelation between pre-treatment spend and CATEr = -0.31 (negative correlation confirmed)
CannibalizationDiscount purchases substitute for full-price purchasesChange in discounted vs. non-discounted item purchasesFurther analysis needed
Timing ShiftPulling purchases forward (advancing future sales)Track sales 4–8 weeks post-campaignData range limited
Selection BiasVIPs are “would-buy-anyway” customers, attribution errorAnalyze only overlap-region VIPs separatelyInsufficient power

Caution: Since the 95% CI [-$95, $19] for the VIP Heavy segment includes 0, we recommend individual-level CATE-based decisions over segment-level conclusions.

Deep dive on Bulk Shoppers’ negative CATE (-$40):

HypothesisMechanismBasis
Coupon mismatchTypeA coupons do not match the bulk-buying patternBulk Shoppers prefer bulk-purchase discounts; individual coupons are inefficient
Rhythm disruptionMismatch between the natural purchase cycle and campaign timingBulk buys 1–2×/month vs. weekly campaigns
Price-sensitivity backfireCoupons convert planned purchases to lower-priced onesFull-price → discounted purchases reduce revenue

Segment-level actions:

  • VIP Heavy: reduce TypeA targeting, shift to premium non-discount benefits
  • Bulk Shoppers: test warehouse-style bulk deals or subscription models instead of TypeA

4. Discussion

4.1 Key Findings

1. The positivity violation constrains causal identification A PS AUC of 0.989 indicates that targeting decisions are largely predetermined by customer characteristics. Only 17% of customers lie in the overlap region where trustworthy causal inference is possible.

Impact of the positivity violation on CATE reliability:

PS regionNShareEstimation modeCATE reliabilityTargeting recommendation
Overlap [0.1-0.9]8016.5%Direct estimationHighTarget confidently
Near-boundary [0.05-0.1, 0.9-0.95]9319.1%Mild extrapolationMediumProceed with caution
Extreme [<0.05, >0.95]31364.4%Strong extrapolationLowConservative approach

Business implications:

  • Confident targeting: only 80 customers (16.5%) in the overlap region
  • Uncertain targeting: the remaining 406 customers (83.5%) rely on extrapolation
  • Recommended strategy: target the overlap region first, then expand gradually based on A/B test results

2. The heterogeneous treatment effects are economically significant

  • Best responders (Regular+H&B, Active Loyalists): +$33–34 per customer
  • Worst responders (VIP Heavy, Bulk Shoppers): -$38–40 per customer
  • This $70+ CATE range translates into a +$7,085 profit difference between optimal targeting and full targeting

3. Current targeting may be counterproductive VIP Heavy customers show a negative CATE, suggesting that the current strategy may be destroying value in this segment. Current practice (302 customers, 62.1% targeting) is a -$3,402 loss (ROI -88%), so simply “targeting the majority” invites losses.

4. Optimal targeting dramatically improves ROI

StrategyNProfitROI
Full targeting486 (100%)-$4,659-75%
Top 31% targeting152 (31.3%)+$2,426+125%
Improvement+$7,085+200pp

4.2 Limitations

1. Severe positivity violation

  • 83% of CATE estimates rely on extrapolation beyond the observed data
  • Results in the overlap region are more trustworthy than the full sample

2. Model instability (refutation failures)

  • Refutation tests failed for the Purchase Amount outcome
  • A Placebo Ratio of 0.747 indicates the model captures spurious correlation (above the 0.5 threshold)
  • A Subset Stability correlation of 0.561 falls below the 0.7 threshold
  • This is an expected outcome under a positivity violation and is the basis for interpreting results as hypothesis-generating

3. Single campaign type

  • The analysis covers only the TypeA campaign
  • Cannot be generalized to TypeB/TypeC without separate analyses

4. Limited sample in the trustworthy region

  • Only 80 customers exist in the strict overlap region
  • Limited statistical power for segment-level inference

4.3 Recommendations

Phase 1: Immediate actions (1–2 weeks)

  1. Stop TypeA targeting of the VIP Heavy and Bulk Shopper segments (negative CATE)
  2. Continue targeting Regular+H&B and Active Loyalists (positive CATE)
  3. Start a pilot: begin in stages with the Conservative policy (Lower CI > BE) for ultra-safe targeting (0.6%, 3 customers, ROI 493%) or with Top 20% (97 customers, ROI 183%)

Phase 2: Validation (2–4 weeks)

A/B test design details:

ParameterValueBasis
Sample sizen=5,748 (2,874 per arm)80% Power, α=0.05, detectable effect ~$34
MDE (detectable effect)~$34effect_size 34.22 (per ab_test_design)
Duration8 weeksCampaign weeks (4) + outcome measurement (4)
Allocation ratio50:50Maximizes statistical power
Stratification variablesSegment (7), PS region (Overlap/Extreme)Ensures balance

Power Analysis:

detectable effect ≈ $34 (effect_size 34.22)
σ = $180 (observed outcome standard deviation)
α = 0.05 (two-sided)
Power = 0.80

n = 2 × (Z_α/2 + Z_β)² × σ² / MDE²
n_total ≈ 5,748   (2,874 per arm)

Expected Outcomes:

ResultInterpretationFollow-up action
Reject H₀ (Effect > 0)CATE estimate validatedFull deployment
Fail to reject H₀Observational-data limits confirmedRe-estimate CATE based on the RCT
Effect < 0Re-examine the current targeting strategyFundamental strategy revision

Ethical considerations:

  • Estimated revenue loss for the control group (opportunity cost): ~$7,000
  • Recommendation: interim analysis (futility check) after a 3-week pilot
  • Early stopping rule: stop if there is a clear negative effect
  1. Segment-level testing: Light Grocery, Fresh Lovers, Lapsed H&B

Phase 3: Expansion (1–2 months)

  1. If the A/B results confirm the predictions, expand to the full 31.3% targeting
  2. Retrain the model monthly with updated customer behavior
  3. Analyze TypeB and TypeC campaigns separately

4.4 Causal Assumptions Summary

AssumptionStatusEvidenceMitigation
SUTVAOKSingle campaign, independent customers
UnconfoundednessUncertainHidden logic possible in marketing strategySensitivity analysis
PositivityViolatedPS AUC = 0.989PS trimming, A/B test
ConsistencyOKTreatment is clearly defined

5. Conclusion

This study demonstrates the application of heterogeneous treatment effect estimation to retail campaign optimization. Despite a severe positivity violation that constrains causal identification, we uncovered economically significant heterogeneity in the treatment effect:

Key achievements:

  • A +$7,085 profit improvement between optimal and full targeting (-$4,659 → +$2,426)
  • Identified a negative CATE for VIP Heavy (-$38) and Bulk Shoppers (-$40)
  • Optimal policy: targeting 31.3% of customers yields 125% ROI

Methodological contributions:

  • Comprehensive positivity diagnostics including various mitigation strategies
  • Integration of behavioral segmentation (Track 1) and causal targeting (Track 2)
  • An uncertainty-aware, risk-adjusted policy framework
  • Honest model selection based on stability + distribution plausibility rather than raw AUUC

Acknowledged limitations:

  • A PS AUC of 0.989 represents a fundamental identification challenge
  • Refutation tests suggest model instability requiring A/B validation (Placebo 0.747 / Subset 0.561 → fail)
  • Results should be treated as hypothesis-generating rather than definitive

Next steps: The recommended A/B test (n=5,748, MDE ~$34) will validate these findings before full deployment. The staged rollout approach balances protection against estimation error with potential revenue gains.


Appendix: Technical Details

This appendix holds the densest technical detail (parameters · equations · PS-region decomposition · sensitivity grid · segment strategy). It is a layer for the reader who wants to go deep (30 minutes) after reading the body (30 seconds / 5 minutes).

A.1 Software Environment

  • Python 3.9+
  • econml 0.14+ (Microsoft Causal ML)
  • scikit-learn 1.0+
  • xgboost 1.7+
  • optuna (hyperparameter tuning)

A.2 Reproducibility

  • Random seeds fixed for all stochastic processes
  • Full code available in the project notebooks:
    • 03a_hte_estimation.ipynb
    • 03b_hte_validation.ipynb
    • 04_optimal_policy.ipynb

A.3 Data Artifacts

  • HTE results: results/hte_estimation_results.joblib
  • Validation results: results/hte_validation_summary.joblib
  • Policy results: results/policy_learning_results.joblib
  • Policy comparison: results/tables/policy_comparison.csv
  • Model comparison: results/tables/auuc_comparison_purchase_amount.csv, cate_summary_purchase_amount.csv
  • ATE results: results/tables/ate_results_purchase_amount.csv
  • A/B design: results/tables/ab_test_design.csv
  • Breakeven sensitivity: results/tables/breakeven_scenarios.csv

A.4 Key Parameters

ParameterValueBasis
PS Trim[0.10, 0.90]Balances sample size and reliability
Campaign Cost$12.73Average TypeA campaign cost
Profit Margin30%Retail industry standard
Breakeven CATE$42.43Cost / Margin

A.5 CATE Reliability by Propensity Score Region

Understanding where CATE estimates are trustworthy is critical for targeting decisions. This section analyzes the treatment-effect estimates across different propensity-score regions.


A.5.1 CATE by PS Region

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects

PS regionNSample %Mean CATEReliabilityInterpretation
Overlap (0.1-0.9)8016.5%+$34HighMost trustworthy estimate, comparable T/C groups
Extreme Low (<0.1)13628.0%+$16MediumControl-heavy, extrapolating to treatment
Extreme High (>0.9)27055.6%+$1LowTreatment-heavy, extrapolating to control

Key insight: The overlap region shows the highest CATE (+$34), suggesting the true treatment effect is likely positive, but 83% of the sample requires extrapolation.


A.5.2 CATE Bounds by PS Region

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects

RegionPoint EstimateLower BoundUpper BoundWidthAction
Overlap+$34+$8+$60$52Target confidently
Extreme Low+$16-$42+$74$116Proceed with caution
Extreme High+$1-$38+$40$78Consider reducing targeting

Marketing implication: Target confidently only in the overlap region; use conservative estimates elsewhere.


A.5.3 Sensitivity Analysis: Cost & Margin

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects

Breakeven scenario grid (per breakeven_scenarios):

ScenarioCostMarginBreakevenTarget %Profit
Base$12.7330%$42.4331.3%+$2,426
Lower Margin$12.7325%$50.9226.1%+$1,726
Higher Margin$12.7335%$36.3736.0%+$3,173
Higher Cost$15.0030%$50.0026.5%+$2,107
Lower Cost$10.0030%$33.3339.7%+$2,888
Worst (15/25%)$15.0025%60.92+$1,461
Best (10/35%)$10.0035%28.57+$3,702

Robustness: Even when Cost/Margin is pushed to its conservative and optimistic extremes, the optimal policy maintains a positive profit (+$1,461 ~ +$3,702). The key point is that the sign of the policy is not flipped by the assumptions.


A.6 Detailed Segment Marketing Strategy

This section provides comprehensive targeting recommendations for each customer segment based on the CATE analysis and Track 1 profiling.

Note: The “current targeting %” / “recommended %” columns below are illustrative figures that do not exist in any source CSV and are not validated values. The validated facts are only the per-segment mean CATE / N (486 cohort) / directional action (expand · hold · reduce). Read the percentages below strictly as examples meant to convey direction intuitively.


A.6.1 Segment Performance Matrix

(Current/recommended targeting % are illustrative — direction only)

SegmentMean CATECurrent targeting (example)Recommended (example)Direction
Regular+H&B+$34~76%85%+Slight expansion
Active Loyalists+$33~90%95%+Hold
Light Grocery+$30~15%45%Expand
Fresh Lovers+$27~27%55%Expand
Lapsed H&B+$19~20%35%Slight expansion
VIP Heavy-$38~97%50%Reduce
Bulk Shoppers-$40~52%20%Reduce

A.6.2 Detailed Action Plan by Segment

(The targeting % below are illustrative — the validated values are the mean CATE and the directional action)

Segment: VIP Heavy (CATE: -$38)

DimensionCurrent stateProblemRecommendation
Campaign responseNegativeAlready a high purchaser, ceiling effectReduce TypeA frequency
Alternative channelOver-exposed to TypeAMay cause fatigueTest TypeB/C
Value protection$9,716 average spendChurn riskPremium non-discount benefits
Targeting ruleOver-exposed (example ~97%)Over-targetingTarget only to trial new products

Segment: Bulk Shoppers (CATE: -$40)

DimensionCurrent stateProblemRecommendation
Campaign responseNegativePrice-sensitive, coupon mismatchReduce coupon campaigns
Shopping patternIrregular bulkTypeA disrupts the natural rhythmFocus on subscription/regularization
Alternative approachLarge basket per visitNeeds bulk-specific offersWarehouse-style promotions
Targeting ruleMedium exposure (example ~52%)Moderate over-targetingTarget only to expand categories

Segment: Light Grocery (CATE: +$30)

DimensionCurrent stateProblemRecommendation
Campaign responsePositiveCurrently under-targetedSubstantially increase targeting rate
PotentialLow engagementHigh uplift opportunityActivation campaigns
StrategyMinimal exposure (example ~15%)Missing incremental valueGradual rewards program
Targeting ruleMinimal exposureMajor gapTarget all customers with CATE > Breakeven

A.6.3 Risk-Adjusted Targeting Matrix

Dunnhumby — Track 2: Causal Targeting via Heterogeneous Treatment Effects

Risk Toleranceλ parameterTargeted segmentsExpected ProfitROI
Aggressive (λ=0)Full CATERegular+H&B, Active Loyalists, Light Grocery, Fresh Lovers, Lapsed+$2,426125%
Moderate (λ=0.3)70% Point + 30% LowerRegular+H&B, Active Loyalists, Light Grocery, Fresh Lovers+$1,603233%
Conservative (λ=0.7)30% Point + 70% LowerRegular+H&B, Active Loyalists~$1,200~200%
Ultra-safe (λ=1.0 / Lower CI > BE)Lower bound onlyConservative (3 customers)+$188493%

Recommendations by situation:

Business contextRecommended λBasis
Before A/B test0.7-1.0Minimize downside risk
After validation0.3-0.5Balanced confidence
Budget constraint0.0-0.3Maximize absolute profit
New market/product0.5-0.7Limited historical data

A.6.4 Campaign Type Alternatives (Future Analysis)

SegmentTypeA responseHypothetical TypeBHypothetical TypeCRecommended test
VIP HeavyNegativeNeutral/positive?Premium tier?Test premium offers
Bulk ShoppersNegativeBulk deals?Subscription?Test bulk-specific
Fresh LoversPositiveFresh specials?Recipe app?Keep TypeA + test B
Light GroceryPositiveHabit triggers?Gamification?Keep TypeA + test C

Note: TypeB/TypeC analysis requires a separate HTE study with a campaign-type-specific design.


References

Causal Inference Foundations

  • Imbens, G. W., & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. Cambridge University Press.
  • Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41-55.

Heterogeneous Treatment Effects

  • Athey, S., & Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27), 7353-7360.
  • Wager, S., & Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523), 1228-1242.
  • Kennedy, E. H. (2023). Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics, 17(2), 3008-3049.

Policy Learning

  • Athey, S., & Wager, S. (2021). Policy learning with observational data. Econometrica, 89(1), 133-161.
  • Zhou, Z., Athey, S., & Wager, S. (2023). Offline multi-action policy learning: Generalization and optimization. Operations Research, 71(1), 148-183.

Positivity and Sensitivity Analysis

  • Petersen, M. L., Porter, K. E., Gruber, S., Wang, Y., & van der Laan, M. J. (2012). Diagnosing and responding to violations in the positivity assumption. Statistical Methods in Medical Research, 21(1), 31-54.
  • VanderWeele, T. J., & Ding, P. (2017). Sensitivity analysis in observational research: Introducing the E-value. Annals of Internal Medicine, 167(4), 268-274.

Retail Marketing Applications

  • Rossi, P. E., McCulloch, R. E., & Allenby, G. M. (1996). The value of purchase history data in target marketing. Marketing Science, 15(4), 321-340.
  • Hitsch, G. J., & Misra, S. (2018). Heterogeneous treatment effects and optimal targeting policy evaluation. SSRN Working Paper.

Software

  • Battocchi, K., et al. (2019). EconML: A Python package for ML-based heterogeneous treatment effects estimation. Microsoft Research.
  • Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD, 785-794.