Dunnhumby — Track 1: Latent-Factor Customer Segmentation

At a Glance (TL;DR)

Three-line summary

Applying NMF (k=5, 92.44% explained variance) + K-Means (k=7) to the Dunnhumby data (2,500 households · ~2.6M transactions · 102 weeks) yields 7 customer segments.

The segments show high stability at Bootstrap ARI 0.77 ± 0.11 (n=100), and reveal a clear Pareto structure in which the high-value top tier (High: 0·1·6, 45.0% of all customers) accounts for roughly 73.9% of revenue.

These descriptive segments are both a foundation for marketing strategy in their own right and a link to the moderator of Track 2’s causal targeting (causal responsiveness is validated in Track 2).

Key Numbers

Item	Value	Note
Number of segments	7	K-Means, selected by minimum DBI + interpretability
NMF Latent Factor	k=5	Explained variance 92.44%
Segment stability	ARI 0.77 ± 0.11	Bootstrap n=100, 80% sample
Largest segment	Light Grocery 21.0% (524 households)	By customer count
Highest-value segment	VIP Heavy $9,716/household (12.0%)	By average revenue
Pareto structure	High tier 45.0% of customers → 73.9% of revenue	Confirms value concentration

Hero Figures

Dunnhumby — Track 1: Latent-Factor Customer Segmentation Standardized feature profile by segment (Z-scores) — shows the behavioral differentiation of the 7 segments at a glance.

Dunnhumby — Track 1: Latent-Factor Customer Segmentation Feature loadings of the 5 NMF latent factors — separation of the Value dimensions (F2·F3) and the Need dimensions (F1·F4·F5).

[30 sec] At a Glance (TL;DR)
[5 min] Abstract · 1. Introduction · 2. Methodology · 3. Results · 4. Discussion · 5. Conclusion
[30 min] Appendix: Technical Details (parameters · clustering-metrics full table · per-bubble-chart marketing-action mapping) · References

This analysis presents a behavior-based customer segmentation framework using the retail transaction data from the Dunnhumby “The Complete Journey” dataset. Combining Non-negative Matrix Factorization (NMF) with K-Means Clustering, we derive 7 customer segments from 2,500 households over a 102-week observation period.

Key results:

5 interpretable latent factors explain 92.44% of the variance in customer behavior
High stability of the 7 customer segments (Bootstrap ARI = 0.77 ± 0.11, n=100)
The VIP segment (12% of all customers) generates an average of $9,716 in revenue per customer
High-value customers (45.0%) contribute roughly 73.9% of total revenue
Clear marketing strategies derived for each segment

This segmentation provides the foundation for a personalized marketing strategy and serves as the input to the subsequent Causal Targeting analysis (Track 2).

1. Introduction

1.1 Background

Customer segmentation is central to modern retail marketing strategy. By grouping customers based on behavioral patterns, retailers can develop targeted interventions that maximize the return on marketing investment. Traditional demographics-based segmentation often fails to capture the subtle behavioral differences that drive purchase decisions. This study adopts a behavior-first approach that discovers natural customer groups using features extracted from transaction data.

1.2 Dataset

This study analyzes the Dunnhumby “The Complete Journey” dataset:

Item	Value
Number of households	2,500
Number of transactions	~2.6 million (2,595,732)
Analysis period	102 weeks (2 years)
Number of campaigns	30 marketing campaigns
Number of products	92,000+ SKUs
Number of stores	400+

The dataset includes transaction records, household demographics (32% coverage), campaign targeting, and coupon distribution and redemption data.

1.3 Research Objectives

Extract latent behavioral factors that characterize customer shopping patterns
Identify distinct customer segments that yield actionable marketing implications
Validate segment stability through bootstrap resampling
Develop per-segment marketing strategies / recommendations

1.4 Analysis Framework

This analysis is part of a 2-track research framework:

Track 1 (this report): Customer understanding through segmentation
Track 2 (separate): Causal targeting through heterogeneous treatment effect estimation

The Track 1 segments serve as moderators for the Track 2 causal analysis, enabling per-segment campaign optimization.

2. Methodology

2.1 Feature Engineering

We constructed 33 customer-level features from the transaction data, organized into 6 conceptual groups:

Group	Count	Description	Examples
Recency	6	Time since last purchase	days_since_last, active_last_4w
Frequency	6	Shopping frequency patterns	visits_per_week, purchase_regularity
Monetary	7	Spending characteristics	total_sales, avg_basket_size, coupon_savings
Behavioral	7	Shopping behavior	discount_rate, private_label_ratio, n_departments
Category	6	Category preferences	share_grocery, share_fresh, share_h&b
Time	1	Tenure coverage	week_coverage

To address multicollinearity, we removed highly correlated pairs ( $r \ge 0.7$ ), reducing the feature set from 33 to 19.

Multicollinearity handling details:

Removal criterion	Examples of removed features	Retained feature
Perfect correlation (r = 1.0)	frequency_per_month	frequency_per_week
High correlation (r ≥ 0.9)	monetary_actual	monetary_sales
Redundant information (r ≥ 0.7)	active_last_12w	active_last_4w

The 14 removed features:

Frequency: frequency_per_month, transaction_count
Monetary: monetary_actual, monetary_avg_basket_actual, monetary_per_week
Recency: active_last_12w, recency_weeks
Behavioral: avg_products_per_basket
Other redundant derived variables

Note: We chose correlation-based removal over VIF analysis because (1) it maintains compatibility with NMF’s non-negativity constraint, and (2) it prioritizes interpretability. As an alternative, automatic variable selection via Elastic Net regularization is also possible, but here we secured reproducibility through explicit removal.

Preprocessing: For NMF compatibility, MinMaxScaler normalization was applied to the [0, 1] range (non-negative input required).

2.2 Latent Factor Modeling (NMF)

Non-negative Matrix Factorization (NMF) decomposes the customer-feature matrix into two low-dimensional matrices to derive latent behavioral factors.

Rationale for NMF vs PCA:

Criterion	NMF	PCA
Interpretability	Parts-based decomposition → intuitive factor interpretation	Orthogonal axes → hard to interpret
Non-negativity constraint	Naturally non-negative loadings	Allows negative loadings
Business fit	”Customer A = 0.3×loyal + 0.5×fresh” is interpretable	”Customer A = PC1 - 0.2×PC2” is unclear
Marketing collaboration	Easy to communicate with non-technical teams via additive-parts interpretation	Requires technical explanation
Prior research	Widely used in retail segmentation (Lee & Seung, 1999)	General dimensionality reduction

Empirical validation: At the same k, we confirmed that NMF factors produce clearer clustering than PCA components in terms of category/behavior.

Model selection:

n_components: evaluated over the range 2–8
Selection criteria: reconstruction error (elbow method) and factor interpretability
Selected: 5 components (explaining 92.44% of variance)

Dunnhumby — Track 1: Latent-Factor Customer Segmentation Figure 1: NMF component selection — reconstruction error and cumulative explained variance.

NMF parameters:

Solver: Coordinate Descent
Initialization: Random
Max iterations: 1,000
Random state: fixed for reproducibility

2.3 Clustering

We applied K-Means clustering to the NMF factor scores to identify customer segments.

Clustering evaluation:

Tested over the range k = 2–11
Compared K-Means vs. Gaussian Mixture Model (GMM)
K-Means substantially outperformed GMM (Silhouette: 0.219 vs. 0.047)

Optimal-k selection (an honest trade-off):

The internal validation metrics for the candidate k values are below (full table in Appendix A.5):

k	Silhouette	Calinski-Harabasz	Davies-Bouldin
3	0.271	984.9	1.256
5	0.225	794.2	1.321
6	0.207	756.3	1.342
7	0.219	732.0	1.241 ← selected
8	0.209	700.2	1.244

Davies-Bouldin Index (DBI): minimized at k = 7 (1.241) — the best cluster separation among the candidates.
The Silhouette Score is in fact highest at lower k (max 0.271 at k=3). The Silhouette at k=7 (0.219) is not the maximum, and we do not claim it to be “optimal.”
Selected: k = 7 — a decision integrating (1) DBI minimization, (2) business interpretability/actionability (the 7 segments map naturally to marketing actions), and (3) high bootstrap stability (ARI 0.77). In other words, this is not a single-metric optimum but a choice balanced across a quantitative indicator (DBI), a qualitative criterion (actionability), and robustness (ARI).

Dunnhumby — Track 1: Latent-Factor Customer Segmentation Figure 2: Clustering evaluation metrics as a function of k. Silhouette is higher at lower k, while DBI is minimized at k=7.

2.4 Stability Validation

We performed bootstrap resampling to assess segment stability:

100 bootstrap iterations
80% sample ratio per iteration
Metric: Adjusted Rand Index (ARI) between the original and bootstrap assignments

3. Results

3.1 Latent Factor Interpretation

Through NMF, we identified 5 interpretable latent factors representing distinct aspects of customer behavior:

Factor	Name	Top features (loading)	Interpretation
F1	Grocery Deal Seeker	share_grocery (6.72), discount_usage_pct (5.13), private_label_ratio (3.41)	Budget-conscious grocery shoppers who seek discounts
F2	Loyal Regular	purchase_regularity (4.63), n_departments (2.61), n_products (1.53), frequency (1.04)	High-engagement one-stop shoppers
F3	Big Basket	monetary_std (2.45), monetary_avg_basket (2.35), share_grocery (2.08)	Irregular bulk buyers
F4	Fresh Focused	share_fresh (2.26), n_departments (1.21)	Fresh-food category specialists
F5	Health & Beauty	share_health_beauty (2.03), recency (0.41)	Drugstore-type shoppers

Dunnhumby — Track 1: Latent-Factor Customer Segmentation Figure 3: NMF factor loadings heatmap — feature weights for each latent factor.

The factors separate naturally into a Value dimension (F2, F3, capturing frequency and monetary value) and a Need dimension (F1, F4, F5, capturing category preference).

3.2 Clustering Evaluation Metrics

Metric	Value	Interpretation
Explained Variance	92.44%	High factor coverage
Silhouette Score (k=7)	0.219	Adequate for behavioral data (benchmark 0.15–0.30); note the max is at k=3 (0.271)
Calinski-Harabasz Index	732.0	High between-cluster variance
Davies-Bouldin Index	1.241	Minimum among candidate k (best separation)
Bootstrap ARI	0.77 ± 0.11	High stability (95% CI: 0.55–0.99)

Silhouette Score interpretation:

A Silhouette Score of 0.219 is moderate, and in this data it is higher at lower k (k=3). However, this is a common pattern in behavioral clustering, where customer characteristics exist on a continuum rather than in discrete groups. We make clear that k=7 is not the Silhouette optimum but a choice based on DBI minimization + interpretability + stability criteria (§2.3).

Comparison	Silhouette	Source
This study (k=7)	0.219	-
Retail segmentation (general)	0.15–0.30	Wedel & Kamakura (2000)
E-commerce clustering	0.15–0.30	Industry benchmark
Demographics-based segmentation	0.35–0.50	Higher separation from discrete attributes

Causes of the low Silhouette:

Customer behavior is inherently continuously distributed (no discrete boundaries)
RFM and category preferences form a gradient
Transitional customers exist between segments (e.g., Light Grocery → Active Loyalists in transition)

Acceptability assessment: 0.219 is within the benchmark range in the context of behavioral data, and the facts that DBI is minimized at k=7 and that the high Bootstrap ARI (0.77) complement each other reinforce the substantive stability of the segments.

3.3 Stability Validation

Bootstrap resampling (100 iterations, 80% sample) yielded an Adjusted Rand Index of 0.77 ± 0.11, indicating high segment stability. An ARI of 0.70 or above is generally considered strong agreement, confirming that the 7-segment solution is robust to sampling variation.

3.4 The 7 Customer Segments

Clustering identified 7 distinct customer segments (over all 2,500 households):

Seg	Name	Size	Avg revenue	Frequency	Recency	Regularity	Primary factor
0	Active Loyalists	509 (20.4%)	$3,878	171	6 days	0.78	F2 (Loyal)
1	VIP Heavy	299 (12.0%)	$9,716	256	4 days	0.88	F2 (Loyal)
2	Lapsed H&B	193 (7.7%)	$872	37	75 days	0.25	F5 (H&B)
3	Fresh Lovers	339 (13.6%)	$1,233	48	36 days	0.34	F4 (Fresh)
4	Light Grocery	524 (21.0%)	$942	43	42 days	0.30	F1 (Grocery-Deal)
5	Bulk Shoppers	318 (12.7%)	$3,206	56	24 days	0.41	F3 (Basket)
6	Regular + H&B	318 (12.7%)	$3,393	152	12 days	0.70	F2 (Loyal)

Note: We label the primary factor of Seg4 (Light Grocery) as F1 (Grocery-Deal) — this segment has grocery share 0.56 + discount 0.51 loading strongly on F1, so a grocery/discount-seeking tendency dominates.

Dunnhumby — Track 1: Latent-Factor Customer Segmentation Figure 4: Customer segment size distribution.

3.5 Segment Profiles

Dunnhumby — Track 1: Latent-Factor Customer Segmentation Figure 5: Standardized feature profile by segment (Z-scores).

Dunnhumby — Track 1: Latent-Factor Customer Segmentation Figure 6: Average factor score for each customer segment.

Segment characteristics:

Segment 0: Active Loyalists (20.4%)

High purchase regularity (0.78) and diverse category shopping
Strong private-label preference (highest PL ratio at 0.34)
Budget-conscious yet highly loyal shoppers

Segment 1: VIP Heavy (12.0%)

Top performance on every RFM metric
Highest frequency (256), monetary value ($9,716), lowest recency (4 days)
True one-stop shoppers buying an average of 1,316 unique products

Segment 2: Lapsed H&B (7.7%)

Highest recency (75 days) — effectively churned
H&B category specialists with low overall engagement
Win-back targets with uncertain ROI

Segment 3: Fresh Lovers (13.6%)

Fresh-food category specialists with moderate engagement
Relatively active customers (36-day recency) with concentrated baskets

Segment 4: Light Grocery (21.0%)

The largest segment by customer count, with the lowest value per customer ($942)
Light engagement centered on groceries/discounts (F1 Grocery-Deal dominant)
An activation opportunity with habit-formation potential

Segment 5: Bulk Shoppers (12.7%)

Highest average basket size (about $57 per visit)
Low frequency (56) but high spend per visit
Warehouse/Costco-style shopping pattern

Segment 6: Regular + H&B (12.7%)

A second-tier value segment with VIP-conversion potential
Regular buyers (152) with an H&B focus

3.6 Value Tier Distribution

The segments separate naturally into value tiers (over all 2,500 households):

Tier	Segments	N	Customer share	Avg revenue	Total revenue	Revenue share
High	0, 1, 6	1,126	45.0%	$5,291	$5,958K	73.9%
Medium	3, 5	657	26.3%	$2,188	$1,437K	17.8%
Low/At-Risk	2, 4	717	28.7%	$923	$662K	8.2%
Total	-	2,500	100%	$3,223	$8,057K	100%

Calculation basis:

Total revenue = Σ(segment N × average revenue)
Revenue share = tier total revenue / overall total revenue
High Value segments: Active Loyalists ($3,878), VIP Heavy ($9,716), Regular+H&B ($3,393)
(The totals above were recomputed from the corrected per-segment average revenue and reflect the canonical LEDGER values such as Seg4 $942.)

Pareto-law check: The top 45% of customers (High Value) contribute 73.9% of revenue, confirming a clear value-concentration phenomenon close to the classic 80/20 rule.

3.7 Multidimensional Segment Positioning

Dunnhumby — Track 1: Latent-Factor Customer Segmentation Figure 7: Segment positioning on the Loyalty (F2) vs Deal-Seeking (F1) dimensions.

Dunnhumby — Track 1: Latent-Factor Customer Segmentation Figure 8: RFM value positioning showing VIP dominance and segment differentiation.

Dunnhumby — Track 1: Latent-Factor Customer Segmentation Figure 9: Customer lifecycle positioning identifying active high-value and lapsed segments.

4. Discussion

4.1 Key Insights

1. A clear value hierarchy The segmentation shows a clear Pareto distribution: 45.0% of high-value segments contribute roughly 73.9% of estimated revenue. VIP Heavy (12%) alone is the most important retention target.

2. Behavioral differentiation The factors successfully separate customers along both the Value (frequency, monetary) and Need (category preference) dimensions. This dual structure enables both value-based prioritization and need-based personalization.

3. Lifecycle stages The segments map to distinct lifecycle stages:

Active/Growing: Segments 0, 1, 6 (low recency, high engagement)
Stable: Segments 3, 4, 5 (medium recency)
Declining/Churned: Segment 2 (high recency, low engagement)

4. Category specialists Fresh Lovers (13.6%) and the H&B-focused segments show category specialization, suggesting opportunities for category-specific marketing approaches.

4.2 Marketing Strategy / Recommendations

Segment	Priority	Strategy	Key actions
VIP Heavy	High	Retention	Premium perks, churn-prediction alerts, exclusive access
Active Loyalists	High	Strengthen	Private-label promotions, loyalty points, basket expansion
Regular + H&B	Medium	Upgrade	VIP-conversion program, cross-category incentives
Bulk Shoppers	Medium	Regularize	Subscription offers, scheduled delivery, bundle deals
Fresh Lovers	Medium	Engage	Fresh-food content marketing, daily specials, recipe inspiration
Light Grocery	Low	Activate	Habit-formation campaigns, progressive rewards, onboarding
Lapsed H&B	Low	Win-back	Re-engagement campaigns, H&B-focused offers

Recommended budget allocation:

High Priority (60%): VIP Heavy (25%), Active Loyalists (20%), Regular + H&B (15%)
Medium Priority (30%): Bulk Shoppers (10%), Fresh Lovers (10%), Light Grocery (10%)
Low Priority (10%): Lapsed H&B (10%)

Caution (descriptive vs. causal): The allocation above is a priority based on descriptive value (revenue contribution). “Which segment actually responds better to promotions” is a separate causal question, validated in Track 2 via per-segment CATE — and that result may differ from the revenue-value ranking (e.g., a high-value segment does not necessarily have a high treatment effect).

4.3 Limitations

1. Moderate Silhouette Score (0.219) Behavioral data is inherently continuous rather than having discrete boundaries. In this data, the Silhouette is higher at lower k (k=3), so k=7 is not the Silhouette optimum but a choice based on DBI minimization, interpretability, and stability. This score is acceptable for customer segmentation but indicates some overlap between segments.

2. Limited demographics coverage (32%) Only 801 of the 2,500 households have demographic information, which limits demographics-based stratification and persona development.

3. Descriptive vs. causal This segmentation is descriptive. Questions such as “which segment responds best to promotions?” require causal analysis (Track 2).

4. Single-retailer context The results are specific to this retailer’s customer base and may not generalize to other retail contexts.

4.4 Future Directions

1. Track 2 integration The segments will serve as heterogeneous-treatment-effect moderators in the Track 2 causal analysis. This enables per-segment campaign-effect estimation.

2. A/B testing validation The recommended strategies should be validated through controlled experiments before full-scale deployment.

3. Dynamic segmentation Periodic re-clustering to capture segment migration and evolving customer behavior.

4. Value × Need framework An optional extension using separate Value (RFM) and Need (Category) factor models for cross-sell optimization scenarios.

5. Conclusion

This study demonstrates an effective approach to behavior-based customer segmentation using latent factor modeling and clustering. The NMF + K-Means framework successfully identified 7 distinct customer segments with high stability (ARI = 0.77 ± 0.11) and clear business interpretability.

Key achievements:

5 latent factors (92.44% explained variance) capturing the Value (Loyalty, Monetary) and Need (Category Preference) dimensions
7 actionable segments ranging from VIP Heavy ($9,716 average) to Lapsed H&B ($872 average)
A clear priority hierarchy including the 45.0% of high-value customers (contributing 73.9% of revenue) that warrant concentrated retention efforts
Per-segment strategies from Retention (VIP) to Activation (Light Grocery) to Win-back (Lapsed)

The segmentation provides a solid foundation for personalized marketing and serves as the input to the subsequent Causal Targeting analysis, enabling evidence-based marketing optimization.

Appendix: Technical Details

A.1 Software Environment

Python 3.9+
scikit-learn (NMF, K-Means)
pandas, numpy (data processing)
matplotlib, seaborn (visualization)

A.2 Reproducibility

Random seed fixed for all stochastic processes
Full code available in the project notebooks:
- 01_feature_engineering.ipynb
- 02_customer_profiling.ipynb

A.3 Data Artifacts

Segment assignments: data/dunnhumby/processed/segment_models.joblib
Feature metadata: data/dunnhumby/processed/feature_metadata.json