Wide and Deep · Tae Hyun Kim (Lowell)

Definition

Wide & Deep (Cheng et al., 2016) is a CTR prediction model that combines a linear wide component (memorization) with a DNN deep component (generalization). It was first deployed for Google Play app recommendation.

$\hat{y} = \sigma\big(\mathbf{w}_{\text{wide}}^T [\mathbf{x}, \boldsymbol{\phi}(\mathbf{x})] + \mathbf{w}_{\text{deep}}^T \mathbf{a}^{(L)} + b\big)$

Wide Component

A generalized linear model that takes the raw features $\mathbf{x}$ and a cross-product transformation $\boldsymbol{\phi}(\mathbf{x})$ as input:

$\boldsymbol{\phi}_k(\mathbf{x}) = \prod_{i=1}^{d} x_i^{c_{ki}}, \quad c_{ki} \in \{0, 1\}$

These cross-product features must be designed manually, and they memorize the co-occurrence of specific feature combinations.

Deep Component

A feed-forward neural network that converts categorical features into dense embeddings and then passes them through several hidden layers:

$\mathbf{a}^{(l+1)} = \text{ReLU}(\mathbf{W}^{(l)} \mathbf{a}^{(l)} + \mathbf{b}^{(l)})$

It implicitly learns high-order feature interactions, handling generalization to unseen feature combinations.

Joint Training

The wide and deep components are jointly trained, and the output of each component is combined via a weighted sum to produce the final prediction. During training, the wide part used FTRL + L1, while the deep part used AdaGrad.

Intuitive Understanding

A recommender system needs two capabilities at once:

Memorization (Wide): “Recommend apps similar to those the user has installed in the past” — memorizing direct patterns from historical data
Generalization (Deep): “Given this user’s overall tastes, they might also like apps from a new category” — generalizing to unseen combinations

Using Wide alone overfits to past patterns, while using Deep alone may miss specific co-occurrences. Combining the two captures the strengths of both.

Advantages and Disadvantages

Advantages:

A balanced combination of memorization and generalization
Validated by large-scale industrial deployment (Google Play, 500M+ users)
Interpretable, since the roles of the Wide and Deep parts are clearly defined

Disadvantages:

The cross-product features of the wide part must be designed manually (requires domain expertise)
Wide and Deep do not share embeddings — the learning signals of the two components are separated
The choice of cross-product features strongly affects performance — poor design degrades performance
Low-order interactions that are not manually defined cannot be captured

DeepFM - Replaces Wide with an FM so no feature engineering is needed; introduces shared embeddings
Factorization Machine - Automatically learns second-order feature interactions; replaces Wide’s cross-products
FNN - DNN trained after FM pre-training; a different approach from Wide & Deep
PNN - Captures interactions via a product layer; extends only the Deep part of Wide & Deep

Key Papers

Cheng, H., et al. (2016). Wide & deep learning for recommender systems. DLRS 2016. — original paper
guoDeepFMFactorizationMachineBased2017 - DeepFM; improves on the limitations of Wide & Deep

Local graph