Tae Hyun Kim (Lowell)

Wide and Deep

2 min read #recsys#factor-models

Definition

Wide & Deep (Cheng et al., 2016) is a CTR prediction model that combines a linear wide component (memorization) with a DNN deep component (generalization). It was first deployed for Google Play app recommendation.

y^=σ(wwideT[x,ϕ(x)]+wdeepTa(L)+b)\hat{y} = \sigma\big(\mathbf{w}_{\text{wide}}^T [\mathbf{x}, \boldsymbol{\phi}(\mathbf{x})] + \mathbf{w}_{\text{deep}}^T \mathbf{a}^{(L)} + b\big)

Wide Component

A generalized linear model that takes the raw features x\mathbf{x} and a cross-product transformation ϕ(x)\boldsymbol{\phi}(\mathbf{x}) as input:

ϕk(x)=i=1dxicki,cki{0,1}\boldsymbol{\phi}_k(\mathbf{x}) = \prod_{i=1}^{d} x_i^{c_{ki}}, \quad c_{ki} \in \{0, 1\}

These cross-product features must be designed manually, and they memorize the co-occurrence of specific feature combinations.

Deep Component

A feed-forward neural network that converts categorical features into dense embeddings and then passes them through several hidden layers:

a(l+1)=ReLU(W(l)a(l)+b(l))\mathbf{a}^{(l+1)} = \text{ReLU}(\mathbf{W}^{(l)} \mathbf{a}^{(l)} + \mathbf{b}^{(l)})

It implicitly learns high-order feature interactions, handling generalization to unseen feature combinations.

Joint Training

The wide and deep components are jointly trained, and the output of each component is combined via a weighted sum to produce the final prediction. During training, the wide part used FTRL + L1, while the deep part used AdaGrad.

Intuitive Understanding

A recommender system needs two capabilities at once:

  • Memorization (Wide): “Recommend apps similar to those the user has installed in the past” — memorizing direct patterns from historical data
  • Generalization (Deep): “Given this user’s overall tastes, they might also like apps from a new category” — generalizing to unseen combinations

Using Wide alone overfits to past patterns, while using Deep alone may miss specific co-occurrences. Combining the two captures the strengths of both.

Advantages and Disadvantages

Advantages:

  • A balanced combination of memorization and generalization
  • Validated by large-scale industrial deployment (Google Play, 500M+ users)
  • Interpretable, since the roles of the Wide and Deep parts are clearly defined

Disadvantages:

  • The cross-product features of the wide part must be designed manually (requires domain expertise)
  • Wide and Deep do not share embeddings — the learning signals of the two components are separated
  • The choice of cross-product features strongly affects performance — poor design degrades performance
  • Low-order interactions that are not manually defined cannot be captured
  • DeepFM - Replaces Wide with an FM so no feature engineering is needed; introduces shared embeddings
  • Factorization Machine - Automatically learns second-order feature interactions; replaces Wide’s cross-products
  • FNN - DNN trained after FM pre-training; a different approach from Wide & Deep
  • PNN - Captures interactions via a product layer; extends only the Deep part of Wide & Deep

Key Papers

  • Cheng, H., et al. (2016). Wide & deep learning for recommender systems. DLRS 2016. — original paper
  • guoDeepFMFactorizationMachineBased2017 - DeepFM; improves on the limitations of Wide & Deep

Local graph