Tae Hyun Kim (Lowell)

DeepFM

Definition

DeepFM (Guo et al., 2017) is a CTR prediction model that combines an FM component and a Deep component in parallel, jointly learning low-order feature interactions (explicit) and high-order feature interactions (implicit).

y^=σ(yFMlow-order+yDNNhigh-order)\hat{y} = \sigma\big(\underbrace{y_{\text{FM}}}_{\text{low-order}} + \underbrace{y_{\text{DNN}}}_{\text{high-order}}\big)

FM Component (Wide)

The same 2nd-order feature interaction as the Factorization Machine:

yFM=w0+i=1nwixi+i=1nj=i+1nvi,vjxixjy_{\text{FM}} = w_0 + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n} \sum_{j=i+1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j

Deep Component

A feed-forward neural network that takes dense embeddings as input:

a(0)=[e1,e2,,em]\mathbf{a}^{(0)} = [\mathbf{e}_1, \mathbf{e}_2, \ldots, \mathbf{e}_m] a(l)=ReLU(W(l)a(l1)+b(l))\mathbf{a}^{(l)} = \text{ReLU}(\mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}) yDNN=wTa(L)+by_{\text{DNN}} = \mathbf{w}^T \mathbf{a}^{(L)} + b

Key: Shared Embedding

The latent vector vi\mathbf{v}_i of the FM component and the embedding ei\mathbf{e}_i of the Deep component share the same parameters. This yields:

  • The two components reinforce each other’s learning signals
  • End-to-end learning from raw features alone, without feature engineering
  • A reduced number of parameters

Intuitive Understanding

In CTR prediction there are two kinds of feature interactions:

  • Low-order: “male + shooting game → high click probability” — explicit, interpretable 2nd-order combinations
  • High-order: “male in their 20s + Friday evening + mobile + new RPG → click” — a compound interaction of several features

Google’s Wide & Deep proposed combining Wide (memorization) + Deep (generalization), but the Wide part required manual feature engineering (cross-products). DeepFM replaces the Wide part with an FM to automatically learn 2nd-order interactions, while the Deep part complements with high-order interactions.

DeepFM

Mermaid source (click to expand)
> flowchart LR
>     Input["Sparse Input"] --> Emb["Shared Embedding Layer"]
>     Emb --> FM_comp["FM Component<br/>(2nd-order interaction)"]
>     Emb --> DNN["Deep Component<br/>(high-order interaction)"]
>     FM_comp --> Add(("+"))
>     DNN --> Add
>     Add --> Sigmoid["σ(·)"] --> Output["CTR"]
>

Comparison with Wide & Deep

CriterionWide & DeepDeepFM
Wide partCross-product (manually designed)FM (automatically learned)
Feature engineeringRequired (expert knowledge)Not required
Embedding sharingSeparate for Wide/DeepShared
Low-order interactionOnly manually defined combinationsAll 2nd-order combinations

Position in the FM Lineage

ModelStructureInteraction
Factorization MachineFM onlyexplicit 2nd-order
DeepFMFM + DNN (parallel)explicit 2nd-order + implicit high-order
xDeepFMCIN + DNNexplicit high-order + implicit high-order
AutoIntSelf-attention + DNNattention-weighted high-order

DeepFM’s limitation is that high-order interactions rely solely on the implicit learning of the DNN. xDeepFM addressed this by adding explicit high-order interactions via the CIN (Compressed Interaction Network).

Advantages and Disadvantages

Advantages:

  • Feature engineering not required: end-to-end learning from raw features alone
  • Shared embeddings let FM and DNN learn in a mutually complementary way
  • Consistent performance gains over Wide & Deep (Criteo, Company datasets)
  • Relatively simple to implement and widely used as a CTR baseline in industry

Disadvantages:

  • High-order interactions rely on the implicit learning of the DNN (not explicit)
  • Requires hyperparameter tuning of DNN depth and width
  • The embedding dimension is identical for all features (the optimal dimension may differ per field)
  • Cannot directly model sequential patterns (user behavior sequences)

Implementation

Core DeepFM structure (PyTorch):

import torch
import torch.nn as nn

class DeepFM(nn.Module):
    def __init__(self, field_dims: list[int], embed_dim: int, mlp_dims: list[int]):
        super().__init__()
        num_fields = len(field_dims)
        total_dims = sum(field_dims)

        # Shared Embedding
        self.embedding = nn.Embedding(total_dims, embed_dim)
        self.offsets = torch.tensor([0] + field_dims[:-1]).cumsum(0)

        # FM: 1st-order + 2nd-order
        self.linear = nn.Embedding(total_dims, 1)
        self.bias = nn.Parameter(torch.zeros(1))

        # Deep
        mlp_input = num_fields * embed_dim
        layers = []
        for dim in mlp_dims:
            layers += [nn.Linear(mlp_input, dim), nn.ReLU(), nn.Dropout(0.2)]
            mlp_input = dim
        layers.append(nn.Linear(mlp_input, 1))
        self.mlp = nn.Sequential(*layers)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (batch, num_fields) — per-field feature index
        x = x + self.offsets.to(x.device)
        embed = self.embedding(x)  # (batch, fields, embed_dim)

        # FM component
        linear_out = self.linear(x).sum(dim=1)      # 1st-order term
        sum_sq = embed.sum(dim=1) ** 2               # (Σe_i)²
        sq_sum = (embed ** 2).sum(dim=1)             # Σe_i²
        fm_out = 0.5 * (sum_sq - sq_sum).sum(dim=1, keepdim=True)  # 2nd-order term

        # Deep component
        deep_out = self.mlp(embed.view(embed.size(0), -1))  # flatten → MLP

        return torch.sigmoid(self.bias + linear_out + fm_out + deep_out).squeeze(1)
  • Factorization Machine - the Wide part of DeepFM; foundation of 2nd-order feature interaction
  • Wide and Deep - the predecessor model DeepFM improves on; the Wide part requires feature engineering
  • FNN - FM pre-training + DNN; unlike DeepFM, cannot learn end-to-end
  • PNN - learns interactions via a product layer; lacks low-order terms
  • Hybrid-Expert Adaptor - uses DeepFM as the backbone recommendation model in KAR
  • Multi-Task Learning - the parallel training of FM/Deep can be interpreted from a multi-task perspective

Key Papers

  • guoDeepFMFactorizationMachineBased2017 - the original DeepFM paper
  • rendleFactorizationMachines2010 - the original FM paper; theoretical basis of DeepFM
  • Cheng, H., et al. (2016). Wide & deep learning for recommender systems. DLRS 2016. — the prior work DeepFM improves on
  • Lian, J., et al. (2018). xDeepFM: Combining explicit and implicit feature interactions. KDD 2018. — a follow-up advance to DeepFM

Local graph