DeepFM · Tae Hyun Kim (Lowell)

Definition

DeepFM (Guo et al., 2017) is a CTR prediction model that combines an FM component and a Deep component in parallel, jointly learning low-order feature interactions (explicit) and high-order feature interactions (implicit).

$\hat{y} = \sigma\big(\underbrace{y_{\text{FM}}}_{\text{low-order}} + \underbrace{y_{\text{DNN}}}_{\text{high-order}}\big)$

FM Component (Wide)

The same 2nd-order feature interaction as the Factorization Machine:

$y_{\text{FM}} = w_0 + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n} \sum_{j=i+1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j$

Deep Component

A feed-forward neural network that takes dense embeddings as input:

$\mathbf{a}^{(0)} = [\mathbf{e}_1, \mathbf{e}_2, \ldots, \mathbf{e}_m]$ $\mathbf{a}^{(l)} = \text{ReLU}(\mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)})$ $y_{\text{DNN}} = \mathbf{w}^T \mathbf{a}^{(L)} + b$

Key: Shared Embedding

The latent vector $\mathbf{v}_i$ of the FM component and the embedding $\mathbf{e}_i$ of the Deep component share the same parameters. This yields:

The two components reinforce each other’s learning signals
End-to-end learning from raw features alone, without feature engineering
A reduced number of parameters

Intuitive Understanding

In CTR prediction there are two kinds of feature interactions:

Low-order: “male + shooting game → high click probability” — explicit, interpretable 2nd-order combinations
High-order: “male in their 20s + Friday evening + mobile + new RPG → click” — a compound interaction of several features

Google’s Wide & Deep proposed combining Wide (memorization) + Deep (generalization), but the Wide part required manual feature engineering (cross-products). DeepFM replaces the Wide part with an FM to automatically learn 2nd-order interactions, while the Deep part complements with high-order interactions.

DeepFM

Mermaid source (click to expand)

> flowchart LR
>     Input["Sparse Input"] --> Emb["Shared Embedding Layer"]
>     Emb --> FM_comp["FM Component<br/>(2nd-order interaction)"]
>     Emb --> DNN["Deep Component<br/>(high-order interaction)"]
>     FM_comp --> Add(("+"))
>     DNN --> Add
>     Add --> Sigmoid["σ(·)"] --> Output["CTR"]
>

Comparison with Wide & Deep

Criterion	Wide & Deep	DeepFM
Wide part	Cross-product (manually designed)	FM (automatically learned)
Feature engineering	Required (expert knowledge)	Not required
Embedding sharing	Separate for Wide/Deep	Shared
Low-order interaction	Only manually defined combinations	All 2nd-order combinations

Position in the FM Lineage

Model	Structure	Interaction
Factorization Machine	FM only	explicit 2nd-order
DeepFM	FM + DNN (parallel)	explicit 2nd-order + implicit high-order
xDeepFM	CIN + DNN	explicit high-order + implicit high-order
AutoInt	Self-attention + DNN	attention-weighted high-order

DeepFM’s limitation is that high-order interactions rely solely on the implicit learning of the DNN. xDeepFM addressed this by adding explicit high-order interactions via the CIN (Compressed Interaction Network).

Advantages and Disadvantages

Advantages:

Feature engineering not required: end-to-end learning from raw features alone
Shared embeddings let FM and DNN learn in a mutually complementary way
Consistent performance gains over Wide & Deep (Criteo, Company datasets)
Relatively simple to implement and widely used as a CTR baseline in industry

Disadvantages:

High-order interactions rely on the implicit learning of the DNN (not explicit)
Requires hyperparameter tuning of DNN depth and width
The embedding dimension is identical for all features (the optimal dimension may differ per field)
Cannot directly model sequential patterns (user behavior sequences)

Implementation

Core DeepFM structure (PyTorch):

import torch
import torch.nn as nn

class DeepFM(nn.Module):
    def __init__(self, field_dims: list[int], embed_dim: int, mlp_dims: list[int]):
        super().__init__()
        num_fields = len(field_dims)
        total_dims = sum(field_dims)

        # Shared Embedding
        self.embedding = nn.Embedding(total_dims, embed_dim)
        self.offsets = torch.tensor([0] + field_dims[:-1]).cumsum(0)

        # FM: 1st-order + 2nd-order
        self.linear = nn.Embedding(total_dims, 1)
        self.bias = nn.Parameter(torch.zeros(1))

        # Deep
        mlp_input = num_fields * embed_dim
        layers = []
        for dim in mlp_dims:
            layers += [nn.Linear(mlp_input, dim), nn.ReLU(), nn.Dropout(0.2)]
            mlp_input = dim
        layers.append(nn.Linear(mlp_input, 1))
        self.mlp = nn.Sequential(*layers)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (batch, num_fields) — per-field feature index
        x = x + self.offsets.to(x.device)
        embed = self.embedding(x)  # (batch, fields, embed_dim)

        # FM component
        linear_out = self.linear(x).sum(dim=1)      # 1st-order term
        sum_sq = embed.sum(dim=1) ** 2               # (Σe_i)²
        sq_sum = (embed ** 2).sum(dim=1)             # Σe_i²
        fm_out = 0.5 * (sum_sq - sq_sum).sum(dim=1, keepdim=True)  # 2nd-order term

        # Deep component
        deep_out = self.mlp(embed.view(embed.size(0), -1))  # flatten → MLP

        return torch.sigmoid(self.bias + linear_out + fm_out + deep_out).squeeze(1)

Factorization Machine - the Wide part of DeepFM; foundation of 2nd-order feature interaction
Wide and Deep - the predecessor model DeepFM improves on; the Wide part requires feature engineering
FNN - FM pre-training + DNN; unlike DeepFM, cannot learn end-to-end
PNN - learns interactions via a product layer; lacks low-order terms
Hybrid-Expert Adaptor - uses DeepFM as the backbone recommendation model in KAR
Multi-Task Learning - the parallel training of FM/Deep can be interpreted from a multi-task perspective

Key Papers

guoDeepFMFactorizationMachineBased2017 - the original DeepFM paper
rendleFactorizationMachines2010 - the original FM paper; theoretical basis of DeepFM
Cheng, H., et al. (2016). Wide & deep learning for recommender systems. DLRS 2016. — the prior work DeepFM improves on
Lian, J., et al. (2018). xDeepFM: Combining explicit and implicit feature interactions. KDD 2018. — a follow-up advance to DeepFM

Local graph