DeepFM
Definition
DeepFM (Guo et al., 2017) is a CTR prediction model that combines an FM component and a Deep component in parallel, jointly learning low-order feature interactions (explicit) and high-order feature interactions (implicit).
FM Component (Wide)
The same 2nd-order feature interaction as the Factorization Machine:
Deep Component
A feed-forward neural network that takes dense embeddings as input:
Key: Shared Embedding
The latent vector of the FM component and the embedding of the Deep component share the same parameters. This yields:
- The two components reinforce each other’s learning signals
- End-to-end learning from raw features alone, without feature engineering
- A reduced number of parameters
Intuitive Understanding
In CTR prediction there are two kinds of feature interactions:
- Low-order: “male + shooting game → high click probability” — explicit, interpretable 2nd-order combinations
- High-order: “male in their 20s + Friday evening + mobile + new RPG → click” — a compound interaction of several features
Google’s Wide & Deep proposed combining Wide (memorization) + Deep (generalization), but the Wide part required manual feature engineering (cross-products). DeepFM replaces the Wide part with an FM to automatically learn 2nd-order interactions, while the Deep part complements with high-order interactions.
Mermaid source (click to expand)
> flowchart LR
> Input["Sparse Input"] --> Emb["Shared Embedding Layer"]
> Emb --> FM_comp["FM Component<br/>(2nd-order interaction)"]
> Emb --> DNN["Deep Component<br/>(high-order interaction)"]
> FM_comp --> Add(("+"))
> DNN --> Add
> Add --> Sigmoid["σ(·)"] --> Output["CTR"]
>
Comparison with Wide & Deep
| Criterion | Wide & Deep | DeepFM |
|---|---|---|
| Wide part | Cross-product (manually designed) | FM (automatically learned) |
| Feature engineering | Required (expert knowledge) | Not required |
| Embedding sharing | Separate for Wide/Deep | Shared |
| Low-order interaction | Only manually defined combinations | All 2nd-order combinations |
Position in the FM Lineage
| Model | Structure | Interaction |
|---|---|---|
| Factorization Machine | FM only | explicit 2nd-order |
| DeepFM | FM + DNN (parallel) | explicit 2nd-order + implicit high-order |
| xDeepFM | CIN + DNN | explicit high-order + implicit high-order |
| AutoInt | Self-attention + DNN | attention-weighted high-order |
DeepFM’s limitation is that high-order interactions rely solely on the implicit learning of the DNN. xDeepFM addressed this by adding explicit high-order interactions via the CIN (Compressed Interaction Network).
Advantages and Disadvantages
Advantages:
- Feature engineering not required: end-to-end learning from raw features alone
- Shared embeddings let FM and DNN learn in a mutually complementary way
- Consistent performance gains over Wide & Deep (Criteo, Company datasets)
- Relatively simple to implement and widely used as a CTR baseline in industry
Disadvantages:
- High-order interactions rely on the implicit learning of the DNN (not explicit)
- Requires hyperparameter tuning of DNN depth and width
- The embedding dimension is identical for all features (the optimal dimension may differ per field)
- Cannot directly model sequential patterns (user behavior sequences)
Implementation
Core DeepFM structure (PyTorch):
import torch
import torch.nn as nn
class DeepFM(nn.Module):
def __init__(self, field_dims: list[int], embed_dim: int, mlp_dims: list[int]):
super().__init__()
num_fields = len(field_dims)
total_dims = sum(field_dims)
# Shared Embedding
self.embedding = nn.Embedding(total_dims, embed_dim)
self.offsets = torch.tensor([0] + field_dims[:-1]).cumsum(0)
# FM: 1st-order + 2nd-order
self.linear = nn.Embedding(total_dims, 1)
self.bias = nn.Parameter(torch.zeros(1))
# Deep
mlp_input = num_fields * embed_dim
layers = []
for dim in mlp_dims:
layers += [nn.Linear(mlp_input, dim), nn.ReLU(), nn.Dropout(0.2)]
mlp_input = dim
layers.append(nn.Linear(mlp_input, 1))
self.mlp = nn.Sequential(*layers)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# x: (batch, num_fields) — per-field feature index
x = x + self.offsets.to(x.device)
embed = self.embedding(x) # (batch, fields, embed_dim)
# FM component
linear_out = self.linear(x).sum(dim=1) # 1st-order term
sum_sq = embed.sum(dim=1) ** 2 # (Σe_i)²
sq_sum = (embed ** 2).sum(dim=1) # Σe_i²
fm_out = 0.5 * (sum_sq - sq_sum).sum(dim=1, keepdim=True) # 2nd-order term
# Deep component
deep_out = self.mlp(embed.view(embed.size(0), -1)) # flatten → MLP
return torch.sigmoid(self.bias + linear_out + fm_out + deep_out).squeeze(1)
Related Concepts
- Factorization Machine - the Wide part of DeepFM; foundation of 2nd-order feature interaction
- Wide and Deep - the predecessor model DeepFM improves on; the Wide part requires feature engineering
- FNN - FM pre-training + DNN; unlike DeepFM, cannot learn end-to-end
- PNN - learns interactions via a product layer; lacks low-order terms
- Hybrid-Expert Adaptor - uses DeepFM as the backbone recommendation model in KAR
- Multi-Task Learning - the parallel training of FM/Deep can be interpreted from a multi-task perspective
Key Papers
- guoDeepFMFactorizationMachineBased2017 - the original DeepFM paper
- rendleFactorizationMachines2010 - the original FM paper; theoretical basis of DeepFM
- Cheng, H., et al. (2016). Wide & deep learning for recommender systems. DLRS 2016. — the prior work DeepFM improves on
- Lian, J., et al. (2018). xDeepFM: Combining explicit and implicit feature interactions. KDD 2018. — a follow-up advance to DeepFM