Factorization Machine
Definition
The Factorization Machine (FM) is a general-purpose prediction model proposed by Rendle (2010) that models interactions between all pairs of features as the inner product of latent factor vectors.
- : global bias
- : per-feature weights
- : latent vector of feature (: embedding dimension)
- : inner product of two latent vectors
Computational Trick: Linear Time
The naive pairwise computation is , but using the following identity it can be reduced to :
This is derived from the identity .
Intuitive Understanding
In recommender systems, user-item interaction data is extremely sparse. For example, direct observations of “Did User A click Item B?” are very few.
The core idea of FM: assigning a -dimensional latent vector to each feature allows the model to generalize even to feature-pair interactions that are not directly observed. Instead of estimating the interaction strength between features and with an independent parameter , modeling it as gives:
- Data efficiency: the latent vector of a feature is learned from all interactions in which that feature participates
- Generalization: the learned latent vectors enable prediction of interactions for unobserved pairs
- Parameter efficiency: only parameters are needed instead of independent parameters
Relationship to Matrix Factorization
Matrix Factorization (MF) is a special case of FM.
MF’s prediction for user and item :
To express this as an FM, the input vector is constructed as the one-hot encoding of the user and the item:
In this case, the only interaction activated in the FM’s second-order term is , so it becomes equivalent to MF.
| Criterion | Matrix Factorization | Factorization Machine |
|---|---|---|
| Input | user-item rating matrix | arbitrary feature vector |
| Interactions | user-item only | all feature pairs |
| Side info | requires separate handling | integrated naturally |
| Generality | specialized for recommendation (CF) | classification/regression/ranking all possible |
Comparison with Other Approaches
| Model | Feature Interaction | Sparse Data | Computational Complexity | Notes |
|---|---|---|---|---|
| Linear Model | none | possible | ignores interactions | |
| Polynomial Regression | independent | not possible | cannot learn when sparse | |
| SVM (Polynomial kernel) | not possible | dual form, cannot generalize under sparsity | ||
| FM | possible | generalizes via latent vectors |
The key difference: Polynomial Regression and SVM estimate as independent parameters, so they cannot learn without observed data, whereas FM learns indirectly through latent vector factorization.
FM Variants and Lineage
Mermaid source (click to expand)
> flowchart TD > FM["<b>FM</b><br/>Rendle 2010<br/>2nd-order feature interaction"] --> FFM["<b>FFM</b><br/>Juan et al. 2016<br/>field-aware latent vectors"] > FM --> DeepFM["<b>DeepFM</b><br/>Guo et al. 2017<br/>FM + DNN parallel combination"] > FM --> AFM["<b>AFM</b><br/>Xiao et al. 2017<br/>attention-based interaction"] > DeepFM --> xDeepFM["<b>xDeepFM</b><br/>Lian et al. 2018<br/>explicit + implicit<br/>high-order interaction"] > DeepFM --> AutoInt["<b>AutoInt</b><br/>Song et al. 2019<br/>self-attention interaction"] >
| Model | Key Improvement | Interaction Order |
|---|---|---|
| FM | Second-order interactions via latent vector inner products | 2nd-order |
| FFM | Uses different latent vectors per field | 2nd-order (field-aware) |
| DeepFM | FM (explicit 2nd-order) + DNN (implicit high-order) in parallel | 2nd-order + high-order |
| AFM | Weights important interactions via attention | 2nd-order (weighted) |
| xDeepFM | Learns explicit high-order interactions via CIN | explicit high-order |
Advantages and Disadvantages
Advantages:
- linear-time training and inference: suitable for large-scale sparse data
- Can generalize to unobserved feature pairs in sparse data
- General-purpose model: applicable to CF, classification, regression, and ranking alike
- Subsumes existing models such as MF, SVD++, and PITF as special cases
- Naturally integrates side information (context features)
Disadvantages:
- Models only second-order interactions (higher-order requires follow-up models such as DeepFM)
- Limited expressiveness compared to DNN-based models
- Sensitive to the choice of embedding dimension
- Insufficient for tasks where very high-order interactions are important
Implementation
The core operation of the FM second-order term (PyTorch):
import torch
import torch.nn as nn
class FMInteraction(nn.Module):
"""FM 2차 상호작용 레이어"""
def __init__(self, num_features: int, k: int):
super().__init__()
self.V = nn.Parameter(torch.randn(num_features, k) * 0.01)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# x: (batch_size, num_features)
# O(kn) 계산 트릭 적용
vx = x @ self.V # (batch, k)
square_of_sum = vx ** 2 # (Σ v_i x_i)²
sum_of_square = (x ** 2) @ (self.V ** 2) # Σ v_i² x_i²
interaction = 0.5 * (square_of_sum - sum_of_square).sum(dim=1)
return interaction # (batch_size,)
Related Concepts
- DeepFM - FM + DNN in parallel; the representative successor of FM
- NMF - matrix-factorization-based factor model; FM generalizes MF
- Factorization Prompting - a prompting technique inspired by FM’s factor decomposition idea
- Multi-Task Learning - multi-task learning of explicit/implicit interactions in DeepFM and similar models
Key Papers
- rendleFactorizationMachines2010 - the original FM paper; combines SVM with factorization models
- guoDeepFMFactorizationMachineBased2017 - DeepFM; CTR prediction by combining FM + DNN in parallel
- Juan, Y., et al. (2016). Field-aware factorization machines for CTR prediction. RecSys 2016.
- Lian, J., et al. (2018). xDeepFM: Combining explicit and implicit feature interactions for recommender systems. KDD 2018.