Factorization Machine · Tae Hyun Kim (Lowell)

Definition

The Factorization Machine (FM) is a general-purpose prediction model proposed by Rendle (2010) that models interactions between all pairs of features as the inner product of latent factor vectors.

$\hat{y}(\mathbf{x}) = w_0 + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n} \sum_{j=i+1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j$

$w_0 \in \mathbb{R}$ : global bias
$\mathbf{w} \in \mathbb{R}^n$ : per-feature weights
$\mathbf{v}_i \in \mathbb{R}^k$ : latent vector of feature $i$ ( $k$ : embedding dimension)
$\langle \mathbf{v}_i, \mathbf{v}_j \rangle = \sum_{f=1}^{k} v_{i,f} \cdot v_{j,f}$ : inner product of two latent vectors

Computational Trick: $O(kn)$ Linear Time

The naive pairwise computation is $O(kn^2)$ , but using the following identity it can be reduced to $O(kn)$ :

$\sum_{i=1}^{n} \sum_{j=i+1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j = \frac{1}{2} \sum_{f=1}^{k} \left[ \left( \sum_{i=1}^{n} v_{i,f} x_i \right)^2 - \sum_{i=1}^{n} v_{i,f}^2 x_i^2 \right]$

This is derived from the identity $(\sum a_i)^2 = \sum a_i^2 + 2\sum_{i<j} a_i a_j$ .

Intuitive Understanding

In recommender systems, user-item interaction data is extremely sparse. For example, direct observations of “Did User A click Item B?” are very few.

The core idea of FM: assigning a $k$ -dimensional latent vector to each feature allows the model to generalize even to feature-pair interactions that are not directly observed. Instead of estimating the interaction strength between features $i$ and $j$ with an independent parameter $w_{ij}$ , modeling it as $\langle \mathbf{v}_i, \mathbf{v}_j \rangle$ gives:

Data efficiency: the latent vector of a feature is learned from all interactions in which that feature participates
Generalization: the learned latent vectors enable prediction of interactions for unobserved pairs
Parameter efficiency: only $O(kn)$ parameters are needed instead of $O(n^2)$ independent parameters

Relationship to Matrix Factorization

Matrix Factorization (MF) is a special case of FM.

MF’s prediction for user $u$ and item $i$ :

$\hat{y} = \mu + b_u + b_i + \langle \mathbf{p}_u, \mathbf{q}_i \rangle$

To express this as an FM, the input vector $\mathbf{x}$ is constructed as the one-hot encoding of the user and the item:

$\mathbf{x} = (\underbrace{0, \ldots, 1, \ldots, 0}_{\text{user}}, \underbrace{0, \ldots, 1, \ldots, 0}_{\text{item}})$

In this case, the only interaction activated in the FM’s second-order term is $\langle \mathbf{v}_u, \mathbf{v}_i \rangle$ , so it becomes equivalent to MF.

Criterion	Matrix Factorization	Factorization Machine
Input	user-item rating matrix	arbitrary feature vector
Interactions	user-item only	all feature pairs
Side info	requires separate handling	integrated naturally
Generality	specialized for recommendation (CF)	classification/regression/ranking all possible

Comparison with Other Approaches

Model	Feature Interaction	Sparse Data	Computational Complexity	Notes
Linear Model	none	possible	$O(n)$	ignores interactions
Polynomial Regression	independent $w_{ij}$	not possible	$O(n^2)$	cannot learn $w_{ij}$ when sparse
SVM (Polynomial kernel)	$w_{ij}$	not possible	$O(n^2)$	dual form, cannot generalize under sparsity
FM	$\langle \mathbf{v}_i, \mathbf{v}_j \rangle$	possible	$O(kn)$	generalizes via latent vectors

The key difference: Polynomial Regression and SVM estimate $w_{ij}$ as independent parameters, so they cannot learn without observed data, whereas FM learns indirectly through latent vector factorization.

FM Variants and Lineage

Factorization Machine

Mermaid source (click to expand)

> flowchart TD
>     FM["<b>FM</b><br/>Rendle 2010<br/>2nd-order feature interaction"] --> FFM["<b>FFM</b><br/>Juan et al. 2016<br/>field-aware latent vectors"]
>     FM --> DeepFM["<b>DeepFM</b><br/>Guo et al. 2017<br/>FM + DNN parallel combination"]
>     FM --> AFM["<b>AFM</b><br/>Xiao et al. 2017<br/>attention-based interaction"]
>     DeepFM --> xDeepFM["<b>xDeepFM</b><br/>Lian et al. 2018<br/>explicit + implicit<br/>high-order interaction"]
>     DeepFM --> AutoInt["<b>AutoInt</b><br/>Song et al. 2019<br/>self-attention interaction"]
>

Model	Key Improvement	Interaction Order
FM	Second-order interactions via latent vector inner products	2nd-order
FFM	Uses different latent vectors per field	2nd-order (field-aware)
DeepFM	FM (explicit 2nd-order) + DNN (implicit high-order) in parallel	2nd-order + high-order
AFM	Weights important interactions via attention	2nd-order (weighted)
xDeepFM	Learns explicit high-order interactions via CIN	explicit high-order

Advantages and Disadvantages

Advantages:

$O(kn)$ linear-time training and inference: suitable for large-scale sparse data
Can generalize to unobserved feature pairs in sparse data
General-purpose model: applicable to CF, classification, regression, and ranking alike
Subsumes existing models such as MF, SVD++, and PITF as special cases
Naturally integrates side information (context features)

Disadvantages:

Models only second-order interactions (higher-order requires follow-up models such as DeepFM)
Limited expressiveness compared to DNN-based models
Sensitive to the choice of embedding dimension $k$
Insufficient for tasks where very high-order interactions are important

Implementation

The core operation of the FM second-order term (PyTorch):

import torch
import torch.nn as nn

class FMInteraction(nn.Module):
    """FM 2차 상호작용 레이어"""
    def __init__(self, num_features: int, k: int):
        super().__init__()
        self.V = nn.Parameter(torch.randn(num_features, k) * 0.01)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (batch_size, num_features)
        # O(kn) 계산 트릭 적용
        vx = x @ self.V                          # (batch, k)
        square_of_sum = vx ** 2                   # (Σ v_i x_i)²
        sum_of_square = (x ** 2) @ (self.V ** 2)  # Σ v_i² x_i²
        interaction = 0.5 * (square_of_sum - sum_of_square).sum(dim=1)
        return interaction  # (batch_size,)

DeepFM - FM + DNN in parallel; the representative successor of FM
NMF - matrix-factorization-based factor model; FM generalizes MF
Factorization Prompting - a prompting technique inspired by FM’s factor decomposition idea
Multi-Task Learning - multi-task learning of explicit/implicit interactions in DeepFM and similar models

Key Papers

rendleFactorizationMachines2010 - the original FM paper; combines SVM with factorization models
guoDeepFMFactorizationMachineBased2017 - DeepFM; CTR prediction by combining FM + DNN in parallel
Juan, Y., et al. (2016). Field-aware factorization machines for CTR prediction. RecSys 2016.
Lian, J., et al. (2018). xDeepFM: Combining explicit and implicit feature interactions for recommender systems. KDD 2018.

Local graph