Tae Hyun Kim (Lowell)

Factorization Machine

Definition

The Factorization Machine (FM) is a general-purpose prediction model proposed by Rendle (2010) that models interactions between all pairs of features as the inner product of latent factor vectors.

y^(x)=w0+i=1nwixi+i=1nj=i+1nvi,vjxixj\hat{y}(\mathbf{x}) = w_0 + \sum_{i=1}^{n} w_i x_i + \sum_{i=1}^{n} \sum_{j=i+1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j

  • w0Rw_0 \in \mathbb{R}: global bias
  • wRn\mathbf{w} \in \mathbb{R}^n: per-feature weights
  • viRk\mathbf{v}_i \in \mathbb{R}^k: latent vector of feature ii (kk: embedding dimension)
  • vi,vj=f=1kvi,fvj,f\langle \mathbf{v}_i, \mathbf{v}_j \rangle = \sum_{f=1}^{k} v_{i,f} \cdot v_{j,f}: inner product of two latent vectors

Computational Trick: O(kn)O(kn) Linear Time

The naive pairwise computation is O(kn2)O(kn^2), but using the following identity it can be reduced to O(kn)O(kn):

i=1nj=i+1nvi,vjxixj=12f=1k[(i=1nvi,fxi)2i=1nvi,f2xi2]\sum_{i=1}^{n} \sum_{j=i+1}^{n} \langle \mathbf{v}_i, \mathbf{v}_j \rangle x_i x_j = \frac{1}{2} \sum_{f=1}^{k} \left[ \left( \sum_{i=1}^{n} v_{i,f} x_i \right)^2 - \sum_{i=1}^{n} v_{i,f}^2 x_i^2 \right]

This is derived from the identity (ai)2=ai2+2i<jaiaj(\sum a_i)^2 = \sum a_i^2 + 2\sum_{i<j} a_i a_j.

Intuitive Understanding

In recommender systems, user-item interaction data is extremely sparse. For example, direct observations of “Did User A click Item B?” are very few.

The core idea of FM: assigning a kk-dimensional latent vector to each feature allows the model to generalize even to feature-pair interactions that are not directly observed. Instead of estimating the interaction strength between features ii and jj with an independent parameter wijw_{ij}, modeling it as vi,vj\langle \mathbf{v}_i, \mathbf{v}_j \rangle gives:

  1. Data efficiency: the latent vector of a feature is learned from all interactions in which that feature participates
  2. Generalization: the learned latent vectors enable prediction of interactions for unobserved pairs
  3. Parameter efficiency: only O(kn)O(kn) parameters are needed instead of O(n2)O(n^2) independent parameters

Relationship to Matrix Factorization

Matrix Factorization (MF) is a special case of FM.

MF’s prediction for user uu and item ii:

y^=μ+bu+bi+pu,qi\hat{y} = \mu + b_u + b_i + \langle \mathbf{p}_u, \mathbf{q}_i \rangle

To express this as an FM, the input vector x\mathbf{x} is constructed as the one-hot encoding of the user and the item:

x=(0,,1,,0user,0,,1,,0item)\mathbf{x} = (\underbrace{0, \ldots, 1, \ldots, 0}_{\text{user}}, \underbrace{0, \ldots, 1, \ldots, 0}_{\text{item}})

In this case, the only interaction activated in the FM’s second-order term is vu,vi\langle \mathbf{v}_u, \mathbf{v}_i \rangle, so it becomes equivalent to MF.

CriterionMatrix FactorizationFactorization Machine
Inputuser-item rating matrixarbitrary feature vector
Interactionsuser-item onlyall feature pairs
Side inforequires separate handlingintegrated naturally
Generalityspecialized for recommendation (CF)classification/regression/ranking all possible

Comparison with Other Approaches

ModelFeature InteractionSparse DataComputational ComplexityNotes
Linear ModelnonepossibleO(n)O(n)ignores interactions
Polynomial Regressionindependent wijw_{ij}not possibleO(n2)O(n^2)cannot learn wijw_{ij} when sparse
SVM (Polynomial kernel)wijw_{ij}not possibleO(n2)O(n^2)dual form, cannot generalize under sparsity
FMvi,vj\langle \mathbf{v}_i, \mathbf{v}_j \ranglepossibleO(kn)O(kn)generalizes via latent vectors

The key difference: Polynomial Regression and SVM estimate wijw_{ij} as independent parameters, so they cannot learn without observed data, whereas FM learns indirectly through latent vector factorization.

FM Variants and Lineage

Factorization Machine

Mermaid source (click to expand)
> flowchart TD
>     FM["<b>FM</b><br/>Rendle 2010<br/>2nd-order feature interaction"] --> FFM["<b>FFM</b><br/>Juan et al. 2016<br/>field-aware latent vectors"]
>     FM --> DeepFM["<b>DeepFM</b><br/>Guo et al. 2017<br/>FM + DNN parallel combination"]
>     FM --> AFM["<b>AFM</b><br/>Xiao et al. 2017<br/>attention-based interaction"]
>     DeepFM --> xDeepFM["<b>xDeepFM</b><br/>Lian et al. 2018<br/>explicit + implicit<br/>high-order interaction"]
>     DeepFM --> AutoInt["<b>AutoInt</b><br/>Song et al. 2019<br/>self-attention interaction"]
>
ModelKey ImprovementInteraction Order
FMSecond-order interactions via latent vector inner products2nd-order
FFMUses different latent vectors per field2nd-order (field-aware)
DeepFMFM (explicit 2nd-order) + DNN (implicit high-order) in parallel2nd-order + high-order
AFMWeights important interactions via attention2nd-order (weighted)
xDeepFMLearns explicit high-order interactions via CINexplicit high-order

Advantages and Disadvantages

Advantages:

  • O(kn)O(kn) linear-time training and inference: suitable for large-scale sparse data
  • Can generalize to unobserved feature pairs in sparse data
  • General-purpose model: applicable to CF, classification, regression, and ranking alike
  • Subsumes existing models such as MF, SVD++, and PITF as special cases
  • Naturally integrates side information (context features)

Disadvantages:

  • Models only second-order interactions (higher-order requires follow-up models such as DeepFM)
  • Limited expressiveness compared to DNN-based models
  • Sensitive to the choice of embedding dimension kk
  • Insufficient for tasks where very high-order interactions are important

Implementation

The core operation of the FM second-order term (PyTorch):

import torch
import torch.nn as nn

class FMInteraction(nn.Module):
    """FM 2차 상호작용 레이어"""
    def __init__(self, num_features: int, k: int):
        super().__init__()
        self.V = nn.Parameter(torch.randn(num_features, k) * 0.01)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: (batch_size, num_features)
        # O(kn) 계산 트릭 적용
        vx = x @ self.V                          # (batch, k)
        square_of_sum = vx ** 2                   # (Σ v_i x_i)²
        sum_of_square = (x ** 2) @ (self.V ** 2)  # Σ v_i² x_i²
        interaction = 0.5 * (square_of_sum - sum_of_square).sum(dim=1)
        return interaction  # (batch_size,)
  • DeepFM - FM + DNN in parallel; the representative successor of FM
  • NMF - matrix-factorization-based factor model; FM generalizes MF
  • Factorization Prompting - a prompting technique inspired by FM’s factor decomposition idea
  • Multi-Task Learning - multi-task learning of explicit/implicit interactions in DeepFM and similar models

Key Papers

  • rendleFactorizationMachines2010 - the original FM paper; combines SVM with factorization models
  • guoDeepFMFactorizationMachineBased2017 - DeepFM; CTR prediction by combining FM + DNN in parallel
  • Juan, Y., et al. (2016). Field-aware factorization machines for CTR prediction. RecSys 2016.
  • Lian, J., et al. (2018). xDeepFM: Combining explicit and implicit feature interactions for recommender systems. KDD 2018.

Local graph