LLM Multi-Layer Attribute Extraction for Cross-Domain Recommendation

Recommenders are mostly trained inside a single domain, in a closed world. User behavior logs and item-ID embeddings alone make it hard to explain why a user likes an item. This case study covers a pipeline that pulls the open-world knowledge inside LLM/VLM models out as structured attributes and fuses them into standard recommendation backbones. The core hypothesis is simple: decompose attributes into three layers instead of one, and both catalog expressiveness and personalization quality rise together.

Every number here comes from public benchmarks, and the cost figure is a projected API price. No internal experimental metrics are included.

Problem

Prior frameworks that bring LLMs into recommendation (the KAR line) typically split knowledge into just two buckets — reasoning knowledge and factual knowledge. But in a real domain, “factual knowledge” is a basket of things with very different textures. Objective measurable attributes like color or BPM, subjective perceptual attributes like mood or wearing context, and expert structural attributes like key, chord progression, or silhouette theory differ in extraction method, reliability, and caching policy alike.

Research questions (abbreviated). (1) Does a multi-layer taxonomy beat a single layer on expressiveness and recommendation accuracy? (2) Does the same framework transfer across heterogeneous domains like fashion and music? (3) Does multi-layer user profiling capture latent preference more finely? (4) Can LLM/VLM inference cost be kept at a practical level?

Data (public benchmarks)

Both domains use public Kaggle data as pilots, deliberately chosen so their schemas correspond 1:1 — letting us check whether one framework plugs into both.

Domain	Dataset (public)	Scale (approx.)	Modalities
Fashion	H&M Personalized Fashion (Kaggle)	~105K items, ~1.37M customers, ~31M transactions	image + text + transaction log
Music	KKBOX Music Recommendation (WSDM Cup 2018)	~360K songs, ~30K users, ~7.4M+ interactions	audio metadata + listening log

The two schemas are intentionally aligned (e.g., H&M postal_code ↔ KKBOX city, H&M age ↔ KKBOX bd). This symmetry lets us directly ablate whether a design that works on music also works on fashion.

Method / pipeline

The whole thing is a 4-stage end-to-end pipeline.

Stage 1 — Multi-layer attribute extraction (3-layer taxonomy)

Every item is decomposed into three layers.

Layer	Character	Extraction	Example (music)
L1 Product	objective, deterministic, measurable	metadata + lightweight VLM/library	genre, BPM, instrumentation
L2 Perceptual	subjective, affective, context-dependent	LLM world knowledge + reviews/UGC	mood, energy, listening context
L3 Theory-grounded	expert, structural, implicit	domain tools + LLM theoretical interpretation	key/mode, chord progression, rhythmic complexity

Multimodal inputs are handled with model tiering. L1 goes to a lightweight VLM (or direct metadata), L2 to a mid-tier model, L3 to a high-end model — and only attributes where the lightweight model has low confidence get escalated to a higher tier. Fashion uses multi-resolution reasoning (low-res whole image → high-res ROI crop); music renders spectrogram/piano-roll images as L3 input.

Stage 2 — User attribute inference (user profiling)

User profiles are built as a mirror of item attributes. We aggregate the L1/L2/L3 of items a user interacted with into per-layer preference vectors — L1 is “what they consume,” L2 is “why they consume it,” L3 is the “hidden taste they themselves don’t notice.” Rather than re-processing the entire history each time, incremental update (a summary of the prior profile plus only new behavior) saves tokens. See user profiling for the mechanics.

Stage 3 — Knowledge adaptation (hybrid-expert adaptor)

This stage converts text attributes into vectors a recommender can consume. The structure is a mixture-of-experts (MoE) — per-layer text encoders (Sentence-BERT / BGE / E5) → per-layer expert networks → a gating network that dynamically combines them with weights $g_1, g_2, g_3$ :

\mathbf{v}_{\text{aug}} = \sum_{\ell \in \{1,2,3\}} g_\ell(\mathbf{x}) \cdot E_\ell(\mathbf{t}_\ell)

where $E_\ell$ is the expert for layer $\ell$ and $\mathbf{t}_\ell$ is that layer’s attribute text. This MoE design borrows the shared/specialized expert structure directly from multi-task learning. The gating weights $(g_1, g_2, g_3)$ are themselves an interpretable signal of “which layer matters for this user,” so they get reused for segmentation and targeting.

A practical starting point. Begin by concatenating the three layers into a single encoder/adaptor for a baseline, then attach per-layer experts + gating as an optional ablation to compare interpretability and performance — for fast prototyping and training stability.

Stage 4 — Recommendation fusion

The augmented vector $\mathbf{v}_{\text{aug}}$ is fused into a standard backbone via concatenation or cross-attention. Validated backbones span factorization, sequential, and graph families: MF, DeepFM, SASRec, LightGCN. All attribute extraction and vector conversion is pre-stored (computed offline), so at inference time only vectors are looked up — no LLM calls. For how DeepFM fuses these, see DeepFM and Factorization Machine.

Key findings (illustrative / design-level)

Internal metrics are not disclosed, so the following are design-level expectations and qualitative observations.

Layer decomposition naturally forks the extraction policy. L1 is deterministic, so caching/reuse is cheap; L2/L3 lean heavily on LLM inference. Bundled together, the work would be dragged to the cost/reliability profile of the most expensive layer (L3); split apart, each falls to its optimal tier.
Gating weights are a useful by-product. $(g_1, g_2, g_3)$ are internal parameters for recommendation accuracy and a profiling signal that draws out user segments — “L1-dominant (brand-loyal) / L2-dominant (affect-sensitive) / L3-dominant (responsive to structural patterns).”
The framework crosses domains. Fashion’s color theory and silhouette versus music’s key and harmony are entirely different in content yet fill the same L3 slot. Because the data columns are aligned, a design from one domain ports almost verbatim to the other — the core evidence for cross-domain transfer.
Cost stays in a manageable range. Combining prompt caching (~90% off the shared prefix), a semantic attribute cache (avoiding duplicate calls on near-identical items), and incremental updates, full-catalog attribute extraction across both pilots projects to roughly $200–300 in LLM inference (a projection from public API pricing, not a measurement).

Lessons

“Factual knowledge” is not one thing. Treating objective, perceptual, and theoretical attributes as one lump holds the extraction policy hostage to the most expensive layer. Explicitly splitting the layers alone simplifies caching, model tiering, and reliability management.
Get interpretability for free. Don’t discard the gating weights — reuse them as segmentation/targeting signals, and the same model does double duty: recommendation and analysis.
Start from a simple baseline. Running a single concat adaptor first and adding the MoE as an ablation preserves both prototyping speed and the measurability of “which layer actually contributes.”
Public data is enough to validate. Picking two public benchmarks with corresponding schemas lets you test the cross-domain transfer hypothesis directly, without cost or reproducibility burden.

DeepFM — the backbone that fuses the augmented vector (FM + deep)
Factorization Machine — the foundation for modeling feature interactions
Multi-Task Learning — the shared/specialized structure of the hybrid-expert adaptor
User Profiling — Stage 2 multi-layer user preference inference
ESMM — entire-space multi-task estimation (an adjacent method from the selection-bias angle)

Local graph