WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
[SIGIR 2026] KARMA: Healing the Knowledge-Action Gap in Taobao’s Personalized Search
总结
问题
方法
结果
要点
摘要

The paper introduces KARMA, a multimodal alignment framework for Taobao's personalized search that bridges the "Knowledge-Action Gap" in LLMs. By using semantic reconstruction as a train-only regularizer, it enables LLMs to achieve SOTA performance in next-item prediction without online overhead.

Executive Summary

TL;DR: Alibaba researchers have identified a critical failure mode in LLM-based search: Semantic Collapse. When LLMs are trained solely on user clicks, they "forget" their semantic wisdom and turn into inefficient ID-matchers. KARMA solves this by forcing the model to "be able to explain itself"—requiring that the search embeddings can reconstruct the original item descriptions and images during training, without adding a single millisecond to online latency.

Background: This work sits at the intersection of Multimodal LLMs and Industrial Recommender Systems. It moves beyond the naive "fine-tune on clicks" approach toward a more theoretically grounded alignment of semantic knowledge and behavioral actions.

The Problem: The "Barcode" Trap and Semantic Collapse

In industrial search, we often compress a user's entire history into a single vector (the continuous-token interface) to meet strict latency requirements. However, the authors discovered that when you optimize this vector only for "Action" (clicks), the LLM takes a shortcut.

Instead of understanding that a user likes "minimalist Nordic furniture," the model identifies a specific high-frequency Item ID and focuses all its attention on a few token positions. This results in Attention Sinks—visualized as barcode-like strips in the attention maps—where the model ignores the actual content of the items.

Methodology: The KARMA Framework

The core philosophy of KARMA is: Decodability as a Regularizer. The model must learn to predict the next click, but it must also prove that it still knows what the current items "mean."

1. Dual Decodability Paths

  • History-Conditioned Generation ($L_{gen}$): The LLM must be able to generate the text of the target item based on the user's history. This keeps the model's "brain" close to its pre-trained state.
  • Embedding-Conditioned Reconstruction ($L_{recon}$): The resulting search embedding ($h_t$) must contain enough information to reconstruct the target's text and visual features. If the embedding becomes a "meaningless ID," this reconstruction will fail.

KARMA Architecture

2. Multimodal Diffusion Regularization

For visual content, KARMA doesn't just use a simple projection. It employs a Diffusion/Flow-Matching head to reconstruct frozen visual features. This provides a high-fidelity grounding signal that text alone cannot capture.

Experiments: Breaking the Bottleneck

KARMA was tested on Taobao's massive search logs. The results were categorical:

  • Retrieval Performance: The embedding-conditioned reconstruction was the "hero" feature, yielding a massive +19.19 improvement in HR@200.
  • Structural Health: Qualitative analysis of attention maps showed that KARMA successfully broke the "barcode" patterns, restoring a distributed, healthy attention mechanism.

Attention Sink Comparisons (Left: Collapsed "barcode" attention in standard training; Right: Rich, distributed attention in KARMA)

Theoretical Insight: Mean-Seeking vs. Mode-Seeking

An intriguing finding from the paper is that Diffusion is a great regularizer but a poor generator for retrieval.

  • Generative models (Diffusion) are mode-seeking: they want to pick one specific, high-quality version of the future.
  • Retrieval models must be mean-seeking: the embedding should represent the "centroid" of all possible things a user might click next. This is why KARMA uses Diffusion only to "guard" the semantics during training, while keeping the retrieval path discriminative.

Critical Analysis & Future Work

Takeaway: KARMA proves that for industrial LLM applications, we shouldn't discard the generative nature of the LLM. Instead, we should use that generative power as a "scaffolding" (regularizer) that we tear down during inference to keep the system fast.

Limitations: The "Semantic Warm-up" stage is crucial but adds complexity to the training pipeline. Future work might explore if this alignment can be achieved in a single stage through better initialization or MoE (Mixture of Experts) architectures.

Conclusion: By bridging the Knowledge-Action Gap, KARMA allows Taobao to leverage 0.6B to 2B parameter LLMs effectively, gaining the generalization of a linguist and the precision of a salesman—all with zero additional serving cost.

发现相似论文

试试这些示例

  • Search for recent papers addressing "Semantic Collapse" or "Attention Sinks" in Large Language Models when fine-tuned on discriminative recommendation tasks.
  • Which paper first introduced the concept of "Attention Sinks" in Transformers, and how does KARMA's interpretation of this phenomenon in personalized search differ from its original context?
  • Find studies that compare Diffusion-based generative modeling versus traditional MSE/Cross-Entropy for learning representation embeddings in information retrieval.
目录
[SIGIR 2026] KARMA: Healing the Knowledge-Action Gap in Taobao’s Personalized Search
1. Executive Summary
2. The Problem: The "Barcode" Trap and Semantic Collapse
3. Methodology: The KARMA Framework
3.1. 1. Dual Decodability Paths
3.2. 2. Multimodal Diffusion Regularization
4. Experiments: Breaking the Bottleneck
4.1. Theoretical Insight: Mean-Seeking vs. Mode-Seeking
5. Critical Analysis & Future Work