LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval

[Pre-print 2025] LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval

Summary

Problem

Method

Results

Takeaways

Abstract

LaSER is a novel self-distillation framework that internalizes explicit Chain-of-Thought (CoT) reasoning into the latent space of dense retrievers. Built on a shared LLM backbone (e.g., Qwen3, Llama 3), it achieves SOTA performance on reasoning-intensive benchmarks like BRIGHT, matching the effectiveness of "rewrite-then-retrieve" pipelines while reducing latency by over 99%.

TL;DR

LaSER bridges the gap between the high accuracy of "Chain-of-Thought" (CoT) retrieval pipelines and the lightning speed of standard dense retrievers. By distilling explicit reasoning paths into "latent thinking tokens" within the LLM's hidden space, LaSER allows models to "think silently" before embedding a query. It matches the performance of GPT-4-level rewriters while being 300x faster.

The "Reasoning Gap" in Modern IR

The shift from BERT to LLM-based backbones (like Mistral or Qwen) has provided retrievers with massive internal knowledge. However, we typically use them as "dumb encoders"—one forward pass, one vector.

When a user asks a complex, multi-hop, or ambiguous query, standard semantic matching fails because it ignores the intent-discovery phase. Existing workarounds follow a "Rewrite-then-Retrieve" strategy:

Rewriter: LLM generates a 200-word explanation of the query.
Retriever: Encodes the expanded text.

This works but is unbearably slow for production. Conversely, existing "implicit reasoning" models (like GIRCSE) try to learn "thinking tokens" from scratch, but without a guide, these tokens often become semantically "junk," failing to capture the actual logic required for the task.

Methodology: The Art of Latent Thinking

LaSER (Latent Self-distillation for Efficient Retrieval) solves this by using a Dual-View Framework during training.

1. Dual-View Architecture

The model shares a single LLM backbone but processes two paths:

Explicit View (The Teacher): Receives the [Query + Ground-truth CoT]. It knows exactly why it's looking for a document.
Latent View (The Student): Receives only the [Query]. It must generate $K$ continuous "latent tokens" (vectors) to simulate the reasoning path it doesn't see.

LaSER Framework Architecture

2. Multi-Grained Alignment

The "secret sauce" is how the student learns from the teacher. LaSER doesn't just align the final vector (Output-Level Distillation). It uses Trajectory Alignment:

It identifies "keyframes" in the long CoT text.
It forces the $i$ -th latent token to represent the same semantic rank-preference as the $j$ -th segment of the explicit logic.
The Result: The latent tokens are effectively forced to "summarize" the reasoning steps in high-dimensional space.

Experimental Battleground: BRIGHT & Beyond

The authors tested LaSER on BRIGHT, a benchmark specifically designed for reasoning-intensive retrieval (e.g., math, coding, and complex logic).

SOTA Results

LaSER consistently outperformed both standard retrievers and specialized implicit reasoning ones. Notably, it even beat the "Rewrite-then-Retrieve" pipeline in some cases, proving that the internalized model is more robust than a decoupled two-model system.

Performance Comparison on Bright

The Latency-Utility Frontier

The efficiency gains are the most striking feature of LaSER. As seen in the figure below, traditional rewriter pipelines (red dot) sit in the high-latency zone. LaSER (stars) resides in the high-performance, low-latency quadrant, essentially giving us the "best of both worlds."

Latency vs Performance

Deep Insight: Scaling the "Thought"

An interesting finding from the ablation studies is that during inference, more thinking steps ( $K$ ) lead to better results. While the model was trained with $K = 3$ , increasing the number of latent tokens at test-time allows the model to perform "iterative refinement" of the embedding. This suggests that the latent space learned a general logic-progression capability, not just a fixed mapping.

Conclusion & Future Outlook

LaSER proves that we don't need to output characters to "reason." Internalizing reasoning into the latent space is not only a viable path for IR but likely the blueprint for the next generation of RAG (Retrieval-Augmented Generation) backbones.

Limitations: The current model still relies on a high-quality external LLM (like GPT-4o) to generate the "Explicit View" during training. Future work involving Reinforcement Learning (RL) could allow the model to discover its own optimal "latent logic" without human-generated CoT.

Keywords: Dense Retrieval, LLMs, Chain-of-Thought, Knowledge Distillation, Latent Reasoning, BRIGHT Benchmark.

Find Similar Papers

Try Our Examples

Search for recent papers that utilize "latent thinking tokens" or continuous thought vectors to improve Transformer-based retrieval or classification tasks.
What are the foundational techniques for "Chain-of-Thought distillation" into hidden states, and how does LaSER's trajectory alignment differ from previous approaches like CODI?
Explore if the dual-view self-distillation architecture in LaSER has been applied to multi-modal retrieval or cross-lingual semantic matching tasks.

Contents

[Pre-print 2025] LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval

1. TL;DR

2. The "Reasoning Gap" in Modern IR

3. Methodology: The Art of Latent Thinking

3.1. 1. Dual-View Architecture

3.2. 2. Multi-Grained Alignment

4. Experimental Battleground: BRIGHT & Beyond

4.1. SOTA Results

4.2. The Latency-Utility Frontier

5. Deep Insight: Scaling the "Thought"

6. Conclusion & Future Outlook