LaSER is a novel self-distillation framework that internalizes explicit Chain-of-Thought (CoT) reasoning into the latent space of dense retrievers. Built on a shared LLM backbone (e.g., Qwen3, Llama 3), it achieves SOTA performance on reasoning-intensive benchmarks like BRIGHT, matching the effectiveness of "rewrite-then-retrieve" pipelines while reducing latency by over 99%.
TL;DR
LaSER bridges the gap between the high accuracy of "Chain-of-Thought" (CoT) retrieval pipelines and the lightning speed of standard dense retrievers. By distilling explicit reasoning paths into "latent thinking tokens" within the LLM's hidden space, LaSER allows models to "think silently" before embedding a query. It matches the performance of GPT-4-level rewriters while being 300x faster.
The "Reasoning Gap" in Modern IR
The shift from BERT to LLM-based backbones (like Mistral or Qwen) has provided retrievers with massive internal knowledge. However, we typically use them as "dumb encoders"—one forward pass, one vector.
When a user asks a complex, multi-hop, or ambiguous query, standard semantic matching fails because it ignores the intent-discovery phase. Existing workarounds follow a "Rewrite-then-Retrieve" strategy:
- Rewriter: LLM generates a 200-word explanation of the query.
- Retriever: Encodes the expanded text.
This works but is unbearably slow for production. Conversely, existing "implicit reasoning" models (like GIRCSE) try to learn "thinking tokens" from scratch, but without a guide, these tokens often become semantically "junk," failing to capture the actual logic required for the task.
Methodology: The Art of Latent Thinking
LaSER (Latent Self-distillation for Efficient Retrieval) solves this by using a Dual-View Framework during training.
1. Dual-View Architecture
The model shares a single LLM backbone but processes two paths:
- Explicit View (The Teacher): Receives the
[Query + Ground-truth CoT]. It knows exactly why it's looking for a document. - Latent View (The Student): Receives only the
[Query]. It must generate continuous "latent tokens" (vectors) to simulate the reasoning path it doesn't see.

2. Multi-Grained Alignment
The "secret sauce" is how the student learns from the teacher. LaSER doesn't just align the final vector (Output-Level Distillation). It uses Trajectory Alignment:
- It identifies "keyframes" in the long CoT text.
- It forces the -th latent token to represent the same semantic rank-preference as the -th segment of the explicit logic.
- The Result: The latent tokens are effectively forced to "summarize" the reasoning steps in high-dimensional space.
Experimental Battleground: BRIGHT & Beyond
The authors tested LaSER on BRIGHT, a benchmark specifically designed for reasoning-intensive retrieval (e.g., math, coding, and complex logic).
SOTA Results
LaSER consistently outperformed both standard retrievers and specialized implicit reasoning ones. Notably, it even beat the "Rewrite-then-Retrieve" pipeline in some cases, proving that the internalized model is more robust than a decoupled two-model system.

The Latency-Utility Frontier
The efficiency gains are the most striking feature of LaSER. As seen in the figure below, traditional rewriter pipelines (red dot) sit in the high-latency zone. LaSER (stars) resides in the high-performance, low-latency quadrant, essentially giving us the "best of both worlds."

Deep Insight: Scaling the "Thought"
An interesting finding from the ablation studies is that during inference, more thinking steps () lead to better results. While the model was trained with , increasing the number of latent tokens at test-time allows the model to perform "iterative refinement" of the embedding. This suggests that the latent space learned a general logic-progression capability, not just a fixed mapping.
Conclusion & Future Outlook
LaSER proves that we don't need to output characters to "reason." Internalizing reasoning into the latent space is not only a viable path for IR but likely the blueprint for the next generation of RAG (Retrieval-Augmented Generation) backbones.
Limitations: The current model still relies on a high-quality external LLM (like GPT-4o) to generate the "Explicit View" during training. Future work involving Reinforcement Learning (RL) could allow the model to discover its own optimal "latent logic" without human-generated CoT.
Keywords: Dense Retrieval, LLMs, Chain-of-Thought, Knowledge Distillation, Latent Reasoning, BRIGHT Benchmark.
