Revisiting Text Ranking in Deep Research

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Revisiting Text Ranking in Deep Research

[ArXiv 2025] Revisiting Text Ranking in Deep Research: Why IR Fundamentals Still Rule the Agent Era

总结

问题

方法

结果

要点

摘要

This paper systematically revisits text ranking methods in the "Deep Research" setting using the BrowseComp-Plus dataset. It evaluates 5 retrievers and 3 re-rankers across two open-source agents (gpt-oss-20b and GLM-4.7-Flash), finding that properly tuned BM25 and legislative re-ranking pipelines can achieve SOTA performance comparable to GPT-5.

TL;DR

As LLM agents evolve into "Deep Research" assistants traversing the open web, the underlying search mechanism becomes the bottleneck. This paper demonstrates that fancy neural retrievers often fail in deep research due to query mismatch, while "old-school" BM25 on passage-level units—when paired with a modest re-ranker—can outperform massive models like GPT-5.

Problem & Motivation: The Black-Box Search Trap

Current deep research agents (like OpenAI's Deep Research or various CoT-based agents) typically interact with opaque web search APIs. This creates two major technical blind spots:

The Context Window Tax: Feeding entire web pages into an LLM wastes tokens. Truncating them leads to "Information Loss."
The "Google-Style" Query Paradox: Agents don't ask questions like humans; they issue keyword-heavy, quoted queries (e.g., "90+7" attendance 61700). Most neural retrievers (trained on MS MARCO natural language) are allergic to this syntax.

The authors set out to determine if standard IR best practices—passage-level retrieval, re-ranking, and query normalization—can fix these agent-specific search failures.

Methodology: The Core Architecture of Agentic Search

The study benchmarks two agents (gpt-oss-20b and GLM-4.7-Flash) against a fixed corpus (BrowseComp-Plus) to "white-box" the retrieval process.

1. Passage-Level Retrieval

Instead of retrieving 10,000-word documents and truncating them, the authors split the corpus into concise passages (250 words). This allows the agent to "see" more diverse evidence within the same context window and avoids the infamous BM25 length-normalization issues found in long documents.

2. The Q2Q (Query-to-Question) Bridge

To fix the neural retriever performance drop, they introduced Q2Q. This module translates ambiguous keyword strings into descriptive questions using the agent's internal reasoning trace as context.

Model Architecture Equation 1 & 2: The ReAct-based interaction between the Agent and the Ranking Function.

Experiments & Results: David vs. Goliath

The results provide a reality check for the "neural-only" trend:

BM25 is King (Again): On passage-level data, BM25 outperformed specialized neural models like RepLLaMA. This is because agent queries are structurally more similar to "web search" than "reading comprehension."
The Power of Re-ranking: Adding a monoT5-3B re-ranker (a 3-billion-parameter model) to a 20B agent yielded a final accuracy of 0.689, nearly matching a GPT-5-based setup (0.701).
Query Mismatch is Real: Without Q2Q, neural retrievers like SPLADE-v3 were significantly handicapped. Using Q+R (Query + Reasoning) reformulation boosted neural accuracy by ~8%.

BM25 Hyperparameter Sensitivity Figure 1: Heatmap showing that standard BM25 parameters fail on long documents but succeed on passages, highlighting the importance of length normalization.

Critical Analysis & Conclusion

The Takeaway

If you are building a deep-research agent, don't just throw an API at it.

Passage-level indexing is essential for context-window efficiency.
Multi-stage ranking (Retrieve -> Re-rank) is non-negotiable.
Query Reformulation is the key to unlocking neural retrievers.

Limitations

The study identifies that "Reasoning-based re-rankers" (like Rank1) actually struggle with agent queries because they try to over-analyze keyword strings that don't have complex linguistic structure. This suggests that "more reasoning" isn't always the answer—sometimes robust lexical matching is simply more efficient.

Future Outlook

As we move toward agents with million-token windows, the "truncation" problem might fade, but the needle-in-a-haystack problem will intensify. Future work must focus on "Semantic Density"—ensuring that every token the agent reads maximizes the probability of finding the final answer.

发现相似论文

试试这些示例

Find recent papers that benchmark LLM-based agents in deep research or open-web exploration beyond BrowseComp-Plus.
Which studies first identified the performance gap of neural retrievers when handling keyword-style web queries compared to natural language questions?
Explore research that applies passage-level retrieval and multi-stage re-ranking to long-horizon agentic tasks in non-text domains like coding or scientific discovery.

[ArXiv 2025] Revisiting Text Ranking in Deep Research: Why IR Fundamentals Still Rule the Agent Era

1. TL;DR

2. Problem & Motivation: The Black-Box Search Trap

3. Methodology: The Core Architecture of Agentic Search

3.1. 1. Passage-Level Retrieval

3.2. 2. The Q2Q (Query-to-Question) Bridge

4. Experiments & Results: David vs. Goliath

5. Critical Analysis & Conclusion

5.1. The Takeaway

5.2. Limitations

5.3. Future Outlook