This paper systematically revisits text ranking methods in the "Deep Research" setting using the BrowseComp-Plus dataset. It evaluates 5 retrievers and 3 re-rankers across two open-source agents (gpt-oss-20b and GLM-4.7-Flash), finding that properly tuned BM25 and legislative re-ranking pipelines can achieve SOTA performance comparable to GPT-5.
TL;DR
As LLM agents evolve into "Deep Research" assistants traversing the open web, the underlying search mechanism becomes the bottleneck. This paper demonstrates that fancy neural retrievers often fail in deep research due to query mismatch, while "old-school" BM25 on passage-level units—when paired with a modest re-ranker—can outperform massive models like GPT-5.
Problem & Motivation: The Black-Box Search Trap
Current deep research agents (like OpenAI's Deep Research or various CoT-based agents) typically interact with opaque web search APIs. This creates two major technical blind spots:
- The Context Window Tax: Feeding entire web pages into an LLM wastes tokens. Truncating them leads to "Information Loss."
- The "Google-Style" Query Paradox: Agents don't ask questions like humans; they issue keyword-heavy, quoted queries (e.g.,
"90+7" attendance 61700). Most neural retrievers (trained on MS MARCO natural language) are allergic to this syntax.
The authors set out to determine if standard IR best practices—passage-level retrieval, re-ranking, and query normalization—can fix these agent-specific search failures.
Methodology: The Core Architecture of Agentic Search
The study benchmarks two agents (gpt-oss-20b and GLM-4.7-Flash) against a fixed corpus (BrowseComp-Plus) to "white-box" the retrieval process.
1. Passage-Level Retrieval
Instead of retrieving 10,000-word documents and truncating them, the authors split the corpus into concise passages (250 words). This allows the agent to "see" more diverse evidence within the same context window and avoids the infamous BM25 length-normalization issues found in long documents.
2. The Q2Q (Query-to-Question) Bridge
To fix the neural retriever performance drop, they introduced Q2Q. This module translates ambiguous keyword strings into descriptive questions using the agent's internal reasoning trace as context.
Equation 1 & 2: The ReAct-based interaction between the Agent and the Ranking Function.
Experiments & Results: David vs. Goliath
The results provide a reality check for the "neural-only" trend:
- BM25 is King (Again): On passage-level data, BM25 outperformed specialized neural models like RepLLaMA. This is because agent queries are structurally more similar to "web search" than "reading comprehension."
- The Power of Re-ranking: Adding a monoT5-3B re-ranker (a 3-billion-parameter model) to a 20B agent yielded a final accuracy of 0.689, nearly matching a GPT-5-based setup (0.701).
- Query Mismatch is Real: Without Q2Q, neural retrievers like SPLADE-v3 were significantly handicapped. Using Q+R (Query + Reasoning) reformulation boosted neural accuracy by ~8%.
Figure 1: Heatmap showing that standard BM25 parameters fail on long documents but succeed on passages, highlighting the importance of length normalization.
Critical Analysis & Conclusion
The Takeaway
If you are building a deep-research agent, don't just throw an API at it.
- Passage-level indexing is essential for context-window efficiency.
- Multi-stage ranking (Retrieve -> Re-rank) is non-negotiable.
- Query Reformulation is the key to unlocking neural retrievers.
Limitations
The study identifies that "Reasoning-based re-rankers" (like Rank1) actually struggle with agent queries because they try to over-analyze keyword strings that don't have complex linguistic structure. This suggests that "more reasoning" isn't always the answer—sometimes robust lexical matching is simply more efficient.
Future Outlook
As we move toward agents with million-token windows, the "truncation" problem might fade, but the needle-in-a-haystack problem will intensify. Future work must focus on "Semantic Density"—ensuring that every token the agent reads maximizes the probability of finding the final answer.
