WisPaper
WisPaper
Scholar Search
Scholar QA
Pricing
TrueCite
OpenSeeker-v2: Pushing the Limits of Search Agents with High-Difficulty Trajectories
Summary
Problem
Method
Results
Takeaways
Abstract

OpenSeeker-v2 is an advanced 30B-parameter search agent that achieves State-of-the-Art (SOTA) performance using only Supervised Fine-Tuning (SFT). Developed by an academic team at Shanghai Jiao Tong University, it outperforms industrial models like Tongyi DeepResearch and RedSearcher on 4 major benchmarks including BrowseComp and Humanity’s Last Exam.

TL;DR

OpenSeeker-v2 proves that you don't need the massive compute of Big Tech to build a world-class search agent. By focusing on the granularity and difficulty of training data rather than complex RLHF or Continual Pre-training pipelines, an academic team has set new SOTA records on four major benchmarks using a simple SFT (Supervised Fine-Tuning) approach.

Background: Breaking the "Closed-Door Game"

In the current LLM landscape, "Deep Research" capabilities are the new frontier. However, the recipe for models like OpenAI's Deep Research or Alibaba's Tongyi DeepResearch usually involves a prohibitive pipeline: massive Continual Pre-training (CPT) + SFT + Reinforcement Learning (RL).

The authors of OpenSeeker-v2 challenge this status quo. They argue that the bottleneck isn't the training algorithm, but the informativeness of the trajectories. If an agent is trained on "shallow" data, it will never learn the "deep" intuition needed for complex scholarly or technical searches.

Methodology: Engineering Difficulty

The core philosophy of OpenSeeker-v2 is to force the model to "work harder" during training. They achieved this through three primary modifications to their synthesis pipeline:

1. Scaling Graph Size for Richer Exploration

Instead of generating questions from small, localized data points, they expanded the expansion budget for source subgraphs. This means the model must connect dots across a much larger "information map," naturally leading to multi-hop questions that cannot be answered by a single Google search.

2. Expanding the Tool Set

By increasing the variety of tools (e.g., specialized search, calculators, site-specific scrapers), the agent learns functional flexibility—choosing the right tool for the right sub-task rather than defaulting to generic searches.

3. Strict Low-Step Filtering

This is perhaps the most vital "quality control" measure. The authors discarded any synthetic trajectory that was solved in too few steps (). This ensures the training set is composed entirely of long-horizon problems requiring sustained reasoning.

Model Architecture and Paradigm Overview

Experiments and Results: SFT vs. The World

Despite being trained on a relatively tiny dataset of 10.6k samples, OpenSeeker-v2 (instantiated from a Qwen3-30B base) achieved stunning results.

Performance Highlights:

  • BrowseComp: 46.0% (Beating Tongyi DeepResearch's 43.4%).
  • BrowseComp-ZH: 58.1% (An 11.4% lead over the industrial baseline).
  • Trajectory Depth: The average tool-call count reached 64.67 steps, far exceeding RedSearcher (36.01) and OpenSeeker-v1 (46.97).

Performance Comparison across Benchmarks

The results in the table above show that OpenSeeker-v2-30B-SFT not only beats models of similar size but even challenges much larger models like DeepSeek-V3.1-671B and Claude-4.5-Sonnet in specific research-oriented benchmarks.

Deep Insight: The Value of Persistent Reasoning

Why does it work? The data suggests that trajectory length is a proxy for agent intelligence. By filtering for long-horizon tasks, the model develops "grit"—the ability to recover from failed searches or ambiguous results without giving up or hallucinating a final answer prematurely.

Average Tool Call Comparison Figure 2: Comparison of average tool calls, showing OpenSeeker-v2's commitment to deeper exploration.

Conclusion & Future Outlook

OpenSeeker-v2 is a triumph for the open-source community. It proves that data synthesis logic is a massive lever that can compensate for smaller compute budgets.

Takeaways for Researchers:

  • Stop worrying about RL until your SFT data is truly difficult.
  • Use Knowledge Graphs to generate structural multi-hop queries.
  • Filter your data by "reasoning density" (i.e., step count).

The team has open-sourced both the weights and the data synthesis findings, paving the way for the next generation of academic search agents.

Find Similar Papers

Try Our Examples

  • Search for recent papers that utilize automated data synthesis with knowledge graphs to improve multi-hop reasoning in LLM agents.
  • Which paper first introduced the ReAct (Reason + Act) paradigm for LLMs, and how has the step-length of trajectories evolved in SOTA search agents since then?
  • Explore studies investigating the performance gap between SFT-only and RL-tuned agents in long-horizon web browsing tasks.
Contents
OpenSeeker-v2: Pushing the Limits of Search Agents with High-Difficulty Trajectories
1. TL;DR
2. Background: Breaking the "Closed-Door Game"
3. Methodology: Engineering Difficulty
3.1. 1. Scaling Graph Size for Richer Exploration
3.2. 2. Expanding the Tool Set
3.3. 3. Strict Low-Step Filtering
4. Experiments and Results: SFT vs. The World
4.1. Performance Highlights:
5. Deep Insight: The Value of Persistent Reasoning
6. Conclusion & Future Outlook