Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

Uno-Orchestra: Mastering the Accuracy-Efficiency Frontier via Selective Delegation

总结

问题

方法

结果

要点

摘要

Uno-Orchestra is a unified orchestration policy that jointly optimizes task decomposition and subtask routing to heterogeneous (model, primitive) pairs. By collapsing planning and dispatching into a single causal-LM backbone, it achieves SOTA performance across 22 baselines on a 13-benchmark suite, reaching 77.0% macro pass@1.

TL;DR

Uno-Orchestra is a breakthrough in multi-agent systems that treats task planning and model routing as a single, unified decision process. Unlike previous systems that either simply route to one model or always force complex workflows, Uno-Orchestra learns to be "parsimonious"—decomposing only when necessary and dispatching tasks to the cheapest possible capable worker. It beats the best existing benchmarks by 16% in accuracy while being 10x cheaper.

Problem & Motivation: The Orchestration Paradox

As LLMs proliferate, we face a "capabilities vs. cost" ladder spanning three orders of magnitude. Traditional systems fail in two specific ways:

Rigid Routing: They treat every query as a single unit or "atom," missing the parallelizable sub-structure of complex tasks.
Decoupled Planning: They use one model to plan and another to route, creating massive context overhead and failing to optimize the "depth" of the plan against the actual execution budget.

The authors' central insight: Orchestration is a single causal-LM problem. A model should be able to say, in one forward pass, "This part of the problem needs GPT-4 via Python execution, but this other part can be handled by Gemini Flash via direct answer."

Methodology: Selective Delegation and Agentic-GRPO

Uno-Orchestra transitions through four modes of behavior: Lazy (direct answer), Oneshot (parallel subtasks), Continuation (sequential reasoning), and Decomp-Repair (fixing failures).

1. Unified Architecture

The system utilizes a single Qwen2.5-7B backbone that serves as both the planner and the dispatcher. It emits an XML-based DAG (Directed Acyclic Graph) where each node is an "admissible pair" consists of:

Worker Model: (e.g., Claude-3.5-Sonnet, GPT-4o, Gemini Flash).
Primitive/Skill: (e.g., web_search, execute_python, symbolic_math).

Uno-Orchestra Paradigms Figure 1: Contrast between standard routers, hierarchical systems, and the integrated Uno-Orchestra approach.

2. Agentic-GRPO

Standard RL (like vanilla GRPO) struggles with multi-turn tasks because the reward only comes at the very end. If a model makes a great planning decision in Turn 1 but fails to aggregate in Turn 3, the whole trajectory gets a "0." Agentic-GRPO introduces Turn-level Credit Assignment. It distributes parts of the final terminal reward back to individual turns and adds "process shaping" rewards (e.g., rewarding valid XML schema or successful tool calls) to guide the model through the long-horizon desert.

Experimental Results: Breaking the Pareto Frontier

The results are striking. Across 13 benchmarks (Math, Code, Agentic tools, etc.), Uno-Orchestra doesn't just improve accuracy; it shifts the entire efficiency curve.

Accuracy: 77.0% macro pass@1 (vs. 67.2% for the strongest baseline).
Thrift: It achieves this at roughly $0.10 p er q u er y, w h er e a ssy s t e m s w i t h co m p a r ab l e a cc u r a cy (l ik e A g e n tO r c h es t r a) ba l l oo n t oo v er$ 1.20 per query.

Accuracy vs Billed Cost Figure 2: Uno-Orchestra (bottom-right) dominates the Accuracy-Cost frontier, offering high performance at near-minimal cost.

The "Lazy" Advantage

A key reason for the efficiency is the "Lazy" mode (15.6% of trajectories). On simple queries, the model recognizes it doesn't need a multi-agent swarm and just answers directly. This "abstention from complexity" is what allows it to save massive amounts of context and cost compared to traditional "agentic workflows" that always trigger multi-step processes.

Critical Analysis & Conclusion

Takeaways

Joint Optimization Matters: Learning "Whether to decompose" and "Who to route to" together is strictly superior to treating them as separate steps.
Anonymized Training: The authors trained the router on "anonymous" worker labels to prevent the model from simply relying on "brand names" (like GPT-4), forcing it to actually learn the capability profiles of different workers.

Limitations

Verifier Dependency: The RL quality is still highly dependent on having a ground-truth verifier. In "open-ended" real-world tasks where no symbolic or exact-match answer exists, the reward signal becomes much noisier.
API Latency: While token costs are low, the wall-clock time is still subject to the slowest worker in the DAG.

Future Work

The "Selective Delegation" principle is likely the future of "Computer Use" agents. As models begin to interact with operating systems, the ability to parsimoniously decide between expensive "reasoning" steps and cheap "action" steps will be the difference between a viable product and an expensive research demo.

发现相似论文

试试这些示例

Search for recent papers on cost-aware LLM routing that support multi-turn task decomposition beyond flat single-call scenarios.
What is the origin of Group Relative Policy Optimization (GRPO), and how have other works adapted it for long-horizon agentic credit assignment?
Investigate if the Selective Delegation principle has been applied to multi-modal agents or robotics, where tool use and sub-goal planning must be balanced against high compute costs.

Uno-Orchestra: Mastering the Accuracy-Efficiency Frontier via Selective Delegation

1. TL;DR

2. Problem & Motivation: The Orchestration Paradox

3. Methodology: Selective Delegation and Agentic-GRPO

3.1. 1. Unified Architecture

3.2. 2. Agentic-GRPO

4. Experimental Results: Breaking the Pareto Frontier

4.1. The "Lazy" Advantage

5. Critical Analysis & Conclusion

5.1. Takeaways

5.2. Limitations

5.3. Future Work