WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
LLM Router: Prefill is All You Need — Mechanistic Orchestration of LLMs
总结
问题
方法
结果
要点
摘要

This paper introduces LLM Router, a mechanistic orchestration framework that utilizes internal prefill activations from small "encoder" models to predict the performance of larger "target" models. By moving beyond shallow semantic embeddings to a method called Encoder-Target Decoupling and the SharedTrunkNet architecture, the system achieves SOTA routing efficiency, capturing 45.58% of the Oracle accuracy gap while reducing costs by 74.31%.

TL;DR

NVIDIA researchers have unveiled a new paradigm for LLM routing that ignores "what the query says" (semantics) and focuses on "how a model reacts" (mechanics). By analyzing the internal prefill activations of small open-source models, they can predict with high accuracy whether a massive frontier model like GPT-5 or Claude 4.6 will succeed on a specific task. This approach, called Encoder-Target Decoupling, captures nearly half of the theoretical "Oracle" performance headroom while slashing costs by over 70%.

Background: The Semantic Bottleneck

In the current LLM landscape, we often use "routers" to decide which model should handle a query. Traditionally, these routers use Semantic Signals: they look at the text embedding of a user's question and try to guess which model is best based on past performance on similar topics.

However, the authors argue this is fundamentally flawed. Semantic similarity does not equal task difficulty. A simple-sounding math problem and a complex-sounding one might have similar embeddings but vastly different success rates across models. To bridge this "Semantic-Complexity Gap," we need a mechanistic signal—a peek into the engine as it starts to process the request.

Methodology: Peeking under the Hood

The core innovation lies in Encoder-Target Decoupling. Instead of relying on the target model (which might be a restricted, expensive API), the router uses a local, lightweight "Encoder" model.

1. Signal Extraction & Layer Selection

The researchers found that an LLM's "confidence" in its answer is often encoded in its hidden states during the prefill stage (before a single token is generated). But which layer holds the best signal? They used two mathematical probes:

  • Effective Dimensionality ($d_{eff}$): Measures how widely information is distributed across neurons.
  • Fisher Separability ($J$): A multivariate criterion used to identify which layer's geometry most cleanly separates "correct" from "incorrect" outcomes.

2. SharedTrunkNet Architecture

Once the optimal layers are identified, the features are fed into SharedTrunkNet. This multi-output network predicts the success probability for all candidate models simultaneously. This joint optimization allows the router to learn the relative strengths and weaknesses of models in context, rather than evaluating them in isolation.

Overview of the two-stage routing architecture

Experimental Results: Dominating the Pareto Frontier

The team tested their router across three tiers: Frontier (the giants), Small (the 7-9B models), and Mixed (a real-world blend).

Key Metrics:

  • P-AUCCC: A new metric introduced to measure "Area Under the Cost Coverage Curve," essentially rewarding models that achieve high accuracy at the lowest possible cost.
  • Router Efficacy: Measuring the distance to a theoretical "Oracle" selector.

Frontier pool: raw accuracy vs. total cost

The results were striking. As shown in the cost-accuracy curve above, SharedTrunkNet (the blue line) consistently sits above the semantic baselines (like NV-Embed). In the Frontier pool, it closed 45.58% of the gap between the best standalone model and a perfect Oracle, while providing a 74.31% cost reduction compared to always using the most expensive model.

Why it Works: The Geography of Logic

Perhaps the most surprising finding is that "foreign" encoders—like a Qwen model—can predict the success of a Claude or GPT model better than those models' own internal states (if they were accessible). The authors suggest that certain model architectures are better "sensors" of task difficulty than others. By using Fisher $J$ to pick the right "slice" of these sensors, SharedTrunkNet finds a global geometry of correctness that transcends individual model proprietary weights.

Critical Analysis & Conclusion

This work represents a shift from "LLMs as Black Boxes" to "LLMs as Geometrical Manifolds."

Limitations:

  • The current version focuses on confidence but defers output token length prediction (a major cost factor) to future work.
  • While prefill activations can be cached in specialized engines like vLLM, there is still a non-zero compute overhead for the encoder pass.

Future Outlook: LLM Router provides a theoretically grounded foundation for "Collaborative AI." Instead of building one model to rule them all, the future likely involves a swarm of specialized models, orchestrated by a mechanistic router that knows exactly who is best for the job before the first word of the answer is even written.

发现相似论文

试试这些示例

  • Search for recent papers using transformer internal activations or "residual stream" probes for zero-shot task difficulty estimation or LLM routing.
  • Which study first introduced the concept of "correctness directions" crystallizing during the prefill stage of LLMs, and how does this paper build upon that theory?
  • Explore research applying mechanistic routing or encoder-target decoupling to multimodal LLM ensembles (e.g., routing between Vision-Language Models).
目录
LLM Router: Prefill is All You Need — Mechanistic Orchestration of LLMs
1. TL;DR
2. Background: The Semantic Bottleneck
3. Methodology: Peeking under the Hood
3.1. 1. Signal Extraction & Layer Selection
3.2. 2. SharedTrunkNet Architecture
4. Experimental Results: Dominating the Pareto Frontier
4.1. Key Metrics:
5. Why it Works: The Geography of Logic
6. Critical Analysis & Conclusion