WisPaper
WisPaper
学术搜索
学术问答
论文订阅
价格
TrueCite
[Technical Report] Composer 2: Scaling Frontier-Level Agents for Real-World Software Engineering
总结
问题
方法
结果
要点
摘要

The Cursor Research Team introduces Composer 2, a specialized MoE model (1.04T total / 32B active parameters) optimized for agentic software engineering. By combining continued pretraining with large-scale asynchronous Reinforcement Learning (RL), the model achieves state-of-the-art performance, scoring 61.3% on the realistic CursorBench and 73.7% on SWE-bench Multilingual.

Executive Summary

Composer 2 represents a milestone in "Agentic Engineering"—moving beyond simple code completion to autonomous problem-solving within complex, large-scale codebases. Developed by the Cursor Research Team, this 1.04T parameter Mixture-of-Experts (MoE) model is not just another LLM with a coding "flavor"; it is a purpose-built agent trained to navigate, edit, and debug software through specialized Reinforcement Learning (RL) and a dedicated infrastructure.

Positioning: This work is a "SOTA specialized model" that proves domain-specific scaling can effectively challenge or even exceed the performance of massive general-purpose models like GPT-5 or Claude 3.5 in engineering-heavy tasks.

The Problem: The Gap Between Benchmarks and Reality

Current AI coding evaluations often suffer from "Clean Room Syndrome." Public benchmarks like SWE-bench focus on small, isolated bug fixes with highly descriptive prompts. In reality:

  • Context is Messy: Real tasks are underspecified. Developers often give a terse bug report (e.g., "The streaming is broken") rather than a detailed specification.
  • Scope is Massive: Composer 2 targets tasks with a median of 181 lines changed, compared to just 7-10 lines in typical benchmarks.
  • Contamination: Many public benchmark solutions are already leaked into the training data of frontier LLMs, leading to "memorized" rather than "reasoned" solutions.

Methodology: Specialized Training at Scale

1. Continued Pretraining (Knowledge Foundation)

The team selected Kimi K2.5 as the base model and extended it with a code-dominated data mix. They observed a crucial log-linear relationship between codebase perplexity and final RL performance, justifying the compute spend on better "base knowledge" before agentic training begins.

2. Asynchronous Reinforcement Learning (The Agentic Brain)

The core of Composer 2’s intelligence comes from its RL phase. Unlike standard fine-tuning, this RL occurs in a stateful environment called Anyrun, where the model can run shell commands, write tests, and view logs—exactly like a human engineer.

Key Technical Innovations:

  • Router Replay: To stabilize MoE during RL, they override the trainer's router to match the expert choices made during inference. This ensures the "log-probabilities" remain accurate for the policy gradient.
  • Self-Summarization: To bypass context window limits during long tasks, the model is trained to summarize its own past actions. This "memory" is rewarded if the final coding outcome is successful.
  • Nonlinear Length Penalties: To prevent the agent from being "lazy" or "over-thinking," a concave reward curve encourages speed on easy tasks while allowing deep "thinking tokens" for complex debugging.

Model Architecture and Training Flow Figure: The Grouped GEMM training flow in the MoE layer, optimized for hardware-level precision (MXFP8/NVFP4).

Experimental Results: Breaking the Pareto Frontier

Composer 2 achieves a superior balance between cost and accuracy. While it competes with the world's most capable models (GPT-5, Codex) on accuracy, it operates at a significantly lower inference cost due to its MoE architecture and efficient speculative decoding via MTP layers.

  • CursorBench: 61.3% Accuracy (vs. 44.2% for Composer 1.5).
  • Efficiency: The model demonstrates "Pareto-optimal" trade-offs, providing high-effort reasoning paths that outshine low-effort variants of much larger models.

Performance Comparison Figure: The Pareto frontier showing Composer 2's efficiency in tokens and cost relative to other frontier models.

Critical Insight & Conclusion

The success of Composer 2 stems from minimizing the train-test mismatch. By training the model within the same "harness" used by the Cursor IDE, using real production logs and obscure bug reports as training data, the team has moved closer to a true "AI Software Engineer."

Limitations: While powerful, the model still faces challenges in extremely long-horizon tasks (hours of execution) where signal-to-noise ratios in RL rewards become thin. Future work will likely focus on "Verifiable Rewards" and even longer-term memory architectures.

Takeaway: Scaling isn't just about more data; it's about the fidelity of the environment in which the model learns to act.

发现相似论文

试试这些示例

  • Search for recent papers on "Router Replay" or "Expert Alignment" techniques used to stabilize Reinforcement Learning in Mixture-of-Experts (MoE) models.
  • Which original research introduced the Group Relative Policy Optimization (GRPO) algorithm, and how have subsequent works modified its advantage normalization for coding agents?
  • Explore how Multi-Token Prediction (MTP) and speculative decoding are being integrated into agentic frameworks to reduce completion latency in long-horizon tasks.
目录
[Technical Report] Composer 2: Scaling Frontier-Level Agents for Real-World Software Engineering
1. Executive Summary
2. The Problem: The Gap Between Benchmarks and Reality
3. Methodology: Specialized Training at Scale
3.1. 1. Continued Pretraining (Knowledge Foundation)
3.2. 2. Asynchronous Reinforcement Learning (The Agentic Brain)
4. Experimental Results: Breaking the Pareto Frontier
5. Critical Insight & Conclusion