The Cursor Research Team introduces Composer 2, a specialized MoE model (1.04T total / 32B active parameters) optimized for agentic software engineering. By combining continued pretraining with large-scale asynchronous Reinforcement Learning (RL), the model achieves state-of-the-art performance, scoring 61.3% on the realistic CursorBench and 73.7% on SWE-bench Multilingual.
Executive Summary
Composer 2 represents a milestone in "Agentic Engineering"—moving beyond simple code completion to autonomous problem-solving within complex, large-scale codebases. Developed by the Cursor Research Team, this 1.04T parameter Mixture-of-Experts (MoE) model is not just another LLM with a coding "flavor"; it is a purpose-built agent trained to navigate, edit, and debug software through specialized Reinforcement Learning (RL) and a dedicated infrastructure.
Positioning: This work is a "SOTA specialized model" that proves domain-specific scaling can effectively challenge or even exceed the performance of massive general-purpose models like GPT-5 or Claude 3.5 in engineering-heavy tasks.
The Problem: The Gap Between Benchmarks and Reality
Current AI coding evaluations often suffer from "Clean Room Syndrome." Public benchmarks like SWE-bench focus on small, isolated bug fixes with highly descriptive prompts. In reality:
- Context is Messy: Real tasks are underspecified. Developers often give a terse bug report (e.g., "The streaming is broken") rather than a detailed specification.
- Scope is Massive: Composer 2 targets tasks with a median of 181 lines changed, compared to just 7-10 lines in typical benchmarks.
- Contamination: Many public benchmark solutions are already leaked into the training data of frontier LLMs, leading to "memorized" rather than "reasoned" solutions.
Methodology: Specialized Training at Scale
1. Continued Pretraining (Knowledge Foundation)
The team selected Kimi K2.5 as the base model and extended it with a code-dominated data mix. They observed a crucial log-linear relationship between codebase perplexity and final RL performance, justifying the compute spend on better "base knowledge" before agentic training begins.
2. Asynchronous Reinforcement Learning (The Agentic Brain)
The core of Composer 2’s intelligence comes from its RL phase. Unlike standard fine-tuning, this RL occurs in a stateful environment called Anyrun, where the model can run shell commands, write tests, and view logs—exactly like a human engineer.
Key Technical Innovations:
- Router Replay: To stabilize MoE during RL, they override the trainer's router to match the expert choices made during inference. This ensures the "log-probabilities" remain accurate for the policy gradient.
- Self-Summarization: To bypass context window limits during long tasks, the model is trained to summarize its own past actions. This "memory" is rewarded if the final coding outcome is successful.
- Nonlinear Length Penalties: To prevent the agent from being "lazy" or "over-thinking," a concave reward curve encourages speed on easy tasks while allowing deep "thinking tokens" for complex debugging.
Figure: The Grouped GEMM training flow in the MoE layer, optimized for hardware-level precision (MXFP8/NVFP4).
Experimental Results: Breaking the Pareto Frontier
Composer 2 achieves a superior balance between cost and accuracy. While it competes with the world's most capable models (GPT-5, Codex) on accuracy, it operates at a significantly lower inference cost due to its MoE architecture and efficient speculative decoding via MTP layers.
- CursorBench: 61.3% Accuracy (vs. 44.2% for Composer 1.5).
- Efficiency: The model demonstrates "Pareto-optimal" trade-offs, providing high-effort reasoning paths that outshine low-effort variants of much larger models.
Figure: The Pareto frontier showing Composer 2's efficiency in tokens and cost relative to other frontier models.
Critical Insight & Conclusion
The success of Composer 2 stems from minimizing the train-test mismatch. By training the model within the same "harness" used by the Cursor IDE, using real production logs and obscure bug reports as training data, the team has moved closer to a true "AI Software Engineer."
Limitations: While powerful, the model still faces challenges in extremely long-horizon tasks (hours of execution) where signal-to-noise ratios in RL rewards become thin. Future work will likely focus on "Verifiable Rewards" and even longer-term memory architectures.
Takeaway: Scaling isn't just about more data; it's about the fidelity of the environment in which the model learns to act.
