SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

[arXiv 2026] SlopCodeBench: Why Coding Agents Collapse Under Their Own "Slop"

总结

问题

方法

结果

要点

摘要

The paper introduces SlopCodeBench (SCBench), a language-agnostic benchmark evaluating coding agents on long-horizon, iterative software development tasks. Featuring 20 problems across 93 checkpoints, it measures how agents extend their own prior code while tracking trajectory-level "slop" metrics.

TL;DR

Performance on standard coding benchmarks is often an illusion of competence. SlopCodeBench reveals that when AI agents are forced to iterate on their own code across multiple checkpoints, they don't just fail to solve complex tasks—they drown in their own "slop." Even the best models (like GPT-5.4 and Opus 4.6) show massive structural erosion and verbosity that diverges sharply from how humans write software.

Positioning: This work shifts the evaluation paradigm from "Can it pass the tests?" to "Is the generated code maintainable for the next version?"

The Problem: The "Single-Shot" Blind Spot

Most current benchmarks evaluate an agent's ability to fix a bug or implement a feature in one go. If a benchmark is multi-turn, it often provides the agent with "gold standard" code between steps, effectively resetting the technical debt.

In reality, software engineering is a series of compromises. A decision to hardcode a variable in Version 1 makes Version 3 an architectural nightmare. Prior benchmarks didn't capture this causal chain of failure.

Methodology: Measuring "Slop" and Erosion

SlopCodeBench introduces 20 problems (like building a CLI search tool or a REST API) that evolve over 3–8 checkpoints. Crucially:

Agent Persistence: Agents must build on their own previous (and potentially messy) workspace.
Hidden Tests: No fail-to-pass feedback is provided; the agent must rely on the specification.
New Quality Metrics:
- Structural Erosion: Uses Cyclomatic Complexity (CC) and SLOC to see if the "complexity mass" stays in a few "God functions."
- Verbosity: Uses AST-grep rules to find redundant patterns and clone ratios to find copy-pasted logic.

Iterative Evaluation Workflow Figure 1: The Iterative Flow. At each checkpoint, the agent receives a new spec and must modify its own prior workspace.

The "Slop" Trajectory: AI vs. Human

The results are sobering. While human-maintained repositories (like Flask or Scikit-learn) tend to plateau in complexity and verbosity, agent code worsens monotonically.

Erosion rises in 80% of trajectories. Instead of creating new helper functions, agents patch logic into massive if/else ladders. In one case, a main() function grew to over 1,000 lines with a CC of 285.
Verbosity rises in 89.8% of trajectories. Agents tend to repeat boilerplate and defensive scaffolding rather than extracting shared logic.

Erosion and Verbosity Comparative Chart Figure 3: Tracking the monotonic increase of Slop across different model providers.

Can We Prompt Our Way Out?

The authors tested "Anti-Slop" and "Plan-First" prompts. The findings illustrate a fundamental limitation of current LLMs:

Prompting shifts the intercept, not the slope.
Quality-aware prompts make the first version cleaner, but the rate of degradation remains the same as soon as the agent starts iterating.
Cleaner code didn't improve pass rates, but it did increase costs (up to 47.9% more tokens for reasoning).

Prompt Strategy Comparison Figure 5: Prompts lower initial slop, but the lines of degradation remain parallel to the baseline.

Critical Insight: The "Core-to-Regression" Collapse

As problems progress, the gap between passing "Core" tests (new features) and "Regression" tests (old features) widens (see Figure 2 in paper). Agents often break existing functionality to implement the new spec because their internal architectural choices have become too rigid or "sloppy" to accommodate change.

Conclusion & Takeaways

SlopCodeBench proves that pass rates are a lagging indicator of agent failure. A model might have a high "isolated" pass rate today, but its inability to manage structural erosion makes it useless for long-horizon production environments.

Future Work: We need agents that don't just write code, but possess design discipline—the ability to refactor proactively and manage complexity mass over time.

发现相似论文

试试这些示例

Search for recent studies exploring long-horizon code generation where agents must maintain and refactor their own prior outputs over multiple versions.
Which papers introduced the concepts of "code slop" or "LLM verbosity in programming," and how have these metrics been formally quantified before SlopCodeBench?
Examine research that applies reinforcement learning or specialized training objectives to reduce cyclomatic complexity and structural erosion in LLM-generated software.

[arXiv 2026] SlopCodeBench: Why Coding Agents Collapse Under Their Own "Slop"

1. TL;DR

2. The Problem: The "Single-Shot" Blind Spot

3. Methodology: Measuring "Slop" and Erosion

4. The "Slop" Trajectory: AI vs. Human

5. Can We Prompt Our Way Out?

6. Critical Insight: The "Core-to-Regression" Collapse

7. Conclusion & Takeaways