RepoZero: Can LLMs Generate a Code Repository from Scratch?

学术搜索

学术问答

价格

TrueCite

RepoZero: Can LLMs Generate a Code Repository from Scratch?

RepoZero: Moving Beyond Snippets to Verifiable Full-Repo Synthesis

总结

问题

方法

结果

要点

摘要

RepoZero is the first verifiable and scalable benchmark for evaluating Large Language Models (LLMs) in generating entire software repositories from scratch. It utilizes a novel "repository reproduction" task with cross-language constraints (e.g., Py2JS, C2Rust) and an automated execution-based verification pipeline to achieve SOTA-level rigorous evaluation.

Executive Summary

TL;DR: While LLMs have become proficient at writing individual functions, can they architect a whole software repository from scratch? RepoZero introduces a rigorous, execution-based benchmark to answer this. By forcing models to reimplement existing Python/C++ libraries in JavaScript/Rust and verifying them against a "gold standard" source oracle, it exposes a massive gap in current AI capabilities.

Positioning: This work moves the goalposts from simple code completion to Autonomous Software Engineering. It sits at the intersection of benchmarking and test-time scaling, proving that the next leap in AI coding won't come just from bigger models, but from better self-verification loops.

Problem & Motivation: The "Vibe Coding" Trap

Current code benchmarks suffer from two fatal flaws:

Subjectivity: Many rely on "LLM-as-a-judge," where one model guesses if another model's code "looks" right. This lacks the binary rigour of actual software engineering.
Data Leakage: Since models are trained on GitHub, they might "remember" a solution rather than "reason" through it.

The authors argue that the only way to truly test an agent is to see if it can build a system that behaves exactly like a reference implementation, but in a different language to prevent rote memorization.

Methodology: The ACE Framework and Source Oracle

The core innovation is Repository Reproduction. Instead of a vague prompt like "make a web scraper," RepoZero gives the agent the API signatures of a known library and says "make this library in a different language so it passes these tests."

The Oracle Advantage

Because the authors use existing, functional repositories as the "source," they have a Source Oracle. They can generate thousands of inputs, run them through the original code to get the ground truth, and then use those exact pairs to check the agent's work.

Agentic Code-Test Evolution (ACE)

To solve these hard tasks, they propose the ACE workflow:

Initial Synthesis: The agent writes the repo.
Test Generation: A testing agent creates edge cases.
Oracle Verification: The source repo provides the correct answers.
Iterative Refinement: The agent uses execution errors to fix its code.

RepoZero Architecture and ACE Loop Figure 1: The RepoZero construction pipeline and the iterative ACE feedback loop.

Experiments & Results: A Reality Check

The results are a sobering reminder of LLM limitations. Even with powerful scaffolds like Mini-SWE-Agent, most models fail to reach a 60% success rate on "Easy" tasks, and many plummet to below 20% on "Hard" repositories.

Key Findings:

Claude-4.6-Sonnet leads the pack, but even its average performance hovers around 50%.
The Runnability Gap: Models often produce code that runs but is semantically wrong (wrong logic, wrong math precision). About 40% of executable code failed the strict output matching.
Contextual Drift: Agents often forget initial constraints as the project grows larger—a sign that long-context management is still a bottleneck.

Model Performance Comparison Table 1: Performance across different models and difficulty levels. Claude-4.6 remains the SOTA baseline.

Critical Analysis & Conclusion

Takeaway: RepoZero successfully shifts the focus from "writing code" to "verifying logic." The success of the ACE framework suggests that test-time scaling—the ability of an agent to think and test before finalizing—is the most promising path toward autonomous development.

Limitations:

The benchmark is currently "semi-synthetic," meaning it uses existing repos rather than entirely new real-world requirements.
It focuses on deterministic libraries (math, data structures); non-deterministic systems (like UI or concurrent networking) remain a challenge for this verification style.

Future Outlook: We are entering the era of Agentic Engineering, where the value of an LLM isn't its first draft, but its ability to debug itself into a perfect solution. RepoZero provides the first rigorous measuring stick for that evolution.

发现相似论文

试试这些示例

Find recent papers on test-time scaling for LLM-based code generation and how they address the lack of ground-truth oracles in real-world scenarios.
Which studies first introduced the "LLM-as-a-judge" bottleneck in software engineering benchmarks, and what alternative automated verification methods have been proposed besides RepoZero?
Explore research that applies cross-language synthesis or transpilation as a technique to mitigate data leakage in LLM training and evaluation.

目录

RepoZero: Moving Beyond Snippets to Verifiable Full-Repo Synthesis

1. Executive Summary

2. Problem & Motivation: The "Vibe Coding" Trap

3. Methodology: The ACE Framework and Source Oracle

3.1. The Oracle Advantage

3.2. Agentic Code-Test Evolution (ACE)

4. Experiments & Results: A Reality Check

4.1. Key Findings:

5. Critical Analysis & Conclusion