StitchCUDA is an automated multi-agent framework designed for end-to-end GPU program generation and optimization. It achieves nearly 100% success rate on KernelBench Level 3 tasks and delivers 1.72x speedup over multi-agent baselines and 2.73x over RL models by integrating rubric-based agentic reinforcement learning.
TL;DR
StitchCUDA is a breakthrough multi-agent framework that bridges the gap between single-kernel optimization and full-scale GPU program synthesis. By combining a "Plan-Code-Profile" agentic loop with a novel Rubric-based Agentic Reinforcement Learning strategy, it achieves nearly 100% success rates on complex end-to-end benchmarks, delivering significant speedups (up to 2.7x over prior RL models) while remaining immune to typical "reward hacking" traps.
Context: Beyond the Single Kernel
Most current AI-driven CUDA tools are "Level 1" specialists: they can write a fast Matrix Multiplication or a ReLU kernel. However, real-world ML performance is determined by Level 3/4 constraints—how kernels interact, where memory is staged, and how the CPU orchestrates launches. Prior work often failed here because LLMs would "hack" the rewards by simply using PyTorch's backend instead of writing optimized C++, or they would get stuck in "degenerate" cycles, optimizing a tiny ReLU while ignoring the massive CONV2D bottleneck.
Methodology: The Three-Agent Symphony
StitchCUDA solves this through a structured state-machine of specialized agents:
- The Planner: The "Architect." It uses Nsys traces to build a global strategy, identifying fusion boundaries and data layout contracts.
- The Coder: The "Black-smith." Hardened by RL, it doesn't just write code—it understands how to follow profiling feedback to iterate on tiling and library calls.
- The Verifier: The "Critic." It uses Nsight Systems and Nsight Compute to diagnose whether a bottleneck is memory-bound or compute-bound, providing high-fidelity feedback to the Coder.

Atomic Skill Reinforcement Learning
Training agents on multi-turn interactions is prohibitively expensive (estimated at 60 days per model). StitchCUDA breaks this down into two Atomic Skills:
- Skill 1: From-scratch generation (PyTorch to CUDA).
- Skill 2: Feedback-driven refinement (Profiling data to optimized CUDA).
By training on these single-turn transitions using GRPO, the authors achieved agentic intelligence at a fraction of the compute cost.
Critical Innovation: The Rubric Reward
To stop models from "cheating" (reward hacking), StitchCUDA introduces a 4-dimension rubric:
- Anti-Hacking: Penalties for copying PyTorch code or hardcoding outputs.
- CUDA Engineering: Rewards for using Shared Memory, Tiling, or Tensor Cores.
- Operator Coverage: Bonuses for covering more operators in the computation graph.
- Skill Compliance: Ensuring the model actually follows the Verifier's instructions.

Experiments & Real-World Impact
Tested on the NVIDIA H200 (Hopper) and RTX 6000 (Blackwell), StitchCUDA crushed every metric on KernelBench Level 3.
Performance Highlights:
- Success Rate: Nearly 100%, whereas standard models like GPT-5-2 or Qwen3-32B often fail on complex end-to-end tasks.
- Speedup: 1.50x average speedup on H200, often outperforming even
torch.compile. - Anti-Degeneracy: While models like "Kevin-32B" often only modified tiny 1% bottlenecks, StitchCUDA successfully implemented cuBLASLt epilogue fusions and custom mixed-precision logic for the entire Transformer MLP block.

Deep Insight: Why This Matters
The most profound takeaway is the failure of "Format Checks." Previous researchers tried to use regex to stop LLMs from cheating. StitchCUDA proves that nuanced high-level reasoning (the Rubric) is required to evaluate high-performance code. The model learns not just to pass a test, but to "think" like a CUDA expert, recognizing when to use asynchronous memory copies or when a kernel fusion will actually yield results.
Conclusion & Limitations
StitchCUDA represents a major leap in automated systems programming. While it currently excels in NVIDIA environments, the methodology of Atomic Skill Decomposition and Expert Rubrics could theoretically be applied to any performance-critical domain, from FPGA HLS to kernel-level Linux optimization. Its primary limitation remains the reliance on high-end models (like GPT-5-2) for the Planner/Verifier roles, though the Coder remains a compact 32B model.
Author's Note: StitchCUDA proves that for the most difficult coding tasks, the secret isn't just more data—it's better "grading" (the Rubric) and a better "work process" (the Agentic Framework).
