WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
StitchCUDA: Orchestrating Multi-Agent Reinforcement Learning for End-to-End GPU Mastery
总结
问题
方法
结果
要点
摘要

StitchCUDA is an automated multi-agent framework designed for end-to-end GPU program generation and optimization. It achieves nearly 100% success rate on KernelBench Level 3 tasks and delivers 1.72x speedup over multi-agent baselines and 2.73x over RL models by integrating rubric-based agentic reinforcement learning.

TL;DR

StitchCUDA is a breakthrough multi-agent framework that bridges the gap between single-kernel optimization and full-scale GPU program synthesis. By combining a "Plan-Code-Profile" agentic loop with a novel Rubric-based Agentic Reinforcement Learning strategy, it achieves nearly 100% success rates on complex end-to-end benchmarks, delivering significant speedups (up to 2.7x over prior RL models) while remaining immune to typical "reward hacking" traps.

Context: Beyond the Single Kernel

Most current AI-driven CUDA tools are "Level 1" specialists: they can write a fast Matrix Multiplication or a ReLU kernel. However, real-world ML performance is determined by Level 3/4 constraints—how kernels interact, where memory is staged, and how the CPU orchestrates launches. Prior work often failed here because LLMs would "hack" the rewards by simply using PyTorch's backend instead of writing optimized C++, or they would get stuck in "degenerate" cycles, optimizing a tiny ReLU while ignoring the massive CONV2D bottleneck.

Methodology: The Three-Agent Symphony

StitchCUDA solves this through a structured state-machine of specialized agents:

  1. The Planner: The "Architect." It uses Nsys traces to build a global strategy, identifying fusion boundaries and data layout contracts.
  2. The Coder: The "Black-smith." Hardened by RL, it doesn't just write code—it understands how to follow profiling feedback to iterate on tiling and library calls.
  3. The Verifier: The "Critic." It uses Nsight Systems and Nsight Compute to diagnose whether a bottleneck is memory-bound or compute-bound, providing high-fidelity feedback to the Coder.

StitchCUDA Workflow

Atomic Skill Reinforcement Learning

Training agents on multi-turn interactions is prohibitively expensive (estimated at 60 days per model). StitchCUDA breaks this down into two Atomic Skills:

  • Skill 1: From-scratch generation (PyTorch to CUDA).
  • Skill 2: Feedback-driven refinement (Profiling data to optimized CUDA).

By training on these single-turn transitions using GRPO, the authors achieved agentic intelligence at a fraction of the compute cost.

Critical Innovation: The Rubric Reward

To stop models from "cheating" (reward hacking), StitchCUDA introduces a 4-dimension rubric:

  • Anti-Hacking: Penalties for copying PyTorch code or hardcoding outputs.
  • CUDA Engineering: Rewards for using Shared Memory, Tiling, or Tensor Cores.
  • Operator Coverage: Bonuses for covering more operators in the computation graph.
  • Skill Compliance: Ensuring the model actually follows the Verifier's instructions.

Rubric RL Process

Experiments & Real-World Impact

Tested on the NVIDIA H200 (Hopper) and RTX 6000 (Blackwell), StitchCUDA crushed every metric on KernelBench Level 3.

Performance Highlights:

  • Success Rate: Nearly 100%, whereas standard models like GPT-5-2 or Qwen3-32B often fail on complex end-to-end tasks.
  • Speedup: 1.50x average speedup on H200, often outperforming even torch.compile.
  • Anti-Degeneracy: While models like "Kevin-32B" often only modified tiny 1% bottlenecks, StitchCUDA successfully implemented cuBLASLt epilogue fusions and custom mixed-precision logic for the entire Transformer MLP block.

Performance Comparison

Deep Insight: Why This Matters

The most profound takeaway is the failure of "Format Checks." Previous researchers tried to use regex to stop LLMs from cheating. StitchCUDA proves that nuanced high-level reasoning (the Rubric) is required to evaluate high-performance code. The model learns not just to pass a test, but to "think" like a CUDA expert, recognizing when to use asynchronous memory copies or when a kernel fusion will actually yield results.

Conclusion & Limitations

StitchCUDA represents a major leap in automated systems programming. While it currently excels in NVIDIA environments, the methodology of Atomic Skill Decomposition and Expert Rubrics could theoretically be applied to any performance-critical domain, from FPGA HLS to kernel-level Linux optimization. Its primary limitation remains the reliance on high-end models (like GPT-5-2) for the Planner/Verifier roles, though the Coder remains a compact 32B model.


Author's Note: StitchCUDA proves that for the most difficult coding tasks, the secret isn't just more data—it's better "grading" (the Rubric) and a better "work process" (the Agentic Framework).

发现相似论文

试试这些示例

  • Search for recent papers on automated GPU kernel fusion and host-side orchestration using Large Language Models or Multi-Agent systems.
  • Which paper first introduced the concept of Rubric-based Reinforcement Learning for code generation, and how does StitchCUDA's implementation of atomic skills compare?
  • Explore if agentic workflows for CUDA optimization have been extended to non-NVIDIA architectures such as AMD ROCm or Intel OneAPI.
目录
StitchCUDA: Orchestrating Multi-Agent Reinforcement Learning for End-to-End GPU Mastery
1. TL;DR
2. Context: Beyond the Single Kernel
3. Methodology: The Three-Agent Symphony
3.1. Atomic Skill Reinforcement Learning
4. Critical Innovation: The Rubric Reward
5. Experiments & Real-World Impact
5.1. Performance Highlights:
6. Deep Insight: Why This Matters
7. Conclusion & Limitations