This paper introduces CodePercept, a novel paradigm that establishes executable Python code as a fundamental medium for enhancing the visual perception of Multimodal Large Language Models (MLLMs) in STEM domains. By constructing the ICC-1M dataset and the STEM2Code-Eval benchmark, it achieves state-of-the-art results in STEM visual reasoning, demonstrating that perception, rather than reasoning, is the primary bottleneck for current models.
TL;DR
Why do even the most advanced AI models fail at simple geometry or physics problems? A groundbreaking new study reveals that the culprit isn't "bad logic"—it's weak vision. CodePercept shifts the paradigm from vague natural language descriptions to precise, executable Python code. By teaching models to "see" images as structured code, it shatters the previous SOTA on STEM benchmarks, proving that in the world of Science and Math, a line of code is worth a thousand words.
The "Descriptive Aphasia" Problem
In STEM domains, we often encounter what researchers call "descriptive aphasia." Imagine trying to describe the exact coordinates, angles, and intersections of a 3D polyhedral lattice using only natural language. It’s nearly impossible to be precise without being incredibly verbose and confusing.
Current MLLMs rely on "distilled" captions from teachers like GPT-4. These captions are often "hallucinated"—they might say there are 68 arrows when there are actually 63. The authors' first critical insight is shown in their scaling analysis: scaling a model's perception capability yields far higher gains than simply scaling its reasoning engine.

Methodology: Code as the Source of Truth
CodePercept introduces a three-pronged data engine to build ICC-1M (1 Million Image-Caption-Code triplets):
- Image Reproduction: Decomposing existing images into code using high-end MLLMs.
- Image Diversity: Extracting scientific principles (e.g., a "domino puzzle") and procedurally generating infinite new variations.
- Solid Geometry Synthesis: Since LLMs are notoriously bad at 3D spatial thinking, the authors used parametric templates to programmatically create 3D scenes with guaranteed accuracy.
The Learning Paradox: Code vs. Caption
The core of the methodology lies in Code-Grounded Learning.
- Code-Grounded Captions: Instead of asking a model to describe an image, the authors use an Execution Tracer to log exactly what the code rendered (coordinates, colors, labels). This "ground truth" log is used to fix hallucinations in AI-generated descriptions.
- Image-to-Code Translation: The model learns to output Matplotlib code that can actually recreate the input image.

Reinforcement Learning with Verifiable Rewards
To squeeze out maximum performance, the authors applied GRPO (Group Relative Policy Optimization). Unlike language, which is subjective, code provides a "perfect" reward signal: Does it compile? By combining a Format Reward (correct regex), an Execution Reward (no crashes), and a Visual Similarity Reward (using GPT-4o as a "visual judge"), the model learns to refine its perceptual "blueprint" until it is pixel-perfect.
Experimental Results: Seeing is Believing
The results on the new STEM2Code-Eval benchmark are telling. This benchmark requires the model to generate code to reconstruct 1,000 complex STEM images.
| Model | Image Score | Code Score | Exec Rate | | :--- | :--- | :--- | :--- | | GPT-5-Thinking | 64.97 | 64.98 | 96.60 | | CodePercept-32B-R1 | 68.97 | 62.53 | 95.90 | | Qwen3-VL-32B-Base | 36.85 | 39.98 | 81.80 |
Note the massive leap from the base Qwen3-VL model (36.85 image score) to CodePercept-R1 (68.97). By grounding the model in code, it essentially gains a "high-resolution" internal representation of mathematical visual elements.

Critical Insights & Future Outlook
Conclusion: CodePercept proves that for AI to master STEM, it must move beyond "vibe-based" language and embrace "structure-based" symbolic logic. The fact that an 8B model can rival 72B models simply by having better "perceptual captions" is a wake-up call for the industry.
Limitations: Currently, CodePercept is limited by the libraries it uses (primarily Matplotlib). Extending this to CAD formats or specialized physics engine code could be the next frontier in AI-driven engineering.
Takeaway: If you want your model to solve the problem, first make sure it can see the problem. And if you want it to see clearly, teach it to code.
