CodePercept: Code-Grounded Visual STEM Perception for MLLMs

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

CodePercept: Code-Grounded Visual STEM Perception for MLLMs

CodePercept: Breaking the Perception Bottleneck in STEM via Code-Grounded Learning

Summary

Problem

Method

Results

Takeaways

Abstract

This paper introduces CodePercept, a novel paradigm that establishes executable Python code as a fundamental medium for enhancing the visual perception of Multimodal Large Language Models (MLLMs) in STEM domains. By constructing the ICC-1M dataset and the STEM2Code-Eval benchmark, it achieves state-of-the-art results in STEM visual reasoning, demonstrating that perception, rather than reasoning, is the primary bottleneck for current models.

TL;DR

Why do even the most advanced AI models fail at simple geometry or physics problems? A groundbreaking new study reveals that the culprit isn't "bad logic"—it's weak vision. CodePercept shifts the paradigm from vague natural language descriptions to precise, executable Python code. By teaching models to "see" images as structured code, it shatters the previous SOTA on STEM benchmarks, proving that in the world of Science and Math, a line of code is worth a thousand words.

The "Descriptive Aphasia" Problem

In STEM domains, we often encounter what researchers call "descriptive aphasia." Imagine trying to describe the exact coordinates, angles, and intersections of a 3D polyhedral lattice using only natural language. It’s nearly impossible to be precise without being incredibly verbose and confusing.

Current MLLMs rely on "distilled" captions from teachers like GPT-4. These captions are often "hallucinated"—they might say there are 68 arrows when there are actually 63. The authors' first critical insight is shown in their scaling analysis: scaling a model's perception capability yields far higher gains than simply scaling its reasoning engine.

Perception Scaling Analysis

Methodology: Code as the Source of Truth

CodePercept introduces a three-pronged data engine to build ICC-1M (1 Million Image-Caption-Code triplets):

Image Reproduction: Decomposing existing images into code using high-end MLLMs.
Image Diversity: Extracting scientific principles (e.g., a "domino puzzle") and procedurally generating infinite new variations.
Solid Geometry Synthesis: Since LLMs are notoriously bad at 3D spatial thinking, the authors used parametric templates to programmatically create 3D scenes with guaranteed accuracy.

The Learning Paradox: Code vs. Caption

The core of the methodology lies in Code-Grounded Learning.

Code-Grounded Captions: Instead of asking a model to describe an image, the authors use an Execution Tracer to log exactly what the code rendered (coordinates, colors, labels). This "ground truth" log is used to fix hallucinations in AI-generated descriptions.
Image-to-Code Translation: The model learns to output Matplotlib code that can actually recreate the input image.

Overall Architecture

Reinforcement Learning with Verifiable Rewards

To squeeze out maximum performance, the authors applied GRPO (Group Relative Policy Optimization). Unlike language, which is subjective, code provides a "perfect" reward signal: Does it compile? By combining a Format Reward (correct regex), an Execution Reward (no crashes), and a Visual Similarity Reward (using GPT-4o as a "visual judge"), the model learns to refine its perceptual "blueprint" until it is pixel-perfect.

Experimental Results: Seeing is Believing

The results on the new STEM2Code-Eval benchmark are telling. This benchmark requires the model to generate code to reconstruct 1,000 complex STEM images.

| Model | Image Score | Code Score | Exec Rate | | :--- | :--- | :--- | :--- | | GPT-5-Thinking | 64.97 | 64.98 | 96.60 | | CodePercept-32B-R1 | 68.97 | 62.53 | 95.90 | | Qwen3-VL-32B-Base | 36.85 | 39.98 | 81.80 |

Note the massive leap from the base Qwen3-VL model (36.85 image score) to CodePercept-R1 (68.97). By grounding the model in code, it essentially gains a "high-resolution" internal representation of mathematical visual elements.

STEM Results Table

Critical Insights & Future Outlook

Conclusion: CodePercept proves that for AI to master STEM, it must move beyond "vibe-based" language and embrace "structure-based" symbolic logic. The fact that an 8B model can rival 72B models simply by having better "perceptual captions" is a wake-up call for the industry.

Limitations: Currently, CodePercept is limited by the libraries it uses (primarily Matplotlib). Extending this to CAD formats or specialized physics engine code could be the next frontier in AI-driven engineering.

Takeaway: If you want your model to solve the problem, first make sure it can see the problem. And if you want it to see clearly, teach it to code.

Find Similar Papers

Try Our Examples

Search for recent papers that use code generation or symbolic manipulation as an intermediate step to improve Multimodal Large Language Model (MLLM) performance in non-coding tasks like geometry or physics.
Which study first introduced the concept of decoupling perception from reasoning in vision-language tasks, and how does the scaling analysis in the CodePercept paper compare to that original finding?
Find research that applies Reinforcement Learning (RL) with verifiable execution signals (like compilers or physics simulators) to fine-tune MLLMs for scientific reasoning or visual reconstruction.

Contents

CodePercept: Breaking the Perception Bottleneck in STEM via Code-Grounded Learning

1. TL;DR

2. The "Descriptive Aphasia" Problem

3. Methodology: Code as the Source of Truth

3.1. The Learning Paradox: Code vs. Caption

4. Reinforcement Learning with Verifiable Rewards

5. Experimental Results: Seeing is Believing

6. Critical Insights & Future Outlook