RVLM: Recursive Vision-Language Models with Adaptive Depth

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

RVLM: Recursive Vision-Language Models with Adaptive Depth

[Preprint 2026] RVLM: Breaking the Black Box of Medical AI with Recursive Reasoning and Adaptive Depth

Summary

Problem

Method

Results

Takeaways

Abstract

RVLM (Recursive Vision-Language Model) is a unified framework for medical AI that replaces traditional single-pass inference with an iterative "generate-execute" REPL loop. By integrating RRouter for adaptive iteration depth, it achieves SOTA-level interpretability and efficiency on BraTS 2023 (Brain MRI) and MIMIC-CXR benchmarks without task-specific fine-tuning.

TL;DR

Medicine demands transparency, yet most AI models are "black boxes." RVLM (Recursive Vision-Language Model) changes the paradigm by replacing single-pass guesses with a verifiable Python-based reasoning loop. Coupled with RRouter, which scales the "thinking time" based on case complexity, this framework achieves human-like multi-step diagnostic reasoning that is fully auditable and efficient.

Context: Why "One-Shot" Isn't Enough for Radiology

When a radiologist looks at an MRI, they don't just glance and shout a diagnosis. They compare modalities (T1 vs. T2), zoom into suspicious regions, and measure volumes. Current VLMs (like GPT-4o or Gemini) try to do all this in a single "forward pass." This leads to two major issues:

Interpretability Gap: You can't see why the model reached a conclusion.
Resource Waste: The model spends the same "compute energy" on a clear-cut case as it does on a rare, complex tumor.

Methodology: The "Generate-Execute" Loop

RVLM embeds a multimodal model within a Visual REPL (Read-Eval-Print Loop). Instead of outputting text directly, the model writes Python code at each step to:

Inspect: Specifically query different image modalities (e.g., looking for "dural tail signs" on T1ce).
Manipulate: Crop regions of interest or compute "difference maps" between slices.
Verify: Store observations in variables (t1_desc, edema_check) to build a structured evidence chain.

Figure 1: The RVLM System Architecture showing the iteration between the VLM controller and the persistent Python environment.

RRouter: Thinking Only as Much as Needed

To solve the "fixed-budget" problem, the authors introduced RRouter. It calculates a Complexity Score ($s$) based on:

Label Entropy ($H$): How "messy" is the tumor's structure?
Volume ($V$): How large is the lesion?
Sub-region Count: How many different types of tissue are involved?

Simple cases get 3 iterations; complex ones get up to 6+. This mirrors clinical intuition: don't over-analyze a simple fracture, but take your time with a multifocal mass.

Experimental Highlights: BraTS & MIMIC-CXR

The framework was tested on Brain MRIs (BraTS) and Chest X-rays (MIMIC-CXR) using Gemini 2.5 Flash as the engine.

1. The Power of Cross-Modal Verification

In a notable case (BraTS-MEN-00008-000), RVLM detected a hyperintense ring on the FLAIR modality but noticed the segmentation mask showed 0.00cc of edema. It autonomously flagged this cross-modal discrepancy—a high-level reasoning task that single-pass models typically fail.

2. Efficiency Gains

RRouter successfully identified simple cases and capped the budget at 3 iterations, providing a 4x reduction in API costs and latency compared to a maximum-iteration ceiling, without losing diagnostic accuracy.

Auto-generated Radiology Report Figure 2: An auto-generated structured PDF report produced by RVLM, complete with an audit trail of the reasoning steps.

Critical Insight: Trust-by-Design

The most significant contribution of RVLM isn't just "higher accuracy"—it's Regulatory Readiness. By grounding every claim in executable code, RVLM provides:

Traceability: You can re-run the exact code used to generate a diagnosis.
Human Oversight: Clinicians can inspect the intermediate Python variables to see what the AI "saw" at iteration 2 vs. iteration 4.
Compliance: It maps directly to the transparency requirements of the EU AI Act and ISO 42001.

Limitations & Future Path

While groundbreaking, the study is currently limited by a small qualitative sample size and 2D slice analysis. The authors point towards learned routers (using ML to predict complexity) and 3D volumetric reasoning as the next frontiers.

Conclusion

RVLM proves that we don't need to choose between performance and interpretability. By treating diagnostic reasoning as a recursive, adaptive process rather than a static prediction, we can build AI systems that clinicians can actually trust.

Find Similar Papers

Try Our Examples

Search for recent papers that utilize Code-as-Thought or REPL-based execution specifically for multi-modal medical image analysis beyond chest X-rays and brain MRIs.
Which paper first introduced the Recursive Language Model (RLM) framework, and how does RVLM specifically extend its primitive functions to handle visual working memory?
Find studies exploring the application of adaptive computation or "early exit" mechanisms in Large Vision-Language Models (LVLMs) to optimize inference latency in real-time diagnostic tools.

Contents

[Preprint 2026] RVLM: Breaking the Black Box of Medical AI with Recursive Reasoning and Adaptive Depth

1. TL;DR

2. Context: Why "One-Shot" Isn't Enough for Radiology

3. Methodology: The "Generate-Execute" Loop

3.1. RRouter: Thinking Only as Much as Needed

4. Experimental Highlights: BraTS & MIMIC-CXR

4.1. 1. The Power of Cross-Modal Verification

4.2. 2. Efficiency Gains

5. Critical Insight: Trust-by-Design

6. Limitations & Future Path

7. Conclusion