Mario: Multimodal Graph Reasoning with Large Language Models

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Mario: Multimodal Graph Reasoning with Large Language Models

[arXiv 2025] Mario: Unleashing LLM Reasoning on Multimodal Graphs with Adaptive Modality Routing

总结

问题

方法

结果

要点

摘要

Mario is a unified two-stage framework for multimodal graph (MMG) reasoning using Large Language Models. It introduces a graph-conditioned vision–language model for structure-aware alignment and a modality-adaptive instruction tuning mechanism, achieving SOTA results in node classification and link prediction across major benchmarks.

TL;DR

Mario is a novel framework designed to bridge the gap between Large Language Models (LLMs) and Multimodal Graphs (MMGs). By introducing a structure-aware alignment stage followed by a modality-adaptive instruction tuning stage, Mario solves the problem of "noisy" modality pairs and varied data importance across different graph nodes. It sets new SOTA records, particularly excelling in zero-shot transfer scenarios where traditional graph models fail.

The "Disconnected" Reality of Multimodal Graphs

In the real world, multimodal data—like products on Amazon or posts on Reddit—does not exist in a vacuum. These entities are interlinked through co-purchases, comments, or citations. However, current Vision-Language Models (VLMs) like CLIP often treat an image and its text description as a simple isolated pair.

The authors identify two critical failures in this approach:

Weak Cross-modal Consistency (C1): A product image might focus on a warranty logo while the text describes the product's technical specs. Without neighboring nodes to "bridge" the context, the VLM cannot align these views effectively.
Heterogeneous Modality Preference (C2): Not all nodes provide equal value in all modalities. For some nodes, the image is worth a thousand words; for others, the image is just noise.

Methodology: The Two-Stage Powerhouse

Stage 1: Graph-Conditioned Alignment

Instead of relying on frozen embeddings, Mario uses a Topology-Aware Multimodal Mixer. This component injects graph structural bias (like shortest-path distances) directly into the Transformer layers of the vision and text encoders. This ensures that the resulting embeddings for a node are influenced by its neighbors, effectively "denoising" the representation before it ever reaches the LLM.

Mario Framework Architecture

Stage 2: Modality-Adaptive Prompt Routing (MAPR)

The most innovative part of Mario is the MAPR. Rather than forcing the LLM to look at both text and image for every node, a lightweight router analyzes the node's features and its neighborhood. It then picks from three views:

Text-only
Image-only
Multimodal (Both)

This is trained using a teacher-student paradigm where the LLM's own loss (likelihood of being correct) guides the router to favor the most informative modality for that specific node.

Experimental Breakthroughs

Mario was tested across E-commerce, Social Media, and Literature datasets. The results are striking:

Consistent SOTA: Outperforms baselines like GraphGPT and LLaGA across the board.
Zero-Shot Dominance: When trained on toy data and tested on movies/books, Mario maintains its reasoning ability, suggesting it learns "how to reason about structure" rather than just memorizing labels.

Performance Comparison

Insights & Visual Evidence

The authors provide a fascinating visualization of "modality homophily." Their analysis shows that nodes in a local cluster often share the same modality preference. For example, in an "Arts" graph, a cluster of related products might all be best understood through their visual attributes rather than their sparse text descriptions.

Modality Preference Visualization

Conclusion: A New Path for MLLMs

Mario proves that LLMs are capable of sophisticated graph reasoning if we respect the inherent structure and the varying "richness" of multimodal inputs. By moving away from static prompting to an adaptive, structure-aware routing system, Mario provides a blueprint for next-generation recommendation systems and knowledge graph reasoners.

Key Takeaway: Don't treat all modalities as equal for every data point. Let the model decide what is "informative" based on the graph context.

发现相似论文

试试这些示例

Search for recent papers on Multimodal Graph Learning that utilize Graph Neural Networks (GNNs) combined with LLMs for node-level reasoning.
Which study first proposed the concept of "In-Context Learning for Graphs," and how does Mario's modality-adaptive approach extend that theory?
Explore how the Modality-Adaptive Prompt Router (MAPR) architecture can be applied to other multimodal tasks like video-text reasoning or 3D scene understanding.

[arXiv 2025] Mario: Unleashing LLM Reasoning on Multimodal Graphs with Adaptive Modality Routing

1. TL;DR

2. The "Disconnected" Reality of Multimodal Graphs

3. Methodology: The Two-Stage Powerhouse

3.1. Stage 1: Graph-Conditioned Alignment

3.2. Stage 2: Modality-Adaptive Prompt Routing (MAPR)

4. Experimental Breakthroughs

5. Insights & Visual Evidence

6. Conclusion: A New Path for MLLMs