WisPaper
WisPaper
学术搜索
学术问答
价格
TrueCite
MATHNET: The Benchmark That Proves AI Still Can't "Browse" Mathematics
总结
问题
方法
结果
要点
摘要

The paper introduces MATHNET, a massive-scale (30K+ problems) multimodal and multilingual benchmark for Olympiad-level mathematical reasoning and retrieval. Spanning 47 countries and 17 languages, it establishes new SOTA-challenging tasks in problem-solving and symbolic-aware retrieval, with Gemini-3.1-Pro currently leading at 78.4% accuracy.

TL;DR

Researchers from MIT, KAUST, and others have released MATHNET, a monumental dataset of over 30,000 Olympiad-level math problems from 47 countries. Beyond just "solving" problems, the paper exposes a critical weakness in modern AI: Mathematical Retrieval. While models like Gemini-3.1-Pro are becoming formidable solvers, today's embedding models are effectively "math-blind," failing to recognize identical problems if you simply swap for .

Overview of MATHNET

The Motivation: Moving Beyond "AoPS Contamination"

For years, the AI community has relied on platforms like Art of Problem Solving (AoPS) for math data. However, these are often noisy, informal, and increasingly "leaked" into LLM training sets.

MATHNET takes a harder path: it harvests official national booklets from nearly 50 countries (1985–2025). This isn't just a dataset; it's a global archive of human ingenuity. The authors argue that a true AI mathematician must do more than calculate; it must exhibit Analogical Reasoning—the ability to realize that a problem in combinatorial geometry might share the exact same structure as a problem in number theory.

The Core Innovation: A Taxonomy of Mathematical Similarity

The most insightful part of MATHNET is its definition of "Similarity." The authors categorize how math problems relate into three modes:

  1. Invariance: Strict equivalence (e.g., vs. ).
  2. Resonance: Partial similarity where the same "trick" or Lemma applies.
  3. Affinity: Just being in the same neighborhood (e.g., both are prime number problems).

The Math-Aware Retrieval Gap

The researchers tested 27 models and the results were humbling. While Gemini-3.1-Pro hit a staggering 78.4% on solving, most embedding models (used for search/RAG) failed the retrieval test. If you provide a retriever with a problem and ask for its "invariant" twin, the models are often distracted by "hard negatives"—problems that look similar (share keywords like "triangle") but are mathematically unrelated.

MathNet Extraction Pipeline

Methodology: The 3-Stage "Gold-Standard" Pipeline

To ensure the high quality of MATHNET, a sophisticated pipeline was used:

  • OCR & Parsing: Documents were converted to Markdown using dots-ocr.
  • LLM Alignment: A GPT-4/Gemini ensemble matched problems to solutions, even when they were scattered across different sections of 25,000 scanned pages.
  • Human Verification: Every problem pair in the RAG dataset was curated by expert mathematicians (including many IMO team leaders).

Experiments: Solving vs. Searching

The benchmarks are categorized into three tasks:

  • MathNet-Solve: Direct problem sets. Geometry and Discrete Math remain the "final bosses," where even GPT-5's performance dips compared to Algebra.
  • MathNet-Retrieve: Testing embeddings. Results show that current vectors prioritize lexical overlap over symbolic logic.
  • MathNet-RAG: Proving that if you give a model the "Expert" retrieved sample, its reasoning capability skyrockets.

Experiment Results Comparison

Critical Insight: The "Retriever" Bottleneck

The paper suggests that our bottleneck for "AGI in Math" isn't just the LLM's reasoning engine (the "brain"), but its library system (the "retriever"). If an AI cannot recognize that it has solved a structurally identical problem before, it must re-solve every problem from first principles—a highly inefficient path compared to how human Olympians operate.

Conclusion & Future Outlook

MATHNET sets a new bar for what a "high-quality" dataset looks like. It tells us that:

  1. Multimodality is still weak: Figures in geometry problems don't help models as much as they should.
  2. Symbolic-Aware Embeddings are needed: We need new ways to represent math in vector space that respect algebraic transformations.

The takeaway for researchers? Stop scraping forums and start building systems that can understand the "DNA" of a formula.


For more details, visit: mathnet.mit.edu

发现相似论文

试试这些示例

  • Search for recent papers focusing on "Math-Aware Retrieval" or symbolic-sensitive embeddings that outperform standard dense retrievers like Contriever or Ada-002.
  • Identify the methodology used in Omni-MATH or OlympiadBench and compare their data sourcing/verification pipelines with the official booklet extraction method of MATHNET.
  • Explore research utilizing Retrieval-Augmented Generation (RAG) specifically for symbolic reasoning or formal proof generation in Lean/Isabelle to see if retrieval quality affects formal verification.
目录
MATHNET: The Benchmark That Proves AI Still Can't "Browse" Mathematics
1. TL;DR
2. The Motivation: Moving Beyond "AoPS Contamination"
3. The Core Innovation: A Taxonomy of Mathematical Similarity
3.1. The Math-Aware Retrieval Gap
4. Methodology: The 3-Stage "Gold-Standard" Pipeline
5. Experiments: Solving vs. Searching
6. Critical Insight: The "Retriever" Bottleneck
7. Conclusion & Future Outlook