GLM-OCR Technical Report

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

GLM-OCR Technical Report

[Technical Report] GLM-OCR: The 0.9B Powerhouse Redefining Efficient Document Understanding

Summary

Problem

Method

Results

Takeaways

Abstract

GLM-OCR is a compact 0.9B-parameter multimodal model (CogViT-0.4B + GLM-0.5B) designed for high-efficiency document understanding. It introduces a Multi-Token Prediction (MTP) mechanism and a two-stage parallel pipeline, achieving SOTA performance on OmniDocBench v1.5 (94.6) and superior results in table, formula, and seal recognition.

TL;DR

The document intelligence field has long been caught between "accurate but heavy" MLLMs and "fast but rigid" traditional OCR pipelines. GLM-OCR shatters this trade-off. By combining a compact 0.9B architecture with Multi-Token Prediction (MTP) and a smart two-stage layout-aware pipeline, it achieves State-of-the-Art (SOTA) results—even beating 235B-parameter models—while being light enough for edge deployment.

Motivation: Why Scale Isn't Everything in OCR

In the era of LLMs, the instinct is often to throw more parameters at a problem. However, the authors of GLM-OCR identified three critical bottlenecks in current document understanding:

Hallucinations in Complex Layouts: Small models often lose track when faces with multi-column text or dense tables.
The Autoregressive Penalty: Predicting one token at a time is slow and "blind" to local structural dependencies like closing a LaTeX bracket or a Markdown table tag.
Deployment Barriers: Real-world industrial use (invoices, contracts) requires high throughput and low cost, which GPU-heavy models cannot provide.

Methodology: The "Speed-First" Architecture

The GLM-OCR framework is built on a "Divide and Conquer" philosophy.

1. The Core Model: LLM meets MTP

The model consists of a 0.4B CogViT encoder and a 0.5B GLM decoder. The "secret sauce" is the Multi-Token Prediction (MTP). Instead of predicting just the next token, the model uses auxiliary heads to predict multiple future tokens simultaneously.

Why it works: In OCR, structure is deterministic. If you start a table row, the next few tokens are highly predictable. MTP exploits this, generating an average of 5.2 tokens per step.

2. The Two-Stage Pipeline

Rather than feeding a whole A4 page into the VLM, GLM-OCR uses PP-DocLayout-V3 to detect regions (tables, formulas, text).

Parallelism: These regions can be processed in parallel.
Stability: Breaking the page into simpler sub-problems drastically reduces the chance of the model "hallucinating" or skipping lines.

Overall Architecture

Experiments: Punching Above Its Weight class

GLM-OCR was tested against giants like Qwen3-VL (235B) and Gemini-3 Pro. The results are staggering:

OmniDocBench: Ranked #1 with 94.6, surpassing models 100x its size.
Specialized Excellence: It showed massive leads in Seal Recognition (90.5 vs. 63.0 for the next best open model) and Table Recognition (TEDS score of 93.96).

Throughput and Cost

In production environments, speed is currency. GLM-OCR reaches a throughput of 1.86 pages/second for PDFs, nearly 4x faster than MinerU 2.5. Furthermore, its MaaS API pricing (0.2 RMB per million tokens) makes it roughly 1/10th the cost of traditional OCR systems.

Performance Comparison

Real-World Versatility

GLM-OCR isn't just a benchmark chaser. It excels in:

Formula Recognition: Converting complex scientific notations into valid LaTeX.
Table Recovery: Preserving hierarchical headers and merged cells.
KIE (Key Information Extraction): Directly outputting JSON for invoices and forms under zero-shot prompt control.

Formula Recognition Example

Critical Insight & Conclusion

The success of GLM-OCR marks a shift in AI strategy: efficiency via structural alignment. By acknowledging that OCR is a deterministic, layout-dependent task, the authors moved away from "brute-force scaling" and toward "architectural optimization."

Limitations to Watch: As a two-stage model, its primary weakness is error propagation. If the layout detector misses a table, the recognition engine won't see it. However, for 99% of enterprise applications, the 0.9B size and cost-efficiency make it the most viable candidate for mass deployment in 2026.

Takeaway: GLM-OCR proves that a tiny, well-optimized model with "look-ahead" (MTP) capabilities can outperform the largest general-purpose models in specialized vertical domains.

Find Similar Papers

Try Our Examples

Look for recent papers that utilize Multi-Token Prediction (MTP) specifically to improve the structural consistency of JSON or Markdown generation in LLMs.
Which paper originally proposed the CogViT vision encoder architecture, and how does GLM-OCR's training recipe for the encoder differ from the original?
Find research that compares two-stage layout-then-recognize OCR pipelines versus end-to-end unified multimodal OCR approaches in terms of error propagation.

Contents

[Technical Report] GLM-OCR: The 0.9B Powerhouse Redefining Efficient Document Understanding

1. TL;DR

2. Motivation: Why Scale Isn't Everything in OCR

3. Methodology: The "Speed-First" Architecture

3.1. 1. The Core Model: LLM meets MTP

3.2. 2. The Two-Stage Pipeline

4. Experiments: Punching Above Its Weight class

4.1. Throughput and Cost

5. Real-World Versatility

6. Critical Insight & Conclusion