WisPaper
WisPaper
Scholar Search
Scholar QA
Pricing
TrueCite
[ArXiv 2026] SCORE: Curing MLLM "Context Rot" via Reinforcement Learning-Based Token Compression
Summary
Problem
Method
Results
Takeaways
Abstract

The paper introduces SCORE (Surprise-augmented token COmpression via REinforcement learning), a framework designed to compress visual tokens for Multimodal Large Language Models (MLLMs) in video understanding. It employs a lightweight policy network optimized via Reinforcement Learning to achieve a 16x prefill speedup while maintaining 99.5% of original performance, even surpassing uncompressed baselines in certain settings.

TL;DR

The exponential growth of visual tokens in video understanding leads to "context rot"—a phenomenon where LLMs lose the ability to reason as sequences grow too long. SCORE solves this by inserting a lightweight RL-based compressor that learns exactly which tokens to keep. The result? A staggering 16x speedup in prefill time and, surprisingly, better-than-vanilla accuracy by stripping away visual noise.

The Motivation: Drowning in Redundant Pixels

Current Multimodal Large Language Models (MLLMs) are brilliant but inefficient. A single long video can generate over 40,000 visual tokens. Most of these are "junk": static walls, clear skies, or repetitive frames where nothing happens.

The authors identify two fatal flaws in current solutions:

  1. Transformation-based (Pooling/Conv): These are "blind" operators that squash data uniformly, often destroying fine-grained details.
  2. Heuristic-based: These use fixed rules (like similarity clustering) that aren't aware of what the LLM actually needs to answer a specific question.

The "Context Rot" problem means that as you add these junk tokens, the LLM's performance actually drops because the "needle" (the key action) is buried in a "haystack" of thousands of irrelevant tokens.

Methodology: Mining the "Surprise"

SCORE (Surprise-augmented token COmpression via REinforcement learning) bridges the gap between efficiency and semantics.

1. The Surprise-Augmented State

To help the model distinguish between a static background and a moving person, the authors introduce Inter-frame Residuals ($\Delta X$). The policy network looks at both the current token and the change from the previous frame. This "surprise" signal acts as a high-pass filter for motion.

2. Reinforcement Learning with Split-Advantage

Since "token selection" is a discrete (keep/drop) operation, it is non-differentiable. SCORE uses RL to optimize the policy directly based on the LLM's output quality.

  • Group Rollouts: It samples multiple different "masks" for the same video.
  • Split-Advantage: It rewards sparsity ONLY if the performance remains high, preventing the model from over-pruning to chase speed at the cost of intelligence.

Architectural Overview

3. The Pseudo-to-Real Curriculum

Learning to prune in a complex video is hard. SCORE starts with "Pseudo-videos"—stitched images with high-contrast changes. This "warm-up" teaches the model the basic intuition: "Big change = keep; No change = drop." It then refines this on real-world datasets.

Performance: Faster AND Smarter

The most shocking result from the experiments is that at a 25% retention ratio, SCORE actually outperforms the uncompressed Vanilla model. By removing the "clutter" tokens, the LLM can focus better on the core reasoning task.

Performance Benchmarks

Key Metrics:

  • Speed: 16.2x prefill speedup at 10% retention.
  • Efficiency: The policy network adds less than 1% overhead to total inference time.
  • Selectivity: Qualitative results show the model automatically focuses on humans, animals, and tools while ignoring the background.

Qualitative Visualizations

Critical Insights & Future Work

SCORE proves that LLMs are sensitive to visual noise. The fact that performance increases after removing 75% of the data suggests that our current "feed everything" approach to multimodal data is fundamentally flawed.

Limitations: The model currently relies on a frozen visual encoder. A potential future direction would be "jointly" learning the encoder and the pruning policy, or adapting the pruning dynamically based on the complexity of the text query itself.

In conclusion, SCORE provides a scalable, industrial-strength solution for deploying long-form video agents that are both lightning-fast and analytically sharp.

Find Similar Papers

Try Our Examples

  • Search for recent papers published in 2024-2025 that explore Reinforcement Learning for dynamic token pruning in Multimodal Large Language Models.
  • Which paper first introduced the concept of "context rot" in Transformer-based models, and how do modern video compression techniques specifically address its symptoms?
  • Investigate if the surprise-augmented state representation or inter-frame residual logic has been applied to efficient video generation or real-time robot vision tasks.
Contents
[ArXiv 2026] SCORE: Curing MLLM "Context Rot" via Reinforcement Learning-Based Token Compression
1. TL;DR
2. The Motivation: Drowning in Redundant Pixels
3. Methodology: Mining the "Surprise"
3.1. 1. The Surprise-Augmented State
3.2. 2. Reinforcement Learning with Split-Advantage
3.3. 3. The Pseudo-to-Real Curriculum
4. Performance: Faster AND Smarter
4.1. Key Metrics:
5. Critical Insights & Future Work