The paper introduces SCORE (Surprise-augmented token COmpression via REinforcement learning), a framework designed to compress visual tokens for Multimodal Large Language Models (MLLMs) in video understanding. It employs a lightweight policy network optimized via Reinforcement Learning to achieve a 16x prefill speedup while maintaining 99.5% of original performance, even surpassing uncompressed baselines in certain settings.
TL;DR
The exponential growth of visual tokens in video understanding leads to "context rot"—a phenomenon where LLMs lose the ability to reason as sequences grow too long. SCORE solves this by inserting a lightweight RL-based compressor that learns exactly which tokens to keep. The result? A staggering 16x speedup in prefill time and, surprisingly, better-than-vanilla accuracy by stripping away visual noise.
The Motivation: Drowning in Redundant Pixels
Current Multimodal Large Language Models (MLLMs) are brilliant but inefficient. A single long video can generate over 40,000 visual tokens. Most of these are "junk": static walls, clear skies, or repetitive frames where nothing happens.
The authors identify two fatal flaws in current solutions:
- Transformation-based (Pooling/Conv): These are "blind" operators that squash data uniformly, often destroying fine-grained details.
- Heuristic-based: These use fixed rules (like similarity clustering) that aren't aware of what the LLM actually needs to answer a specific question.
The "Context Rot" problem means that as you add these junk tokens, the LLM's performance actually drops because the "needle" (the key action) is buried in a "haystack" of thousands of irrelevant tokens.
Methodology: Mining the "Surprise"
SCORE (Surprise-augmented token COmpression via REinforcement learning) bridges the gap between efficiency and semantics.
1. The Surprise-Augmented State
To help the model distinguish between a static background and a moving person, the authors introduce Inter-frame Residuals ($\Delta X$). The policy network looks at both the current token and the change from the previous frame. This "surprise" signal acts as a high-pass filter for motion.
2. Reinforcement Learning with Split-Advantage
Since "token selection" is a discrete (keep/drop) operation, it is non-differentiable. SCORE uses RL to optimize the policy directly based on the LLM's output quality.
- Group Rollouts: It samples multiple different "masks" for the same video.
- Split-Advantage: It rewards sparsity ONLY if the performance remains high, preventing the model from over-pruning to chase speed at the cost of intelligence.

3. The Pseudo-to-Real Curriculum
Learning to prune in a complex video is hard. SCORE starts with "Pseudo-videos"—stitched images with high-contrast changes. This "warm-up" teaches the model the basic intuition: "Big change = keep; No change = drop." It then refines this on real-world datasets.
Performance: Faster AND Smarter
The most shocking result from the experiments is that at a 25% retention ratio, SCORE actually outperforms the uncompressed Vanilla model. By removing the "clutter" tokens, the LLM can focus better on the core reasoning task.

Key Metrics:
- Speed: 16.2x prefill speedup at 10% retention.
- Efficiency: The policy network adds less than 1% overhead to total inference time.
- Selectivity: Qualitative results show the model automatically focuses on humans, animals, and tools while ignoring the background.

Critical Insights & Future Work
SCORE proves that LLMs are sensitive to visual noise. The fact that performance increases after removing 75% of the data suggests that our current "feed everything" approach to multimodal data is fundamentally flawed.
Limitations: The model currently relies on a frozen visual encoder. A potential future direction would be "jointly" learning the encoder and the pruning policy, or adapting the pruning dynamically based on the complexity of the text query itself.
In conclusion, SCORE provides a scalable, industrial-strength solution for deploying long-form video agents that are both lightning-fast and analytically sharp.
