MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models

[CVPR 2024/2025] MASQuant: Solving the "Smoothing Misalignment" in Multimodal LLMs

总结

问题

方法

结果

要点

摘要

MASQuant is a novel post-training quantization (PTQ) framework designed specifically for Multimodal Large Language Models (MLLMs) like Qwen2.5-VL and Qwen2.5-Omni. It introduces Modality-Aware Smoothing (MAS) and Cross-Modal Compensation (CMC) to achieve SOTA performance in W4A8 settings, maintaining near-lossless accuracy across vision, audio, and text tasks.

TL;DR

Post-training quantization (PTQ) for Multimodal Large Language Models (MLLMs) has long been plagued by a "winner-takes-all" problem where visual tokens, having much larger activation magnitudes, dictate the quantization scales and destroy audio or text signals. MASQuant resolves this by introducing Modality-Aware Smoothing (MAS) and a clever SVD-based Cross-Modal Compensation (CMC) technique, enabling efficient W4A8 quantization with near-zero performance degradation.

The "Smoothing Misalignment" Crisis

In text-only models, SmoothQuant successfully migrates outliers from activations to weights using a single scaling factor per channel. However, MLLMs process heterogeneous data. Visual tokens are "loud" (high magnitude), while audio tokens are "quiet."

As shown in the paper's analysis, if you use one smoothing factor for a mixed-modality layer:

The factor aligns with the dominant modality (usually Vision).
The non-dominant modalities (Audio/Text) are over-smoothed, effectively crushing their signal into quantization noise.

This is why traditional PTQ methods cause the audio Word Error Rate (WER) to skyrocket from 3.9% to nearly 80%—the model literally "stops hearing" clearly once quantized.

Smoothing Misalignment Analysis Figure 1: Comparison showing how uniform smoothing in SmoothQuant leads to low SQNR and high PPL in MLLMs compared to the proposed MASQuant.

Methodology: The MASQuant Framework

MASQuant breaks the compromise by allowing each modality to have its own "ideal" smoothing factor while keeping memory overhead low.

1. Modality-Aware Smoothing (MAS)

Instead of searching for a single hyperparameter $β$ , MAS treats smoothing factors as free parameters optimized via Modality-Balanced Reconstruction loss. It calculates $S_{t e x t}$ , $S_{v i s i o n}$ , and $S_{a u d i o}$ independently to ensure each modality is perfectly scaled.

2. Cross-Modal Compensation (CMC)

The problem with having three different smoothing factors is that you would normally need three sets of quantized weights. To save memory, MASQuant uses Text as the base. For other modalities, it calculates the residual weight difference $Δ W$ .

The ingenious part? The authors prove that while $Δ W$ isn't naturally low-rank, it becomes highly low-rank after activation whitening. By applying SVD on the whitened residual, they can compress the modal difference into two tiny matrices ( $L_{1}, L_{2}$ ).

Architecture Overview Figure 2: The MASQuant pipeline showing how low-rank matrices compensate the base quantized weights for non-text modalities.

Experimental Battleground

The model was tested on the latest Qwen2.5-VL and Qwen2.5-Omni models.

W8A8 Performance: MASQuant is virtually indistinguishable from FP16.
W4A8 Performance: While SmoothQuant (SQ) and MBQ show catastrophic failure in audio tasks (WER > 70%), MASQuant maintains a WER of ~3.6%.
Speed & VRAM: The framework delivers a 2.5x speedup over FP16 with 2.8x memory savings, with only a tiny 5-10% latency overhead compared to basic weight-only quantization.

Results Table Table 1: Performance on OmniBench and Audio tasks. Note the catastrophic failure of SQ/RTN in the W4A8 category.

Why It Works: The Effective Rank Intuition

The success of CMC hinges on Theorem 2 in the paper. By whitening the activations, the "energy" of the modality difference is concentrated into a few singular values. The paper demonstrates that the "Effective Rank" of the weight difference drops significantly after whitening, allowing a rank-ratio as small as 0.05 to recover almost all the lost accuracy.

Critical Insight & Conclusion

MASQuant proves that the "one-size-fits-all" approach to quantization is dead for the multimodal era. The disparity between modalities is a feature, not a bug, and successful compression must embrace modality-specific scaling.

Limitations: While CMC is efficient, it still requires a "modality mask" during inference to trigger the low-rank branch, which adds slight complexity to the CUDA kernels. However, this is a small price to pay for a model that can actually "see," "hear," and "read" in 4-bit precision.

发现相似论文

试试这些示例

Search for recent papers published after 2024 that specifically address the problem of "modality dominance" or "outlier disparity" in Multimodal Large Language Model quantization.
Identify the origin of "channel-wise smoothing" in PTQ and explain how MASQuant evolves this concept compared to its predecessors like SmoothQuant and AWQ.
Investigate if SVD-based whitening and low-rank compensation have been applied to other cross-modal tasks such as text-to-video generation or multimodal reinforcement learning for efficient deployment.

[CVPR 2024/2025] MASQuant: Solving the "Smoothing Misalignment" in Multimodal LLMs

1. TL;DR

2. The "Smoothing Misalignment" Crisis

3. Methodology: The MASQuant Framework

3.1. 1. Modality-Aware Smoothing (MAS)

3.2. 2. Cross-Modal Compensation (CMC)

4. Experimental Battleground

5. Why It Works: The Effective Rank Intuition

6. Critical Insight & Conclusion