GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs

WisPaper

Scholar Search

Scholar QA

Pricing

TrueCite

Workspace

Home

Blog

GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs

[ArXiv 2024] GlowQ: Solving the Redundancy Crisis in Quantized LLM Error Correction

Summary

Problem

Method

Results

Takeaways

Abstract

GlowQ is a novel group-shared low-rank approximation framework for quantized Large Language Models (LLMs) that introduces a shared right-factor matrix for modules within the same input-sharing group. It achieves SOTA recovery in low-bit (W4A16) regimes, matching the accuracy of independent per-layer correction methods while significantly reducing memory and latency overhead.

TL;DR

GlowQ redefines how we "fix" quantization errors in Large Language Models. Instead of slapping a unique low-rank correction module on every single layer (which is slow and memory-hungry), GlowQ groups modules that share the same input and uses a single, shared high-precision projection. The result? 37.4% higher throughput and lower perplexity compared to standard 4-bit quantization methods.

The Problem: The High Cost of Being Unique

Post-Training Quantization (PTQ) to 4-bits (W4A16) is the industry standard for LLM deployment. However, 4-bit weights often lead to accuracy degradation. Current state-of-the-art methods like L2QER and QERA fix this by adding a low-rank term: $W \approx W_q + AB$.

The catch? These methods treat every projection ($W_q, W_k, W_v$, etc.) as an island. They compute the high-precision product $A(BX)$ repeatedly. In a standard Transformer, $Q, K,$ and $V$ all look at the exact same input $X$. Computing $B_q X, B_k X,$ and $B_v X$ separately is logically redundant and computationally expensive. This "independence" is a bottleneck that prevents quantized models from reaching their full speed potential.

Methodology: Shared Projections and Covariance Alignment

1. The Group-Shared Insight

GlowQ’s core innovation is simple yet profound: If modules share an input, they should share the heavy lifting. In a Transformer block, the $Q, K,$ and $V$ projections are grouped. GlowQ learns a single $B_{shared}$ for the entire group.

Inference: Compute $R = B_{shared} X$ once.
Correction: Each module $i$ only performs a lightweight $A_i R$.

2. Why it Works: Covariance Alignment

You can't just average the errors. Real-world LLM activations are anisotropic (highly directional). GlowQ uses a data-aware objective: $$\min_{A, B} | (E_{cat} - AB) \Sigma_x^{1/2} |_F^2$$ By "whitening" the error matrix with the input covariance $\Sigma_x$, the shared factor $B$ is forced to align with the directions the model actually uses most frequently.

GlowQ Overview

3. Scaling Up with QR-Reduced RSVD

To avoid the massive memory overhead of calculating SVD on tall matrices, the authors use a QR-reduced Randomized SVD. They compress the stacked error into a $d imes d$ core, perform a fast randomized sketch, and then lift the solution back. This makes the "calibration" phase feasible even for 30B+ models.

Experimental Results: Efficiency Meets Accuracy

GlowQ was tested against a battery of models including LLaMA 3, Qwen 2.5, and Mistral.

SOTA Recovery: On LLaMA 3 8B, GlowQ achieved a perplexity of 6.59, outperforming both AWQ (6.64) and GPTQ (6.63) while being faster.
Latency Gains: In LLaMA 2 13B tests, GlowQ-S (the selective version) reduced the Time-To-First-Byte (TTFB) by 23.4%.
Throughput: The throughput (tokens per second) increased by 37.4% because the GPU no longer wastes cycles on redundant high-precision matmuls.

Efficiency vs Accuracy

Deep Insight: Beyond Dense Models

One of the most impressive findings is GlowQ's performance on Mixture-of-Experts (MoE). In an MoE block, you have dozens of experts. Standard methods would add correction parameters to every expert. GlowQ uses a single $B_{shared}$ across all experts in a group, reducing the memory footprint of error correction by 63% while matching the accuracy of far more bloated methods.

Conclusion & Takeaways

GlowQ proves that "independent error correction" is a luxury we don't need. By exploiting the inherent grouping of operations in the Transformer architecture, we can have our cake and eat it too: the memory savings of 4-bit quantization with the speed of optimized kernels and the accuracy of high-precision models.

Key Takeaway for Practitioners: When designing low-rank adapters (LoRA) or quantization corrections, always look for shared input manifolds. If multiple weights act on the same activation, there is a "shared subspace" waiting to be exploited for massive efficiency gains.

Limitations: The "Selective Restore" policy (GlowQ-S) requires model-specific tuning (the "elbow point" on the PPL/Latency curve) to find the optimal trade-off. However, the default "GlowQ" (full restoration) still offers a significant speedup with no tuning required.

Find Similar Papers

Try Our Examples

Search for recent papers on low-rank error correction in post-training quantization that specifically target throughput optimization or memory bandwidth reduction.
Which study first introduced the concept of the 'whitened SVD' for neural network weight approximation, and how does GlowQ adapt this for input-sharing groups?
Investigate applications of group-shared low-rank adapters in Multi-Head Latent Attention (MLA) or Mixture-of-Experts (MoE) architectures beyond standard dense Transformers.

Contents

[ArXiv 2024] GlowQ: Solving the Redundancy Crisis in Quantized LLM Error Correction

1. TL;DR

2. The Problem: The High Cost of Being Unique

3. Methodology: Shared Projections and Covariance Alignment

3.1. 1. The Group-Shared Insight

3.2. 2. Why it Works: Covariance Alignment

3.3. 3. Scaling Up with QR-Reduced RSVD

4. Experimental Results: Efficiency Meets Accuracy

5. Deep Insight: Beyond Dense Models

6. Conclusion & Takeaways