Long Context Pre-Training with Lighthouse Attention

WisPaper

学术搜索

学术问答

价格

TrueCite

工作空间

Home

Blog

Long Context Pre-Training with Lighthouse Attention

Lighthouse Attention: Symmetrical Hierarchy for Subquadratic Long-Context Pre-training

总结

问题

方法

结果

要点

摘要

Lighthouse Attention is a training-only subquadratic attention mechanism that uses a symmetrical hierarchical selection process (pooling Q, K, and V) to handle extreme sequence lengths. It wraps around standard Scaled Dot-Product Attention (SDPA) and achieves SOTA training efficiency, matching or beating full-attention models after a short "recovery" phase while being significantly faster at contexts up to 1M tokens.

TL;DR

Lighthouse Attention is a novel training-time algorithm designed to break the quadratic barrier of Transformer attention without sacrificing model quality. By pooling Queries, Keys, and Values symmetrically and using a discrete "Top-K" selection, it speeds up long-context training (up to 1M tokens) by nearly 2x. Most importantly, it introduces a "Recovery" phase where the model is switched back to full attention, proving that sparse training doesn't have to break inference.

Problem & Motivation: The Saliency Paradox

As LLMs move toward context windows of 128K, 1M, and beyond, we hit a hardware wall. Standard attention (SDPA) scales quadratically ( $Θ (N^{2})$ ). Current sparse solutions generally fall into two traps:

Asymmetry: They compress Keys and Values but leave Queries dense. This makes sense for inference (one query at a time) but is inefficient during training where all queries are known.
Kernel Entanglement: Most sparse methods require custom CUDA kernels that are hard to maintain and don't benefit from the rapid optimizations of stock FlashAttention.

The authors' insight is simple: Queries are as redundant as Keys. If we compress both, we can reduce the problem to "thick" attention over a much smaller, gathered sub-sequence.

Methodology: The Four-Stage Pipeline

Lighthouse wraps the attention layer in a 4-stage process that leaves the core kernel untouched:

Pyramid Pooling: Q, K, and V are average-pooled into multiple resolution levels (e.g., every 2, 4, or 8 tokens).
Hierarchical Selection: A parameter-free scorer (using $ℓ_{2}$ norms) identifies the most "important" entries across all levels. A custom Chunked-Bitonic Top-K kernel handles this selection efficiently on GPUs.
Gather & FlashAttention: The selected entries are packed into a single, contiguous sub-sequence. Crucially, because this sub-sequence is dense, we can use unmodified FlashAttention.
Scatter-Back: The result is redistributed to the original sequence positions using a deterministic scatter kernel.

Lighthouse Architecture

Why No Differentiable Selection?

Lighthouse does not use Straight-Through Estimators (STE) or Gumbel-Softmax. The selection is discrete and non-differentiable. The weights learn to produce values that are useful when selected, rather than learning how to "game" a scorer. This avoids the common optimization instability seen in learned sparse models.

Experiments: Faster Training, Better Recovery

The authors tested Lighthouse on B200 GPUs at various context lengths.

1. Scaling Performance

The performance gap widens as sequences get longer. At 512K tokens, Lighthouse provides a staggering 21x speedup in the forward pass. This efficiency allows for training on contexts that would otherwise lead to Out-Of-Memory (OOM) errors on high-end hardware.

Latency Comparison

2. The SDPA Recovery Test

A major contribution is the Two-Stage Recipe. After pre-training for ~10k steps with Lighthouse (sparse), the model is switched to full SDPA (dense) for the final ~6k steps.

Result: The "recovered" models didn't just match the baseline; they often surpassed it in terms of validation loss (0.6980 vs. 0.7237).
Retrieval: In "Needle-in-a-Haystack" tests, the recovered models showed retrieval capabilities on par with models trained entirely with dense attention.

Training Results Table

Critical Analysis & Takeaways

Lighthouse Attention proves that sparsity is a bridge, not a permanent modification.

Pros:

Kernel Agnostic: Allows researchers to immediately benefit from FlashAttention-3/4.
Symmetric Efficiency: Compressing Queries significantly reduces FLOPs during the training phase.
Superior Regularization: Surprisingly, smaller budgets ( $k$ ) resulted in lower loss, suggesting the hierarchical selection acts as a useful bottleneck.

Limitations:

Autoregressive Decoding: The symmetric pooling assumes Queries are known in advance, making this specifically a pre-training optimization rather than a zero-shot inference speedup.
Complexity: While $k$ is fixed, the complexity is sub-quadratic, not strictly linear.

Future Outlook: The multi-scale pyramid approach is naturally suited for multimodal tasks. We should expect to see Lighthouse-style selection applied to vision and audio, where spatial/temporal redundancy is even higher than in text.

发现相似论文

试试这些示例

Search for recent papers that utilize Query pooling or symmetric compression in Transformer attention during the pre-training phase.
Which paper first introduced the concept of two-stage training (sparse-to-dense) for long-context LLMs, and how does Lighthouse's recovery phase compare to it?
Investigate studies that apply hierarchical selection-based attention to non-text modalities like high-resolution video or medical imaging.

Lighthouse Attention: Symmetrical Hierarchy for Subquadratic Long-Context Pre-training

1. TL;DR

2. Problem & Motivation: The Saliency Paradox

3. Methodology: The Four-Stage Pipeline

3.1. Why No Differentiable Selection?

4. Experiments: Faster Training, Better Recovery

4.1. 1. Scaling Performance

4.2. 2. The SDPA Recovery Test

5. Critical Analysis & Takeaways