FlashAttention-4 is a next-generation attention algorithm specifically co-designed for the NVIDIA Blackwell architecture. It introduces techniques like asynchronous pipeline redesign, software-emulated exponentials, and 2-CTA MMA modes to achieve up to 1613 TFLOPs/s (71% utilization) and a 1.3x speedup over cuDNN 9.13.
TL;DR
FlashAttention-4 is not just an incremental update; it is a fundamental redesign of the attention mechanism for the NVIDIA Blackwell (B200/GB200) era. As Tensor Core throughput doubles while memory bandwidth lags, FlashAttention-4 introduces kernel-algorithm co-design to mitigate non-matmul bottlenecks. It achieves a staggering 1613 TFLOPs/s (71% utilization) and is implemented in a new Python-based DSL that compiles 30x faster than the old C++ templates.
The "Blackwell Bottleneck": Why FlashAttention-3 Wasn't Enough
In the jump from Hopper (H100) to Blackwell (B200), NVIDIA doubled the peak BF16 throughput to 2.25 PFLOPS. However, the Multi-Function Unit (MUFU)—responsible for the exp operations in softmax—and Shared Memory (SMEM) bandwidth did not scale at the same rate.
The authors' roofline analysis reveals a grim reality for standard kernels: for typical tile sizes on Blackwell, shared memory traffic and exponential operations now exceed MMA compute time by up to 60%. The GPU is essentially waiting for the "math" to finish while it struggles to move data and calculate exponents.
Methodology: High-Precision Engineering for Asymmetric Hardware
1. Software-Emulated Exponentials
To solve the bottleneck of the limited MUFU units, FlashAttention-4 implements software-emulated 2^x using standard Floating-Point (FMA) units.
- The Insight: Use FMA units (which are abundant) to compute a polynomial approximation of the fractional part of an exponent while using integer ALU shifts for the integer part.
- The Result: By offloading 10-25% of the exponential work to FMA units, they effectively increase the total exponential throughput of the SM.
Table: Polynomial approximation is indistinguishable from hardware MUFU once rounded to BF16.
2. The 2-CTA Backward Pass
The backward pass is notoriously memory-bound. Blackwell introduces a 2-CTA MMA mode where a pair of CTAs (thread blocks) work together.
- Shared Memory Savings: Each CTA stages only half of operand B, reducing redundant traffic.
- Atomic Reduction: By partitioning the output across the CTA pair, FlashAttention-4 halves the number of global atomic adds required for the
dQgradient, which is a major source of non-determinism and latency.
Figure: The 2-CTA dQ step partitioning the workload to minimize SMEM traffic.
3. Pipeline Pipelining: Tensor Memory (TMEM)
Unlike Hopper, where accumulators lived in registers, Blackwell introduces Tensor Memory (TMEM), a 256KB on-chip buffer. FlashAttention-4 uses a "ping-pong" schedule in TMEM to overlap the next tile's matrix multiplication with the current tile's softmax computation, ensuring the Tensor Cores are never idle.
Performance: Breaking Records
FlashAttention-4 doesn't just beat the baseline; it dominates.
- Speedup: 1.3x faster than cuDNN 9.13 and 2.7x faster than Triton on B200.
- Efficiency: Reaches 71% of the B200's theoretical peak performance.
- Compilation: By moving from C++ templates to CuTe-DSL in Python, the team reduced compile headers and logic from minutes to seconds (32x faster for backward pass).
Figure: Performance comparison across different sequence lengths on B200.
Critical Insight & Future Outlook
The most profound contribution of FlashAttention-4 is the move toward Python-embedded DSLs for high-performance kernels. For years, the barrier to entry for CUDA optimization was the "C++ template hell." By providing a Python-based framework that compiles directly to SASS/PTX without losing control, the authors have democratized low-level GPU programming.
However, the reliance on Blackwell-specific features (TMEM, 2-CTA mode) means that while this is the "gold standard" for B200, the industry still faces a fragmented landscape between Blackwell datacenters and consumer RDNA/GeForce hardware.
Conclusion
FlashAttention-4 proves that as hardware scales asymmetrically, we can no longer rely on hardware-agnostic code. The future of AI efficiency lies in the tight coupling of algorithmic math (skipping rescaling) and hardware-specific plumbing (TMEM/2-CTA).
