Curiosity · Concept

FlashAttention

FlashAttention, introduced by Tri Dao et al. in 2022, is the standard kernel for self-attention on modern GPUs. It is mathematically identical to naive attention but restructures the computation to fuse softmax with matrix multiplies, tile them inside fast SRAM, and never materialize the full N×N attention matrix in HBM. The result is 2-4x faster attention with much lower memory.

Quick reference

Proficiency
Advanced
Also known as
FlashAttention-2, FA2, FA3
Prerequisites
Self-attention, GPU memory hierarchy (HBM vs SRAM)

Frequently asked questions

What is FlashAttention?

An IO-aware GPU kernel for self-attention that gives mathematically exact results while being faster and more memory-efficient than standard implementations. It tiles the attention computation into blocks that fit in fast on-chip SRAM, avoiding the slow step of writing the N×N attention matrix to HBM.

How is FlashAttention different from approximate attention?

Approximate attention methods (Linformer, Performer, Longformer) change the math to reduce complexity, at some accuracy cost. FlashAttention is exact — identical output to vanilla attention — but runs faster by reorganizing memory access. You get the speedup without changing model behavior.

What did FlashAttention-2 and FlashAttention-3 add?

FA2 improved parallelism across sequence length and warp-level work distribution, yielding another 2x speedup over FA1. FA3 specializes for NVIDIA Hopper with async tensor cores, warp-specialization, and FP8 support — key for fast training and inference on H100 class hardware.

Do I need to do anything to use FlashAttention?

It is integrated into PyTorch's scaled_dot_product_attention, Hugging Face Transformers, vLLM, TensorRT-LLM, and most modern LLM inference stacks — usually enabled by default on supported GPUs. You get it automatically when running modern LLMs.

Sources

  1. Dao et al. — FlashAttention — accessed 2026-04-20
  2. Dao — FlashAttention-2 — accessed 2026-04-20
  3. Shah et al. — FlashAttention-3 — accessed 2026-04-20