Understanding FlashAttention: Inner vs Outer Loop Optimization
Understanding FlashAttention: Inner vs Outer Loop Optimization
FlashAttention is a groundbreaking optimization technique for computing attention in Transformer models. It drastically improves performance by reducing memory bottlenecks and utilizing GPU memory more efficiently.

๐ What Problem Does It Solve?
In traditional attention mechanisms:
- Attention matrices like Q, K, and V are huge.
- GPU cores (CUDA cores) must fetch data from HBM (High Bandwidth Memory) repeatedly.
- Each access to HBM is slow and inefficient.
- Shared memory (SRAM) exists but is not optimally used.
This leads to frequent memory transfers, under-utilized cores, and slow inference time.
โก FlashAttention to the Rescue
FlashAttention solves this by:
- Dividing Q, K, V matrices into smaller blocks (e.g., 32x32).
- Copying each block from HBM to SRAM once (not repeatedly).
- Performing all computations inside SRAM, near the GPU cores.
- Writing results back to HBM only once per block.
This dramatically reduces memory access overhead and accelerates attention computations.
๐ Inner Loop vs Outer Loop
Outer Loop
- Responsible for loading blocks of K/V from HBM to SRAM.
- Each iteration handles a large memory transfer.
- Runs infrequently but handles heavy data movement.
Inner Loop
- Executes on the data already in SRAM.
- Performs matrix multiplications (QรKแต), softmax, and QKรV.
- Runs frequently but operates on fast-access memory.
- Fast and efficient โ no further HBM access needed.
๐ง Analogy: Kitchen Example
- HBM = Warehouse far away.
- SRAM = Workbench in your kitchen.
- Outer loop = You bring a tray of ingredients from warehouse to your kitchen.
- Inner loop = You cook the full meal using what’s already on your workbench.
Traditional attention = you run back to the warehouse for every spoon of spice ๐
FlashAttention = bring the whole spice rack once, cook in peace! ๐จโ๐ณ
โ Summary
Feature | Traditional Attention | FlashAttention |
---|---|---|
Memory Access | Frequent HBM access | One-time block transfer |
SRAM Usage | Under-utilized | Fully utilized per block |
Computation Location | Mix of HBM and registers | All in SRAM |
Speed | Slower, memory bottleneck | Much faster, memory-efficient |
FlashAttention is a key breakthrough for making large models faster and more scalable โ especially during inference.