Understanding FlashAttention: Inner vs Outer Loop Optimization

Feb 1, 2025 · 2 min read
blog NLP

Understanding FlashAttention: Inner vs Outer Loop Optimization

FlashAttention is a groundbreaking optimization technique for computing attention in Transformer models. It drastically improves performance by reducing memory bottlenecks and utilizing GPU memory more efficiently.

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness


🚀 What Problem Does It Solve?

In traditional attention mechanisms:

  • Attention matrices like Q, K, and V are huge.
  • GPU cores (CUDA cores) must fetch data from HBM (High Bandwidth Memory) repeatedly.
  • Each access to HBM is slow and inefficient.
  • Shared memory (SRAM) exists but is not optimally used.

This leads to frequent memory transfers, under-utilized cores, and slow inference time.


⚡ FlashAttention to the Rescue

FlashAttention solves this by:

  • Dividing Q, K, V matrices into smaller blocks (e.g., 32x32).
  • Copying each block from HBM to SRAM once (not repeatedly).
  • Performing all computations inside SRAM, near the GPU cores.
  • Writing results back to HBM only once per block.

This dramatically reduces memory access overhead and accelerates attention computations.


🔁 Inner Loop vs Outer Loop

Outer Loop

  • Responsible for loading blocks of K/V from HBM to SRAM.
  • Each iteration handles a large memory transfer.
  • Runs infrequently but handles heavy data movement.

Inner Loop

  • Executes on the data already in SRAM.
  • Performs matrix multiplications (Q×Kᵀ), softmax, and QK×V.
  • Runs frequently but operates on fast-access memory.
  • Fast and efficient — no further HBM access needed.

🧠 Analogy: Kitchen Example

  • HBM = Warehouse far away.
  • SRAM = Workbench in your kitchen.
  • Outer loop = You bring a tray of ingredients from warehouse to your kitchen.
  • Inner loop = You cook the full meal using what’s already on your workbench.

Traditional attention = you run back to the warehouse for every spoon of spice 😅
FlashAttention = bring the whole spice rack once, cook in peace! 👨‍🍳


✅ Summary

FeatureTraditional AttentionFlashAttention
Memory AccessFrequent HBM accessOne-time block transfer
SRAM UsageUnder-utilizedFully utilized per block
Computation LocationMix of HBM and registersAll in SRAM
SpeedSlower, memory bottleneckMuch faster, memory-efficient

FlashAttention is a key breakthrough for making large models faster and more scalable — especially during inference.