efficiencyPublished: May 27, 2022
FlashAttention: Fast and memory-efficient exact attention with IO-awareness
By Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré
Research TL;DR
"Reduces memory bottlenecks in Transformer attention calculation using hardware-aware tiling. Enables longer context lengths and faster training speeds."
Abstract
We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU High Bandwidth Memory (HBM) and GPU SRAM. We show that FlashAttention accelerates training and inference times for Transformers.