efficiencyPublished: May 27, 2022

FlashAttention: Fast and memory-efficient exact attention with IO-awareness

By Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré

Research TL;DR

"Reduces memory bottlenecks in Transformer attention calculation using hardware-aware tiling. Enables longer context lengths and faster training speeds."

Abstract

We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU High Bandwidth Memory (HBM) and GPU SRAM. We show that FlashAttention accelerates training and inference times for Transformers.

Read full paper on arXiv →

FlashAttention: Fast and memory-efficient exact attention with IO-awareness

Abstract

Related Research

Training compute-optimal large language models

LoRA: Low-rank adaptation of large language models