visionPublished: October 22, 2020
An image is worth 16x16 words: Transformers for image recognition at scale
By Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner
Research TL;DR
"Introduces the Vision Transformer (ViT), showing that self-attention layers can fully replace convolutions in image classification tasks."
Abstract
While the Transformer architecture has become the de-facto standard for natural language processing, its applications to computer vision remain limited. In this paper, we show that applying a standard Transformer directly to images, with the fewest possible modifications, works well.