arrow_backBack to research feed
visionPublished: October 22, 2020

An image is worth 16x16 words: Transformers for image recognition at scale

By Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner

Research TL;DR

"Introduces the Vision Transformer (ViT), showing that self-attention layers can fully replace convolutions in image classification tasks."

Abstract

While the Transformer architecture has become the de-facto standard for natural language processing, its applications to computer vision remain limited. In this paper, we show that applying a standard Transformer directly to images, with the fewest possible modifications, works well.

Read full paper on arXiv →