Variant of Transformer designed for vision processing
A vision transformer (ViT) is a transformer designed for computer vision.[1] A ViT decomposes an input image into a series of patches (rather than text into tokens), serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.
ViTs were designed as alternatives to convolutional neural networks (CNNs) in computer vision applications. They have different inductive biases, training stability, and data efficiency.[2] Compared to CNNs, ViTs are less data efficient, but have higher capacity. Some of the largest modern computer vision models are ViTs, such as one with 22B parameters.[3][4] More recently, a 113 Billion Parameters dense ViT model is proposed for weather and climate prediction, the largest ViT to date, and is trained on Frontier supercomputer at 1.6 exaFLOP computing throughput for training.[5]