Variant of Transformer designed for vision processing
A vision transformer (ViT) is a transformer designed for computer vision.[1] A ViT breaks down an input image into a series of patches (rather than breaking up text into tokens), serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.
ViT were designed as alternatives to convolutional neural networks (CNN) in computer vision applications. They have different inductive biases, training stability, and data efficiency.[2] Compared to CNN, ViT is less data efficient, but has higher capacity. Some of the largest modern computer vision models are ViT, such as one with 22B parameters.[3][4]