Vision transformer

The architecture of Vision Transformer. An input image is divided into patches, each of which is linearly mapped through a patch embedding layer, before entering a standard Transformer encoder.

A vision transformer (ViT) is a transformer designed for computer vision.[1] A ViT decomposes an input image into a series of patches (rather than text into tokens), serializes each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

ViTs were designed as alternatives to convolutional neural networks (CNNs) in computer vision applications. They have different inductive biases, training stability, and data efficiency.[2] Compared to CNNs, ViTs are less data efficient, but have higher capacity. Some of the largest modern computer vision models are ViTs, such as one with 22B parameters.[3][4] More recently, a 113 Billion Parameters dense ViT model is proposed for weather and climate prediction, the largest ViT to date, and is trained on Frontier supercomputer at 1.6 exaFLOP computing throughput for training.[5]

Subsequent to its publication, many variants were proposed, with hybrid architectures with both features of ViTs and CNNs. ViTs have found application in image recognition, image segmentation, and autonomous driving.[6][7]

  1. ^ Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob (2021-06-03). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". arXiv:2010.11929 [cs.CV].
  2. ^ Cite error: The named reference :1 was invoked but never defined (see the help page).
  3. ^ Dehghani, Mostafa; Djolonga, Josip; Mustafa, Basil; Padlewski, Piotr; Heek, Jonathan; Gilmer, Justin; Steiner, Andreas; Caron, Mathilde; Geirhos, Robert (2023-02-10), Scaling Vision Transformers to 22 Billion Parameters, arXiv:2302.05442
  4. ^ "Scaling vision transformers to 22 billion parameters". research.google. Retrieved 2024-08-07.
  5. ^ "ORBIT: Oak Ridge Base Foundation Model for Earth System Predictability." https://arxiv.org/abs/2404.14712
  6. ^ Han, Kai; Wang, Yunhe; Chen, Hanting; Chen, Xinghao; Guo, Jianyuan; Liu, Zhenhua; Tang, Yehui; Xiao, An; Xu, Chunjing; Xu, Yixing; Yang, Zhaohui; Zhang, Yiman; Tao, Dacheng (2023-01-01). "A Survey on Vision Transformer". IEEE Transactions on Pattern Analysis and Machine Intelligence. 45 (1): 87–110. arXiv:2012.12556. doi:10.1109/TPAMI.2022.3152247. ISSN 0162-8828. PMID 35180075.
  7. ^ Khan, Salman; Naseer, Muzammal; Hayat, Munawar; Zamir, Syed Waqas; Khan, Fahad Shahbaz; Shah, Mubarak (2022-09-13). "Transformers in Vision: A Survey". ACM Comput. Surv. 54 (10s): 200:1–200:41. arXiv:2101.01169. doi:10.1145/3505244. ISSN 0360-0300.