Home Knowledge Base Vision Transformers (ViT)

Vision Transformers (ViT) are the adaptation of the Transformer architecture from NLP to computer vision — replacing traditional convolutional neural networks by splitting images into fixed-size patches, linearly embedding each patch into a token, and processing the sequence of patch tokens through standard Transformer encoder layers with self-attention.

ViT Architecture:

Scaling Properties:

ViT Variants and Improvements:

Vision Transformers represent a paradigm shift in computer vision — demonstrating that the inductive biases of convolutions (locality, translation equivariance) are not necessary when sufficient data and compute are available, with self-attention learning these patterns from data while also capturing long-range dependencies that CNNs struggle with.

vision transformer vit architecturepatch embedding transformerposition encoding imagevision transformer scalingvit vs cnn comparison

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.