Home Knowledge Base Vision Transformers (ViT)

Vision Transformers (ViT) are the deep learning architecture that applies the Transformer's self-attention mechanism directly to image recognition — splitting an image into a sequence of fixed-size patches, embedding each patch as a token, and processing the sequence through standard Transformer encoder layers to achieve state-of-the-art image classification without any convolutional layers.

The Patch Embedding Insight

ConvNets process images through local receptive fields that gradually expand across layers. ViT takes a radically different approach: a 224x224 image is divided into a grid of non-overlapping patches (typically 16x16 pixels each, yielding 196 patches). Each patch is flattened to a 768-dimensional vector through a linear projection, producing a sequence of 196 "visual tokens" plus a learnable [CLS] classification token.

Architecture

1. Patch Embedding: Linear projection of flattened patches, plus learnable positional embeddings (since Transformers have no inherent spatial awareness). 2. Transformer Encoder: Standard multi-head self-attention and MLP blocks, typically 12-24 layers. Every patch attends to every other patch from the first layer — giving global receptive field immediately, unlike ConvNets which build global context gradually. 3. Classification Head: The [CLS] token's final representation is projected through a linear layer to class logits.

Scaling Behavior

ViT's key finding: Transformers underperform ConvNets when trained on small datasets (ImageNet-1K alone) because they lack the inductive biases (translation equivariance, locality) that help ConvNets learn efficiently from limited data. However, when pre-trained on large datasets (ImageNet-21K, JFT-300M), ViT matches or exceeds the best ConvNets while being more computationally efficient at scale.

Major Variants

Impact Beyond Classification

ViT's success triggered the adoption of Transformers across all of computer vision: object detection (DETR, DINO), semantic segmentation (SegFormer, Mask2Former), video understanding (TimeSformer, VideoMAE), and multimodal models (CLIP, LLaVA) all use ViT backbones.

Vision Transformers are the architecture that proved attention is all you need — for images too — demonstrating that the same mechanism powering language models can see, classify, and understand visual information when given enough data to overcome its lack of visual inductive bias.

vision transformer vitimage patch embeddingvit classificationtransformer image recognitionvisual attention mechanism

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.