Vision Transformers (ViT) are the adaptation of the Transformer architecture from NLP to computer vision ā replacing traditional convolutional neural networks by splitting images into fixed-size patches, linearly embedding each patch into a token, and processing the sequence of patch tokens through standard Transformer encoder layers with self-attention.
ViT Architecture:
- Patch Embedding: image of size HĆWĆC split into N patches of size PĆP ā each patch flattened to P²C vector and linearly projected to embedding dimension D; typical P=16 for 224Ć224 images produces N=196 patches
- Position Embeddings: learnable 1D position embeddings added to patch embeddings ā encode spatial location information lost during patch extraction; 2D-aware position encodings (relative or sinusoidal) offer marginal improvement
- Class Token: special [CLS] token prepended to the patch sequence ā its output representation after the final Transformer layer serves as the image-level representation for classification; alternative: global average pooling over all patch outputs
- Transformer Encoder: standard multi-head self-attention (MSA) and feed-forward network (FFN) blocks ā each layer applies LayerNorm ā MSA ā residual ā LayerNorm ā FFN ā residual; typical ViT-Base has 12 layers, D=768, 12 attention heads
Scaling Properties:
- Data Requirements: ViT requires significantly more training data than CNNs to achieve comparable accuracy ā pre-training on ImageNet-21K (14M images) or JFT-300M (300M images) followed by fine-tuning on target dataset
- DeiT (Data-efficient ViT): achieves competitive accuracy on ImageNet-1K alone ā uses strong data augmentation (RandAugment, CutMix, Mixup), regularization (stochastic depth), and distillation token learning from a CNN teacher
- Scale Progression: ViT-Small (22M params), ViT-Base (86M), ViT-Large (307M), ViT-Huge (632M) ā accuracy scales log-linearly with model size and dataset size; largest models outperform all CNNs on standard benchmarks
- Compute Scaling: self-attention is O(N²) where N is number of patches ā limits input resolution; 384Ć384 input with P=16 produces 576 patches requiring 3Ć more attention compute than 224Ć224
ViT Variants and Improvements:
- Swin Transformer: hierarchical ViT with shifted window attention ā O(N) complexity enables processing high-resolution images; window-based self-attention limits each token's attention to local patches with cross-window connections via shifts
- BEiT/MAE: self-supervised pre-training for ViT ā Masked Autoencoder (MAE) masks 75% of patches and reconstructs them, learning powerful visual representations without labeled data
- Hybrid ViT: combines CNN backbone for early feature extraction with Transformer for later layers ā CNN handles low-level features efficiently while Transformer captures global relationships
- Multi-Scale ViT: processes patches at multiple resolutions or progressively reduces token count ā achieves CNN-like feature pyramid for dense prediction tasks (detection, segmentation)
Vision Transformers represent a paradigm shift in computer vision ā demonstrating that the inductive biases of convolutions (locality, translation equivariance) are not necessary when sufficient data and compute are available, with self-attention learning these patterns from data while also capturing long-range dependencies that CNNs struggle with.