Vision Transformers (ViT)

Vision Transformers (ViT) are the adaptation of the Transformer architecture from NLP to computer vision — replacing traditional convolutional neural networks by splitting images into fixed-size patches, linearly embedding each patch into a token, and processing the sequence of patch tokens through standard Transformer encoder layers with self-attention.

ViT Architecture:
- Patch Embedding: image of size H×W×C split into N patches of size P×P — each patch flattened to P²C vector and linearly projected to embedding dimension D; typical P=16 for 224×224 images produces N=196 patches
- Position Embeddings: learnable 1D position embeddings added to patch embeddings — encode spatial location information lost during patch extraction; 2D-aware position encodings (relative or sinusoidal) offer marginal improvement
- Class Token: special [CLS] token prepended to the patch sequence — its output representation after the final Transformer layer serves as the image-level representation for classification; alternative: global average pooling over all patch outputs
- Transformer Encoder: standard multi-head self-attention (MSA) and feed-forward network (FFN) blocks — each layer applies LayerNorm → MSA → residual → LayerNorm → FFN → residual; typical ViT-Base has 12 layers, D=768, 12 attention heads

Scaling Properties:
- Data Requirements: ViT requires significantly more training data than CNNs to achieve comparable accuracy — pre-training on ImageNet-21K (14M images) or JFT-300M (300M images) followed by fine-tuning on target dataset
- DeiT (Data-efficient ViT): achieves competitive accuracy on ImageNet-1K alone — uses strong data augmentation (RandAugment, CutMix, Mixup), regularization (stochastic depth), and distillation token learning from a CNN teacher
- Scale Progression: ViT-Small (22M params), ViT-Base (86M), ViT-Large (307M), ViT-Huge (632M) — accuracy scales log-linearly with model size and dataset size; largest models outperform all CNNs on standard benchmarks
- Compute Scaling: self-attention is O(N²) where N is number of patches — limits input resolution; 384×384 input with P=16 produces 576 patches requiring 3× more attention compute than 224×224

ViT Variants and Improvements:
- Swin Transformer: hierarchical ViT with shifted window attention — O(N) complexity enables processing high-resolution images; window-based self-attention limits each token's attention to local patches with cross-window connections via shifts
- BEiT/MAE: self-supervised pre-training for ViT — Masked Autoencoder (MAE) masks 75% of patches and reconstructs them, learning powerful visual representations without labeled data
- Hybrid ViT: combines CNN backbone for early feature extraction with Transformer for later layers — CNN handles low-level features efficiently while Transformer captures global relationships
- Multi-Scale ViT: processes patches at multiple resolutions or progressively reduces token count — achieves CNN-like feature pyramid for dense prediction tasks (detection, segmentation)

Vision Transformers represent a paradigm shift in computer vision — demonstrating that the inductive biases of convolutions (locality, translation equivariance) are not necessary when sufficient data and compute are available, with self-attention learning these patterns from data while also capturing long-range dependencies that CNNs struggle with.

Want to learn more?