Vision Transformer (ViT) is the architecture applying pure self-attention mechanisms to image patches without convolutions ā demonstrating that transformer scaling and inductive biases enable state-of-the-art performance on image classification when trained on sufficient data.
ViT Architecture Overview:
- Image patchification: divide image into non-overlapping 16Ć16 pixel patches; 224Ć224 image ā 14Ć14 = 196 patches
- Patch embedding: linear projection embeds each patch to D dimensions (typically 768); learnable projection weights
- Positional embedding: absolute position embeddings (learnable or fixed sinusoidal) added to patch embeddings; encode patch positions
- CLS token: learnable token prepended to sequence; aggregates global information; used for classification
Self-Attention Mechanism:
- Pure transformer: stacked transformer encoder blocks; each block applies multi-head self-attention + feed-forward
- No convolution: departure from CNN inductive bias (locality, translation equivariance); learn from data
- Global receptive field: every token attends to all other tokens; effective receptive field is entire image
- Computational complexity: O(n²) attention where n = number of patches; manageable for 196-1024 patches
- Interpretability: attention weights visualizable; show which patches relevant for prediction
Training Data Requirements:
- Supervised learning limitation: ViT underperforms ResNet on ImageNet (1M images) without augmentation/regularization
- Large-scale pretraining: ViT shines on datasets >10M images (ImageNet-21k, JFT-300M); scaling laws favor transformers
- Scaling curves: ViT performance improves predictably with model size and data; simple scaling laws
- Inductive bias importance: CNNs exploit locality/translation; ViTs require data to learn these; large data compensates
Data-Efficient ViT (DeiT):
- Knowledge distillation: use CNN teacher to guide ViT training; soft targets improve learning
- Augmentation strategy: RandAugment, Mixup, Cutmix significantly improve ViT training stability
- Regularization: stochastic depth, drop path regularization; reduce overfitting on ImageNet
- Training recipe: careful hyperparameter selection (learning rates, schedules) important; not automatic transfer from CNN recipes
- Performance: DeiT achieves 81.8% ImageNet with 60M parameters; competitive with EfficientNet despite less data
Hybrid Architectures:
- Convolutional stem: initial convolutional layers extract features; patchified features fed to transformer
- Hybrid ViT: combine CNN inductive biases with transformer flexibility; improved data efficiency
- Trade-off: some inductive bias reduces data requirements; pure transformers more flexible
Vision Transformer Variants:
- Swin Transformer: hierarchical structure with shifted windows; efficient local attention; multi-scale features
- Local attention: window-based self-attention reduces complexity from O(n²) to O(n); enables large images/3D data
- Hierarchical features: coarse-to-fine features like CNNs; better for dense prediction (detection, segmentation)
- Shifted windows: windows shifted between layers; enables cross-window communication; efficient computation
ViT Downstream Tasks:
- Image classification: primary task; competitive with CNNs when sufficient data
- Object detection: adapt ViT for detection; competitive with CNN-based detectors (DETR, ViTDet)
- Semantic segmentation: adapt ViT for dense prediction; strong performance with appropriate architectural modifications
- Instance segmentation: mask heads added; competitive panoptic segmentation
- 3D perception: extend ViT to 3D point clouds, video; show transformer generality
Analysis and Interpretability:
- Attention visualization: attention patterns reveal which image regions relevant; interpretable behavior
- Emergent properties: ViT learns edge detectors, texture detectors, object detectors despite no explicit supervision
- Low-level features: first layers learn diverse low-level features; more diverse than CNNs
- Patch tokenization: learned patch embeddings develop interesting semantic structure
Advantages Over CNNs:
- Scalability: ViT scaling laws cleaner and more favorable than CNNs; unlimited receptive field
- Flexibility: patch-based approach applies to any modality (images, video, 3D, audio); CNNs modality-specific
- Transfer learning: ViT pretraining transfers better to downstream tasks; learned representations more general
- Theoretical understanding: transformer scaling behavior better understood; principled scaling laws
Computational Efficiency:
- Memory requirements: QKV projections require O(n²) memory for attention; challenging for high-resolution images
- Efficient variants: sparse attention patterns, local windows reduce complexity; maintain performance
- Hardware acceleration: transformers parallelize well on TPUs/GPUs; efficient implementation critical
- Speed vs accuracy: larger ViTs slower inference; must choose model size for latency constraints
Vision Transformer demonstrates that pure self-attention applied to image patches ā without inductive biases from convolution ā achieves strong performance when combined with large-scale pretraining and appropriate regularization.