Patch embedding

Patch embedding is the linear projection layer that maps each flattened image patch from pixel space into a high-dimensional vector representation — converting raw RGB pixel values within each patch into dense feature vectors that serve as input tokens to the Vision Transformer encoder, analogous to word embeddings in natural language processing.

What Is Patch Embedding?

- Definition: A learnable linear transformation (typically implemented as a Conv2D layer) that projects each image patch from its raw pixel representation (e.g., 16×16×3 = 768 values) into a D-dimensional embedding vector (e.g., D = 768 for ViT-Base).
- Implementation: A Conv2D layer with kernel_size = patch_size and stride = patch_size simultaneously extracts patches and projects them — Conv2D(in_channels=3, out_channels=768, kernel_size=16, stride=16).
- Output: For a 224×224 image with 16×16 patches, the embedding layer produces 196 vectors of dimension D, forming the input sequence to the transformer.
- Learnable Weights: The embedding projection matrix is learned during training — the model discovers which linear combinations of pixel values create the most useful feature representations.

Why Patch Embedding Matters

- Dimensionality Alignment: Transforms variable-size patch pixel data into fixed-size vectors matching the transformer's hidden dimension, enabling standard transformer processing.
- Feature Extraction: The learned projection captures basic visual features (edges, colors, textures) within each patch — functioning like the first convolutional layer of a CNN but without the sliding window.
- Information Compression: For ViT-Base, each 16×16×3 = 768 pixel values map to exactly 768 embedding dimensions — an isometric mapping that preserves information while restructuring it for transformer processing.
- Computational Efficiency: A single matrix multiplication per patch replaces the multi-layer feature extraction hierarchies used in CNNs.
- Foundation for Attention: The quality of patch embeddings directly affects the transformer's ability to compute meaningful attention patterns between patches — poor embeddings mean poor attention.

Patch Embedding Variants

Standard Linear Projection (ViT):
- Single Conv2D with large kernel matching patch size.
- Simplest and most common approach.
- Works well with sufficient pretraining data.

Convolutional Stem (Hybrid ViT):
- Replace single large-kernel conv with a small CNN stem (3-5 convolutional layers with small 3×3 kernels).
- Provides better low-level feature extraction and translation equivariance.
- Improves performance when pretraining data is limited.

Overlapping Patch Embedding (CvT, CMT):
- Use stride smaller than kernel size to create overlapping patches.
- Reduces information loss at patch boundaries.
- Slightly increases sequence length and compute cost.

Embedding Dimension Comparison

| Model | Patch Size | Embedding Dim | Patches (224²) | Params in Embedding |
|-------|-----------|--------------|-----------------|---------------------|
| ViT-Tiny | 16×16 | 192 | 196 | 147K |
| ViT-Small | 16×16 | 384 | 196 | 295K |
| ViT-Base | 16×16 | 768 | 196 | 590K |
| ViT-Large | 16×16 | 1024 | 196 | 786K |
| ViT-Huge | 14×14 | 1280 | 256 | 753K |

Position Embedding Addition

After patch embedding, a position embedding is added to each patch token to encode spatial location:
- Learned Position Embeddings: A separate learnable vector for each patch position — standard in original ViT.
- Sinusoidal Position Embeddings: Fixed mathematical encoding using sine and cosine functions.
- Without Position Embedding: The model loses all spatial information — it cannot distinguish a patch in the top-left from one in the bottom-right.

Tools & Frameworks

- PyTorch: timm library provides ViT implementations with configurable patch embedding layers.
- Hugging Face: transformers.ViTModel includes standard patch embedding as ViTEmbeddings.
- JAX/Flax: Google's scenic and big_vision repositories implement patch embedding for TPU training.

Patch embedding is the critical first transformation in every Vision Transformer — converting the continuous pixel world into discrete token representations that unlock the full power of self-attention for visual understanding.

Want to learn more?