Home Knowledge Base Patch embedding

Patch embedding is the linear projection layer that maps each flattened image patch from pixel space into a high-dimensional vector representation — converting raw RGB pixel values within each patch into dense feature vectors that serve as input tokens to the Vision Transformer encoder, analogous to word embeddings in natural language processing.

What Is Patch Embedding?

Why Patch Embedding Matters

Patch Embedding Variants

Standard Linear Projection (ViT):

Convolutional Stem (Hybrid ViT):

Overlapping Patch Embedding (CvT, CMT):

Embedding Dimension Comparison

ModelPatch SizeEmbedding DimPatches (224²)Params in Embedding
ViT-Tiny16×16192196147K
ViT-Small16×16384196295K
ViT-Base16×16768196590K
ViT-Large16×161024196786K
ViT-Huge14×141280256753K

Position Embedding Addition

After patch embedding, a position embedding is added to each patch token to encode spatial location:

Tools & Frameworks

Patch embedding is the critical first transformation in every Vision Transformer — converting the continuous pixel world into discrete token representations that unlock the full power of self-attention for visual understanding.

patch embeddingcomputer vision

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.