Home Knowledge Base Class token (CLS)

Class token (CLS) is a special learnable embedding vector prepended to the sequence of patch tokens in a Vision Transformer that aggregates global image information through self-attention — serving as the summary representation of the entire image that is ultimately fed into the classification head to produce the final prediction.

What Is the Class Token?

Why the Class Token Matters

How the CLS Token Works

Layer 1 (Early):

Middle Layers:

Final Layers:

Classification Head:

CLS Token vs. Global Average Pooling

AspectCLS TokenGlobal Average Pooling (GAP)
MechanismLearned attention-based aggregationSimple mean of all patch tokens
LearnableYes (additional parameters)No (fixed operation)
FlexibilityCan weight patches differentlyEqual weight to all patches
PerformanceSlightly better with large-scale pretrainingCompetitive or better with less data
DeiT DefaultCLS token used
MAE/BEiTOften use GAP insteadPreferred in self-supervised ViTs

Variants and Extensions

The class token is the lens through which a Vision Transformer sees the whole image — by attending to every patch across every layer, this single learned vector distills an entire image into a representation rich enough to drive accurate classification and transfer learning.

class tokenclscomputer vision

Related Topics

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.