Class token (CLS)

Class token (CLS) is a special learnable embedding vector prepended to the sequence of patch tokens in a Vision Transformer that aggregates global image information through self-attention — serving as the summary representation of the entire image that is ultimately fed into the classification head to produce the final prediction.

What Is the Class Token?

- Definition: A trainable parameter vector of the same dimension as patch embeddings (e.g., 768-D for ViT-Base) that is concatenated to the beginning of the patch token sequence before being processed by the transformer encoder layers.
- Origin: Borrowed directly from BERT (Bidirectional Encoder Representations from Transformers), where the [CLS] token similarly aggregates sequence-level information for classification tasks.
- Sequence Position: Added as position 0, making the full input sequence [CLS, patch_1, patch_2, ..., patch_N] with length N+1 (e.g., 197 tokens for 196 patches + 1 CLS).
- Output Usage: After passing through all transformer layers, only the CLS token's final hidden state is used for classification — it is fed into an MLP head that produces class probabilities.

Why the Class Token Matters

- Global Information Aggregation: Through self-attention across all transformer layers, the CLS token attends to every patch in the image, gradually building a holistic representation of the entire visual scene.
- Task-Agnostic Representation: The CLS token learns a general-purpose image representation during pretraining that transfers effectively to diverse downstream tasks.
- Decoupled from Spatial Structure: Unlike CNN global average pooling, the CLS token is not tied to any spatial location — it can learn complex non-linear combinations of patch information through attention.
- Clean Architectural Separation: The CLS token cleanly separates the "understanding" function (transformer encoder) from the "decision" function (classification head) without requiring architectural modifications.
- BERT Compatibility: Using a CLS token maintains architectural consistency with NLP transformers, enabling shared research insights and multimodal fusion between vision and language models.

How the CLS Token Works

Layer 1 (Early):
- CLS token attends broadly to all patches with roughly uniform attention weights.
- Captures low-level global statistics (average color, overall brightness, texture distribution).

Middle Layers:
- Attention becomes more selective — CLS token focuses on informative patches (objects, distinctive features).
- Builds intermediate feature representations combining local and global context.

Final Layers:
- CLS token has attended to all patches across all layers through residual connections.
- Contains a rich, compressed representation of the entire image's semantic content.

Classification Head:
- The CLS token's final hidden state (768-D for ViT-Base) is passed through an MLP.
- MLP typically: Linear(768, num_classes) or Linear(768, hidden) → GELU → Linear(hidden, num_classes).

CLS Token vs. Global Average Pooling

| Aspect | CLS Token | Global Average Pooling (GAP) |
|--------|-----------|------------------------------|
| Mechanism | Learned attention-based aggregation | Simple mean of all patch tokens |
| Learnable | Yes (additional parameters) | No (fixed operation) |
| Flexibility | Can weight patches differently | Equal weight to all patches |
| Performance | Slightly better with large-scale pretraining | Competitive or better with less data |
| DeiT Default | CLS token used | — |
| MAE/BEiT | Often use GAP instead | Preferred in self-supervised ViTs |

Variants and Extensions

- Register Tokens: Recent work (Darcet et al., 2023) adds additional learnable tokens beyond CLS to serve as "registers" that reduce attention artifacts in patch tokens.
- Multiple CLS Tokens: Some architectures use separate CLS tokens for different tasks or scales in multi-task learning.
- CLS-Free ViTs: Models like MAE (Masked Autoencoders) and DINOv2 often use global average pooling instead of a CLS token, achieving competitive or superior results.
- Distillation Token (DeiT): A second class-like token trained to match a teacher model's predictions, used alongside the standard CLS token.

The class token is the lens through which a Vision Transformer sees the whole image — by attending to every patch across every layer, this single learned vector distills an entire image into a representation rich enough to drive accurate classification and transfer learning.

Want to learn more?