Homeβ€Ί Knowledge Baseβ€Ί The Transformer architecture

The Transformer architecture was introduced in the landmark 2017 paper "Attention Is All You Need" and has since become the foundation for virtually all modern large language models.

The Transformer architecture was introduced in the landmark 2017 paper "Attention Is All You Need" by Vaswani et al. It replaced recurrence with pure attention mechanisms and has since become the foundation for virtually all modern large language models.

Problems with Previous Approaches (RNNs/LSTMs)

The Key Insight

Attention alone is sufficient. By allowing every position to directly attend to every other position in a single operation, the sequential constraint is eliminated entirely.

Core Mechanism: Self-Attention

Scaled Dot-Product Attention

The heart of the Transformer is scaled dot-product attention. Given an input sequence of embeddings, we compute three projections:

Mathematical Formulation

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Where:

Why the Scaling Factor?

The scaling factor $\sqrt{d_k}$ is critical. Without it:

$$ \text{For large } d_k: \quad q \cdot k = \sum_{i=1}^{d_k} q_i k_i \quad \text{grows as } O(d_k) $$

This pushes softmax into regions of extremely small gradients:

$$ \frac{\partial}{\partial x_i} \text{softmax}(x)_j = \text{softmax}(x)_j \left(\delta_{ij} - \text{softmax}(x)_i\right) $$

When inputs are large, softmax outputs approach one-hot vectors, and gradients vanish.

Properties of Self-Attention

Multi-Head Attention

Rather than computing a single attention function, Transformers use multiple parallel attention "heads."

Mathematical Formulation

$$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O $$

Where each head is:

$$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$

Projection Dimensions

Typical Configuration

For a model with $d_{\text{model}} = 512$ and $h = 8$ heads:

$$ d_k = d_v = \frac{d_{\text{model}}}{h} = \frac{512}{8} = 64 $$

Why Multiple Heads?

Position Encoding

The Problem

Self-attention is permutation-equivariant:

$$ \text{Attention}(\pi(X)) = \pi(\text{Attention}(X)) $$

Where $\pi$ is any permutation. The operation has no inherent notion of position or order.

Sinusoidal Position Encodings (Original)

The original paper used fixed sinusoidal encodings:

$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) $$

$$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) $$

Where:

Properties of Sinusoidal Encodings

$$ PE_{pos+k} = T_k \cdot PE_{pos} $$

Where $T_k$ is a rotation matrix depending only on $k$.

Modern Alternatives

#Rotary Position Embeddings (RoPE)

Encodes position through rotation in 2D subspaces:

$$ f(x_m, m) = \begin{pmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta \end{pmatrix} \begin{pmatrix} x_m^{(1)} \\ x_m^{(2)} \end{pmatrix} $$

For query $q$ at position $m$ and key $k$ at position $n$:

$$ q_m^T k_n = (R_m q)^T (R_n k) = q^T R_{n-m} k $$

This makes attention depend only on relative position $(n-m)$.

#ALiBi (Attention with Linear Biases)

Adds a linear bias based on distance:

$$ \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} - m \cdot |i-j|\right)V $$

Where $m$ is a head-specific slope and $|i-j|$ is the distance between positions.

The Complete Transformer Layer

Layer Composition

A single Transformer layer consists of:

Input β†’ [Layer Norm] β†’ Multi-Head Attention β†’ [+ Residual] β†’ 
      β†’ [Layer Norm] β†’ Feed-Forward Network β†’ [+ Residual] β†’ Output

Feed-Forward Network (FFN)

Applied position-wise (identically to each position):

$$ \text{FFN}(x) = \sigma(xW_1 + b_1)W_2 + b_2 $$

Where:

Activation Functions

#ReLU (Original) $$ \text{ReLU}(x) = \max(0, x) $$

#GELU (Common in modern models) $$ \text{GELU}(x) = x \cdot \Phi(x) \approx x \cdot \sigma(1.702x) $$

Where $\Phi$ is the standard Gaussian CDF.

#SwiGLU (State-of-the-art) $$ \text{SwiGLU}(x) = \text{Swish}(xW_1) \odot (xW_2) $$

Where $\text{Swish}(x) = x \cdot \sigma(x)$ and $\odot$ is element-wise multiplication.

Layer Normalization

$$ \text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta $$

Where:

#Pre-LN vs Post-LN

Post-LN (Original): $$ x' = \text{LayerNorm}(x + \text{Attention}(x)) $$

Pre-LN (Modern, more stable): $$ x' = x + \text{Attention}(\text{LayerNorm}(x)) $$

RMSNorm (Simplified Alternative)

$$ \text{RMSNorm}(x) = \gamma \odot \frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}} $$

Removes the mean-centering step for efficiency.

Residual Connections

$$ x_{l+1} = x_l + F_l(x_l) $$

Essential for:

Architectural Variants

Encoder-Only (BERT-style)

Attention Pattern: Bidirectional (each position attends to all positions)

$$ \text{Mask}_{ij} = 0 \quad \forall i, j $$

Use Cases:

Pre-training Objective: Masked Language Modeling (MLM)

$$ \mathcal{L}_{\text{MLM}} = -\mathbb{E}_{x \sim \mathcal{D}} \left[ \sum_{i \in \mathcal{M}} \log P(x_i | x_{\backslash \mathcal{M}}) \right] $$

Decoder-Only (GPT-style)

Attention Pattern: Causal (positions only attend to previous positions)

$$ \text{Mask}_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases} $$

Use Cases:

Pre-training Objective: Next Token Prediction

$$ \mathcal{L}_{\text{LM}} = -\sum_{t=1}^{T} \log P(x_t | x_{

Encoder-Decoder (Original Transformer)

Components:

Cross-Attention:

$$ \text{CrossAttention}(Q_{\text{dec}}, K_{\text{enc}}, V_{\text{enc}}) $$

Where queries come from decoder, keys/values from encoder.

Use Cases:

Why Transformers Scale

Empirical Scaling Laws

Performance follows predictable power laws with scale:

$$ L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N} $$

$$ L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D} $$

$$ L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C} $$

Where:

Factors Contributing to Scaling

Computational Considerations

Complexity Analysis

OperationTime ComplexitySpace Complexity
Self-Attention$O(n^2 \cdot d)$$O(n^2 + nd)$
FFN$O(n \cdot d \cdot d_{ff})$$O(nd + d \cdot d_{ff})$
Full Layer$O(n^2 \cdot d + n \cdot d \cdot d_{ff})$$O(n^2 + nd_{ff})$

KV Cache for Inference

For autoregressive generation, we cache keys and values:

$$ K_t = [K_{t-1}; k_t], \quad V_t = [V_{t-1}; v_t] $$

Memory per layer: $O(n \cdot d_k \cdot 2)$ for keys and values

Total cache size: $$ \text{KV Cache} = 2 \times L \times n \times d_{\text{model}} \times \text{precision bytes} $$

Efficient Attention Variants

#Flash Attention

Uses tiling to compute exact attention with $O(n)$ memory:

#Multi-Query Attention (MQA)

Shares keys and values across all heads:

$$ \text{head}_i = \text{Attention}(QW_i^Q, KW^K, VW^V) $$

Benefit: Reduces KV cache by factor of $h$

#Grouped-Query Attention (GQA)

Compromise between MHA and MQA:

Applications Beyond Language

Vision Transformers (ViT)

Images treated as sequences of patches:

$$ x_{\text{patch}} = \text{Flatten}(\text{Image}[i:i+P, j:j+P]) \cdot E $$

Where $P \times P$ is patch size and $E$ is the embedding projection.

AlphaFold (Protein Structure)

Uses Transformers for:

Other Domains

Open Questions and Frontiers

Theoretical Understanding

Efficiency Challenges

Alternative Architectures

Current Research Directions

Summary

The Transformer architecture represents one of the most significant innovations in deep learning history. Its key contributions:

1. Eliminated sequential processing through self-attention 2. Enabled massive parallelization for efficient training 3. Demonstrated remarkable scaling properties that continue to hold 4. Proved domain-general across language, vision, biology, and beyond

The combination of parallelism, global connectivity, and learned routingβ€”all without recurrenceβ€”unlocked capabilities that seemed far off before its introduction.

transformertransformerstransformer architectureself-attentionattention mechanismencoder-decodermulti-head attentionpositional encodingBERTGPTneural networks

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization β€” search the full knowledge base or chat with our AI assistant.