The Transformer architecture
Keywords: transformer,transformers,transformer architecture,self-attention,attention mechanism,encoder-decoder,multi-head attention,positional encoding,BERT,GPT,neural networks
The Transformer architecture was introduced in the landmark 2017 paper "Attention Is All You Need" and has since become the foundation for virtually all modern large language models.
The Transformer architecture was introduced in the landmark 2017 paper "Attention Is All You Need" by Vaswani et al. It replaced recurrence with pure attention mechanisms and has since become the foundation for virtually all modern large language models.
Problems with Previous Approaches (RNNs/LSTMs)
- Sequential bottleneck: Processing proceeded step-by-step through sequences, preventing parallelization
- Long-range dependency challenges: Information from distant positions had to flow through many intermediate steps
- Vanishing gradient problems: Training signals degraded over long sequences, even with gating mechanisms
- Computational inefficiency: Sequential nature created fundamental bottlenecks on modern parallel hardware
The Key Insight
Attention alone is sufficient. By allowing every position to directly attend to every other position in a single operation, the sequential constraint is eliminated entirely.
Core Mechanism: Self-Attention
Scaled Dot-Product Attention
The heart of the Transformer is scaled dot-product attention. Given an input sequence of embeddings, we compute three projections:
- Query ($Q$): What information is this position looking for?
- Key ($K$): What information does this position contain?
- Value ($V$): What information should be transmitted if attended to?
Mathematical Formulation
$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
Where:
- $Q \in \mathbb{R}^{n \times d_k}$ β Query matrix
- $K \in \mathbb{R}^{n \times d_k}$ β Key matrix
- $V \in \mathbb{R}^{n \times d_v}$ β Value matrix
- $d_k$ β Dimension of keys/queries
- $n$ β Sequence length
Why the Scaling Factor?
The scaling factor $\sqrt{d_k}$ is critical. Without it:
$$ \text{For large } d_k: \quad q \cdot k = \sum_{i=1}^{d_k} q_i k_i \quad \text{grows as } O(d_k) $$
This pushes softmax into regions of extremely small gradients:
$$ \frac{\partial}{\partial x_i} \text{softmax}(x)_j = \text{softmax}(x)_j \left(\delta_{ij} - \text{softmax}(x)_i\right) $$
When inputs are large, softmax outputs approach one-hot vectors, and gradients vanish.
Properties of Self-Attention
- Parallelization: All positions computed simultaneously β $O(1)$ sequential operations
- Direct connectivity: Any position can directly access any other
- Learned routing: Attention patterns are computed fresh for each input
- Computational complexity: $O(n^2 \cdot d)$ time and $O(n^2)$ memory
Multi-Head Attention
Rather than computing a single attention function, Transformers use multiple parallel attention "heads."
Mathematical Formulation
$$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O $$
Where each head is:
$$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$
Projection Dimensions
- $W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$
- $W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$
- $W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$
- $W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}$
Typical Configuration
For a model with $d_{\text{model}} = 512$ and $h = 8$ heads:
$$ d_k = d_v = \frac{d_{\text{model}}}{h} = \frac{512}{8} = 64 $$
Why Multiple Heads?
- Different representation subspaces: Each head can learn different relationship types
- Specialization: One head might track syntactic dependencies, another semantic relationships
- Redundancy and robustness: Information captured across multiple heads
- Efficient computation: Same total dimensionality as single-head attention
Position Encoding
The Problem
Self-attention is permutation-equivariant:
$$ \text{Attention}(\pi(X)) = \pi(\text{Attention}(X)) $$
Where $\pi$ is any permutation. The operation has no inherent notion of position or order.
Sinusoidal Position Encodings (Original)
The original paper used fixed sinusoidal encodings:
$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) $$
$$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) $$
Where:
- $pos$ β Position in the sequence $(0, 1, 2, \ldots)$
- $i$ β Dimension index $(0, 1, \ldots, d_{\text{model}}/2 - 1)$
- $d_{\text{model}}$ β Model dimension
Properties of Sinusoidal Encodings
- Unique encoding: Each position gets a distinct vector
- Bounded values: All values in $[-1, 1]$
- Relative position as linear transformation: $PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$
$$ PE_{pos+k} = T_k \cdot PE_{pos} $$
Where $T_k$ is a rotation matrix depending only on $k$.
Modern Alternatives
#Rotary Position Embeddings (RoPE)
Encodes position through rotation in 2D subspaces:
$$ f(x_m, m) = \begin{pmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta \end{pmatrix} \begin{pmatrix} x_m^{(1)} \\ x_m^{(2)} \end{pmatrix} $$
For query $q$ at position $m$ and key $k$ at position $n$:
$$ q_m^T k_n = (R_m q)^T (R_n k) = q^T R_{n-m} k $$
This makes attention depend only on relative position $(n-m)$.
#ALiBi (Attention with Linear Biases)
Adds a linear bias based on distance:
$$ \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} - m \cdot |i-j|\right)V $$
Where $m$ is a head-specific slope and $|i-j|$ is the distance between positions.
The Complete Transformer Layer
Layer Composition
A single Transformer layer consists of:
Input β [Layer Norm] β Multi-Head Attention β [+ Residual] β
β [Layer Norm] β Feed-Forward Network β [+ Residual] β Output
Feed-Forward Network (FFN)
Applied position-wise (identically to each position):
$$ \text{FFN}(x) = \sigma(xW_1 + b_1)W_2 + b_2 $$
Where:
- $W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{ff}}$ β Expansion projection
- $W_2 \in \mathbb{R}^{d_{ff} \times d_{\text{model}}}$ β Contraction projection
- $d_{ff}$ β Inner dimension (typically $4 \times d_{\text{model}}$)
- $\sigma$ β Activation function
Activation Functions
#ReLU (Original) $$ \text{ReLU}(x) = \max(0, x) $$
#GELU (Common in modern models) $$ \text{GELU}(x) = x \cdot \Phi(x) \approx x \cdot \sigma(1.702x) $$
Where $\Phi$ is the standard Gaussian CDF.
#SwiGLU (State-of-the-art) $$ \text{SwiGLU}(x) = \text{Swish}(xW_1) \odot (xW_2) $$
Where $\text{Swish}(x) = x \cdot \sigma(x)$ and $\odot$ is element-wise multiplication.
Layer Normalization
$$ \text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta $$
Where:
- $\mu = \frac{1}{d}\sum_{i=1}^{d} x_i$ β Mean across features
- $\sigma^2 = \frac{1}{d}\sum_{i=1}^{d} (x_i - \mu)^2$ β Variance across features
- $\gamma, \beta$ β Learned scale and shift parameters
- $\epsilon$ β Small constant for numerical stability
#Pre-LN vs Post-LN
Post-LN (Original): $$ x' = \text{LayerNorm}(x + \text{Attention}(x)) $$
Pre-LN (Modern, more stable): $$ x' = x + \text{Attention}(\text{LayerNorm}(x)) $$
RMSNorm (Simplified Alternative)
$$ \text{RMSNorm}(x) = \gamma \odot \frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}} $$
Removes the mean-centering step for efficiency.
Residual Connections
$$ x_{l+1} = x_l + F_l(x_l) $$
Essential for:
- Gradient flow: Direct path for gradients in deep networks
- Incremental learning: Layers learn refinements rather than complete transformations
- Training stability: Easier optimization landscape
Architectural Variants
Encoder-Only (BERT-style)
Attention Pattern: Bidirectional (each position attends to all positions)
$$ \text{Mask}_{ij} = 0 \quad \forall i, j $$
Use Cases:
- Text classification
- Named entity recognition
- Question answering
- Sentence embeddings
Pre-training Objective: Masked Language Modeling (MLM)
$$ \mathcal{L}_{\text{MLM}} = -\mathbb{E}_{x \sim \mathcal{D}} \left[ \sum_{i \in \mathcal{M}} \log P(x_i | x_{\backslash \mathcal{M}}) \right] $$
Decoder-Only (GPT-style)
Attention Pattern: Causal (positions only attend to previous positions)
$$ \text{Mask}_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases} $$
Use Cases:
- Text generation
- Conversational AI
- Code completion
- General-purpose LLMs (GPT, Claude, LLaMA)
Pre-training Objective: Next Token Prediction
$$ \mathcal{L}_{\text{LM}} = -\sum_{t=1}^{T} \log P(x_t | x_{ Encoder-Decoder (Original Transformer) Components: Cross-Attention: $$ \text{CrossAttention}(Q_{\text{dec}}, K_{\text{enc}}, V_{\text{enc}}) $$ Where queries come from decoder, keys/values from encoder. Use Cases: Why Transformers Scale Empirical Scaling Laws Performance follows predictable power laws with scale: $$ L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N} $$ $$ L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D} $$ $$ L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C} $$ Where: Factors Contributing to Scaling Computational Considerations Complexity Analysis KV Cache for Inference For autoregressive generation, we cache keys and values: $$ K_t = [K_{t-1}; k_t], \quad V_t = [V_{t-1}; v_t] $$ Memory per layer: $O(n \cdot d_k \cdot 2)$ for keys and values Total cache size: $$ \text{KV Cache} = 2 \times L \times n \times d_{\text{model}} \times \text{precision bytes} $$ Efficient Attention Variants #Flash Attention Uses tiling to compute exact attention with $O(n)$ memory: #Multi-Query Attention (MQA) Shares keys and values across all heads: $$ \text{head}_i = \text{Attention}(QW_i^Q, KW^K, VW^V) $$ Benefit: Reduces KV cache by factor of $h$ #Grouped-Query Attention (GQA) Compromise between MHA and MQA: Applications Beyond Language Vision Transformers (ViT) Images treated as sequences of patches: $$ x_{\text{patch}} = \text{Flatten}(\text{Image}[i:i+P, j:j+P]) \cdot E $$ Where $P \times P$ is patch size and $E$ is the embedding projection. AlphaFold (Protein Structure) Uses Transformers for: Other Domains Open Questions and Frontiers Theoretical Understanding Efficiency Challenges Alternative Architectures Current Research Directions Summary The Transformer architecture represents one of the most significant innovations in deep learning history. Its key contributions: 1. Eliminated sequential processing through self-attention 2. Enabled massive parallelization for efficient training 3. Demonstrated remarkable scaling properties that continue to hold 4. Proved domain-general across language, vision, biology, and beyond The combination of parallelism, global connectivity, and learned routingβall without recurrenceβunlocked capabilities that seemed far off before its introduction. Source: ChipFoundryServices β Search this topic β Ask CFSGPT From EUV lithography to CUDA optimization β search the full knowledge base or chat with our AI assistant.Operation Time Complexity Space Complexity Self-Attention $O(n^2 \cdot d)$ $O(n^2 + nd)$ FFN $O(n \cdot d \cdot d_{ff})$ $O(nd + d \cdot d_{ff})$ Full Layer $O(n^2 \cdot d + n \cdot d \cdot d_{ff})$ $O(n^2 + nd_{ff})$ Explore 500+ Semiconductor & AI Topics