Home› Knowledge Base› The Transformer architecture

The Transformer architecture

Keywords: transformer,transformers,transformer architecture,self-attention,attention mechanism,encoder-decoder,multi-head attention,positional encoding,BERT,GPT,neural networks

The Transformer architecture was introduced in the landmark 2017 paper "Attention Is All You Need" and has since become the foundation for virtually all modern large language models.

The Transformer architecture was introduced in the landmark 2017 paper "Attention Is All You Need" by Vaswani et al. It replaced recurrence with pure attention mechanisms and has since become the foundation for virtually all modern large language models.

Problems with Previous Approaches (RNNs/LSTMs)

Sequential bottleneck: Processing proceeded step-by-step through sequences, preventing parallelization
Long-range dependency challenges: Information from distant positions had to flow through many intermediate steps
Vanishing gradient problems: Training signals degraded over long sequences, even with gating mechanisms
Computational inefficiency: Sequential nature created fundamental bottlenecks on modern parallel hardware

The Key Insight

Attention alone is sufficient. By allowing every position to directly attend to every other position in a single operation, the sequential constraint is eliminated entirely.

Core Mechanism: Self-Attention

Scaled Dot-Product Attention

The heart of the Transformer is scaled dot-product attention. Given an input sequence of embeddings, we compute three projections:

Query ($Q$): What information is this position looking for?
Key ($K$): What information does this position contain?
Value ($V$): What information should be transmitted if attended to?

Mathematical Formulation

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Where:

$Q \in \mathbb{R}^{n \times d_k}$ — Query matrix
$K \in \mathbb{R}^{n \times d_k}$ — Key matrix
$V \in \mathbb{R}^{n \times d_v}$ — Value matrix
$d_k$ — Dimension of keys/queries
$n$ — Sequence length

Why the Scaling Factor?

The scaling factor $\sqrt{d_k}$ is critical. Without it:

$$ \text{For large } d_k: \quad q \cdot k = \sum_{i=1}^{d_k} q_i k_i \quad \text{grows as } O(d_k) $$

This pushes softmax into regions of extremely small gradients:

$$ \frac{\partial}{\partial x_i} \text{softmax}(x)_j = \text{softmax}(x)_j \left(\delta_{ij} - \text{softmax}(x)_i\right) $$

When inputs are large, softmax outputs approach one-hot vectors, and gradients vanish.

Properties of Self-Attention

Parallelization: All positions computed simultaneously — $O(1)$ sequential operations
Direct connectivity: Any position can directly access any other
Learned routing: Attention patterns are computed fresh for each input
Computational complexity: $O(n^2 \cdot d)$ time and $O(n^2)$ memory

Multi-Head Attention

Rather than computing a single attention function, Transformers use multiple parallel attention "heads."

Mathematical Formulation

$$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O $$

Where each head is:

$$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$

Projection Dimensions

$W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$
$W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$
$W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$
$W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}$

Typical Configuration

For a model with $d_{\text{model}} = 512$ and $h = 8$ heads:

$$ d_k = d_v = \frac{d_{\text{model}}}{h} = \frac{512}{8} = 64 $$

Why Multiple Heads?

Different representation subspaces: Each head can learn different relationship types
Specialization: One head might track syntactic dependencies, another semantic relationships
Redundancy and robustness: Information captured across multiple heads
Efficient computation: Same total dimensionality as single-head attention

Position Encoding

The Problem

Self-attention is permutation-equivariant:

$$ \text{Attention}(\pi(X)) = \pi(\text{Attention}(X)) $$

Where $\pi$ is any permutation. The operation has no inherent notion of position or order.

Sinusoidal Position Encodings (Original)

The original paper used fixed sinusoidal encodings:

$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) $$

$$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) $$

Where:

$pos$ — Position in the sequence $(0, 1, 2, \ldots)$
$i$ — Dimension index $(0, 1, \ldots, d_{\text{model}}/2 - 1)$
$d_{\text{model}}$ — Model dimension

Properties of Sinusoidal Encodings

Unique encoding: Each position gets a distinct vector
Bounded values: All values in $[-1, 1]$
Relative position as linear transformation: $PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$

$$ PE_{pos+k} = T_k \cdot PE_{pos} $$

Where $T_k$ is a rotation matrix depending only on $k$.

Modern Alternatives

#Rotary Position Embeddings (RoPE)

Encodes position through rotation in 2D subspaces:

$$ f(x_m, m) = \begin{pmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta \end{pmatrix} \begin{pmatrix} x_m^{(1)} \\ x_m^{(2)} \end{pmatrix} $$

For query $q$ at position $m$ and key $k$ at position $n$:

$$ q_m^T k_n = (R_m q)^T (R_n k) = q^T R_{n-m} k $$

This makes attention depend only on relative position $(n-m)$.

#ALiBi (Attention with Linear Biases)

Adds a linear bias based on distance:

$$ \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} - m \cdot |i-j|\right)V $$

Where $m$ is a head-specific slope and $|i-j|$ is the distance between positions.

The Complete Transformer Layer

Layer Composition

A single Transformer layer consists of:

Input → [Layer Norm] → Multi-Head Attention → [+ Residual] → 
      → [Layer Norm] → Feed-Forward Network → [+ Residual] → Output

Feed-Forward Network (FFN)

Applied position-wise (identically to each position):

$$ \text{FFN}(x) = \sigma(xW_1 + b_1)W_2 + b_2 $$

Where:

$W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{ff}}$ — Expansion projection
$W_2 \in \mathbb{R}^{d_{ff} \times d_{\text{model}}}$ — Contraction projection
$d_{ff}$ — Inner dimension (typically $4 \times d_{\text{model}}$)
$\sigma$ — Activation function

Activation Functions

#ReLU (Original) $$ \text{ReLU}(x) = \max(0, x) $$

#GELU (Common in modern models) $$ \text{GELU}(x) = x \cdot \Phi(x) \approx x \cdot \sigma(1.702x) $$

Where $\Phi$ is the standard Gaussian CDF.

#SwiGLU (State-of-the-art) $$ \text{SwiGLU}(x) = \text{Swish}(xW_1) \odot (xW_2) $$

Where $\text{Swish}(x) = x \cdot \sigma(x)$ and $\odot$ is element-wise multiplication.

Layer Normalization

$$ \text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta $$

Where:

$\mu = \frac{1}{d}\sum_{i=1}^{d} x_i$ — Mean across features
$\sigma^2 = \frac{1}{d}\sum_{i=1}^{d} (x_i - \mu)^2$ — Variance across features
$\gamma, \beta$ — Learned scale and shift parameters
$\epsilon$ — Small constant for numerical stability

#Pre-LN vs Post-LN

Post-LN (Original): $$ x' = \text{LayerNorm}(x + \text{Attention}(x)) $$

Pre-LN (Modern, more stable): $$ x' = x + \text{Attention}(\text{LayerNorm}(x)) $$

RMSNorm (Simplified Alternative)

$$ \text{RMSNorm}(x) = \gamma \odot \frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}} $$

Removes the mean-centering step for efficiency.

Residual Connections

$$ x_{l+1} = x_l + F_l(x_l) $$

Essential for:

Gradient flow: Direct path for gradients in deep networks
Incremental learning: Layers learn refinements rather than complete transformations
Training stability: Easier optimization landscape

Architectural Variants

Encoder-Only (BERT-style)

Attention Pattern: Bidirectional (each position attends to all positions)

$$ \text{Mask}_{ij} = 0 \quad \forall i, j $$

Use Cases:

Text classification
Named entity recognition
Question answering
Sentence embeddings

Pre-training Objective: Masked Language Modeling (MLM)

$$ \mathcal{L}_{\text{MLM}} = -\mathbb{E}_{x \sim \mathcal{D}} \left[ \sum_{i \in \mathcal{M}} \log P(x_i | x_{\backslash \mathcal{M}}) \right] $$

Decoder-Only (GPT-style)

Attention Pattern: Causal (positions only attend to previous positions)

$$ \text{Mask}_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases} $$

Use Cases:

Text generation
Conversational AI
Code completion
General-purpose LLMs (GPT, Claude, LLaMA)

Pre-training Objective: Next Token Prediction

$$ \mathcal{L}_{\text{LM}} = -\sum_{t=1}^{T} \log P(x_t | x_{

Encoder-Decoder (Original Transformer)

Components:

Encoder: Bidirectional self-attention on input
Decoder: Causal self-attention + cross-attention to encoder

Cross-Attention:

$$ \text{CrossAttention}(Q_{\text{dec}}, K_{\text{enc}}, V_{\text{enc}}) $$

Where queries come from decoder, keys/values from encoder.

Use Cases:

Machine translation
Summarization
Speech-to-text

Why Transformers Scale

Empirical Scaling Laws

Performance follows predictable power laws with scale:

$$ L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N} $$

$$ L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D} $$

$$ L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C} $$

Where:

$L$ — Loss
$N$ — Number of parameters
$D$ — Dataset size
$C$ — Compute budget
$\alpha_N \approx 0.076$, $\alpha_D \approx 0.095$, $\alpha_C \approx 0.050$ (empirically measured)

Factors Contributing to Scaling

Efficient hardware utilization: Parallel attention maps well to GPUs/TPUs
Minimal inductive bias: Flexibility to learn optimal information flow
Depth: Residual connections enable very deep networks (100+ layers)
Dynamic computation: Input-dependent attention patterns

Computational Considerations

Complexity Analysis

Operation	Time Complexity	Space Complexity
Self-Attention	$O(n^2 \cdot d)$	$O(n^2 + nd)$
FFN	$O(n \cdot d \cdot d_{ff})$	$O(nd + d \cdot d_{ff})$
Full Layer	$O(n^2 \cdot d + n \cdot d \cdot d_{ff})$	$O(n^2 + nd_{ff})$

KV Cache for Inference

For autoregressive generation, we cache keys and values:

$$ K_t = [K_{t-1}; k_t], \quad V_t = [V_{t-1}; v_t] $$

Memory per layer: $O(n \cdot d_k \cdot 2)$ for keys and values

Total cache size: $$ \text{KV Cache} = 2 \times L \times n \times d_{\text{model}} \times \text{precision bytes} $$

Efficient Attention Variants

#Flash Attention

Uses tiling to compute exact attention with $O(n)$ memory:

Operates on blocks of size $B$
Avoids materializing the full $n \times n$ attention matrix
IO-aware: Minimizes memory bandwidth bottleneck

#Multi-Query Attention (MQA)

Shares keys and values across all heads:

$$ \text{head}_i = \text{Attention}(QW_i^Q, KW^K, VW^V) $$

Benefit: Reduces KV cache by factor of $h$

#Grouped-Query Attention (GQA)

Compromise between MHA and MQA:

Groups of $g$ heads share the same K, V projections
Memory reduction factor: $h/g$

Applications Beyond Language

Vision Transformers (ViT)

Images treated as sequences of patches:

$$ x_{\text{patch}} = \text{Flatten}(\text{Image}[i:i+P, j:j+P]) \cdot E $$

Where $P \times P$ is patch size and $E$ is the embedding projection.

AlphaFold (Protein Structure)

Uses Transformers for:

Multiple sequence alignment processing
Structure module with invariant point attention
Achieved revolutionary accuracy in protein folding

Other Domains

Audio/Speech: Speech recognition, music generation, text-to-speech
Code: Code generation, completion, and understanding
Multimodal: Vision-language models processing both images and text
Time Series: Forecasting, anomaly detection
Robotics: Decision making, control policies

Open Questions and Frontiers

Theoretical Understanding

Why do they work so well? Complete theoretical understanding remains elusive
In-context learning: Evidence suggests implicit gradient descent during inference
Mechanistic interpretability: What do individual attention heads compute?

Efficiency Challenges

Long context: $O(n^2)$ attention remains a fundamental bottleneck
Inference cost: KV cache grows linearly with context
Training efficiency: Data and compute requirements continue to grow

Alternative Architectures

State Space Models (Mamba): Linear-time sequence modeling with selection
Hybrid architectures: Combining attention with other mechanisms
Sparse architectures: Mixture of Experts (MoE) for efficient scaling

Current Research Directions

Extremely long context (millions of tokens)
More efficient architectures without capability loss
Better theoretical foundations
Improved interpretability and alignment

Summary

The Transformer architecture represents one of the most significant innovations in deep learning history. Its key contributions:

1. Eliminated sequential processing through self-attention 2. Enabled massive parallelization for efficient training 3. Demonstrated remarkable scaling properties that continue to hold 4. Proved domain-general across language, vision, biology, and beyond

The combination of parallelism, global connectivity, and learned routing—all without recurrence—unlocked capabilities that seemed far off before its introduction.

Source: ChipFoundryServices — Search this topic — Ask CFSGPT

transformertransformerstransformer architectureself-attentionattention mechanismencoder-decodermulti-head attentionpositional encodingBERTGPTneural networks

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.

🔍 Search Topics 💬 Ask CFSGPT 📚 Browse All