l-diversity, training techniques
L-diversity requires at least l diverse values for sensitive attributes within groups.
169 technical terms and definitions
L-diversity requires at least l diverse values for sensitive attributes within groups.
Bound maximum per-pixel change.
Limit number of pixels changed.
Bound Euclidean distance.
Change labels in training data.
Semi-supervised learning via graph structure.
Soften hard labels for regularization.
Learn dynamics from Lagrangian mechanics.
Google's conversational AI model.
Efficient attention for long context.
Attend to selected landmark tokens for efficiency.
LangChain provides framework for building LLM applications with chains and agents.
Framework for building LLM applications with chains and agents.
LangChain is framework for LLM applications. Chains, agents, tools. Popular ecosystem.
LangChain and LlamaIndex are frameworks for building LLM apps: chains, agents, RAG, memory. Accelerate development.
Sample using gradient of log density.
Langflow is visual LLM flow builder. DataStax.
Remove language-specific features.
Understanding how language models make decisions.
Pre-train for particular language.
# Large Language Model (LLM) Training
A comprehensive guide to understanding how Large Language Models are trained, from architecture to deployment.
## 1. Architecture: The Transformer
Modern LLMs are built on the **Transformer architecture** (Vaswani et al., 2017). The key innovation is the **self-attention mechanism**.
### 1.1 Self-Attention Mechanism
The attention function computes a weighted sum of values based on query-key similarity:
$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$
Where:
- $Q$ = Query matrix of shape $(n \times d_k)$
- $K$ = Key matrix of shape $(n \times d_k)$
- $V$ = Value matrix of shape $(n \times d_v)$
- $d_k$ = Dimension of keys (scaling factor prevents gradient vanishing)
### 1.2 Multi-Head Attention
Multiple attention heads learn different relationship types:
$$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O
$$
Where each head is computed as:
$$
\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
$$
### 1.3 Transformer Block Components
Each transformer block contains:
- **Multi-head self-attention layer**
- Allows tokens to attend to all previous tokens
- Masked attention for autoregressive generation
- **Feed-forward network (FFN)**
$$
\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2
$$
- Or using GELU/SwiGLU activations in modern models
- **Layer normalization**
$$
\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta
$$
- **Residual connections**
$$
\text{output} = \text{LayerNorm}(x + \text{Sublayer}(x))
$$
### 1.4 Positional Encoding
Since attention is permutation-invariant, position information must be injected:
**Sinusoidal (original Transformer):**
$$
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)
$$
$$
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)
$$
**Rotary Position Embedding (RoPE):** Used in modern models like LLaMA:
$$
f_q(x_m, m) = (W_q x_m) e^{im\theta}
$$
## 2. Tokenization
Text must be converted to discrete tokens before training.
### 2.1 Byte-Pair Encoding (BPE)
**Algorithm:**
1. Initialize vocabulary with all unique characters
2. Count frequency of adjacent token pairs
3. Merge most frequent pair into new token
4. Repeat until vocabulary size reached
**Vocabulary sizes:**
| Model | Vocab Size |
|-------|------------|
| GPT-2 | 50,257 |
| GPT-4 | ~100,000 |
| LLaMA | 32,000 |
| Claude | ~100,000 |
### 2.2 Tokenization Impact
- **Efficiency**: Tokens per word ratio affects context utilization
- **Multilingual**: Some languages require more tokens per concept
- **Code**: Special handling for programming syntax
- **Rare words**: Split into subword pieces
## 3. Pre-Training
The most computationally intensive phase where the model learns language patterns.
### 3.1 Training Objective: Next-Token Prediction
**Causal Language Modeling Loss:**
$$
\mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_1, x_2, \ldots, x_{t-1}; \theta)
$$
Or equivalently, the cross-entropy loss:
$$
\mathcal{L}_{CE} = -\sum_{i=1}^{V} y_i \log(\hat{y}_i)
$$
Where:
- $V$ = vocabulary size
- $y_i$ = one-hot encoded true token
- $\hat{y}_i$ = predicted probability for token $i$
### 3.2 Training Data
**Data sources:**
- Web crawls (Common Crawl): ~60%
- Books and literature: ~8%
- Academic papers: ~5%
- Code repositories: ~10%
- Wikipedia: ~3%
- Curated high-quality sources: ~14%
**Data processing pipeline:**
1. **Deduplication**
- MinHash / SimHash for near-duplicate detection
- Exact substring matching
2. **Quality filtering**
- Perplexity-based filtering
- Heuristic rules (length, symbol ratio)
- Classifier-based quality scoring
3. **Toxicity removal**
- Keyword filtering
- Classifier-based detection
### 3.3 Scaling Laws
**Kaplan et al. (2020) Power Laws:**
$$
L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad \alpha_N \approx 0.076
$$
$$
L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad \alpha_D \approx 0.095
$$
$$
L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}, \quad \alpha_C \approx 0.050
$$
Where:
- $L$ = Loss (lower is better)
- $N$ = Number of parameters
- $D$ = Dataset size (tokens)
- $C$ = Compute budget (FLOPs)
**Chinchilla Optimal Scaling:**
For compute-optimal training:
$$
N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5}
$$
Rule of thumb: **~20 tokens per parameter**
### 3.4 Optimization
**AdamW Optimizer:**
$$
m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
$$
$$
v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
$$
$$
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}
$$
$$
\theta_t = \theta_{t-1} - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1} \right)
$$
**Typical hyperparameters:**
- $\beta_1 = 0.9$
- $\beta_2 = 0.95$
- $\epsilon = 10^{-8}$
- $\lambda = 0.1$ (weight decay)
**Learning Rate Schedule:**
Warmup + Cosine decay:
$$
\eta_t = \begin{cases}
\eta_{max} \cdot \frac{t}{T_{warmup}} & \text{if } t < T_{warmup} \\
\eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{t - T_{warmup}}{T_{total} - T_{warmup}} \pi\right)\right) & \text{otherwise}
\end{cases}
$$
### 3.5 Distributed Training
**Parallelism strategies:**
- **Data Parallelism (DP)**
- Model replicated across $N$ GPUs
- Batch split into $N$ micro-batches
- Gradients synchronized via all-reduce
- **Tensor Parallelism (TP)**
- Individual layers split across GPUs
- For attention: $Q, K, V$ projections partitioned
- **Pipeline Parallelism (PP)**
- Sequential layers on different GPUs
- Micro-batch pipelining reduces bubble time
- **ZeRO (Zero Redundancy Optimizer)**
- Stage 1: Partition optimizer states
- Stage 2: + Partition gradients
- Stage 3: + Partition parameters
**Compute requirements:**
$$
\text{FLOPs} \approx 6 \times N \times D
$$
Where $N$ = parameters, $D$ = training tokens.
For a 70B model on 2T tokens:
$$
\text{FLOPs} \approx 6 \times 70 \times 10^9 \times 2 \times 10^{12} = 8.4 \times 10^{23}
$$
## 4. Post-Training & Alignment
Transforms a base model into a helpful assistant.
### 4.1 Supervised Fine-Tuning (SFT)
**Objective:** Same as pre-training, but on curated instruction-response pairs.
$$
\mathcal{L}_{SFT} = -\sum_{t=1}^{T} \log P(y_t | x, y_{
# Large Language Model (LLM) Training
A comprehensive guide to understanding how Large Language Models are trained, from architecture to deployment.
## 1. Architecture: The Transformer
Modern LLMs are built on the **Transformer architecture** (Vaswani et al., 2017). The key innovation is the **self-attention mechanism**.
### 1.1 Self-Attention Mechanism
The attention function computes a weighted sum of values based on query-key similarity:
$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$
Where:
- $Q$ = Query matrix of shape $(n \times d_k)$
- $K$ = Key matrix of shape $(n \times d_k)$
- $V$ = Value matrix of shape $(n \times d_v)$
- $d_k$ = Dimension of keys (scaling factor prevents gradient vanishing)
### 1.2 Multi-Head Attention
Multiple attention heads learn different relationship types:
$$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O
$$
Where each head is computed as:
$$
\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
$$
### 1.3 Transformer Block Components
Each transformer block contains:
- **Multi-head self-attention layer**
- Allows tokens to attend to all previous tokens
- Masked attention for autoregressive generation
- **Feed-forward network (FFN)**
$$
\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2
$$
- Or using GELU/SwiGLU activations in modern models
- **Layer normalization**
$$
\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta
$$
- **Residual connections**
$$
\text{output} = \text{LayerNorm}(x + \text{Sublayer}(x))
$$
### 1.4 Positional Encoding
Since attention is permutation-invariant, position information must be injected:
**Sinusoidal (original Transformer):**
$$
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)
$$
$$
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)
$$
**Rotary Position Embedding (RoPE):** Used in modern models like LLaMA:
$$
f_q(x_m, m) = (W_q x_m) e^{im\theta}
$$
## 2. Tokenization
Text must be converted to discrete tokens before training.
### 2.1 Byte-Pair Encoding (BPE)
**Algorithm:**
1. Initialize vocabulary with all unique characters
2. Count frequency of adjacent token pairs
3. Merge most frequent pair into new token
4. Repeat until vocabulary size reached
**Vocabulary sizes:**
| Model | Vocab Size |
|-------|------------|
| GPT-2 | 50,257 |
| GPT-4 | ~100,000 |
| LLaMA | 32,000 |
| Claude | ~100,000 |
### 2.2 Tokenization Impact
- **Efficiency**: Tokens per word ratio affects context utilization
- **Multilingual**: Some languages require more tokens per concept
- **Code**: Special handling for programming syntax
- **Rare words**: Split into subword pieces
## 3. Pre-Training
The most computationally intensive phase where the model learns language patterns.
### 3.1 Training Objective: Next-Token Prediction
**Causal Language Modeling Loss:**
$$
\mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_1, x_2, \ldots, x_{t-1}; \theta)
$$
Or equivalently, the cross-entropy loss:
$$
\mathcal{L}_{CE} = -\sum_{i=1}^{V} y_i \log(\hat{y}_i)
$$
Where:
- $V$ = vocabulary size
- $y_i$ = one-hot encoded true token
- $\hat{y}_i$ = predicted probability for token $i$
### 3.2 Training Data
**Data sources:**
- Web crawls (Common Crawl): ~60%
- Books and literature: ~8%
- Academic papers: ~5%
- Code repositories: ~10%
- Wikipedia: ~3%
- Curated high-quality sources: ~14%
**Data processing pipeline:**
1. **Deduplication**
- MinHash / SimHash for near-duplicate detection
- Exact substring matching
2. **Quality filtering**
- Perplexity-based filtering
- Heuristic rules (length, symbol ratio)
- Classifier-based quality scoring
3. **Toxicity removal**
- Keyword filtering
- Classifier-based detection
### 3.3 Scaling Laws
**Kaplan et al. (2020) Power Laws:**
$$
L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad \alpha_N \approx 0.076
$$
$$
L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad \alpha_D \approx 0.095
$$
$$
L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}, \quad \alpha_C \approx 0.050
$$
Where:
- $L$ = Loss (lower is better)
- $N$ = Number of parameters
- $D$ = Dataset size (tokens)
- $C$ = Compute budget (FLOPs)
**Chinchilla Optimal Scaling:**
For compute-optimal training:
$$
N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5}
$$
Rule of thumb: **~20 tokens per parameter**
### 3.4 Optimization
**AdamW Optimizer:**
$$
m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
$$
$$
v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
$$
$$
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}
$$
$$
\theta_t = \theta_{t-1} - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1} \right)
$$
**Typical hyperparameters:**
- $\beta_1 = 0.9$
- $\beta_2 = 0.95$
- $\epsilon = 10^{-8}$
- $\lambda = 0.1$ (weight decay)
**Learning Rate Schedule:**
Warmup + Cosine decay:
$$
\eta_t = \begin{cases}
\eta_{max} \cdot \frac{t}{T_{warmup}} & \text{if } t < T_{warmup} \\
\eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{t - T_{warmup}}{T_{total} - T_{warmup}} \pi\right)\right) & \text{otherwise}
\end{cases}
$$
### 3.5 Distributed Training
**Parallelism strategies:**
- **Data Parallelism (DP)**
- Model replicated across $N$ GPUs
- Batch split into $N$ micro-batches
- Gradients synchronized via all-reduce
- **Tensor Parallelism (TP)**
- Individual layers split across GPUs
- For attention: $Q, K, V$ projections partitioned
- **Pipeline Parallelism (PP)**
- Sequential layers on different GPUs
- Micro-batch pipelining reduces bubble time
- **ZeRO (Zero Redundancy Optimizer)**
- Stage 1: Partition optimizer states
- Stage 2: + Partition gradients
- Stage 3: + Partition parameters
**Compute requirements:**
$$
\text{FLOPs} \approx 6 \times N \times D
$$
Where $N$ = parameters, $D$ = training tokens.
For a 70B model on 2T tokens:
$$
\text{FLOPs} \approx 6 \times 70 \times 10^9 \times 2 \times 10^{12} = 8.4 \times 10^{23}
$$
## 4. Post-Training & Alignment
Transforms a base model into a helpful assistant.
### 4.1 Supervised Fine-Tuning (SFT)
**Objective:** Same as pre-training, but on curated instruction-response pairs.
$$
\mathcal{L}_{SFT} = -\sum_{t=1}^{T} \log P(y_t | x, y_{
Laser-based focused ion beam alternatives use pulsed lasers for material removal avoiding gallium contamination in failure analysis.
Use laser to fix mask defects.
Laser voltage probing measures internal node voltages non-invasively through electro-optic or thermal effects.
Non-contact voltage measurement.
Combine predictions at end.
Interact embeddings at fine granularity.
Latency prediction models estimate inference time from architecture specifications guiding hardware-aware NAS.
Fast sampling in latent space.
Diffusion in compressed latent space.
Run diffusion in compressed latent space for efficiency (Stable Diffusion).
Latent diffusion models perform diffusion in compressed latent space reducing computational cost.
Latent directions are vectors in latent space corresponding to semantic attributes.
Defects not caught by test.
Neural ODEs in latent space for irregular time series.
Add/subtract concepts in latent space.
Combine latent codes with arithmetic operations.
Separate independent factors.
Smooth transitions in latent space.
Latent space interpolation smoothly transitions between generated samples.
Smoothly transition between samples.
Edit by moving in latent space.
Explore latent space systematically.
Upscale in latent space.
Learn compact representations of environment dynamics.
Different normalization schemes.
Backpropagate relevance scores.
Small constant preventing division by zero.
Layout optimization selects memory formats for tensors minimizing data transformation overhead.