l-diversity, training techniques
L-diversity requires at least l diverse values for sensitive attributes within groups.
442 technical terms and definitions
L-diversity requires at least l diverse values for sensitive attributes within groups.
Ensure diversity of sensitive attributes within groups.
Bound maximum per-pixel change.
Limit number of pixels changed.
Level 1 cache on GPU.
Bound Euclidean distance.
Level 2 cache shared across GPU.
Variation between different lots.
Label encoding converts categories to integers. Use for ordinal.
Change labels in training data.
Errors in training labels.
Semi-supervised learning via graph structure.
Output distribution changes.
Regularization technique for classification.
Soften hard labels for regularization.
Annotation tools (Label Studio, Argilla) streamline data labeling. UI for annotators, quality control, export.
Label Studio is open source annotation tool. Multiple data types. Customizable.
Labelbox is enterprise annotation platform. Model-assisted labeling.
ViT learns structure from data.
Learn dynamics from Lagrangian formulation.
Lagrangian methods solve constrained RL by introducing multipliers and alternating between policy and multiplier updates.
Learn dynamics from Lagrangian mechanics.
LakeFS provides Git-like versioning for data lakes. Branch, merge, rollback data.
Optimizer for large batch training.
Lambda Labs offers GPU cloud and workstations. Training and inference. Alternative to hyperscalers.
LambdaRank optimizes ranking metrics directly by computing gradients from swapping document pairs.
Google's conversational AI model.
Thin sample preparation for TEM.
Smooth unidirectional air flow to prevent particle turbulence.
Lamp heaters use radiant energy from high-intensity bulbs.
Contact pads instead of balls.
Efficient attention for long context.
Attend to selected landmark tokens for efficiency.
LangChain provides framework for building LLM applications with chains and agents.
Framework for building LLM applications with chains and agents.
LangChain is framework for LLM applications. Chains, agents, tools. Popular ecosystem.
LangChain and LlamaIndex are frameworks for building LLM apps: chains, agents, RAG, memory. Accelerate development.
Sample using gradient of log density.
Langflow is visual LLM flow builder. DataStax.
Langfuse is open source LLM tracing. Debug chains, track costs. Self-hostable.
Measure plasma density and potential.
Remove language-specific features.
Keep specific languages.
Detect text language.
Understanding how language models make decisions.
Universal cross-lingual embeddings.
Pre-train for particular language.
# Large Language Model (LLM) Training
A comprehensive guide to understanding how Large Language Models are trained, from architecture to deployment.
## 1. Architecture: The Transformer
Modern LLMs are built on the **Transformer architecture** (Vaswani et al., 2017). The key innovation is the **self-attention mechanism**.
### 1.1 Self-Attention Mechanism
The attention function computes a weighted sum of values based on query-key similarity:
$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$
Where:
- $Q$ = Query matrix of shape $(n \times d_k)$
- $K$ = Key matrix of shape $(n \times d_k)$
- $V$ = Value matrix of shape $(n \times d_v)$
- $d_k$ = Dimension of keys (scaling factor prevents gradient vanishing)
### 1.2 Multi-Head Attention
Multiple attention heads learn different relationship types:
$$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O
$$
Where each head is computed as:
$$
\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
$$
### 1.3 Transformer Block Components
Each transformer block contains:
- **Multi-head self-attention layer**
- Allows tokens to attend to all previous tokens
- Masked attention for autoregressive generation
- **Feed-forward network (FFN)**
$$
\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2
$$
- Or using GELU/SwiGLU activations in modern models
- **Layer normalization**
$$
\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta
$$
- **Residual connections**
$$
\text{output} = \text{LayerNorm}(x + \text{Sublayer}(x))
$$
### 1.4 Positional Encoding
Since attention is permutation-invariant, position information must be injected:
**Sinusoidal (original Transformer):**
$$
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)
$$
$$
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)
$$
**Rotary Position Embedding (RoPE):** Used in modern models like LLaMA:
$$
f_q(x_m, m) = (W_q x_m) e^{im\theta}
$$
## 2. Tokenization
Text must be converted to discrete tokens before training.
### 2.1 Byte-Pair Encoding (BPE)
**Algorithm:**
1. Initialize vocabulary with all unique characters
2. Count frequency of adjacent token pairs
3. Merge most frequent pair into new token
4. Repeat until vocabulary size reached
**Vocabulary sizes:**
| Model | Vocab Size |
|-------|------------|
| GPT-2 | 50,257 |
| GPT-4 | ~100,000 |
| LLaMA | 32,000 |
| Claude | ~100,000 |
### 2.2 Tokenization Impact
- **Efficiency**: Tokens per word ratio affects context utilization
- **Multilingual**: Some languages require more tokens per concept
- **Code**: Special handling for programming syntax
- **Rare words**: Split into subword pieces
## 3. Pre-Training
The most computationally intensive phase where the model learns language patterns.
### 3.1 Training Objective: Next-Token Prediction
**Causal Language Modeling Loss:**
$$
\mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_1, x_2, \ldots, x_{t-1}; \theta)
$$
Or equivalently, the cross-entropy loss:
$$
\mathcal{L}_{CE} = -\sum_{i=1}^{V} y_i \log(\hat{y}_i)
$$
Where:
- $V$ = vocabulary size
- $y_i$ = one-hot encoded true token
- $\hat{y}_i$ = predicted probability for token $i$
### 3.2 Training Data
**Data sources:**
- Web crawls (Common Crawl): ~60%
- Books and literature: ~8%
- Academic papers: ~5%
- Code repositories: ~10%
- Wikipedia: ~3%
- Curated high-quality sources: ~14%
**Data processing pipeline:**
1. **Deduplication**
- MinHash / SimHash for near-duplicate detection
- Exact substring matching
2. **Quality filtering**
- Perplexity-based filtering
- Heuristic rules (length, symbol ratio)
- Classifier-based quality scoring
3. **Toxicity removal**
- Keyword filtering
- Classifier-based detection
### 3.3 Scaling Laws
**Kaplan et al. (2020) Power Laws:**
$$
L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad \alpha_N \approx 0.076
$$
$$
L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad \alpha_D \approx 0.095
$$
$$
L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}, \quad \alpha_C \approx 0.050
$$
Where:
- $L$ = Loss (lower is better)
- $N$ = Number of parameters
- $D$ = Dataset size (tokens)
- $C$ = Compute budget (FLOPs)
**Chinchilla Optimal Scaling:**
For compute-optimal training:
$$
N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5}
$$
Rule of thumb: **~20 tokens per parameter**
### 3.4 Optimization
**AdamW Optimizer:**
$$
m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
$$
$$
v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
$$
$$
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}
$$
$$
\theta_t = \theta_{t-1} - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1} \right)
$$
**Typical hyperparameters:**
- $\beta_1 = 0.9$
- $\beta_2 = 0.95$
- $\epsilon = 10^{-8}$
- $\lambda = 0.1$ (weight decay)
**Learning Rate Schedule:**
Warmup + Cosine decay:
$$
\eta_t = \begin{cases}
\eta_{max} \cdot \frac{t}{T_{warmup}} & \text{if } t < T_{warmup} \\
\eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{t - T_{warmup}}{T_{total} - T_{warmup}} \pi\right)\right) & \text{otherwise}
\end{cases}
$$
### 3.5 Distributed Training
**Parallelism strategies:**
- **Data Parallelism (DP)**
- Model replicated across $N$ GPUs
- Batch split into $N$ micro-batches
- Gradients synchronized via all-reduce
- **Tensor Parallelism (TP)**
- Individual layers split across GPUs
- For attention: $Q, K, V$ projections partitioned
- **Pipeline Parallelism (PP)**
- Sequential layers on different GPUs
- Micro-batch pipelining reduces bubble time
- **ZeRO (Zero Redundancy Optimizer)**
- Stage 1: Partition optimizer states
- Stage 2: + Partition gradients
- Stage 3: + Partition parameters
**Compute requirements:**
$$
\text{FLOPs} \approx 6 \times N \times D
$$
Where $N$ = parameters, $D$ = training tokens.
For a 70B model on 2T tokens:
$$
\text{FLOPs} \approx 6 \times 70 \times 10^9 \times 2 \times 10^{12} = 8.4 \times 10^{23}
$$
## 4. Post-Training & Alignment
Transforms a base model into a helpful assistant.
### 4.1 Supervised Fine-Tuning (SFT)
**Objective:** Same as pre-training, but on curated instruction-response pairs.
$$
\mathcal{L}_{SFT} = -\sum_{t=1}^{T} \log P(y_t | x, y_{
# Large Language Model (LLM) Training
A comprehensive guide to understanding how Large Language Models are trained, from architecture to deployment.
## 1. Architecture: The Transformer
Modern LLMs are built on the **Transformer architecture** (Vaswani et al., 2017). The key innovation is the **self-attention mechanism**.
### 1.1 Self-Attention Mechanism
The attention function computes a weighted sum of values based on query-key similarity:
$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$
Where:
- $Q$ = Query matrix of shape $(n \times d_k)$
- $K$ = Key matrix of shape $(n \times d_k)$
- $V$ = Value matrix of shape $(n \times d_v)$
- $d_k$ = Dimension of keys (scaling factor prevents gradient vanishing)
### 1.2 Multi-Head Attention
Multiple attention heads learn different relationship types:
$$
\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O
$$
Where each head is computed as:
$$
\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)
$$
### 1.3 Transformer Block Components
Each transformer block contains:
- **Multi-head self-attention layer**
- Allows tokens to attend to all previous tokens
- Masked attention for autoregressive generation
- **Feed-forward network (FFN)**
$$
\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2
$$
- Or using GELU/SwiGLU activations in modern models
- **Layer normalization**
$$
\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta
$$
- **Residual connections**
$$
\text{output} = \text{LayerNorm}(x + \text{Sublayer}(x))
$$
### 1.4 Positional Encoding
Since attention is permutation-invariant, position information must be injected:
**Sinusoidal (original Transformer):**
$$
PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)
$$
$$
PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)
$$
**Rotary Position Embedding (RoPE):** Used in modern models like LLaMA:
$$
f_q(x_m, m) = (W_q x_m) e^{im\theta}
$$
## 2. Tokenization
Text must be converted to discrete tokens before training.
### 2.1 Byte-Pair Encoding (BPE)
**Algorithm:**
1. Initialize vocabulary with all unique characters
2. Count frequency of adjacent token pairs
3. Merge most frequent pair into new token
4. Repeat until vocabulary size reached
**Vocabulary sizes:**
| Model | Vocab Size |
|-------|------------|
| GPT-2 | 50,257 |
| GPT-4 | ~100,000 |
| LLaMA | 32,000 |
| Claude | ~100,000 |
### 2.2 Tokenization Impact
- **Efficiency**: Tokens per word ratio affects context utilization
- **Multilingual**: Some languages require more tokens per concept
- **Code**: Special handling for programming syntax
- **Rare words**: Split into subword pieces
## 3. Pre-Training
The most computationally intensive phase where the model learns language patterns.
### 3.1 Training Objective: Next-Token Prediction
**Causal Language Modeling Loss:**
$$
\mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_1, x_2, \ldots, x_{t-1}; \theta)
$$
Or equivalently, the cross-entropy loss:
$$
\mathcal{L}_{CE} = -\sum_{i=1}^{V} y_i \log(\hat{y}_i)
$$
Where:
- $V$ = vocabulary size
- $y_i$ = one-hot encoded true token
- $\hat{y}_i$ = predicted probability for token $i$
### 3.2 Training Data
**Data sources:**
- Web crawls (Common Crawl): ~60%
- Books and literature: ~8%
- Academic papers: ~5%
- Code repositories: ~10%
- Wikipedia: ~3%
- Curated high-quality sources: ~14%
**Data processing pipeline:**
1. **Deduplication**
- MinHash / SimHash for near-duplicate detection
- Exact substring matching
2. **Quality filtering**
- Perplexity-based filtering
- Heuristic rules (length, symbol ratio)
- Classifier-based quality scoring
3. **Toxicity removal**
- Keyword filtering
- Classifier-based detection
### 3.3 Scaling Laws
**Kaplan et al. (2020) Power Laws:**
$$
L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad \alpha_N \approx 0.076
$$
$$
L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad \alpha_D \approx 0.095
$$
$$
L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}, \quad \alpha_C \approx 0.050
$$
Where:
- $L$ = Loss (lower is better)
- $N$ = Number of parameters
- $D$ = Dataset size (tokens)
- $C$ = Compute budget (FLOPs)
**Chinchilla Optimal Scaling:**
For compute-optimal training:
$$
N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5}
$$
Rule of thumb: **~20 tokens per parameter**
### 3.4 Optimization
**AdamW Optimizer:**
$$
m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t
$$
$$
v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2
$$
$$
\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}
$$
$$
\theta_t = \theta_{t-1} - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1} \right)
$$
**Typical hyperparameters:**
- $\beta_1 = 0.9$
- $\beta_2 = 0.95$
- $\epsilon = 10^{-8}$
- $\lambda = 0.1$ (weight decay)
**Learning Rate Schedule:**
Warmup + Cosine decay:
$$
\eta_t = \begin{cases}
\eta_{max} \cdot \frac{t}{T_{warmup}} & \text{if } t < T_{warmup} \\
\eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{t - T_{warmup}}{T_{total} - T_{warmup}} \pi\right)\right) & \text{otherwise}
\end{cases}
$$
### 3.5 Distributed Training
**Parallelism strategies:**
- **Data Parallelism (DP)**
- Model replicated across $N$ GPUs
- Batch split into $N$ micro-batches
- Gradients synchronized via all-reduce
- **Tensor Parallelism (TP)**
- Individual layers split across GPUs
- For attention: $Q, K, V$ projections partitioned
- **Pipeline Parallelism (PP)**
- Sequential layers on different GPUs
- Micro-batch pipelining reduces bubble time
- **ZeRO (Zero Redundancy Optimizer)**
- Stage 1: Partition optimizer states
- Stage 2: + Partition gradients
- Stage 3: + Partition parameters
**Compute requirements:**
$$
\text{FLOPs} \approx 6 \times N \times D
$$
Where $N$ = parameters, $D$ = training tokens.
For a 70B model on 2T tokens:
$$
\text{FLOPs} \approx 6 \times 70 \times 10^9 \times 2 \times 10^{12} = 8.4 \times 10^{23}
$$
## 4. Post-Training & Alignment
Transforms a base model into a helpful assistant.
### 4.1 Supervised Fine-Tuning (SFT)
**Objective:** Same as pre-training, but on curated instruction-response pairs.
$$
\mathcal{L}_{SFT} = -\sum_{t=1}^{T} \log P(y_t | x, y_{
Larger-the-better characteristics should be maximized with minimal variation.