← Back to AI Factory Chat

AI Factory Glossary

442 technical terms and definitions

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z All
Showing page 1 of 9 (442 entries)

l-diversity, training techniques

L-diversity requires at least l diverse values for sensitive attributes within groups.

l-diversity,privacy

Ensure diversity of sensitive attributes within groups.

l-infinity attacks, ai safety

Bound maximum per-pixel change.

l0 attacks, l0, ai safety

Limit number of pixels changed.

l1 cache, l1, hardware

Level 1 cache on GPU.

l2 attacks, l2, ai safety

Bound Euclidean distance.

l2 cache, l2, hardware

Level 2 cache shared across GPU.

l2l (lot-to-lot variation),l2l,lot-to-lot variation,manufacturing

Variation between different lots.

label encoding,ordinal,convert

Label encoding converts categories to integers. Use for ordinal.

label flipping, ai safety

Change labels in training data.

label noise,data quality

Errors in training labels.

label propagation on graphs, graph neural networks

Semi-supervised learning via graph structure.

label shift,transfer learning

Output distribution changes.

label smoothing in vit, computer vision

Regularization technique for classification.

label smoothing, machine learning

Soften hard labels for regularization.

label studio,annotation tool

Annotation tools (Label Studio, Argilla) streamline data labeling. UI for annotators, quality control, export.

label studio,annotation,open

Label Studio is open source annotation tool. Multiple data types. Customizable.

labelbox,platform,annotation

Labelbox is enterprise annotation platform. Model-assisted labeling.

lack of inductive bias, computer vision

ViT learns structure from data.

lagrangian mechanics learning, scientific ml

Learn dynamics from Lagrangian formulation.

lagrangian methods rl, reinforcement learning advanced

Lagrangian methods solve constrained RL by introducing multipliers and alternating between policy and multiplier updates.

lagrangian neural networks, scientific ml

Learn dynamics from Lagrangian mechanics.

lakefs,data lake,version

LakeFS provides Git-like versioning for data lakes. Branch, merge, rollback data.

lamb, lamb, optimization

Optimizer for large batch training.

lambda labs,gpu,cloud

Lambda Labs offers GPU cloud and workstations. Training and inference. Alternative to hyperscalers.

lambdarank, recommendation systems

LambdaRank optimizes ranking metrics directly by computing gradients from swapping document pairs.

lamda (language model for dialogue applications),lamda,language model for dialogue applications,foundation model

Google's conversational AI model.

lamella preparation,metrology

Thin sample preparation for TEM.

laminar flow,facility

Smooth unidirectional air flow to prevent particle turbulence.

lamp heater, manufacturing equipment

Lamp heaters use radiant energy from high-intensity bulbs.

land grid array, lga, packaging

Contact pads instead of balls.

landmark attention, llm architecture

Efficient attention for long context.

landmark attention,llm architecture

Attend to selected landmark tokens for efficiency.

langchain, ai agents

LangChain provides framework for building LLM applications with chains and agents.

langchain,framework

Framework for building LLM applications with chains and agents.

langchain,framework,llm

LangChain is framework for LLM applications. Chains, agents, tools. Popular ecosystem.

langchain,llamaindex,framework

LangChain and LlamaIndex are frameworks for building LLM apps: chains, agents, RAG, memory. Accelerate development.

langevin dynamics,generative models

Sample using gradient of log density.

langflow,visual,llm

Langflow is visual LLM flow builder. DataStax.

langfuse,tracing,open source

Langfuse is open source LLM tracing. Debug chains, track costs. Self-hostable.

langmuir probe,metrology

Measure plasma density and potential.

language adversarial training, nlp

Remove language-specific features.

language filtering, data quality

Keep specific languages.

language identification, data quality

Detect text language.

language model interpretability, explainable ai

Understanding how language models make decisions.

language-agnostic representations, nlp

Universal cross-lingual embeddings.

language-specific pre-training, transfer learning

Pre-train for particular language.

large language model training, transformer, attention, tokenization, scaling laws, RLHF, DPO, mixture of experts, extended context, evaluation, deployment

# Large Language Model (LLM) Training A comprehensive guide to understanding how Large Language Models are trained, from architecture to deployment. ## 1. Architecture: The Transformer Modern LLMs are built on the **Transformer architecture** (Vaswani et al., 2017). The key innovation is the **self-attention mechanism**. ### 1.1 Self-Attention Mechanism The attention function computes a weighted sum of values based on query-key similarity: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$ Where: - $Q$ = Query matrix of shape $(n \times d_k)$ - $K$ = Key matrix of shape $(n \times d_k)$ - $V$ = Value matrix of shape $(n \times d_v)$ - $d_k$ = Dimension of keys (scaling factor prevents gradient vanishing) ### 1.2 Multi-Head Attention Multiple attention heads learn different relationship types: $$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O $$ Where each head is computed as: $$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$ ### 1.3 Transformer Block Components Each transformer block contains: - **Multi-head self-attention layer** - Allows tokens to attend to all previous tokens - Masked attention for autoregressive generation - **Feed-forward network (FFN)** $$ \text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2 $$ - Or using GELU/SwiGLU activations in modern models - **Layer normalization** $$ \text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta $$ - **Residual connections** $$ \text{output} = \text{LayerNorm}(x + \text{Sublayer}(x)) $$ ### 1.4 Positional Encoding Since attention is permutation-invariant, position information must be injected: **Sinusoidal (original Transformer):** $$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$ $$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$ **Rotary Position Embedding (RoPE):** Used in modern models like LLaMA: $$ f_q(x_m, m) = (W_q x_m) e^{im\theta} $$ ## 2. Tokenization Text must be converted to discrete tokens before training. ### 2.1 Byte-Pair Encoding (BPE) **Algorithm:** 1. Initialize vocabulary with all unique characters 2. Count frequency of adjacent token pairs 3. Merge most frequent pair into new token 4. Repeat until vocabulary size reached **Vocabulary sizes:** | Model | Vocab Size | |-------|------------| | GPT-2 | 50,257 | | GPT-4 | ~100,000 | | LLaMA | 32,000 | | Claude | ~100,000 | ### 2.2 Tokenization Impact - **Efficiency**: Tokens per word ratio affects context utilization - **Multilingual**: Some languages require more tokens per concept - **Code**: Special handling for programming syntax - **Rare words**: Split into subword pieces ## 3. Pre-Training The most computationally intensive phase where the model learns language patterns. ### 3.1 Training Objective: Next-Token Prediction **Causal Language Modeling Loss:** $$ \mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_1, x_2, \ldots, x_{t-1}; \theta) $$ Or equivalently, the cross-entropy loss: $$ \mathcal{L}_{CE} = -\sum_{i=1}^{V} y_i \log(\hat{y}_i) $$ Where: - $V$ = vocabulary size - $y_i$ = one-hot encoded true token - $\hat{y}_i$ = predicted probability for token $i$ ### 3.2 Training Data **Data sources:** - Web crawls (Common Crawl): ~60% - Books and literature: ~8% - Academic papers: ~5% - Code repositories: ~10% - Wikipedia: ~3% - Curated high-quality sources: ~14% **Data processing pipeline:** 1. **Deduplication** - MinHash / SimHash for near-duplicate detection - Exact substring matching 2. **Quality filtering** - Perplexity-based filtering - Heuristic rules (length, symbol ratio) - Classifier-based quality scoring 3. **Toxicity removal** - Keyword filtering - Classifier-based detection ### 3.3 Scaling Laws **Kaplan et al. (2020) Power Laws:** $$ L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad \alpha_N \approx 0.076 $$ $$ L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad \alpha_D \approx 0.095 $$ $$ L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}, \quad \alpha_C \approx 0.050 $$ Where: - $L$ = Loss (lower is better) - $N$ = Number of parameters - $D$ = Dataset size (tokens) - $C$ = Compute budget (FLOPs) **Chinchilla Optimal Scaling:** For compute-optimal training: $$ N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5} $$ Rule of thumb: **~20 tokens per parameter** ### 3.4 Optimization **AdamW Optimizer:** $$ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t $$ $$ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 $$ $$ \hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} $$ $$ \theta_t = \theta_{t-1} - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1} \right) $$ **Typical hyperparameters:** - $\beta_1 = 0.9$ - $\beta_2 = 0.95$ - $\epsilon = 10^{-8}$ - $\lambda = 0.1$ (weight decay) **Learning Rate Schedule:** Warmup + Cosine decay: $$ \eta_t = \begin{cases} \eta_{max} \cdot \frac{t}{T_{warmup}} & \text{if } t < T_{warmup} \\ \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{t - T_{warmup}}{T_{total} - T_{warmup}} \pi\right)\right) & \text{otherwise} \end{cases} $$ ### 3.5 Distributed Training **Parallelism strategies:** - **Data Parallelism (DP)** - Model replicated across $N$ GPUs - Batch split into $N$ micro-batches - Gradients synchronized via all-reduce - **Tensor Parallelism (TP)** - Individual layers split across GPUs - For attention: $Q, K, V$ projections partitioned - **Pipeline Parallelism (PP)** - Sequential layers on different GPUs - Micro-batch pipelining reduces bubble time - **ZeRO (Zero Redundancy Optimizer)** - Stage 1: Partition optimizer states - Stage 2: + Partition gradients - Stage 3: + Partition parameters **Compute requirements:** $$ \text{FLOPs} \approx 6 \times N \times D $$ Where $N$ = parameters, $D$ = training tokens. For a 70B model on 2T tokens: $$ \text{FLOPs} \approx 6 \times 70 \times 10^9 \times 2 \times 10^{12} = 8.4 \times 10^{23} $$ ## 4. Post-Training & Alignment Transforms a base model into a helpful assistant. ### 4.1 Supervised Fine-Tuning (SFT) **Objective:** Same as pre-training, but on curated instruction-response pairs. $$ \mathcal{L}_{SFT} = -\sum_{t=1}^{T} \log P(y_t | x, y_{

large language model training, transformer, attention, tokenization, scaling laws, RLHF, DPO, mixture of experts, extended context, evaluation, deployment

# Large Language Model (LLM) Training A comprehensive guide to understanding how Large Language Models are trained, from architecture to deployment. ## 1. Architecture: The Transformer Modern LLMs are built on the **Transformer architecture** (Vaswani et al., 2017). The key innovation is the **self-attention mechanism**. ### 1.1 Self-Attention Mechanism The attention function computes a weighted sum of values based on query-key similarity: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$ Where: - $Q$ = Query matrix of shape $(n \times d_k)$ - $K$ = Key matrix of shape $(n \times d_k)$ - $V$ = Value matrix of shape $(n \times d_v)$ - $d_k$ = Dimension of keys (scaling factor prevents gradient vanishing) ### 1.2 Multi-Head Attention Multiple attention heads learn different relationship types: $$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O $$ Where each head is computed as: $$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$ ### 1.3 Transformer Block Components Each transformer block contains: - **Multi-head self-attention layer** - Allows tokens to attend to all previous tokens - Masked attention for autoregressive generation - **Feed-forward network (FFN)** $$ \text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2 $$ - Or using GELU/SwiGLU activations in modern models - **Layer normalization** $$ \text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta $$ - **Residual connections** $$ \text{output} = \text{LayerNorm}(x + \text{Sublayer}(x)) $$ ### 1.4 Positional Encoding Since attention is permutation-invariant, position information must be injected: **Sinusoidal (original Transformer):** $$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$ $$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$ **Rotary Position Embedding (RoPE):** Used in modern models like LLaMA: $$ f_q(x_m, m) = (W_q x_m) e^{im\theta} $$ ## 2. Tokenization Text must be converted to discrete tokens before training. ### 2.1 Byte-Pair Encoding (BPE) **Algorithm:** 1. Initialize vocabulary with all unique characters 2. Count frequency of adjacent token pairs 3. Merge most frequent pair into new token 4. Repeat until vocabulary size reached **Vocabulary sizes:** | Model | Vocab Size | |-------|------------| | GPT-2 | 50,257 | | GPT-4 | ~100,000 | | LLaMA | 32,000 | | Claude | ~100,000 | ### 2.2 Tokenization Impact - **Efficiency**: Tokens per word ratio affects context utilization - **Multilingual**: Some languages require more tokens per concept - **Code**: Special handling for programming syntax - **Rare words**: Split into subword pieces ## 3. Pre-Training The most computationally intensive phase where the model learns language patterns. ### 3.1 Training Objective: Next-Token Prediction **Causal Language Modeling Loss:** $$ \mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_1, x_2, \ldots, x_{t-1}; \theta) $$ Or equivalently, the cross-entropy loss: $$ \mathcal{L}_{CE} = -\sum_{i=1}^{V} y_i \log(\hat{y}_i) $$ Where: - $V$ = vocabulary size - $y_i$ = one-hot encoded true token - $\hat{y}_i$ = predicted probability for token $i$ ### 3.2 Training Data **Data sources:** - Web crawls (Common Crawl): ~60% - Books and literature: ~8% - Academic papers: ~5% - Code repositories: ~10% - Wikipedia: ~3% - Curated high-quality sources: ~14% **Data processing pipeline:** 1. **Deduplication** - MinHash / SimHash for near-duplicate detection - Exact substring matching 2. **Quality filtering** - Perplexity-based filtering - Heuristic rules (length, symbol ratio) - Classifier-based quality scoring 3. **Toxicity removal** - Keyword filtering - Classifier-based detection ### 3.3 Scaling Laws **Kaplan et al. (2020) Power Laws:** $$ L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad \alpha_N \approx 0.076 $$ $$ L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad \alpha_D \approx 0.095 $$ $$ L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}, \quad \alpha_C \approx 0.050 $$ Where: - $L$ = Loss (lower is better) - $N$ = Number of parameters - $D$ = Dataset size (tokens) - $C$ = Compute budget (FLOPs) **Chinchilla Optimal Scaling:** For compute-optimal training: $$ N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5} $$ Rule of thumb: **~20 tokens per parameter** ### 3.4 Optimization **AdamW Optimizer:** $$ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t $$ $$ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 $$ $$ \hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} $$ $$ \theta_t = \theta_{t-1} - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1} \right) $$ **Typical hyperparameters:** - $\beta_1 = 0.9$ - $\beta_2 = 0.95$ - $\epsilon = 10^{-8}$ - $\lambda = 0.1$ (weight decay) **Learning Rate Schedule:** Warmup + Cosine decay: $$ \eta_t = \begin{cases} \eta_{max} \cdot \frac{t}{T_{warmup}} & \text{if } t < T_{warmup} \\ \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{t - T_{warmup}}{T_{total} - T_{warmup}} \pi\right)\right) & \text{otherwise} \end{cases} $$ ### 3.5 Distributed Training **Parallelism strategies:** - **Data Parallelism (DP)** - Model replicated across $N$ GPUs - Batch split into $N$ micro-batches - Gradients synchronized via all-reduce - **Tensor Parallelism (TP)** - Individual layers split across GPUs - For attention: $Q, K, V$ projections partitioned - **Pipeline Parallelism (PP)** - Sequential layers on different GPUs - Micro-batch pipelining reduces bubble time - **ZeRO (Zero Redundancy Optimizer)** - Stage 1: Partition optimizer states - Stage 2: + Partition gradients - Stage 3: + Partition parameters **Compute requirements:** $$ \text{FLOPs} \approx 6 \times N \times D $$ Where $N$ = parameters, $D$ = training tokens. For a 70B model on 2T tokens: $$ \text{FLOPs} \approx 6 \times 70 \times 10^9 \times 2 \times 10^{12} = 8.4 \times 10^{23} $$ ## 4. Post-Training & Alignment Transforms a base model into a helpful assistant. ### 4.1 Supervised Fine-Tuning (SFT) **Objective:** Same as pre-training, but on curated instruction-response pairs. $$ \mathcal{L}_{SFT} = -\sum_{t=1}^{T} \log P(y_t | x, y_{

larger-the-better, quality & reliability

Larger-the-better characteristics should be maximized with minimal variation.