← Back to AI Factory Chat

AI Factory Glossary

169 technical terms and definitions

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z All
Showing page 1 of 4 (169 entries)

l-diversity, training techniques

L-diversity requires at least l diverse values for sensitive attributes within groups.

l-infinity attacks, ai safety

Bound maximum per-pixel change.

l0 attacks, l0, ai safety

Limit number of pixels changed.

l2 attacks, l2, ai safety

Bound Euclidean distance.

label flipping, ai safety

Change labels in training data.

label propagation on graphs, graph neural networks

Semi-supervised learning via graph structure.

label smoothing, machine learning

Soften hard labels for regularization.

lagrangian neural networks, scientific ml

Learn dynamics from Lagrangian mechanics.

lamda (language model for dialogue applications),lamda,language model for dialogue applications,foundation model

Google's conversational AI model.

landmark attention, llm architecture

Efficient attention for long context.

landmark attention,llm architecture

Attend to selected landmark tokens for efficiency.

langchain, ai agents

LangChain provides framework for building LLM applications with chains and agents.

langchain,framework

Framework for building LLM applications with chains and agents.

langchain,framework,llm

LangChain is framework for LLM applications. Chains, agents, tools. Popular ecosystem.

langchain,llamaindex,framework

LangChain and LlamaIndex are frameworks for building LLM apps: chains, agents, RAG, memory. Accelerate development.

langevin dynamics,generative models

Sample using gradient of log density.

langflow,visual,llm

Langflow is visual LLM flow builder. DataStax.

language adversarial training, nlp

Remove language-specific features.

language model interpretability, explainable ai

Understanding how language models make decisions.

language-specific pre-training, transfer learning

Pre-train for particular language.

large language model training, transformer, attention, tokenization, scaling laws, RLHF, DPO, mixture of experts, extended context, evaluation, deployment

# Large Language Model (LLM) Training A comprehensive guide to understanding how Large Language Models are trained, from architecture to deployment. ## 1. Architecture: The Transformer Modern LLMs are built on the **Transformer architecture** (Vaswani et al., 2017). The key innovation is the **self-attention mechanism**. ### 1.1 Self-Attention Mechanism The attention function computes a weighted sum of values based on query-key similarity: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$ Where: - $Q$ = Query matrix of shape $(n \times d_k)$ - $K$ = Key matrix of shape $(n \times d_k)$ - $V$ = Value matrix of shape $(n \times d_v)$ - $d_k$ = Dimension of keys (scaling factor prevents gradient vanishing) ### 1.2 Multi-Head Attention Multiple attention heads learn different relationship types: $$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O $$ Where each head is computed as: $$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$ ### 1.3 Transformer Block Components Each transformer block contains: - **Multi-head self-attention layer** - Allows tokens to attend to all previous tokens - Masked attention for autoregressive generation - **Feed-forward network (FFN)** $$ \text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2 $$ - Or using GELU/SwiGLU activations in modern models - **Layer normalization** $$ \text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta $$ - **Residual connections** $$ \text{output} = \text{LayerNorm}(x + \text{Sublayer}(x)) $$ ### 1.4 Positional Encoding Since attention is permutation-invariant, position information must be injected: **Sinusoidal (original Transformer):** $$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$ $$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$ **Rotary Position Embedding (RoPE):** Used in modern models like LLaMA: $$ f_q(x_m, m) = (W_q x_m) e^{im\theta} $$ ## 2. Tokenization Text must be converted to discrete tokens before training. ### 2.1 Byte-Pair Encoding (BPE) **Algorithm:** 1. Initialize vocabulary with all unique characters 2. Count frequency of adjacent token pairs 3. Merge most frequent pair into new token 4. Repeat until vocabulary size reached **Vocabulary sizes:** | Model | Vocab Size | |-------|------------| | GPT-2 | 50,257 | | GPT-4 | ~100,000 | | LLaMA | 32,000 | | Claude | ~100,000 | ### 2.2 Tokenization Impact - **Efficiency**: Tokens per word ratio affects context utilization - **Multilingual**: Some languages require more tokens per concept - **Code**: Special handling for programming syntax - **Rare words**: Split into subword pieces ## 3. Pre-Training The most computationally intensive phase where the model learns language patterns. ### 3.1 Training Objective: Next-Token Prediction **Causal Language Modeling Loss:** $$ \mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_1, x_2, \ldots, x_{t-1}; \theta) $$ Or equivalently, the cross-entropy loss: $$ \mathcal{L}_{CE} = -\sum_{i=1}^{V} y_i \log(\hat{y}_i) $$ Where: - $V$ = vocabulary size - $y_i$ = one-hot encoded true token - $\hat{y}_i$ = predicted probability for token $i$ ### 3.2 Training Data **Data sources:** - Web crawls (Common Crawl): ~60% - Books and literature: ~8% - Academic papers: ~5% - Code repositories: ~10% - Wikipedia: ~3% - Curated high-quality sources: ~14% **Data processing pipeline:** 1. **Deduplication** - MinHash / SimHash for near-duplicate detection - Exact substring matching 2. **Quality filtering** - Perplexity-based filtering - Heuristic rules (length, symbol ratio) - Classifier-based quality scoring 3. **Toxicity removal** - Keyword filtering - Classifier-based detection ### 3.3 Scaling Laws **Kaplan et al. (2020) Power Laws:** $$ L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad \alpha_N \approx 0.076 $$ $$ L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad \alpha_D \approx 0.095 $$ $$ L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}, \quad \alpha_C \approx 0.050 $$ Where: - $L$ = Loss (lower is better) - $N$ = Number of parameters - $D$ = Dataset size (tokens) - $C$ = Compute budget (FLOPs) **Chinchilla Optimal Scaling:** For compute-optimal training: $$ N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5} $$ Rule of thumb: **~20 tokens per parameter** ### 3.4 Optimization **AdamW Optimizer:** $$ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t $$ $$ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 $$ $$ \hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} $$ $$ \theta_t = \theta_{t-1} - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1} \right) $$ **Typical hyperparameters:** - $\beta_1 = 0.9$ - $\beta_2 = 0.95$ - $\epsilon = 10^{-8}$ - $\lambda = 0.1$ (weight decay) **Learning Rate Schedule:** Warmup + Cosine decay: $$ \eta_t = \begin{cases} \eta_{max} \cdot \frac{t}{T_{warmup}} & \text{if } t < T_{warmup} \\ \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{t - T_{warmup}}{T_{total} - T_{warmup}} \pi\right)\right) & \text{otherwise} \end{cases} $$ ### 3.5 Distributed Training **Parallelism strategies:** - **Data Parallelism (DP)** - Model replicated across $N$ GPUs - Batch split into $N$ micro-batches - Gradients synchronized via all-reduce - **Tensor Parallelism (TP)** - Individual layers split across GPUs - For attention: $Q, K, V$ projections partitioned - **Pipeline Parallelism (PP)** - Sequential layers on different GPUs - Micro-batch pipelining reduces bubble time - **ZeRO (Zero Redundancy Optimizer)** - Stage 1: Partition optimizer states - Stage 2: + Partition gradients - Stage 3: + Partition parameters **Compute requirements:** $$ \text{FLOPs} \approx 6 \times N \times D $$ Where $N$ = parameters, $D$ = training tokens. For a 70B model on 2T tokens: $$ \text{FLOPs} \approx 6 \times 70 \times 10^9 \times 2 \times 10^{12} = 8.4 \times 10^{23} $$ ## 4. Post-Training & Alignment Transforms a base model into a helpful assistant. ### 4.1 Supervised Fine-Tuning (SFT) **Objective:** Same as pre-training, but on curated instruction-response pairs. $$ \mathcal{L}_{SFT} = -\sum_{t=1}^{T} \log P(y_t | x, y_{

large language model training, transformer, attention, tokenization, scaling laws, RLHF, DPO, mixture of experts, extended context, evaluation, deployment

# Large Language Model (LLM) Training A comprehensive guide to understanding how Large Language Models are trained, from architecture to deployment. ## 1. Architecture: The Transformer Modern LLMs are built on the **Transformer architecture** (Vaswani et al., 2017). The key innovation is the **self-attention mechanism**. ### 1.1 Self-Attention Mechanism The attention function computes a weighted sum of values based on query-key similarity: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$ Where: - $Q$ = Query matrix of shape $(n \times d_k)$ - $K$ = Key matrix of shape $(n \times d_k)$ - $V$ = Value matrix of shape $(n \times d_v)$ - $d_k$ = Dimension of keys (scaling factor prevents gradient vanishing) ### 1.2 Multi-Head Attention Multiple attention heads learn different relationship types: $$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O $$ Where each head is computed as: $$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$ ### 1.3 Transformer Block Components Each transformer block contains: - **Multi-head self-attention layer** - Allows tokens to attend to all previous tokens - Masked attention for autoregressive generation - **Feed-forward network (FFN)** $$ \text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2 $$ - Or using GELU/SwiGLU activations in modern models - **Layer normalization** $$ \text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta $$ - **Residual connections** $$ \text{output} = \text{LayerNorm}(x + \text{Sublayer}(x)) $$ ### 1.4 Positional Encoding Since attention is permutation-invariant, position information must be injected: **Sinusoidal (original Transformer):** $$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$ $$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$ **Rotary Position Embedding (RoPE):** Used in modern models like LLaMA: $$ f_q(x_m, m) = (W_q x_m) e^{im\theta} $$ ## 2. Tokenization Text must be converted to discrete tokens before training. ### 2.1 Byte-Pair Encoding (BPE) **Algorithm:** 1. Initialize vocabulary with all unique characters 2. Count frequency of adjacent token pairs 3. Merge most frequent pair into new token 4. Repeat until vocabulary size reached **Vocabulary sizes:** | Model | Vocab Size | |-------|------------| | GPT-2 | 50,257 | | GPT-4 | ~100,000 | | LLaMA | 32,000 | | Claude | ~100,000 | ### 2.2 Tokenization Impact - **Efficiency**: Tokens per word ratio affects context utilization - **Multilingual**: Some languages require more tokens per concept - **Code**: Special handling for programming syntax - **Rare words**: Split into subword pieces ## 3. Pre-Training The most computationally intensive phase where the model learns language patterns. ### 3.1 Training Objective: Next-Token Prediction **Causal Language Modeling Loss:** $$ \mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_1, x_2, \ldots, x_{t-1}; \theta) $$ Or equivalently, the cross-entropy loss: $$ \mathcal{L}_{CE} = -\sum_{i=1}^{V} y_i \log(\hat{y}_i) $$ Where: - $V$ = vocabulary size - $y_i$ = one-hot encoded true token - $\hat{y}_i$ = predicted probability for token $i$ ### 3.2 Training Data **Data sources:** - Web crawls (Common Crawl): ~60% - Books and literature: ~8% - Academic papers: ~5% - Code repositories: ~10% - Wikipedia: ~3% - Curated high-quality sources: ~14% **Data processing pipeline:** 1. **Deduplication** - MinHash / SimHash for near-duplicate detection - Exact substring matching 2. **Quality filtering** - Perplexity-based filtering - Heuristic rules (length, symbol ratio) - Classifier-based quality scoring 3. **Toxicity removal** - Keyword filtering - Classifier-based detection ### 3.3 Scaling Laws **Kaplan et al. (2020) Power Laws:** $$ L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad \alpha_N \approx 0.076 $$ $$ L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad \alpha_D \approx 0.095 $$ $$ L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}, \quad \alpha_C \approx 0.050 $$ Where: - $L$ = Loss (lower is better) - $N$ = Number of parameters - $D$ = Dataset size (tokens) - $C$ = Compute budget (FLOPs) **Chinchilla Optimal Scaling:** For compute-optimal training: $$ N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5} $$ Rule of thumb: **~20 tokens per parameter** ### 3.4 Optimization **AdamW Optimizer:** $$ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t $$ $$ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 $$ $$ \hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} $$ $$ \theta_t = \theta_{t-1} - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1} \right) $$ **Typical hyperparameters:** - $\beta_1 = 0.9$ - $\beta_2 = 0.95$ - $\epsilon = 10^{-8}$ - $\lambda = 0.1$ (weight decay) **Learning Rate Schedule:** Warmup + Cosine decay: $$ \eta_t = \begin{cases} \eta_{max} \cdot \frac{t}{T_{warmup}} & \text{if } t < T_{warmup} \\ \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{t - T_{warmup}}{T_{total} - T_{warmup}} \pi\right)\right) & \text{otherwise} \end{cases} $$ ### 3.5 Distributed Training **Parallelism strategies:** - **Data Parallelism (DP)** - Model replicated across $N$ GPUs - Batch split into $N$ micro-batches - Gradients synchronized via all-reduce - **Tensor Parallelism (TP)** - Individual layers split across GPUs - For attention: $Q, K, V$ projections partitioned - **Pipeline Parallelism (PP)** - Sequential layers on different GPUs - Micro-batch pipelining reduces bubble time - **ZeRO (Zero Redundancy Optimizer)** - Stage 1: Partition optimizer states - Stage 2: + Partition gradients - Stage 3: + Partition parameters **Compute requirements:** $$ \text{FLOPs} \approx 6 \times N \times D $$ Where $N$ = parameters, $D$ = training tokens. For a 70B model on 2T tokens: $$ \text{FLOPs} \approx 6 \times 70 \times 10^9 \times 2 \times 10^{12} = 8.4 \times 10^{23} $$ ## 4. Post-Training & Alignment Transforms a base model into a helpful assistant. ### 4.1 Supervised Fine-Tuning (SFT) **Objective:** Same as pre-training, but on curated instruction-response pairs. $$ \mathcal{L}_{SFT} = -\sum_{t=1}^{T} \log P(y_t | x, y_{

laser fib, failure analysis advanced

Laser-based focused ion beam alternatives use pulsed lasers for material removal avoiding gallium contamination in failure analysis.

laser repair, lithography

Use laser to fix mask defects.

laser voltage probing, failure analysis advanced

Laser voltage probing measures internal node voltages non-invasively through electro-optic or thermal effects.

laser voltage probing,failure analysis

Non-contact voltage measurement.

late fusion, multimodal ai

Combine predictions at end.

late interaction models, rag

Interact embeddings at fine granularity.

latency prediction, model optimization

Latency prediction models estimate inference time from architecture specifications guiding hardware-aware NAS.

latent consistency models,generative models

Fast sampling in latent space.

latent diffusion models, ldm, generative models

Diffusion in compressed latent space.

latent diffusion models,generative models

Run diffusion in compressed latent space for efficiency (Stable Diffusion).

latent diffusion, multimodal ai

Latent diffusion models perform diffusion in compressed latent space reducing computational cost.

latent direction, multimodal ai

Latent directions are vectors in latent space corresponding to semantic attributes.

latent failures, reliability

Defects not caught by test.

latent odes, neural architecture

Neural ODEs in latent space for irregular time series.

latent space arithmetic, generative models

Add/subtract concepts in latent space.

latent space arithmetic,generative models

Combine latent codes with arithmetic operations.

latent space disentanglement, generative models

Separate independent factors.

latent space interpolation, generative models

Smooth transitions in latent space.

latent space interpolation, multimodal ai

Latent space interpolation smoothly transitions between generated samples.

latent space interpolation,generative models

Smoothly transition between samples.

latent space manipulation,generative models

Edit by moving in latent space.

latent space navigation, generative models

Explore latent space systematically.

latent upscaling, generative models

Upscale in latent space.

latent world models, reinforcement learning

Learn compact representations of environment dynamics.

layer normalization variants, neural architecture

Different normalization schemes.

layer-wise relevance propagation, lrp, explainable ai

Backpropagate relevance scores.

layernorm epsilon, neural architecture

Small constant preventing division by zero.

layout optimization, model optimization

Layout optimization selects memory formats for tensors minimizing data transformation overhead.