← Back to AI Factory Chat

AI Factory Glossary

3,145 technical terms and definitions

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z All
Showing page 30 of 63 (3,145 entries)

langchain,llamaindex,framework

LangChain and LlamaIndex are frameworks for building LLM apps: chains, agents, RAG, memory. Accelerate development.

langevin dynamics,generative models

Sample using gradient of log density.

langflow,visual,llm

Langflow is visual LLM flow builder. DataStax.

language adversarial training, nlp

Remove language-specific features.

language model interpretability, explainable ai

Understanding how language models make decisions.

language-specific pre-training, transfer learning

Pre-train for particular language.

large language model training, transformer, attention, tokenization, scaling laws, RLHF, DPO, mixture of experts, extended context, evaluation, deployment

# Large Language Model (LLM) Training A comprehensive guide to understanding how Large Language Models are trained, from architecture to deployment. ## 1. Architecture: The Transformer Modern LLMs are built on the **Transformer architecture** (Vaswani et al., 2017). The key innovation is the **self-attention mechanism**. ### 1.1 Self-Attention Mechanism The attention function computes a weighted sum of values based on query-key similarity: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$ Where: - $Q$ = Query matrix of shape $(n \times d_k)$ - $K$ = Key matrix of shape $(n \times d_k)$ - $V$ = Value matrix of shape $(n \times d_v)$ - $d_k$ = Dimension of keys (scaling factor prevents gradient vanishing) ### 1.2 Multi-Head Attention Multiple attention heads learn different relationship types: $$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O $$ Where each head is computed as: $$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$ ### 1.3 Transformer Block Components Each transformer block contains: - **Multi-head self-attention layer** - Allows tokens to attend to all previous tokens - Masked attention for autoregressive generation - **Feed-forward network (FFN)** $$ \text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2 $$ - Or using GELU/SwiGLU activations in modern models - **Layer normalization** $$ \text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta $$ - **Residual connections** $$ \text{output} = \text{LayerNorm}(x + \text{Sublayer}(x)) $$ ### 1.4 Positional Encoding Since attention is permutation-invariant, position information must be injected: **Sinusoidal (original Transformer):** $$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$ $$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$ **Rotary Position Embedding (RoPE):** Used in modern models like LLaMA: $$ f_q(x_m, m) = (W_q x_m) e^{im\theta} $$ ## 2. Tokenization Text must be converted to discrete tokens before training. ### 2.1 Byte-Pair Encoding (BPE) **Algorithm:** 1. Initialize vocabulary with all unique characters 2. Count frequency of adjacent token pairs 3. Merge most frequent pair into new token 4. Repeat until vocabulary size reached **Vocabulary sizes:** | Model | Vocab Size | |-------|------------| | GPT-2 | 50,257 | | GPT-4 | ~100,000 | | LLaMA | 32,000 | | Claude | ~100,000 | ### 2.2 Tokenization Impact - **Efficiency**: Tokens per word ratio affects context utilization - **Multilingual**: Some languages require more tokens per concept - **Code**: Special handling for programming syntax - **Rare words**: Split into subword pieces ## 3. Pre-Training The most computationally intensive phase where the model learns language patterns. ### 3.1 Training Objective: Next-Token Prediction **Causal Language Modeling Loss:** $$ \mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_1, x_2, \ldots, x_{t-1}; \theta) $$ Or equivalently, the cross-entropy loss: $$ \mathcal{L}_{CE} = -\sum_{i=1}^{V} y_i \log(\hat{y}_i) $$ Where: - $V$ = vocabulary size - $y_i$ = one-hot encoded true token - $\hat{y}_i$ = predicted probability for token $i$ ### 3.2 Training Data **Data sources:** - Web crawls (Common Crawl): ~60% - Books and literature: ~8% - Academic papers: ~5% - Code repositories: ~10% - Wikipedia: ~3% - Curated high-quality sources: ~14% **Data processing pipeline:** 1. **Deduplication** - MinHash / SimHash for near-duplicate detection - Exact substring matching 2. **Quality filtering** - Perplexity-based filtering - Heuristic rules (length, symbol ratio) - Classifier-based quality scoring 3. **Toxicity removal** - Keyword filtering - Classifier-based detection ### 3.3 Scaling Laws **Kaplan et al. (2020) Power Laws:** $$ L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad \alpha_N \approx 0.076 $$ $$ L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad \alpha_D \approx 0.095 $$ $$ L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}, \quad \alpha_C \approx 0.050 $$ Where: - $L$ = Loss (lower is better) - $N$ = Number of parameters - $D$ = Dataset size (tokens) - $C$ = Compute budget (FLOPs) **Chinchilla Optimal Scaling:** For compute-optimal training: $$ N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5} $$ Rule of thumb: **~20 tokens per parameter** ### 3.4 Optimization **AdamW Optimizer:** $$ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t $$ $$ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 $$ $$ \hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} $$ $$ \theta_t = \theta_{t-1} - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1} \right) $$ **Typical hyperparameters:** - $\beta_1 = 0.9$ - $\beta_2 = 0.95$ - $\epsilon = 10^{-8}$ - $\lambda = 0.1$ (weight decay) **Learning Rate Schedule:** Warmup + Cosine decay: $$ \eta_t = \begin{cases} \eta_{max} \cdot \frac{t}{T_{warmup}} & \text{if } t < T_{warmup} \\ \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{t - T_{warmup}}{T_{total} - T_{warmup}} \pi\right)\right) & \text{otherwise} \end{cases} $$ ### 3.5 Distributed Training **Parallelism strategies:** - **Data Parallelism (DP)** - Model replicated across $N$ GPUs - Batch split into $N$ micro-batches - Gradients synchronized via all-reduce - **Tensor Parallelism (TP)** - Individual layers split across GPUs - For attention: $Q, K, V$ projections partitioned - **Pipeline Parallelism (PP)** - Sequential layers on different GPUs - Micro-batch pipelining reduces bubble time - **ZeRO (Zero Redundancy Optimizer)** - Stage 1: Partition optimizer states - Stage 2: + Partition gradients - Stage 3: + Partition parameters **Compute requirements:** $$ \text{FLOPs} \approx 6 \times N \times D $$ Where $N$ = parameters, $D$ = training tokens. For a 70B model on 2T tokens: $$ \text{FLOPs} \approx 6 \times 70 \times 10^9 \times 2 \times 10^{12} = 8.4 \times 10^{23} $$ ## 4. Post-Training & Alignment Transforms a base model into a helpful assistant. ### 4.1 Supervised Fine-Tuning (SFT) **Objective:** Same as pre-training, but on curated instruction-response pairs. $$ \mathcal{L}_{SFT} = -\sum_{t=1}^{T} \log P(y_t | x, y_{

large language model training, transformer, attention, tokenization, scaling laws, RLHF, DPO, mixture of experts, extended context, evaluation, deployment

# Large Language Model (LLM) Training A comprehensive guide to understanding how Large Language Models are trained, from architecture to deployment. ## 1. Architecture: The Transformer Modern LLMs are built on the **Transformer architecture** (Vaswani et al., 2017). The key innovation is the **self-attention mechanism**. ### 1.1 Self-Attention Mechanism The attention function computes a weighted sum of values based on query-key similarity: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$ Where: - $Q$ = Query matrix of shape $(n \times d_k)$ - $K$ = Key matrix of shape $(n \times d_k)$ - $V$ = Value matrix of shape $(n \times d_v)$ - $d_k$ = Dimension of keys (scaling factor prevents gradient vanishing) ### 1.2 Multi-Head Attention Multiple attention heads learn different relationship types: $$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O $$ Where each head is computed as: $$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$ ### 1.3 Transformer Block Components Each transformer block contains: - **Multi-head self-attention layer** - Allows tokens to attend to all previous tokens - Masked attention for autoregressive generation - **Feed-forward network (FFN)** $$ \text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2 $$ - Or using GELU/SwiGLU activations in modern models - **Layer normalization** $$ \text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta $$ - **Residual connections** $$ \text{output} = \text{LayerNorm}(x + \text{Sublayer}(x)) $$ ### 1.4 Positional Encoding Since attention is permutation-invariant, position information must be injected: **Sinusoidal (original Transformer):** $$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$ $$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$ **Rotary Position Embedding (RoPE):** Used in modern models like LLaMA: $$ f_q(x_m, m) = (W_q x_m) e^{im\theta} $$ ## 2. Tokenization Text must be converted to discrete tokens before training. ### 2.1 Byte-Pair Encoding (BPE) **Algorithm:** 1. Initialize vocabulary with all unique characters 2. Count frequency of adjacent token pairs 3. Merge most frequent pair into new token 4. Repeat until vocabulary size reached **Vocabulary sizes:** | Model | Vocab Size | |-------|------------| | GPT-2 | 50,257 | | GPT-4 | ~100,000 | | LLaMA | 32,000 | | Claude | ~100,000 | ### 2.2 Tokenization Impact - **Efficiency**: Tokens per word ratio affects context utilization - **Multilingual**: Some languages require more tokens per concept - **Code**: Special handling for programming syntax - **Rare words**: Split into subword pieces ## 3. Pre-Training The most computationally intensive phase where the model learns language patterns. ### 3.1 Training Objective: Next-Token Prediction **Causal Language Modeling Loss:** $$ \mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_1, x_2, \ldots, x_{t-1}; \theta) $$ Or equivalently, the cross-entropy loss: $$ \mathcal{L}_{CE} = -\sum_{i=1}^{V} y_i \log(\hat{y}_i) $$ Where: - $V$ = vocabulary size - $y_i$ = one-hot encoded true token - $\hat{y}_i$ = predicted probability for token $i$ ### 3.2 Training Data **Data sources:** - Web crawls (Common Crawl): ~60% - Books and literature: ~8% - Academic papers: ~5% - Code repositories: ~10% - Wikipedia: ~3% - Curated high-quality sources: ~14% **Data processing pipeline:** 1. **Deduplication** - MinHash / SimHash for near-duplicate detection - Exact substring matching 2. **Quality filtering** - Perplexity-based filtering - Heuristic rules (length, symbol ratio) - Classifier-based quality scoring 3. **Toxicity removal** - Keyword filtering - Classifier-based detection ### 3.3 Scaling Laws **Kaplan et al. (2020) Power Laws:** $$ L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad \alpha_N \approx 0.076 $$ $$ L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad \alpha_D \approx 0.095 $$ $$ L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C}, \quad \alpha_C \approx 0.050 $$ Where: - $L$ = Loss (lower is better) - $N$ = Number of parameters - $D$ = Dataset size (tokens) - $C$ = Compute budget (FLOPs) **Chinchilla Optimal Scaling:** For compute-optimal training: $$ N_{opt} \propto C^{0.5}, \quad D_{opt} \propto C^{0.5} $$ Rule of thumb: **~20 tokens per parameter** ### 3.4 Optimization **AdamW Optimizer:** $$ m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t $$ $$ v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2 $$ $$ \hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t} $$ $$ \theta_t = \theta_{t-1} - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1} \right) $$ **Typical hyperparameters:** - $\beta_1 = 0.9$ - $\beta_2 = 0.95$ - $\epsilon = 10^{-8}$ - $\lambda = 0.1$ (weight decay) **Learning Rate Schedule:** Warmup + Cosine decay: $$ \eta_t = \begin{cases} \eta_{max} \cdot \frac{t}{T_{warmup}} & \text{if } t < T_{warmup} \\ \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{t - T_{warmup}}{T_{total} - T_{warmup}} \pi\right)\right) & \text{otherwise} \end{cases} $$ ### 3.5 Distributed Training **Parallelism strategies:** - **Data Parallelism (DP)** - Model replicated across $N$ GPUs - Batch split into $N$ micro-batches - Gradients synchronized via all-reduce - **Tensor Parallelism (TP)** - Individual layers split across GPUs - For attention: $Q, K, V$ projections partitioned - **Pipeline Parallelism (PP)** - Sequential layers on different GPUs - Micro-batch pipelining reduces bubble time - **ZeRO (Zero Redundancy Optimizer)** - Stage 1: Partition optimizer states - Stage 2: + Partition gradients - Stage 3: + Partition parameters **Compute requirements:** $$ \text{FLOPs} \approx 6 \times N \times D $$ Where $N$ = parameters, $D$ = training tokens. For a 70B model on 2T tokens: $$ \text{FLOPs} \approx 6 \times 70 \times 10^9 \times 2 \times 10^{12} = 8.4 \times 10^{23} $$ ## 4. Post-Training & Alignment Transforms a base model into a helpful assistant. ### 4.1 Supervised Fine-Tuning (SFT) **Objective:** Same as pre-training, but on curated instruction-response pairs. $$ \mathcal{L}_{SFT} = -\sum_{t=1}^{T} \log P(y_t | x, y_{

laser fib, failure analysis advanced

Laser-based focused ion beam alternatives use pulsed lasers for material removal avoiding gallium contamination in failure analysis.

laser repair, lithography

Use laser to fix mask defects.

laser voltage probing, failure analysis advanced

Laser voltage probing measures internal node voltages non-invasively through electro-optic or thermal effects.

laser voltage probing,failure analysis

Non-contact voltage measurement.

late fusion, multimodal ai

Combine predictions at end.

late interaction models, rag

Interact embeddings at fine granularity.

latency prediction, model optimization

Latency prediction models estimate inference time from architecture specifications guiding hardware-aware NAS.

latent consistency models,generative models

Fast sampling in latent space.

latent diffusion models, ldm, generative models

Diffusion in compressed latent space.

latent diffusion models,generative models

Run diffusion in compressed latent space for efficiency (Stable Diffusion).

latent diffusion, multimodal ai

Latent diffusion models perform diffusion in compressed latent space reducing computational cost.

latent direction, multimodal ai

Latent directions are vectors in latent space corresponding to semantic attributes.

latent failures, reliability

Defects not caught by test.

latent odes, neural architecture

Neural ODEs in latent space for irregular time series.

latent space arithmetic, generative models

Add/subtract concepts in latent space.

latent space arithmetic,generative models

Combine latent codes with arithmetic operations.

latent space disentanglement, generative models

Separate independent factors.

latent space interpolation, generative models

Smooth transitions in latent space.

latent space interpolation, multimodal ai

Latent space interpolation smoothly transitions between generated samples.

latent space interpolation,generative models

Smoothly transition between samples.

latent space manipulation,generative models

Edit by moving in latent space.

latent space navigation, generative models

Explore latent space systematically.

latent upscaling, generative models

Upscale in latent space.

latent world models, reinforcement learning

Learn compact representations of environment dynamics.

layer normalization variants, neural architecture

Different normalization schemes.

layer-wise relevance propagation, lrp, explainable ai

Backpropagate relevance scores.

layernorm epsilon, neural architecture

Small constant preventing division by zero.

layout optimization, model optimization

Layout optimization selects memory formats for tensors minimizing data transformation overhead.

lazy class, code ai

Class not doing enough.

lazy training regime, theory

Networks stay close to initialization.

lead optimization, healthcare ai

Improve drug candidate properties.

lead time management, supply chain & logistics

Lead time management optimizes procurement and delivery schedules minimizing delays and carrying costs.

leaky relu, neural architecture

Fixed small negative slope.

learnable position embedding, transformer

Position embeddings as parameters.

learned layer selection, neural architecture

Train which layers to use per input.

learned noise schedule, generative models

Train noise schedule.

learned routing, llm architecture

Learned routing trains networks to assign tokens to experts.

learned step size, model optimization

Learned step size quantization optimizes quantization scale factors during training.

learning curve prediction, neural architecture search

Learning curve prediction forecasts final performance from initial training enabling efficient architecture selection.

learning rate schedule,model training

Plan for adjusting learning rate during training (cosine linear step decay).

learning to rank,machine learning

ML approaches to ranking.

learning using privileged information, lupi, machine learning

Framework for privileged information.