Ai Glossary - Letter Z | AI Factory - Chip Foundry Services

zero (zero redundancy optimizer),zero,zero redundancy optimizer,model training

ZeRO (Zero Redundancy Optimizer) partitions optimizer states, gradients, and parameters across data parallel devices. **The problem**: Data parallelism replicates everything on each device - wasteful memory usage. 175B model needs 175B parameters x N devices. **ZeRO insight**: Optimizer states (Adam moments), gradients, and parameters dont all need to be replicated. Partition them. **ZeRO stages**: **Stage 1**: Partition optimizer states. 4x memory reduction (Adam stores 4x params). **Stage 2**: Also partition gradients. 8x reduction. **Stage 3**: Also partition parameters. Linear reduction with device count. **How it works**: Each device owns shard of params. All-gather to reconstruct needed params for forward/backward, reduce-scatter gradients, update local shard. **Communication overhead**: More communication than vanilla data parallel, but enables training otherwise-impossible model sizes. **Memory savings**: ZeRO-3 can train 175B model on 8 GPUs that couldnt individually fit 175B. **DeepSpeed**: Microsoft library implementing ZeRO. Industry standard for large-scale training. **ZeRO-Offload**: Offload to CPU memory for even larger models. **ZeRO-Infinity**: Offload to NVMe for multi-trillion parameter models.

zero liquid discharge, environmental & sustainability

**Zero Liquid Discharge** is **a wastewater strategy where liquid effluent is eliminated through treatment and recovery** - It minimizes environmental discharge by recovering water and isolating solids for handling. **What Is Zero Liquid Discharge?** - **Definition**: a wastewater strategy where liquid effluent is eliminated through treatment and recovery. - **Core Mechanism**: Advanced treatment, concentration, and crystallization systems recover reusable water from waste streams. - **Operational Scope**: It is applied in environmental-and-sustainability programs to improve robustness, accountability, and long-term performance outcomes. - **Failure Modes**: High energy demand and scaling issues can challenge economic feasibility. **Why Zero Liquid Discharge Matters** - **Outcome Quality**: Better methods improve decision reliability, efficiency, and measurable impact. - **Risk Management**: Structured controls reduce instability, bias loops, and hidden failure modes. - **Operational Efficiency**: Well-calibrated methods lower rework and accelerate learning cycles. - **Strategic Alignment**: Clear metrics connect technical actions to business and sustainability goals. - **Scalable Deployment**: Robust approaches transfer effectively across domains and operating conditions. **How It Is Used in Practice** - **Method Selection**: Choose approaches by compliance targets, resource intensity, and long-term sustainability objectives. - **Calibration**: Optimize energy-water tradeoffs and monitor concentrate-management reliability. - **Validation**: Track resource efficiency, emissions performance, and objective metrics through recurring controlled evaluations. Zero Liquid Discharge is **a high-impact method for resilient environmental-and-sustainability execution** - It is a high-stringency approach for water compliance and sustainability goals.

zero optimizer deepspeed,zero redundancy optimizer,distributed training memory,zero stage 1 2 3,memory efficient distributed training

**ZeRO (Zero Redundancy Optimizer)** is **the memory optimization technique for distributed training that partitions optimizer states, gradients, and parameters across data-parallel processes** — eliminating memory redundancy to enable training models 100-1000× larger than possible with standard data parallelism, achieving linear scaling to thousands of GPUs while maintaining training efficiency and convergence properties. **Memory Redundancy in Data Parallelism:** - **Standard Data Parallelism**: each GPU stores complete copy of model parameters, gradients, and optimizer states; for Adam optimizer with model size M: each GPU stores M (parameters) + M (gradients) + 2M (momentum, variance) = 4M memory - **Redundancy Problem**: for 8 GPUs, total memory 32M but only M unique parameters; 31M wasted on redundant copies; limits model size to what fits on single GPU; inefficient memory utilization - **Example**: GPT-3 175B parameters in FP16: 350GB parameters + 350GB gradients + 700GB optimizer states = 1.4TB per GPU; impossible on 80GB A100; ZeRO partitions across GPUs - **Communication**: standard data parallelism requires all-reduce of gradients; communication volume scales with model size; ZeRO adds communication for parameter gathering but reduces memory dramatically **ZeRO Stages:** - **ZeRO Stage 1 (Optimizer State Partitioning)**: partition optimizer states across GPUs; each GPU stores 1/N of optimizer states for N GPUs; reduces optimizer memory by N×; parameters and gradients still replicated; 4× memory reduction for Adam - **ZeRO Stage 2 (Gradient Partitioning)**: partition gradients in addition to optimizer states; each GPU stores 1/N of gradients; reduces gradient memory by N×; parameters still replicated; 8× memory reduction total - **ZeRO Stage 3 (Parameter Partitioning)**: partition parameters across GPUs; each GPU stores 1/N of parameters; gather parameters just-in-time for forward/backward; maximum memory reduction; 64× reduction for Adam with 8 GPUs - **Stage Selection**: Stage 1 for moderate models (1-10B); Stage 2 for large models (10-100B); Stage 3 for extreme models (100B-1T); trade-off between memory and communication **ZeRO Stage 3 Deep Dive:** - **Parameter Gathering**: before computing layer, all-gather parameters from all GPUs; each GPU broadcasts its 1/N partition; reconstructs full layer; computes forward pass; discards parameters after use - **Gradient Computation**: backward pass gathers parameters again; computes gradients; reduces gradients to owner GPU; each GPU receives 1/N of gradients; updates its 1/N of parameters - **Communication Pattern**: all-gather for forward (gather parameters), reduce-scatter for backward (distribute gradients); communication volume same as standard data parallelism; but enables N× larger models - **Overlapping**: overlap communication with computation; prefetch next layer parameters while computing current layer; hide communication latency; maintains training efficiency **Memory Savings:** - **Model States**: ZeRO-3 reduces per-GPU memory from 4M to 4M/N + communication buffers; for 8 GPUs: 8× reduction; for 64 GPUs: 64× reduction; enables models 10-100× larger - **Activation Memory**: ZeRO doesn't reduce activation memory; combine with gradient checkpointing for activation savings; multiplicative benefits; enables 100-1000× larger models - **Example Calculation**: 175B parameter model, Adam optimizer, 8 GPUs: Standard DP = 1.4TB per GPU (impossible); ZeRO-3 = 175GB per GPU (feasible on 8×A100 80GB) - **Scaling**: memory per GPU decreases linearly with GPU count; enables training arbitrarily large models with enough GPUs; practical limit from communication overhead **Communication Overhead:** - **Bandwidth Requirements**: ZeRO-3 requires 2× communication vs standard data parallelism (all-gather + reduce-scatter vs all-reduce); but enables models that don't fit otherwise - **Latency Sensitivity**: small models or fast GPUs may see slowdown from communication; ZeRO-3 beneficial when model size > 1B parameters; smaller models use Stage 1 or 2 - **Network Topology**: requires high-bandwidth interconnect (NVLink, InfiniBand); 100-400 Gb/s per GPU; slower networks (Ethernet) see larger overhead; topology-aware optimization helps - **Scaling Efficiency**: maintains 80-95% scaling efficiency to 64-128 GPUs; degrades to 60-80% at 512-1024 GPUs; still enables training impossible otherwise **DeepSpeed Integration:** - **DeepSpeed Library**: Microsoft's implementation of ZeRO; production-ready; used for training GPT-3, Megatron-Turing NLG, Bloom; extensive optimization and tuning - **Configuration**: simple JSON config to enable ZeRO stages; zero_optimization: {stage: 3}; automatic partitioning and communication; minimal code changes - **ZeRO-Offload**: offload optimizer states and gradients to CPU memory; further reduces GPU memory; trades PCIe bandwidth for memory; enables training on consumer GPUs - **ZeRO-Infinity**: offload to NVMe SSD; enables training models larger than total system memory; extreme memory savings at cost of I/O latency; for models 1T+ parameters **Combining with Other Techniques:** - **ZeRO + Gradient Checkpointing**: multiplicative memory savings; ZeRO reduces model state memory, checkpointing reduces activation memory; enables 100-1000× larger models - **ZeRO + Mixed Precision**: FP16/BF16 training reduces memory 2×; combined with ZeRO gives 128× reduction (64× from ZeRO-3, 2× from mixed precision) - **ZeRO + Model Parallelism**: ZeRO for data parallelism, pipeline/tensor parallelism for model parallelism; hybrid approach for extreme scale; used in Megatron-DeepSpeed - **ZeRO + LoRA**: ZeRO enables fine-tuning large models; LoRA reduces trainable parameters; combination enables fine-tuning 100B+ models on modest hardware **Production Deployment:** - **Training Stability**: ZeRO maintains same convergence as standard training; no hyperparameter changes needed; extensively validated on large models - **Fault Tolerance**: checkpoint/resume works with ZeRO; each GPU saves its partition; restore from checkpoint seamlessly; critical for long training runs - **Monitoring**: DeepSpeed provides memory and communication profiling; identifies bottlenecks; helps optimize configuration; essential for large-scale training - **Multi-Node Scaling**: ZeRO scales to thousands of GPUs across hundreds of nodes; used for training largest models (Bloom 176B, Megatron-Turing 530B); production-proven **Best Practices:** - **Stage Selection**: use Stage 1 for models <10B, Stage 2 for 10-100B, Stage 3 for >100B; measure memory and speed; choose based on bottleneck - **Batch Size**: increase batch size with saved memory; improves training stability and convergence; typical increase 4-16× vs standard data parallelism - **Communication Optimization**: use NVLink for intra-node, InfiniBand for inter-node; enable NCCL optimizations; topology-aware placement; critical for efficiency - **Profiling**: profile memory and communication; identify bottlenecks; adjust configuration; iterate to optimal settings; essential for large-scale training ZeRO is **the breakthrough that made training 100B+ parameter models practical** — by eliminating memory redundancy in distributed training, it enables models 100-1000× larger than possible with standard approaches, democratizing large-scale AI research and enabling the frontier models that define the current state of artificial intelligence.

zero-cost proxies, neural architecture

**Zero-Cost Proxies** are **metrics that estimate the performance of a neural architecture without any training** — computed in a single forward/backward pass at initialization, enabling architecture ranking in seconds instead of hours. **What Are Zero-Cost Proxies?** - **Examples**: - **SynFlow**: Sum of product of all parameters' absolute values (measures signal propagation). - **NASWOT**: Log-determinant of the neural tangent kernel at initialization. - **GradNorm**: Norm of gradients at initialization. - **Fisher**: Fisher information of the network at initialization. - **Cost**: One forward + one backward pass = seconds per architecture. **Why It Matters** - **Speed**: Evaluate 10,000 architectures in minutes (vs. days for one-shot, weeks for full training). - **Pre-Filtering**: Use zero-cost proxies to prune the search space before expensive evaluation. - **Limitation**: Correlation with trained accuracy is imperfect (0.5-0.8 Spearman rank), but improving. **Zero-Cost Proxies** are **instant architecture critics** — predicting network performance at birth, before a single weight update.

zero-cost proxy, neural architecture search

**Zero-cost proxy** is **a neural-architecture-evaluation signal that estimates model quality without full training** - Proxies use initialization-time statistics such as gradient norms or synaptic saliency to rank architectures quickly. **What Is Zero-cost proxy?** - **Definition**: A neural-architecture-evaluation signal that estimates model quality without full training. - **Core Mechanism**: Proxies use initialization-time statistics such as gradient norms or synaptic saliency to rank architectures quickly. - **Operational Scope**: It is used in machine-learning system design to improve model quality, efficiency, and deployment reliability across complex tasks. - **Failure Modes**: Proxy rankings can fail when task characteristics differ from assumptions behind the proxy. **Why Zero-cost proxy Matters** - **Performance Quality**: Better methods increase accuracy, stability, and robustness across challenging workloads. - **Efficiency**: Strong algorithm choices reduce data, compute, or search cost for equivalent outcomes. - **Risk Control**: Structured optimization and diagnostics reduce unstable or misleading model behavior. - **Deployment Readiness**: Hardware and uncertainty awareness improve real-world production performance. - **Scalable Learning**: Robust workflows transfer more effectively across tasks, datasets, and environments. **How It Is Used in Practice** - **Method Selection**: Choose approach by data regime, action space, compute budget, and operational constraints. - **Calibration**: Combine multiple proxies and validate rank correlation against partially trained reference models. - **Validation**: Track distributional metrics, stability indicators, and end-task outcomes across repeated evaluations. Zero-cost proxy is **a high-value technique in advanced machine-learning system engineering** - It accelerates NAS by reducing dependence on expensive full training loops.

zero-failure testing, reliability

**Zero-failure testing** is the **qualification strategy that defines pass criteria based on observing no failures over a planned sample and exposure window** - it simplifies acceptance decisions, but requires disciplined statistical design to avoid false confidence. **What Is Zero-failure testing?** - **Definition**: Test plan where any observed failure fails the criterion and zero failures are required to pass. - **Statistical Basis**: Pass meaning is expressed as lower confidence bound on reliability, not absolute perfection. - **Typical Use**: Early qualification gates, screening validation, and high-reliability component acceptance. - **Key Variables**: Sample count, stress time, confidence level, and assumed failure model. **Why Zero-failure testing Matters** - **Operational Simplicity**: Clear pass-fail rule improves execution speed and review clarity. - **High Assurance**: When properly sized, zero-failure plans provide strong reliability evidence. - **Release Discipline**: Strict criterion discourages weakly justified reliability claims. - **Risk Visibility**: Failure occurrence immediately triggers root cause and containment investigation. - **Program Fit**: Useful when product class requires conservative qualification behavior. **How It Is Used in Practice** - **Plan Sizing**: Compute required sample and stress exposure for desired reliability-confidence target. - **Mechanism Coverage**: Ensure stress conditions activate relevant field failure mechanisms. - **Failure Response**: Define rapid escalation and corrective action workflow before test start. Zero-failure testing is **a strict but effective reliability gate when statistically designed correctly** - it trades tolerance for clarity and strong confidence in release readiness.

zero-shot chain-of-thought,reasoning

**Zero-shot chain-of-thought (Zero-shot CoT)** is the remarkably simple technique of appending the phrase **"Let's think step by step"** (or a similar instruction) to a prompt — without providing any reasoning examples — to trigger the language model to generate its own step-by-step reasoning before producing a final answer. **The Discovery** - Standard **few-shot CoT** requires carefully crafted reasoning examples in the prompt — effective but labor-intensive to create for each task. - Researchers discovered that simply adding **"Let's think step by step"** to the end of a zero-shot prompt (no examples at all) dramatically improves reasoning performance. - This single phrase can improve accuracy on math and logic tasks by **40–70%** compared to standard zero-shot prompting. **How Zero-Shot CoT Works** - **Without CoT**: "What is 23 + 47 × 2?" → Model often gives wrong answer by misapplying order of operations. - **With Zero-Shot CoT**: "What is 23 + 47 × 2? Let's think step by step." → Model responds: ``` Step 1: First, compute 47 × 2 = 94 Step 2: Then, add 23 + 94 = 117 Answer: 117 ``` **Two-Stage Process** 1. **Reasoning Extraction**: Append "Let's think step by step" → model generates a reasoning chain. 2. **Answer Extraction**: After the reasoning, prompt "Therefore, the answer is" → model produces the final answer. - Some implementations use both stages explicitly; others let the model naturally conclude with an answer. **Why It Works** - The phrase **activates reasoning patterns** learned during pretraining — the model has seen many examples of step-by-step reasoning in its training data. - Without the prompt, the model defaults to **pattern matching** or **direct recall** — which often fails for problems requiring multi-step logic. - The instruction makes the model **allocate more computation** (more tokens) to the problem before committing to an answer. **Effective Trigger Phrases** - "Let's think step by step" — the original and most studied. - "Let's work this out step by step to be sure we have the right answer." - "Let's solve this carefully." - "Think about this step by step before answering." - Research shows the exact phrasing matters — some variations work better than others for specific models. **Limitations** - **Less Effective Than Few-Shot CoT**: On many benchmarks, few-shot CoT with well-crafted examples still outperforms zero-shot CoT. - **Model Size Dependent**: Zero-shot CoT primarily works with large models (>100B parameters). Smaller models may produce incoherent reasoning. - **Task Dependent**: Works well for math, logic, and commonsense reasoning. Less effective for creative tasks or tasks requiring domain-specific procedures. - **Unfaithful Reasoning**: The model may generate plausible-looking but logically flawed reasoning — the presence of steps doesn't guarantee correctness. **Practical Impact** - Zero-shot CoT is the **most cost-effective reasoning improvement** available — it requires no example crafting, no fine-tuning, and works across many tasks. - It's become a **standard baseline** in prompt engineering — virtually every complex prompt now includes some form of "think step by step" instruction. Zero-shot chain-of-thought is one of the **most influential discoveries** in prompt engineering — a single phrase that unlocks latent reasoning capabilities, demonstrating that how you ask is as important as what you ask.

zero-shot distillation, model compression

**Zero-Shot Distillation** is a **variant of data-free distillation where the student is trained without any real data or data generation process** — relying entirely on the teacher's learned parameters and the structure of the output space to transfer knowledge. **How Does Zero-Shot Distillation Work?** - **Crafted Inputs**: Generate pseudo-data by optimizing random noise to maximize specific class activations in the teacher. - **Model Inversion**: Use gradient-based optimization to "invert" the teacher — finding inputs that produce representative outputs. - **Dirichlet Sampling**: Sample from the simplex of class probabilities to create diverse soft label targets. - **Difference from Data-Free**: Zero-shot is even more restrictive — no generator network training, just direct optimization. **Why It Matters** - **Extreme Constraint**: When not even a generator can be trained (no compute budget for data generation). - **Model IP**: Enables knowledge transfer from a black-box teacher API with minimal queries. - **Research**: Explores the fundamental limits of how much knowledge can be extracted from a model without data. **Zero-Shot Distillation** is **knowledge transfer at the extreme** — distilling a model's knowledge with literally zero training examples from any source.

AI Factory Glossary