Ai Glossary - Letter I | AI Factory - Chip Foundry Services

isotonic regression,ai safety

**Isotonic Regression** is a non-parametric calibration technique that fits a monotonically non-decreasing step function to map a model's raw prediction scores to calibrated probabilities, without assuming any specific functional form for the calibration mapping. The method partitions the score range into bins where the calibrated probability within each bin equals the empirical accuracy, subject to the constraint that the mapping is monotonically increasing. **Why Isotonic Regression Matters in AI/ML:** Isotonic regression provides **flexible, assumption-free calibration** that can correct arbitrary distortions in a model's probability estimates—including non-linear miscalibration patterns that parametric methods like Platt scaling cannot capture. • **Non-parametric flexibility** — Unlike Platt scaling (which assumes a sigmoid calibration curve), isotonic regression makes no assumptions about the shape of the miscalibration; it can correct S-shaped, concave, step-wise, or arbitrarily distorted probability mappings • **Monotonicity constraint** — The only assumption is that higher model scores should correspond to higher true probabilities (monotonicity); this minimal constraint preserves the model's ranking while adjusting the probability magnitudes • **Pool Adjacent Violators (PAV) algorithm** — Isotonic regression is solved efficiently by the PAV algorithm: scores are sorted, and whenever the monotonicity constraint is violated (a higher score has lower observed accuracy), the violating groups are merged and their probabilities averaged • **Calibration quality** — With sufficient data, isotonic regression achieves better calibration than Platt scaling because it can model complex miscalibration patterns; however, it requires more calibration data (5,000-10,000 examples) to avoid overfitting • **Step function output** — The calibrated mapping is a step function with as many steps as distinct score-accuracy groups; for smooth probabilities, the output can be further smoothed with interpolation | Property | Isotonic Regression | Platt Scaling | |----------|-------------------|---------------| | Parametric | No (non-parametric) | Yes (2 parameters) | | Flexibility | Arbitrary monotone mapping | Sigmoid only | | Data Requirements | 5,000-10,000 examples | 1,000-5,000 examples | | Overfitting Risk | Higher (with small data) | Lower (constrained) | | Calibration Quality | Better (with enough data) | Good (if sigmoid appropriate) | | Output Shape | Step function | Smooth sigmoid | | Multiclass | One-vs-all | Temperature scaling | **Isotonic regression is the most flexible post-hoc calibration technique available, providing non-parametric, assumption-free correction of arbitrary probability miscalibration patterns while preserving the model's ranking, making it the preferred calibration method when sufficient validation data is available and the miscalibration pattern is complex or unknown.**

issue triaging, code ai

**Issue Triaging** is the **code AI task of automatically classifying, prioritizing, assigning, and de-duplicating bug reports and feature requests in software issue trackers** — enabling development teams to process incoming GitHub Issues, Jira tickets, and Bugzilla reports at scale without the triaging bottleneck that delays critical bug fixes, causes duplicate work, and leaves important user feedback unaddressed. **What Is Issue Triaging?** - **Input**: Issue title, description body, labels, reporter information, linked code references, and similar existing issues. - **Triage Actions**: - **Classification**: Bug vs. feature request vs. documentation vs. question vs. enhancement. - **Priority Assignment**: Critical / High / Medium / Low based on impact and urgency. - **Component Assignment**: Which team, repository, or subsystem owns this issue. - **Duplicate Detection**: Does this issue already exist under a different title? - **Assignee Recommendation**: Which developer has the relevant expertise and capacity? - **Label Application**: Apply standardized labels from project taxonomy. - **Status Routing**: Close as "won't fix," "needs more info," or move to sprint planning. - **Key Benchmarks**: GHTorrent (GitHub archive), Bugzilla DBs (Mozilla, Eclipse, NetBeans), GitHub Issues corpora, DeepTriage (Microsoft). **The Triaging Scale Problem** At scale, issue triaging is a significant operational burden: - VS Code: ~5,000 new GitHub issues/month; 180,000+ total open/closed issues. - Linux Kernel: ~15,000 bug reports/year across multiple subsystems. - Android AOSP: ~50,000+ issues tracked across hundreds of components. Manual triaging requires a dedicated team of engineers who could otherwise be writing code. Microsoft published that automated triage for VS Code reduces manual triaging effort by 60%. **Technical Tasks in Detail** **Bug Report Classification**: - Fine-tuned BERT/RoBERTa on labeled issue datasets. - Accuracy ~88-92% for binary bug/not-bug classification. - Harder: 7-class granular classification (performance, crash, security, UI, documentation, etc.) achieves ~72-80%. **Duplicate Issue Detection**: - Semantic similarity between new issue and all existing open issues. - Siamese network or bi-encoder models comparing issue titles and bodies. - Challenge: "App crashes when clicking back button" and "SegFault on navigation back gesture" are duplicates despite zero lexical overlap. - Best models achieve ~85% precision@5 for duplicate retrieval. **Priority Prediction**: - Regress or classify priority from issue text features + reporter history + code component affected. - Imbalanced task: most issues are medium priority; critical bugs are rare. - Microsoft DeepTriage: 85% accuracy on 3-class priority with bug-specific features. **Assignee Recommendation**: - Predict which developer on the team should fix a given bug based on code ownership, expertise profile, and recent contribution history. - Hybrid: Text similarity to past issues + code file ownership graph + developer workload. - Accuracy: ~70-78% for top-3 assignee recommendation on established projects. **Why Issue Triaging Matters** - **Developer Productivity**: Developers interrupted by triage duties lose flow state repeatedly. Automated first-pass triage lets human reviewers focus only on edge cases requiring judgment. - **SLA Compliance**: Enterprise software support contracts define response-time SLAs by severity. Automated severity classification ensures SLA routing happens immediately on ticket creation. - **Community Health**: Open source projects with slow issue response rates (weeks to triage) lose contributor trust. Automated triage + quick acknowledgment improves community satisfaction. - **Security Vulnerability Identification**: Automatically detecting security-related issues (crash reports that may indicate exploitable bugs, authentication-related failures) enables faster escalation to security teams. - **Product Roadmap Signal**: Aggregating and classifying thousands of feature requests enables data-driven prioritization of development roadmap items based on frequency and user impact. Issue Triaging is **the intelligent inbox for software development** — automatically classifying, prioritizing, routing, and deduplicating the continuous stream of user-reported bugs and feature requests that would otherwise overwhelm development teams, ensuring that critical issues reach the right engineers immediately while noise and duplicates are filtered efficiently.

iterated amplification, ai safety

**Iterated Amplification** is an **AI alignment technique that bootstraps human oversight by iteratively using AI assistance to solve increasingly complex evaluation tasks** — starting with problems humans can evaluate directly, then using AI-assisted humans to evaluate slightly harder problems, and continuing to expand the frontier of evaluable tasks. **Amplification Process** - **Base Case**: Human evaluates simple AI outputs directly — standard RLHF. - **Amplification Step**: For harder tasks, decompose into sub-problems that a human-with-AI-assistant can evaluate. - **Iteration**: The AI assistant itself was trained using the previous round's amplified evaluator. - **Distillation**: Train a new model to mimic the amplified evaluator — producing a standalone, efficient model. **Why It Matters** - **Scalable Oversight**: Enables evaluation of AI outputs that are too complex for unaided human judgment. - **Alignment Path**: Provides a concrete path to aligning superhuman AI — evaluation capability grows with AI capability. - **Decomposition**: Complex tasks are decomposed into human-manageable sub-problems — divide and conquer for alignment. **Iterated Amplification** is **growing the evaluator alongside the AI** — bootstrapping human oversight to keep pace with increasingly capable AI systems.

iterated amplification, ai safety

**Iterated Amplification** is **an alignment approach where hard tasks are recursively decomposed into easier subproblems humans can supervise** - It is a core method in modern AI safety execution workflows. **What Is Iterated Amplification?** - **Definition**: an alignment approach where hard tasks are recursively decomposed into easier subproblems humans can supervise. - **Core Mechanism**: Model and human collaboration expands effective oversight by chaining simpler evaluable steps. - **Operational Scope**: It is applied in AI safety engineering, alignment governance, and production risk-control workflows to improve system reliability, policy compliance, and deployment resilience. - **Failure Modes**: Poor decomposition quality can propagate early mistakes into final judgments. **Why Iterated Amplification Matters** - **Outcome Quality**: Better methods improve decision reliability, efficiency, and measurable impact. - **Risk Management**: Structured controls reduce instability, bias loops, and hidden failure modes. - **Operational Efficiency**: Well-calibrated methods lower rework and accelerate learning cycles. - **Strategic Alignment**: Clear metrics connect technical actions to business and sustainability goals. - **Scalable Deployment**: Robust approaches transfer effectively across domains and operating conditions. **How It Is Used in Practice** - **Method Selection**: Choose approaches by risk profile, implementation complexity, and measurable impact. - **Calibration**: Validate decomposition trees and include cross-check mechanisms between branches. - **Validation**: Track objective metrics, compliance rates, and operational outcomes through recurring controlled reviews. Iterated Amplification is **a high-impact method for resilient AI execution** - It provides a path toward supervising complex reasoning beyond direct human capacity.

iteration / step,model training

An iteration or step is one update of model weights after processing one batch, the atomic unit of training. **Definition**: Forward pass on batch, compute loss, backward pass, optimizer step = one iteration. **Relationship to epochs**: steps_per_epoch = dataset_size / batch_size. Total steps = epochs x steps_per_epoch. **LLM training**: Often measured in steps rather than epochs. Millions of steps for large models. **What happens each step**: Load batch, forward pass, compute loss, backward pass (gradients), optimizer update, (optional logging). **With gradient accumulation**: Logical step may span multiple forward-backward passes before optimizer update. **Logging frequency**: Log every N steps (e.g., 100). Too frequent is expensive, too infrequent misses issues. **Checkpointing**: Save model every N steps or epochs. Balance between safety and storage. **Learning rate per step**: Most schedulers update LR per step, not per epoch. Smoother adaptation. **Steps vs samples**: Sometimes report samples (steps x batch size) for comparisons across batch sizes. **Progress tracking**: Steps are wall-clock-neutral metric. Epochs depend on dataset size.

iteration, batch, mini-batch training

**Training Terminology: Epochs, Batches, Iterations** **Definitions** **Batch** A subset of training examples processed together: ```python batch_size = 32 # Process 32 examples at once ``` **Iteration (Step)** One forward + backward pass on a single batch: ``` 1 iteration = process 1 batch = 1 gradient update ``` **Epoch** One complete pass through the entire training dataset: ``` 1 epoch = dataset_size / batch_size iterations ``` **Example Calculation** ``` Dataset: 10,000 examples Batch size: 32 Iterations per epoch: 10,000 / 32 ≈ 312 Training for 3 epochs = 3 × 312 = 936 total iterations ``` **Effective Batch Size** **Gradient Accumulation** Process more examples before updating weights: ```python accumulation_steps = 4 effective_batch_size = batch_size × accumulation_steps # 32 × 4 = 128 effective batch size ``` Why use it: - Fit larger effective batches on limited GPU memory - More stable gradients **Distributed Training** With multiple GPUs: ``` global_batch_size = batch_size × num_gpus × accumulation_steps ``` **LLM Training Scale** **Pretraining** | Model | Tokens | Epochs | Notes | |-------|--------|--------|-------| | GPT-3 | 300B | <1 | Never repeats data | | Llama 2 | 2T | ~1 | Some repetition | | Llama 3 | 15T | ~4 on some data | Selective repetition | **Fine-Tuning** | Method | Typical Epochs | |--------|----------------| | SFT | 1-3 | | LoRA | 1-5 | | Full fine-tuning | 1-3 | More epochs risk overfitting on small datasets. **Training Code Example** ```python num_epochs = 3 batch_size = 32 accumulation_steps = 4 for epoch in range(num_epochs): for i, batch in enumerate(dataloader): # Forward pass loss = model(batch) loss = loss / accumulation_steps loss.backward() # Update only every N steps if (i + 1) % accumulation_steps == 0: optimizer.step() optimizer.zero_grad() print(f"Completed epoch {epoch + 1}") ``` **Monitoring Progress** ``` Step 1000: loss=2.34, lr=0.0001 Step 2000: loss=1.87, lr=0.0001 Epoch 1/3 complete ... ```

iterative magnitude pruning,model optimization

**Pruning** removes the parts of a trained neural network that contribute least, and **sparsity** is the result: a model in which most weights are zero. The premise is that large networks are heavily over-parameterized — they have far more weights than they strictly need — so a large fraction can be deleted with little or no loss in accuracy. Pruning is a core model-compression technique for shrinking memory footprint, cutting energy use, and speeding up inference, especially on edge and cost-sensitive deployments, and it composes with quantization and distillation.\n\n```svg\n\n```\n\n**The first choice is unstructured versus structured.** Unstructured pruning zeros out individual weights, usually the ones with the smallest magnitude; it reaches very high sparsity with excellent accuracy retention, but the surviving pattern is irregular, so a dense GPU sees no speedup without specialized sparse kernels. Structured pruning instead removes whole units — channels, filters, or attention heads — producing a smaller dense model that runs faster on any hardware, at the cost of somewhat lower achievable sparsity and a bigger accuracy hit per weight removed.\n\n**The standard recipe is prune, then recover, repeatedly.** You rank weights by an importance score — magnitude is the simplest, but gradient-, Taylor-, and Fisher-based scores estimate impact more carefully — remove the least important, then fine-tune the network to recover the accuracy lost. Doing this gradually over several rounds (iterative pruning) reliably beats removing everything in a single pass (one-shot pruning), because the network gets a chance to reallocate capacity between cuts.\n\n**The Lottery Ticket Hypothesis reframed what pruning finds.** Frankle and Carbin showed that a dense network contains a sparse "winning subnetwork" that, when trained from the original initialization, can match the full network's accuracy. This shifted the mental model from "compress a trained model" toward "a trainable sparse subnetwork was hiding inside all along," and it spurred a wave of research into finding such subnetworks early rather than after full training.\n\n**Turning sparsity into real speed is a hardware problem.** A model can be ninety percent zeros and still run at full dense speed, because general matrix hardware processes the zeros anyway. Getting wall-clock gains requires patterns the hardware can exploit: structured pruning that yields a genuinely smaller dense model, or semi-structured "N:M" sparsity — such as NVIDIA's 2:4, where two of every four weights are zero — which maps directly onto sparse tensor cores. This is why deployment-focused work favors structured and N:M patterns over free-form unstructured sparsity.\n\n**The payoff and the caveats.** Pruning can substantially cut model size and energy while preserving most accuracy, and it stacks with other compression methods for large combined gains. The caveats are that accuracy degrades as sparsity climbs toward extreme levels, the prune-and-fine-tune loop adds training cost, and the theoretical reduction in floating-point operations often exceeds the actual speedup once memory layout and hardware realities are accounted for.\n\n| Type | What it removes | Achievable sparsity | Where it speeds up |\n|---|---|---|---|\n| Unstructured (magnitude) | individual weights | very high | only with sparse kernels/hardware |\n| Structured | channels, filters, heads | moderate | any hardware (smaller dense model) |\n| Semi-structured N:M (2:4) | a fixed pattern per block | around one half | sparse tensor cores |\n| Lottery ticket | finds a winning subnetwork | high | an insight about initialization |\n\nRead pruning through a *what-can-the-hardware-exploit* lens rather than a *how-many-weights-can-I-delete* lens: reaching high sparsity is the easy part, but the removed weights only become real speed when the surviving pattern is structured or N:M regular — which is why the practical art is trading a little sparsity for a layout the chip can actually run faster.\n

iterative pruning, model optimization

**Pruning** removes the parts of a trained neural network that contribute least, and **sparsity** is the result: a model in which most weights are zero. The premise is that large networks are heavily over-parameterized — they have far more weights than they strictly need — so a large fraction can be deleted with little or no loss in accuracy. Pruning is a core model-compression technique for shrinking memory footprint, cutting energy use, and speeding up inference, especially on edge and cost-sensitive deployments, and it composes with quantization and distillation.\n\n```svg\n\n```\n\n**The first choice is unstructured versus structured.** Unstructured pruning zeros out individual weights, usually the ones with the smallest magnitude; it reaches very high sparsity with excellent accuracy retention, but the surviving pattern is irregular, so a dense GPU sees no speedup without specialized sparse kernels. Structured pruning instead removes whole units — channels, filters, or attention heads — producing a smaller dense model that runs faster on any hardware, at the cost of somewhat lower achievable sparsity and a bigger accuracy hit per weight removed.\n\n**The standard recipe is prune, then recover, repeatedly.** You rank weights by an importance score — magnitude is the simplest, but gradient-, Taylor-, and Fisher-based scores estimate impact more carefully — remove the least important, then fine-tune the network to recover the accuracy lost. Doing this gradually over several rounds (iterative pruning) reliably beats removing everything in a single pass (one-shot pruning), because the network gets a chance to reallocate capacity between cuts.\n\n**The Lottery Ticket Hypothesis reframed what pruning finds.** Frankle and Carbin showed that a dense network contains a sparse "winning subnetwork" that, when trained from the original initialization, can match the full network's accuracy. This shifted the mental model from "compress a trained model" toward "a trainable sparse subnetwork was hiding inside all along," and it spurred a wave of research into finding such subnetworks early rather than after full training.\n\n**Turning sparsity into real speed is a hardware problem.** A model can be ninety percent zeros and still run at full dense speed, because general matrix hardware processes the zeros anyway. Getting wall-clock gains requires patterns the hardware can exploit: structured pruning that yields a genuinely smaller dense model, or semi-structured "N:M" sparsity — such as NVIDIA's 2:4, where two of every four weights are zero — which maps directly onto sparse tensor cores. This is why deployment-focused work favors structured and N:M patterns over free-form unstructured sparsity.\n\n**The payoff and the caveats.** Pruning can substantially cut model size and energy while preserving most accuracy, and it stacks with other compression methods for large combined gains. The caveats are that accuracy degrades as sparsity climbs toward extreme levels, the prune-and-fine-tune loop adds training cost, and the theoretical reduction in floating-point operations often exceeds the actual speedup once memory layout and hardware realities are accounted for.\n\n| Type | What it removes | Achievable sparsity | Where it speeds up |\n|---|---|---|---|\n| Unstructured (magnitude) | individual weights | very high | only with sparse kernels/hardware |\n| Structured | channels, filters, heads | moderate | any hardware (smaller dense model) |\n| Semi-structured N:M (2:4) | a fixed pattern per block | around one half | sparse tensor cores |\n| Lottery ticket | finds a winning subnetwork | high | an insight about initialization |\n\nRead pruning through a *what-can-the-hardware-exploit* lens rather than a *how-many-weights-can-I-delete* lens: reaching high sparsity is the easy part, but the removed weights only become real speed when the surviving pattern is structured or N:M regular — which is why the practical art is trading a little sparsity for a layout the chip can actually run faster.\n

AI Factory Glossary

isotonic regression,ai safety

issue triaging, code ai

iterated amplification, ai safety

iterated amplification, ai safety

iteration / step,model training

iteration, batch, mini-batch training

iterative magnitude pruning,model optimization

iterative pruning, model optimization