← Back to AI Factory Chat

AI Factory Glossary

656 technical terms and definitions

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z All
Showing page 4 of 14 (656 entries)

default,full stack ai,ai stack,ai infrastructure,ai build,vertical integration,frontier ai,ai systems,ai development

# The Full Stack AI Build: A Comprehensive Analysis ## Overview The 5-Layer AI Stack represents the complete vertical integration required to build frontier AI systems: $$ \text{Energy} \rightarrow \text{Chips} \rightarrow \text{Infrastructure} \rightarrow \text{Models} \rightarrow \text{Applications} $$ ## Layer 1: Energy (Electricity) The foundational constraint upon which all other layers depend. ### Key Metrics - **Training Energy Consumption**: A frontier LLM requires approximately $50-100+ \text{ GWh}$ - **Power Usage Effectiveness (PUE)**: $$ \text{PUE} = \frac{\text{Total Facility Energy}}{\text{IT Equipment Energy}} $$ - **Target PUE**: $\text{PUE} \leq 1.2$ for modern AI data centers ### Critical Concerns - Reliability and uptime requirements: $99.999\%$ availability - Cost optimization: USD per kWh directly impacts training costs - Carbon footprint: $\text{kg CO}_2/\text{kWh}$ varies by source - Geographic availability constraints ### Energy Cost Model $$ C_{\text{energy}} = P_{\text{peak}} \times T_{\text{training}} \times \text{PUE} \times \text{Cost}_{\text{kWh}} $$ Where: - $P_{\text{peak}}$ = Peak power consumption (MW) - $T_{\text{training}}$ = Training duration (hours) - $\text{Cost}_{\text{kWh}}$ = Energy cost per kilowatt-hour ## Layer 2: Chips The computational substrate transforming electricity into useful operations. ### 2.1 Design Chips #### Architecture Decisions - **GPU vs TPU vs Custom ASIC**: Trade-offs in flexibility vs efficiency - **Core compute unit**: Matrix multiplication engines $$ C = A \times B \quad \text{where } A \in \mathbb{R}^{m \times k}, B \in \mathbb{R}^{k \times n} $$ - **Operations count**: $$ \text{FLOPs}_{\text{matmul}} = 2 \times m \times n \times k $$ #### Memory Bandwidth Optimization The real bottleneck for transformer models: $$ \text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Accessed}} $$ For transformers: $$ \text{AI}_{\text{attention}} = \frac{2 \times n^2 \times d}{n^2 + 2nd} \approx O(d) \text{ (memory bound)} $$ #### Numerical Precision Trade-offs | Precision | Bits | Dynamic Range | Use Case | |-----------|------|---------------|----------| | FP32 | 32 | $\pm 3.4 \times 10^{38}$ | Reference | | FP16 | 16 | $\pm 65504$ | Training | | BF16 | 16 | $\pm 3.4 \times 10^{38}$ | Training | | FP8 | 8 | $\pm 448$ (E4M3) | Inference | | INT8 | 8 | $[-128, 127]$ | Quantized inference | | INT4 | 4 | $[-8, 7]$ | Extreme quantization | Quantization error bound: $$ \|W - W_q\|_F \leq \frac{\Delta}{2}\sqrt{n} $$ Where $\Delta$ is the quantization step size. #### Power Efficiency $$ \text{Efficiency} = \frac{\text{FLOPS}}{\text{Watt}} \quad [\text{FLOPS/W}] $$ Modern targets: - Training: $> 300 \text{ TFLOPS/chip}$ at $< 700\text{W}$ - Inference: $> 1000 \text{ TOPS/W}$ (INT8) ### 2.2 Wafer Fabrication #### Process Technology - **Transistor density**: $$ D = \frac{N_{\text{transistors}}}{A_{\text{die}}} \quad [\text{transistors/mm}^2] $$ - **Node progression**: $7\text{nm} \rightarrow 5\text{nm} \rightarrow 3\text{nm} \rightarrow 2\text{nm}$ #### Yield Optimization Defect density model (Poisson): $$ Y = e^{-D_0 \times A} $$ Where: - $Y$ = Yield (probability of good die) - $D_0$ = Defect density (defects/cm²) - $A$ = Die area (cm²) #### Advanced Packaging - **CoWoS** (Chip-on-Wafer-on-Substrate) - **Chiplets**: Disaggregated design - **3D Stacking**: HBM memory integration Bandwidth scaling: $$ \text{BW}_{\text{HBM3}} = N_{\text{stacks}} \times \text{BW}_{\text{per\_stack}} \approx 6 \times 819 = 4.9 \text{ TB/s} $$ ## Layer 3: Infrastructure Systems layer orchestrating chips into usable compute. ### 3.1 Build AI Infrastructure #### Cluster Architecture - **Nodes per cluster**: $N_{\text{nodes}} = 1000-10000+$ - **GPUs per node**: $G_{\text{per\_node}} = 8$ (typical) - **Total GPUs**: $$ G_{\text{total}} = N_{\text{nodes}} \times G_{\text{per\_node}} $$ #### Network Topology **Fat-tree bandwidth**: $$ \text{Bisection BW} = \frac{N \times \text{BW}_{\text{link}}}{2} $$ **All-reduce communication cost**: $$ T_{\text{all-reduce}} = 2(n-1) \times \frac{M}{n \times \text{BW}} + 2(n-1) \times \alpha $$ Where: - $M$ = Message size - $n$ = Number of participants - $\alpha$ = Latency per hop #### Storage Requirements Training data storage: $$ S_{\text{data}} = N_{\text{tokens}} \times \text{bytes\_per\_token} \times \text{redundancy} $$ For 10T tokens: $$ S \approx 10^{13} \times 2 \times 3 = 60 \text{ PB} $$ #### Reliability Engineering **Checkpointing overhead**: $$ \text{Overhead} = \frac{T_{\text{checkpoint}}}{T_{\text{checkpoint\_interval}}} $$ **Mean Time Between Failures (MTBF)** for cluster: $$ \text{MTBF}_{\text{cluster}} = \frac{\text{MTBF}_{\text{component}}}{N_{\text{components}}} $$ ## Layer 4: Models (LLMs) Where core AI capability emerges. ### 4.1 Build Large Language Models #### Transformer Architecture **Self-attention mechanism**: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$ Where: - $Q = XW_Q \in \mathbb{R}^{n \times d_k}$ (Queries) - $K = XW_K \in \mathbb{R}^{n \times d_k}$ (Keys) - $V = XW_V \in \mathbb{R}^{n \times d_v}$ (Values) **Multi-Head Attention (MHA)**: $$ \text{MHA}(X) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W_O $$ $$ \text{head}_i = \text{Attention}(XW_Q^i, XW_K^i, XW_V^i) $$ **Grouped Query Attention (GQA)**: Reduces KV cache by factor $g$: $$ \text{KV\_cache} = 2 \times L \times n \times d \times \frac{h}{g} \times \text{bytes} $$ #### Feed-Forward Network $$ \text{FFN}(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2 $$ SwiGLU variant: $$ \text{SwiGLU}(x) = (\text{Swish}(xW_1) \odot xW_3)W_2 $$ #### Model Parameter Count For decoder-only transformer: $$ P = 12 \times L \times d^2 + V \times d $$ Where: - $L$ = Number of layers - $d$ = Model dimension - $V$ = Vocabulary size #### Mixture of Experts (MoE) $$ y = \sum_{i=1}^{N} G(x)_i \cdot E_i(x) $$ Where $G(x)$ is the gating function: $$ G(x) = \text{TopK}(\text{softmax}(xW_g)) $$ ### 4.2 Pre-training #### Training Objective **Next-token prediction (autoregressive)**: $$ \mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_{

defect density (d0),defect density,d0,manufacturing

Defects per unit area.

defect density map, yield enhancement

Defect density maps visualize spatial distribution of defects across wafers identifying problem regions.

defect density map,metrology

Spatial distribution of defects.

defect density model, yield enhancement

Defect density models like Poisson and negative binomial relate defect counts to die area predicting yield from defect measurements.

defect density,production

Number of defects per unit area affects yield.

defect detection and correction, quality

Find and fix failure causes.

defect inspection,metrology

Automated optical or e-beam inspection to find particles scratches defects.

defect pareto, quality

Rank defect types by frequency.

defect pareto, yield enhancement

Defect Pareto analysis ranks defect types by frequency to prioritize yield improvement efforts on the most impactful defect categories.

defect part per million (dppm),defect part per million,dppm,quality

Quality metric for shipped parts.

defect rate, quality

Fraction of units with defects.

defect review, metrology

High-resolution follow-up of detected defects.

defect review, yield enhancement

Defect review classifies detected defects by type size and source enabling targeted process improvements.

defect source analysis, dsa, metrology

Trace defects to source.

defect vs defective, quality

Flaw vs unit with flaw.

defect waste, manufacturing operations

Defect waste produces nonconforming products requiring rework or scrap.

defect waste, production

Making defective products.

defect-level prediction, advanced test & probe

Defect-level prediction estimates outgoing defect rate from fault coverage and defect density models.

defects per million opportunities, dpmo, quality

Quality metric.

defense in depth,ai safety

Layer multiple safety mechanisms for robustness.

definite description resolution, nlp

Resolve "the X" references.

definitive screening design, dsd, doe

Three-level screening in few runs.

deflashing, packaging

Remove molding flash.

deformable alignment, video understanding

Align with learned deformations.

deformable attention, transformer

Learn offset locations for attention.

deformable convolution, computer vision

Adapt sampling locations geometrically.

deformable models,computer vision

Models that can deform to fit data.

deformation field, multimodal ai

Deformation fields warp canonical representations to model non-rigid motion.

degenerate doping, device physics

Doping so heavy Fermi level enters band.

degraded failure analysis, reliability

Analyze partially degraded units.

degraded mode, manufacturing operations

Degraded mode operation continues at reduced capacity or performance during partial failure.

deit (data-efficient image transformer),deit,data-efficient image transformer,computer vision

Train ViT efficiently with distillation.

delay fault,testing

Signal arrives too late.

delay test, advanced test & probe

Delay testing detects excessive path delays caused by parametric variations or defects using transition patterns.

delegation pattern,multi-agent

Main agent delegates subtasks to specialized sub-agents.

delimiter-based protection, ai safety

Use special markers to separate sections.

delta lake,acid,table

Delta Lake adds ACID transactions to data lakes. Time travel, schema enforcement.

delta-i noise, signal & power integrity

Delta-I noise results from rapid current changes flowing through power distribution inductance causing voltage transients on supply rails.

demand control ventilation, environmental & sustainability

Demand control ventilation modulates outdoor air intake based on occupancy reducing conditioning loads.

demand forecasting, supply chain & logistics

Demand forecasting predicts future material requirements based on production schedules and market trends.

demo,prototype,poc

Build demos quickly to validate ideas. Prototypes get feedback. POC proves feasibility before full build.

democratic co-learning, advanced training

Democratic co-learning trains multiple diverse models that teach each other through weighted voting on unlabeled examples.

demographic parity, evaluation

Demographic parity requires equal positive prediction rates across groups.

demographic parity,equal outcome,fair

Demographic parity: equal positive rates across groups. May conflict with accuracy. Choose metric carefully.

demographic parity,fairness

Model outputs same distribution for all demographic groups.

demonstration retrieval, prompting techniques

Demonstration retrieval searches example databases for relevant few-shot demonstrations.

demonstration selection, prompting techniques

Demonstration selection chooses optimal few-shot examples maximizing task performance.

demonstration selection,prompt engineering

Choose best in-context examples for few-shot.

dendritic growth, reliability

Conductive filament formation.