darts, darts, neural architecture search
Differentiable Architecture Search enables gradient-based NAS by relaxing the discrete architecture search space into continuous representations.
3,145 technical terms and definitions
Differentiable Architecture Search enables gradient-based NAS by relaxing the discrete architecture search space into continuous representations.
Data anonymization removes identifying information preventing re-identification.
Privacy-preserving augmentation creates variations without exposing original data.
Transform existing data (paraphrase translate corrupt) to increase training diversity.
Groups of data appearing together.
When model exposes training data or sensitive information.
Data minimization collects only necessary information reducing privacy risks.
Data mix balances domains (web, books, code). Proportions affect model capabilities. Code improves reasoning.
Combine different data sources.
Impact of data sequence.
Data parallel: same model, different data. Model parallel: split model across GPUs. Hybrid for largest models.
Replicate model on each device process different data batches.
Inject malicious data into training set to corrupt model behavior.
Data poisoning injects malicious samples into training data. Degrades or manipulates model behavior.
Ratio of different data types.
Revisit earlier training data.
Data retention policies specify how long data is stored before deletion.
Randomize order across distributed system.
Limited by available data.
Distill without access to original data.
Split datasets across nodes.
LLM quality depends heavily on data: diverse, large, deduplicated, filtered, and domain-relevant corpora produce stronger and more robust models.
Deterministic sampling for diffusion.
Deterministic diffusion sampling.
# Semiconductor Manufacturing: Dielectric Deposition Process (DDP) Modeling ## Overview **DDP (Dielectric Deposition Process)** refers to the set of techniques used to deposit insulating films in semiconductor fabrication. Dielectric materials serve critical functions: - **Gate dielectrics** — $\text{SiO}_2$, high-$\kappa$ materials like $\text{HfO}_2$ - **Interlayer dielectrics (ILD)** — isolating metal interconnect layers - **Spacer dielectrics** — defining transistor gate dimensions - **Passivation layers** — protecting finished devices - **Hard masks** — etch selectivity during patterning ## Dielectric Deposition Methods ### Primary Techniques | Method | Full Name | Temperature Range | Typical Applications | |--------|-----------|-------------------|---------------------| | **PECVD** | Plasma-Enhanced CVD | $200-400°C$ | $\text{SiO}_2$, $\text{SiN}_x$ for ILD, passivation | | **LPCVD** | Low-Pressure CVD | $400-800°C$ | High-quality $\text{Si}_3\text{N}_4$, poly-Si | | **HDPCVD** | High-Density Plasma CVD | $300-450°C$ | Gap-fill for trenches and vias | | **ALD** | Atomic Layer Deposition | $150-350°C$ | Ultra-thin gate dielectrics ($\text{HfO}_2$, $\text{Al}_2\text{O}_3$) | | **Thermal Oxidation** | — | $800-1200°C$ | Gate oxide ($\text{SiO}_2$) | | **Spin-on** | SOG/SOD | $100-400°C$ | Planarization layers | ### Selection Criteria - **Conformality requirements** — ALD > LPCVD > PECVD - **Thermal budget** — PECVD/ALD for low-$T$, thermal oxidation for high-quality - **Throughput** — CVD methods faster than ALD - **Film quality** — Thermal > LPCVD > PECVD generally ## Physics of Dielectric Deposition Modeling ### Fundamental Transport Equations Modeling dielectric deposition requires solving coupled partial differential equations for mass, momentum, and energy transport. #### Mass Transport (Species Concentration) $$ \frac{\partial C}{\partial t} + \nabla \cdot (\mathbf{v}C) = D\nabla^2 C + R $$ Where: - $C$ — species concentration $[\text{mol/m}^3]$ - $\mathbf{v}$ — velocity field $[\text{m/s}]$ - $D$ — diffusion coefficient $[\text{m}^2/\text{s}]$ - $R$ — reaction rate $[\text{mol/m}^3 \cdot \text{s}]$ #### Energy Balance $$ \rho C_p \left(\frac{\partial T}{\partial t} + \mathbf{v} \cdot \nabla T\right) = k\nabla^2 T + Q $$ Where: - $\rho$ — density $[\text{kg/m}^3]$ - $C_p$ — specific heat capacity $[\text{J/kg} \cdot \text{K}]$ - $k$ — thermal conductivity $[\text{W/m} \cdot \text{K}]$ - $Q$ — heat generation rate $[\text{W/m}^3]$ #### Momentum Balance (Navier-Stokes) $$ \rho\left(\frac{\partial \mathbf{v}}{\partial t} + \mathbf{v} \cdot \nabla \mathbf{v}\right) = -\nabla p + \mu \nabla^2 \mathbf{v} + \rho \mathbf{g} $$ Where: - $p$ — pressure $[\text{Pa}]$ - $\mu$ — dynamic viscosity $[\text{Pa} \cdot \text{s}]$ - $\mathbf{g}$ — gravitational acceleration $[\text{m/s}^2]$ ### Surface Reaction Kinetics #### Arrhenius Rate Expression $$ k = A \exp\left(-\frac{E_a}{RT}\right) $$ Where: - $k$ — rate constant - $A$ — pre-exponential factor - $E_a$ — activation energy $[\text{J/mol}]$ - $R$ — gas constant $= 8.314 \, \text{J/mol} \cdot \text{K}$ - $T$ — temperature $[\text{K}]$ #### Langmuir Adsorption Isotherm (for ALD) $$ \theta = \frac{K \cdot p}{1 + K \cdot p} $$ Where: - $\theta$ — fractional surface coverage $(0 \leq \theta \leq 1)$ - $K$ — equilibrium adsorption constant - $p$ — partial pressure of adsorbate #### Sticking Coefficient $$ S = S_0 \cdot (1 - \theta)^n \cdot \exp\left(-\frac{E_a}{RT}\right) $$ Where: - $S$ — sticking coefficient (probability of adsorption) - $S_0$ — initial sticking coefficient - $n$ — reaction order ### Plasma Modeling (PECVD/HDPCVD) #### Electron Energy Distribution Function (EEDF) For non-Maxwellian plasmas, the Druyvesteyn distribution: $$ f(\varepsilon) = C \cdot \varepsilon^{1/2} \exp\left(-\left(\frac{\varepsilon}{\bar{\varepsilon}}\right)^2\right) $$ Where: - $\varepsilon$ — electron energy $[\text{eV}]$ - $\bar{\varepsilon}$ — mean electron energy - $C$ — normalization constant #### Ion Bombardment Energy $$ E_{ion} = e \cdot V_{sheath} + \frac{1}{2}m_{ion}v_{Bohm}^2 $$ Where: - $V_{sheath}$ — plasma sheath voltage - $v_{Bohm} = \sqrt{\frac{k_B T_e}{m_{ion}}}$ — Bohm velocity #### Radical Generation Rate $$ R_{radical} = n_e \cdot n_{gas} \cdot \langle \sigma v \rangle $$ Where: - $n_e$ — electron density $[\text{m}^{-3}]$ - $n_{gas}$ — neutral gas density - $\langle \sigma v \rangle$ — rate coefficient (energy-averaged cross-section × velocity) ## Feature-Scale Modeling ### Critical Phenomena in High Aspect Ratio Structures Modern semiconductor devices require filling trenches and vias with aspect ratios (AR) exceeding 50:1. #### Knudsen Number $$ Kn = \frac{\lambda}{d} $$ Where: - $\lambda$ — mean free path of gas molecules - $d$ — characteristic feature dimension | Regime | Knudsen Number | Transport Type | |--------|---------------|----------------| | Continuum | $Kn < 0.01$ | Viscous flow | | Slip | $0.01 < Kn < 0.1$ | Transition | | Transition | $0.1 < Kn < 10$ | Mixed | | Free molecular | $Kn > 10$ | Ballistic/Knudsen | #### Mean Free Path Calculation $$ \lambda = \frac{k_B T}{\sqrt{2} \pi d_m^2 p} $$ Where: - $d_m$ — molecular diameter $[\text{m}]$ - $p$ — pressure $[\text{Pa}]$ ### Step Coverage Model $$ SC = \frac{t_{sidewall}}{t_{top}} \times 100\% $$ For diffusion-limited deposition: $$ SC \approx \frac{1}{\sqrt{1 + AR^2}} $$ For reaction-limited deposition: $$ SC \approx 1 - \frac{S \cdot AR}{2} $$ Where: - $S$ — sticking coefficient - $AR$ — aspect ratio = depth/width ### Void Formation Criterion Void formation occurs when: $$ \frac{d(thickness_{sidewall})}{dz} > \frac{w(z)}{2 \cdot t_{total}} $$ Where: - $w(z)$ — feature width at depth $z$ - $t_{total}$ — total deposition time ## Film Properties to Model ### Structural Properties - **Thickness uniformity**: $$ U = \frac{t_{max} - t_{min}}{t_{max} + t_{min}} \times 100\% $$ - **Film stress** (Stoney equation): $$ \sigma_f = \frac{E_s t_s^2}{6(1-\nu_s)t_f} \cdot \frac{1}{R} $$ Where: - $E_s$, $\nu_s$ — substrate Young's modulus and Poisson ratio - $t_s$, $t_f$ — substrate and film thickness - $R$ — radius of curvature - **Density from refractive index** (Lorentz-Lorenz): $$ \frac{n^2 - 1}{n^2 + 2} = \frac{4\pi}{3} N \alpha $$ Where $N$ is molecular density and $\alpha$ is polarizability ### Electrical Properties - **Dielectric constant** (capacitance method): $$ \kappa = \frac{C \cdot t}{\varepsilon_0 \cdot A} $$ - **Breakdown field**: $$ E_{BD} = \frac{V_{BD}}{t} $$ - **Leakage current density** (Fowler-Nordheim tunneling): $$ J = \frac{q^3 E^2}{8\pi h \phi_B} \exp\left(-\frac{8\pi\sqrt{2m^*}\phi_B^{3/2}}{3qhE}\right) $$ Where: - $E$ — electric field - $\phi_B$ — barrier height - $m^*$ — effective electron mass ## Multiscale Modeling Hierarchy ### Scale Linking Framework ``` ┌─────────────────────────────────────────────────────────────────────┐ │ ATOMISTIC (Å-nm) MESOSCALE (nm-μm) CONTINUUM │ │ ───────────────── ────────────────── (μm-mm) │ │ ────────── │ │ • DFT calculations • Kinetic Monte Carlo • CFD │ │ • Molecular Dynamics • Level-set methods • FEM │ │ • Ab initio MD • Cellular automata • TCAD │ │ │ │ Outputs: Outputs: Outputs: │ │ • Binding energies • Film morphology • Flow │ │ • Reaction barriers • Growth rate • T, C │ │ • Diffusion coefficients • Surface roughness • Profiles │ └─────────────────────────────────────────────────────────────────────┘ ``` ### DFT Calculations Solve the Kohn-Sham equations: $$ \left[-\frac{\hbar^2}{2m}\nabla^2 + V_{eff}(\mathbf{r})\right]\psi_i(\mathbf{r}) = \varepsilon_i \psi_i(\mathbf{r}) $$ Where: $$ V_{eff} = V_{ext} + V_H + V_{xc} $$ - $V_{ext}$ — external potential (nuclei) - $V_H$ — Hartree potential (electron-electron) - $V_{xc}$ — exchange-correlation potential ### Kinetic Monte Carlo (kMC) Event selection probability: $$ P_i = \frac{k_i}{\sum_j k_j} $$ Time advancement: $$ \Delta t = -\frac{\ln(r)}{\sum_j k_j} $$ Where $r$ is a random number $\in (0,1]$ ## Specific Process Examples ### PECVD $\text{SiO}_2$ from TEOS #### Overall Reaction $$ \text{Si(OC}_2\text{H}_5\text{)}_4 + 12\text{O}^* \xrightarrow{\text{plasma}} \text{SiO}_2 + 8\text{CO}_2 + 10\text{H}_2\text{O} $$ #### Key Process Parameters | Parameter | Typical Range | Effect | |-----------|--------------|--------| | RF Power | $100-1000 \, \text{W}$ | ↑ Power → ↑ Density, ↓ Dep rate | | Pressure | $0.5-5 \, \text{Torr}$ | ↑ Pressure → ↑ Dep rate, ↓ Conformality | | Temperature | $300-400°C$ | ↑ Temp → ↑ Density, ↓ H content | | TEOS:O₂ ratio | $1:5$ to $1:20$ | Affects stoichiometry, quality | #### Deposition Rate Model $$ R_{dep} = k_0 \cdot p_{TEOS}^a \cdot p_{O_2}^b \cdot \exp\left(-\frac{E_a}{RT}\right) $$ Typical values: $a \approx 0.5$, $b \approx 0.3$, $E_a \approx 0.3 \, \text{eV}$ ### ALD High-$\kappa$ Dielectrics ($\text{HfO}_2$) #### Half-Reactions **Cycle A (Metal precursor):** $$ \text{Hf(N(CH}_3\text{)}_2\text{)}_4\text{(g)} + \text{*-OH} \rightarrow \text{*-O-Hf(N(CH}_3\text{)}_2\text{)}_3 + \text{HN(CH}_3\text{)}_2 $$ **Cycle B (Oxidizer):** $$ \text{*-O-Hf(N(CH}_3\text{)}_2\text{)}_3 + 2\text{H}_2\text{O} \rightarrow \text{*-O-Hf(OH)}_3 + 3\text{HN(CH}_3\text{)}_2 $$ #### Growth Per Cycle (GPC) $$ \text{GPC} = \frac{\theta_{sat} \cdot \rho_{site} \cdot M_{HfO_2}}{\rho_{HfO_2} \cdot N_A} $$ Typical GPC for $\text{HfO}_2$: $0.8-1.2 \, \text{Å/cycle}$ #### ALD Window ``` ┌────────────────────────────┐ GPC │ ┌──────────────┐ │ (Å/ │ /│ │\ │ cycle) │ / │ ALD │ \ │ │ / │ WINDOW │ \ │ │ / │ │ \ │ │/ │ │ \ │ └─────┴──────────────┴─────┴─┘ T_min T_max Temperature (°C) ``` Below $T_{min}$: Condensation, incomplete reactions Above $T_{max}$: Precursor decomposition, CVD-like behavior ### HDPCVD Gap Fill #### Deposition-Etch Competition Net deposition rate: $$ R_{net}(z) = R_{dep}(\theta) - R_{etch}(E_{ion}, \theta) $$ Where: - $R_{dep}(\theta)$ — angular-dependent deposition rate - $R_{etch}$ — ion-enhanced etch rate - $\theta$ — angle from surface normal #### Sputter Yield (Yamamura Formula) $$ Y(E, \theta) = Y_0(E) \cdot f(\theta) $$ Where: $$ f(\theta) = \cos^{-f}\theta \cdot \exp\left[-\Sigma(\cos^{-1}\theta - 1)\right] $$ ## Machine Learning Applications ### Virtual Metrology **Objective:** Predict film properties from in-situ sensor data without destructive measurement. $$ \hat{y} = f_{ML}(\mathbf{x}_{sensors}, \mathbf{x}_{recipe}) $$ Where: - $\hat{y}$ — predicted property (thickness, stress, etc.) - $\mathbf{x}_{sensors}$ — OES, pressure, RF power signals - $\mathbf{x}_{recipe}$ — setpoints and timing ### Gaussian Process Regression $$ y(\mathbf{x}) \sim \mathcal{GP}\left(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x}')\right) $$ Posterior mean prediction: $$ \mu(\mathbf{x}^*) = \mathbf{k}^T(\mathbf{K} + \sigma_n^2\mathbf{I})^{-1}\mathbf{y} $$ Uncertainty quantification: $$ \sigma^2(\mathbf{x}^*) = k(\mathbf{x}^*, \mathbf{x}^*) - \mathbf{k}^T(\mathbf{K} + \sigma_n^2\mathbf{I})^{-1}\mathbf{k} $$ ### Bayesian Optimization for Recipe Development **Acquisition function** (Expected Improvement): $$ \text{EI}(\mathbf{x}) = \mathbb{E}\left[\max(f(\mathbf{x}) - f^+, 0)\right] $$ Where $f^+$ is the best observed value. ## Advanced Node Challenges (Sub-5nm) ### Critical Challenges | Challenge | Technical Details | Modeling Complexity | |-----------|------------------|---------------------| | **Ultra-high AR** | 3D NAND: 100+ layers, AR > 50:1 | Knudsen transport, ballistic modeling | | **Atomic precision** | Gate dielectrics: 1-2 nm | Monolayer-level control, quantum effects | | **Low-$\kappa$ integration** | $\kappa < 2.5$ porous films | Mechanical integrity, plasma damage | | **Selective deposition** | Area-selective ALD | Nucleation control, surface chemistry | | **Thermal budget** | BEOL: $< 400°C$ | Kinetic limitations, precursor chemistry | ### Equivalent Oxide Thickness (EOT) For high-$\kappa$ gate stacks: $$ \text{EOT} = t_{IL} + \frac{\kappa_{SiO_2}}{\kappa_{high-k}} \cdot t_{high-k} $$ Where: - $t_{IL}$ — interfacial layer thickness - $\kappa_{SiO_2} = 3.9$ - Typical high-$\kappa$: $\kappa_{HfO_2} \approx 20-25$ ### Low-$\kappa$ Dielectric Design Effective dielectric constant: $$ \kappa_{eff} = \kappa_{matrix} \cdot (1 - p) + \kappa_{air} \cdot p $$ Where $p$ is porosity fraction. Target for advanced nodes: $\kappa_{eff} < 2.0$ ## Tools and Software ### Commercial TCAD - **Synopsys Sentaurus Process** — full process simulation - **Silvaco Victory Process** — alternative TCAD suite - **Lam Research SEMulator3D** — 3D topography simulation ### Multiphysics Platforms - **COMSOL Multiphysics** — coupled PDE solving - **Ansys Fluent** — CFD for reactor design - **Ansys CFX** — alternative CFD solver ### Specialized Tools - **CHEMKIN** (Ansys) — gas-phase reaction kinetics - **Reaction Design** — combustion and plasma chemistry - **Custom Monte Carlo codes** — feature-scale simulation ### Open Source Options - **OpenFOAM** — CFD framework - **LAMMPS** — molecular dynamics - **Quantum ESPRESSO** — DFT calculations - **SPARTA** — DSMC for rarefied gas dynamics ## Summary Dielectric deposition modeling in semiconductor manufacturing integrates: 1. **Transport phenomena** — mass, momentum, energy conservation 2. **Reaction kinetics** — surface and gas-phase chemistry 3. **Plasma physics** — for PECVD/HDPCVD processes 4. **Feature-scale physics** — conformality, void formation 5. **Multiscale approaches** — atomistic to continuum 6. **Machine learning** — for optimization and virtual metrology The goal is predicting and optimizing film properties based on process parameters while accounting for the extreme topography of modern semiconductor devices.
Original diffusion formulation.
Design novel drug candidates from scratch.
Find unused code.
Dead code elimination removes unused operations from computation graphs.
AIs debate to find truth.
AI debate has models argue opposing positions enabling human judgment of complex questions.
Disentangled attention and enhanced decoding.
Methods to remove bias.
Methods to reduce unwanted biases (data augmentation INLP CDA).
Approximate neural network with tree.
Architecture with only decoder blocks (GPT for generation).
Reconstruct inputs from features.
Early failures declining.
Synthesize from formal specifications.
Neural network with CORAL loss.
Ensemble of independently trained neural networks.
Techniques for training 50+ layer ViTs.
DeepAR is an autoregressive recurrent network that learns probabilistic forecasts across multiple related time series using Monte Carlo sampling.
DeepEval provides unit tests for LLMs. Assert on metrics. CI integration.
Find minimal perturbation to decision boundary.
Decompose predictions to input features.
Microsoft's training optimization library.
Random walk-based graph embeddings.
# The Full Stack AI Build: A Comprehensive Analysis
## Overview
The 5-Layer AI Stack represents the complete vertical integration required to build frontier AI systems:
$$
\text{Energy} \rightarrow \text{Chips} \rightarrow \text{Infrastructure} \rightarrow \text{Models} \rightarrow \text{Applications}
$$
## Layer 1: Energy (Electricity)
The foundational constraint upon which all other layers depend.
### Key Metrics
- **Training Energy Consumption**: A frontier LLM requires approximately $50-100+ \text{ GWh}$
- **Power Usage Effectiveness (PUE)**:
$$
\text{PUE} = \frac{\text{Total Facility Energy}}{\text{IT Equipment Energy}}
$$
- **Target PUE**: $\text{PUE} \leq 1.2$ for modern AI data centers
### Critical Concerns
- Reliability and uptime requirements: $99.999\%$ availability
- Cost optimization: USD per kWh directly impacts training costs
- Carbon footprint: $\text{kg CO}_2/\text{kWh}$ varies by source
- Geographic availability constraints
### Energy Cost Model
$$
C_{\text{energy}} = P_{\text{peak}} \times T_{\text{training}} \times \text{PUE} \times \text{Cost}_{\text{kWh}}
$$
Where:
- $P_{\text{peak}}$ = Peak power consumption (MW)
- $T_{\text{training}}$ = Training duration (hours)
- $\text{Cost}_{\text{kWh}}$ = Energy cost per kilowatt-hour
## Layer 2: Chips
The computational substrate transforming electricity into useful operations.
### 2.1 Design Chips
#### Architecture Decisions
- **GPU vs TPU vs Custom ASIC**: Trade-offs in flexibility vs efficiency
- **Core compute unit**: Matrix multiplication engines
$$
C = A \times B \quad \text{where } A \in \mathbb{R}^{m \times k}, B \in \mathbb{R}^{k \times n}
$$
- **Operations count**:
$$
\text{FLOPs}_{\text{matmul}} = 2 \times m \times n \times k
$$
#### Memory Bandwidth Optimization
The real bottleneck for transformer models:
$$
\text{Arithmetic Intensity} = \frac{\text{FLOPs}}{\text{Bytes Accessed}}
$$
For transformers:
$$
\text{AI}_{\text{attention}} = \frac{2 \times n^2 \times d}{n^2 + 2nd} \approx O(d) \text{ (memory bound)}
$$
#### Numerical Precision Trade-offs
| Precision | Bits | Dynamic Range | Use Case |
|-----------|------|---------------|----------|
| FP32 | 32 | $\pm 3.4 \times 10^{38}$ | Reference |
| FP16 | 16 | $\pm 65504$ | Training |
| BF16 | 16 | $\pm 3.4 \times 10^{38}$ | Training |
| FP8 | 8 | $\pm 448$ (E4M3) | Inference |
| INT8 | 8 | $[-128, 127]$ | Quantized inference |
| INT4 | 4 | $[-8, 7]$ | Extreme quantization |
Quantization error bound:
$$
\|W - W_q\|_F \leq \frac{\Delta}{2}\sqrt{n}
$$
Where $\Delta$ is the quantization step size.
#### Power Efficiency
$$
\text{Efficiency} = \frac{\text{FLOPS}}{\text{Watt}} \quad [\text{FLOPS/W}]
$$
Modern targets:
- Training: $> 300 \text{ TFLOPS/chip}$ at $< 700\text{W}$
- Inference: $> 1000 \text{ TOPS/W}$ (INT8)
### 2.2 Wafer Fabrication
#### Process Technology
- **Transistor density**:
$$
D = \frac{N_{\text{transistors}}}{A_{\text{die}}} \quad [\text{transistors/mm}^2]
$$
- **Node progression**: $7\text{nm} \rightarrow 5\text{nm} \rightarrow 3\text{nm} \rightarrow 2\text{nm}$
#### Yield Optimization
Defect density model (Poisson):
$$
Y = e^{-D_0 \times A}
$$
Where:
- $Y$ = Yield (probability of good die)
- $D_0$ = Defect density (defects/cm²)
- $A$ = Die area (cm²)
#### Advanced Packaging
- **CoWoS** (Chip-on-Wafer-on-Substrate)
- **Chiplets**: Disaggregated design
- **3D Stacking**: HBM memory integration
Bandwidth scaling:
$$
\text{BW}_{\text{HBM3}} = N_{\text{stacks}} \times \text{BW}_{\text{per\_stack}} \approx 6 \times 819 = 4.9 \text{ TB/s}
$$
## Layer 3: Infrastructure
Systems layer orchestrating chips into usable compute.
### 3.1 Build AI Infrastructure
#### Cluster Architecture
- **Nodes per cluster**: $N_{\text{nodes}} = 1000-10000+$
- **GPUs per node**: $G_{\text{per\_node}} = 8$ (typical)
- **Total GPUs**:
$$
G_{\text{total}} = N_{\text{nodes}} \times G_{\text{per\_node}}
$$
#### Network Topology
**Fat-tree bandwidth**:
$$
\text{Bisection BW} = \frac{N \times \text{BW}_{\text{link}}}{2}
$$
**All-reduce communication cost**:
$$
T_{\text{all-reduce}} = 2(n-1) \times \frac{M}{n \times \text{BW}} + 2(n-1) \times \alpha
$$
Where:
- $M$ = Message size
- $n$ = Number of participants
- $\alpha$ = Latency per hop
#### Storage Requirements
Training data storage:
$$
S_{\text{data}} = N_{\text{tokens}} \times \text{bytes\_per\_token} \times \text{redundancy}
$$
For 10T tokens:
$$
S \approx 10^{13} \times 2 \times 3 = 60 \text{ PB}
$$
#### Reliability Engineering
**Checkpointing overhead**:
$$
\text{Overhead} = \frac{T_{\text{checkpoint}}}{T_{\text{checkpoint\_interval}}}
$$
**Mean Time Between Failures (MTBF)** for cluster:
$$
\text{MTBF}_{\text{cluster}} = \frac{\text{MTBF}_{\text{component}}}{N_{\text{components}}}
$$
## Layer 4: Models (LLMs)
Where core AI capability emerges.
### 4.1 Build Large Language Models
#### Transformer Architecture
**Self-attention mechanism**:
$$
\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$$
Where:
- $Q = XW_Q \in \mathbb{R}^{n \times d_k}$ (Queries)
- $K = XW_K \in \mathbb{R}^{n \times d_k}$ (Keys)
- $V = XW_V \in \mathbb{R}^{n \times d_v}$ (Values)
**Multi-Head Attention (MHA)**:
$$
\text{MHA}(X) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W_O
$$
$$
\text{head}_i = \text{Attention}(XW_Q^i, XW_K^i, XW_V^i)
$$
**Grouped Query Attention (GQA)**:
Reduces KV cache by factor $g$:
$$
\text{KV\_cache} = 2 \times L \times n \times d \times \frac{h}{g} \times \text{bytes}
$$
#### Feed-Forward Network
$$
\text{FFN}(x) = \text{GELU}(xW_1 + b_1)W_2 + b_2
$$
SwiGLU variant:
$$
\text{SwiGLU}(x) = (\text{Swish}(xW_1) \odot xW_3)W_2
$$
#### Model Parameter Count
For decoder-only transformer:
$$
P = 12 \times L \times d^2 + V \times d
$$
Where:
- $L$ = Number of layers
- $d$ = Model dimension
- $V$ = Vocabulary size
#### Mixture of Experts (MoE)
$$
y = \sum_{i=1}^{N} G(x)_i \cdot E_i(x)
$$
Where $G(x)$ is the gating function:
$$
G(x) = \text{TopK}(\text{softmax}(xW_g))
$$
### 4.2 Pre-training
#### Training Objective
**Next-token prediction (autoregressive)**:
$$
\mathcal{L} = -\sum_{t=1}^{T} \log P(x_t | x_{
Defect density models like Poisson and negative binomial relate defect counts to die area predicting yield from defect measurements.