few-shot learning for rare defects, data analysis
Learn from few examples of rare defects.
395 technical terms and definitions
Learn from few examples of rare defects.
Provide examples in the prompt.
Few-shot prompting provides multiple examples teaching task through in-context learning.
Generate with minimal steps.
Feed-Forward Equalization uses FIR filter to pre-emphasize or de-emphasize signal frequencies.
Field-aware Factorization Machines learn separate latent factors for each feature field improving expressiveness.
FFT convolutions compute long kernels efficiently in frequency domain.
FFT convolution performs spatial convolution through frequency domain multiplication for large kernels.
Ceiling units that provide filtered laminar air flow in cleanroom.
Single-step gradient-based attack.
Fast Gradient Sign Method generates adversarial examples using gradient sign to maximize loss.
Use ion beam to cut cross-sections or edit circuits for FA.
Measure distribution similarity.
Alignment references on board.
Failures occurring at customer.
Thick oxide for isolation (older technology).
Learn from actual failures.
FIFOs buffer data transfers between asynchronous clock domains preventing data loss.
First-in-first-out dispatching processes lots in arrival order.
First-in-first-out lanes control WIP between processes without pull signals maintaining sequence.
Understand metaphors idioms.
Fill-in-the-middle models complete code in context. Better than left-to-right.
Add dummy features for CMP uniformity.
Fill rate measures order fulfillment completeness as percentage of requested items delivered.
Generate code for middle section given surrounding context.
Silica particles in EMC.
Percentage of filler.
Internal tensile or compressive stress in deposited film.
Users only see similar content.
Normalization without batch statistics.
Data filtering removes low-quality text. Heuristics (length, language) or trained classifiers. Clean data = better models.
Fin efficiency measures the effectiveness of heat sink fins in transferring heat accounting for temperature gradients along fin length.
Fin optimization designs heat sink geometry balancing surface area material and pressure drop.
Defect check before shipment.
Yield at final test.
Percentage passing final test.
Test packaged chips for functionality performance.
Final yield represents the percentage of die passing all tests after complete fabrication reflecting cumulative process quality.
Create financial summaries.
FinBERT is financial sentiment model. Trained on financial news.
Fine-tuning APIs (OpenAI, Anthropic) let you customize models without infrastructure. Easy but less control.
Classify entities into detailed types.
Detailed emotional states.
Classify on scale (very negative to very positive).
Small ball pitch.
Very small spacing between connections.
Compare full fine-tuning to linear evaluation.
# Comprehensive Guide: LLM Training, AI Architecture & the Modern AI Stack
## 1. Foundational Hierarchy from Math to Intelligence
### The Conceptual Stack
The AI field builds upon layers of abstraction, each enabling the next:
- **Mathematics & Statistics**
- Linear algebra (vectors, matrices, tensors)
- Calculus (gradients, optimization)
- Probability theory (distributions, Bayesian inference)
- Information theory (entropy, mutual information)
- **Machine Learning**
- Statistical learning from data
- Pattern recognition without explicit programming
- Core paradigm: learn from examples, not rules
- **Deep Learning**
- Neural networks with multiple layers
- Hierarchical feature learning
- End-to-end differentiable systems
- **Artificial Intelligence**
- The overarching goal
- Systems exhibiting intelligent behavior
- ML/DL are currently the most successful approaches
### Mathematical Foundation
The fundamental learning objective in supervised learning:
$$\min_{\theta} \mathcal{L}(\theta) = \min_{\theta} \frac{1}{N} \sum_{i=1}^{N} \ell(f_\theta(x_i), y_i)$$
Where:
- $\theta$ = model parameters
- $f_\theta$ = model function
- $\ell$ = loss function
- $(x_i, y_i)$ = training examples
## 2. The Transformer Revolution
### Why Transformers Changed Everything
Before 2017, sequence modeling relied on recurrent architectures:
- **RNNs (Recurrent Neural Networks)**
- Sequential processing (cannot parallelize)
- Vanishing gradient problem
- Limited long-range dependencies
- **LSTMs (Long Short-Term Memory)**
- Gating mechanisms for memory
- Better long-range, but still sequential
- Computationally expensive for long sequences
### The Attention Mechanism
The breakthrough paper "Attention Is All You Need" (Vaswani et al., 2017) introduced:
**Self-Attention Formula:**
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
Where:
- $Q$ = Query matrix $\in \mathbb{R}^{n \times d_k}$
- $K$ = Key matrix $\in \mathbb{R}^{n \times d_k}$
- $V$ = Value matrix $\in \mathbb{R}^{n \times d_v}$
- $d_k$ = dimension of keys (scaling factor)
- $n$ = sequence length
**Multi-Head Attention:**
$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$
$$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$
### Key Architectural Components
| Component | Function | Mathematical Operation |
|-----------|----------|----------------------|
| Multi-Head Attention | Multiple parallel attention patterns | $h$ parallel attention functions |
| Positional Encoding | Injects sequence order | $PE_{(pos,2i)} = \sin(pos/10000^{2i/d})$ |
| Feed-Forward Networks | Non-linear transformations | $\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$ |
| Layer Normalization | Training stability | $\text{LN}(x) = \gamma \odot \frac{x - \mu}{\sigma} + \beta$ |
| Residual Connections | Gradient flow | $\text{output} = x + \text{SubLayer}(x)$ |
### BERT vs GPT: Two Paradigms
**BERT (Bidirectional Encoder Representations from Transformers)**
- Architecture: Encoder-only
- Context: Bidirectional (sees left and right)
- Pre-training objectives:
- **MLM (Masked Language Modeling):**
$$\mathcal{L}_{\text{MLM}} = -\mathbb{E}\left[\log P(x_{\text{mask}} | x_{\text{context}})\right]$$
- **NSP (Next Sentence Prediction):**
$$\mathcal{L}_{\text{NSP}} = -\mathbb{E}\left[\log P(\text{IsNext} | A, B)\right]$$
- Best for: Classification, NER, extractive QA
**GPT (Generative Pre-trained Transformer)**
- Architecture: Decoder-only
- Context: Autoregressive (left-to-right only)
- Pre-training objective:
$$\mathcal{L}_{\text{GPT}} = -\sum_{t=1}^{T} \log P(x_t | x_{
Modern transistor architecture with vertical fin channel.
# FinFET: The Three-Dimensional Transistor Revolution Fin Field-Effect Transistor (FinFET) ## 1. Definition and Overview A **FinFET** (Fin Field-Effect Transistor) is a non-planar, multi-gate transistor architecture where: - The conducting channel is formed by a thin silicon "fin" rising vertically from the substrate - The gate electrode wraps around the fin on **three sides** (top + two sidewalls) - Superior electrostatic control is achieved compared to planar MOSFETs The name derives from the **fin-like silicon structure** that protrudes from the substrate, resembling a fish's dorsal fin. ## 2. The Scaling Problem: Why FinFETs Were Needed ### 2.1 Short-Channel Effects in Planar MOSFETs As transistor gate length $L_g$ scaled below $\sim 25 \text{ nm}$, planar architectures encountered severe degradation: #### Drain-Induced Barrier Lowering (DIBL) $$ \text{DIBL} = \frac{V_{th,lin} - V_{th,sat}}{V_{DS,sat} - V_{DS,lin}} \quad \left[\frac{\text{mV}}{\text{V}}\right] $$ Where: - $V_{th,lin}$ = Threshold voltage at low $V_{DS}$ (linear region) - $V_{th,sat}$ = Threshold voltage at high $V_{DS}$ (saturation region) - Ideal DIBL $\approx 0$; problematic when $> 100 \text{ mV/V}$ #### Subthreshold Swing Degradation The subthreshold swing $SS$ determines how sharply a transistor turns off: $$ SS = \frac{\partial V_{GS}}{\partial (\log_{10} I_D)} = \ln(10) \cdot \frac{kT}{q} \cdot \left(1 + \frac{C_D}{C_{ox}}\right) $$ Where: - $k$ = Boltzmann constant ($1.38 \times 10^{-23} \text{ J/K}$) - $T$ = Temperature (K) - $q$ = Electron charge ($1.6 \times 10^{-19} \text{ C}$) - $C_D$ = Depletion capacitance - $C_{ox}$ = Gate oxide capacitance **Theoretical minimum at room temperature:** $$ SS_{min} = \ln(10) \cdot \frac{kT}{q} \approx 60 \text{ mV/decade} \quad \text{(at } T = 300K \text{)} $$ #### Threshold Voltage Roll-Off $$ \Delta V_{th} = V_{th}(L) - V_{th}(L \to \infty) $$ As $L_g \to 0$, $\Delta V_{th}$ becomes increasingly negative (threshold voltage drops unpredictably). ### 2.2 The Electrostatic Control Problem In planar MOSFETs, the **natural length** $\lambda$ characterizes gate control: $$ \lambda = \sqrt{\frac{\epsilon_{Si}}{\epsilon_{ox}} \cdot t_{ox} \cdot t_{Si}} $$ Where: - $\epsilon_{Si}$ = Silicon permittivity ($11.7 \cdot \epsilon_0$) - $\epsilon_{ox}$ = Oxide permittivity ($3.9 \cdot \epsilon_0$) - $t_{ox}$ = Gate oxide thickness - $t_{Si}$ = Silicon body thickness **Rule of thumb:** Good short-channel control requires: $$ L_g \geq 5\lambda \text{ to } 10\lambda $$ ## 3. FinFET Architecture ### 3.1 Physical Structure ``` - ┌─────────────────┐ │ GATE │ │ (Metal/Poly) │ └────────┬────────┘ │ ┌──────────────┼──────────────┐ │ │ │ ▼ ▼ ▼ ┌────────┐ ┌────────┐ ┌────────┐ │ GATE │ │ FIN │ │ GATE │ │(side) │◄───│(Si) │───►│(side) │ └────────┘ │ │ └────────┘ │ W_fin │ │◄──────►│ │ │ │ H_fin │ │ ▲ │ │ │ │ └───┴────┘ │ ───────┴─────── SUBSTRATE ``` ### 3.2 Key Dimensions | Parameter | Symbol | Typical Value (Advanced Nodes) | |-----------|--------|-------------------------------| | Fin Width | $W_{fin}$ | 5–10 nm | | Fin Height | $H_{fin}$ | 40–50 nm | | Fin Pitch | $P_{fin}$ | 20–30 nm | | Gate Length | $L_g$ | 7–20 nm | | Equivalent Oxide Thickness | $EOT$ | 0.5–1.0 nm | ### 3.3 Effective Channel Width The effective width in a FinFET is given by: $$ W_{eff} = N_{fin} \cdot (2 \cdot H_{fin} + W_{fin}) $$ Where: - $N_{fin}$ = Number of parallel fins - $H_{fin}$ = Fin height - $W_{fin}$ = Fin width (top surface contribution) **Note:** Often $W_{fin} \ll H_{fin}$, so: $$ W_{eff} \approx 2 \cdot N_{fin} \cdot H_{fin} $$ ### 3.4 Drive Current The saturation drain current for a FinFET: $$ I_{D,sat} = \frac{W_{eff}}{L_g} \cdot \mu_{eff} \cdot C_{ox} \cdot \frac{(V_{GS} - V_{th})^2}{2} \cdot (1 + \lambda V_{DS}) $$ Where: - $\mu_{eff}$ = Effective carrier mobility - $C_{ox}$ = Gate oxide capacitance per unit area - $\lambda$ = Channel length modulation parameter For **velocity-saturated** short-channel devices: $$ I_{D,sat} \approx W_{eff} \cdot C_{ox} \cdot v_{sat} \cdot (V_{GS} - V_{th}) $$ Where $v_{sat} \approx 10^7 \text{ cm/s}$ for electrons in silicon. ## 4. Electrostatic Advantage of FinFETs ### 4.1 Multi-Gate Scaling Length For a **tri-gate** FinFET, the natural length becomes: $$ \lambda_{FinFET} = \sqrt{\frac{\epsilon_{Si}}{\epsilon_{ox}} \cdot t_{ox} \cdot \frac{W_{fin}}{2}} $$ **Comparison:** | Architecture | Natural Length $\lambda$ | |--------------|-------------------------| | Single-gate (planar) | $\sqrt{\frac{\epsilon_{Si}}{\epsilon_{ox}} \cdot t_{ox} \cdot t_{Si}}$ | | Double-gate | $\sqrt{\frac{\epsilon_{Si}}{2\epsilon_{ox}} \cdot t_{ox} \cdot t_{Si}}$ | | Tri-gate (FinFET) | $\sqrt{\frac{\epsilon_{Si}}{3\epsilon_{ox}} \cdot t_{ox} \cdot \frac{W_{fin}}{2}}$ | | Gate-All-Around | $\sqrt{\frac{\epsilon_{Si}}{4\epsilon_{ox}} \cdot t_{ox} \cdot \frac{d}{4}}$ | ### 4.2 DIBL Improvement Empirically, FinFETs achieve: $$ \text{DIBL}_{FinFET} \approx \frac{\text{DIBL}_{planar}}{3 \text{ to } 5} $$ Typical values: - Planar at 22nm: $\text{DIBL} \approx 150-200 \text{ mV/V}$ - FinFET at 22nm: $\text{DIBL} \approx 30-50 \text{ mV/V}$ ### 4.3 Subthreshold Swing FinFETs approach the theoretical limit more closely: $$ SS_{FinFET} \approx 65-75 \text{ mV/decade} $$ Compared to planar devices at equivalent nodes: $$ SS_{planar} \approx 90-120 \text{ mV/decade} $$ ## 5. FinFET Capacitance Model ### 5.1 Total Gate Capacitance $$ C_{gate} = C_{ox} + C_{fringe} + C_{overlap} $$ Where: **Oxide capacitance per fin:** $$ C_{ox} = \epsilon_{ox} \cdot \frac{(2H_{fin} + W_{fin}) \cdot L_g}{t_{ox}} $$ **Fringe capacitance** (fin-to-fin coupling): $$ C_{fringe} \approx \epsilon_0 \cdot \kappa_{eff} \cdot \frac{H_{fin}}{P_{fin} - W_{fin}} $$ ### 5.2 Parasitic Capacitance Components ``` - ┌─────────────────────────────────────────────────┐ │ │ │ C_gd (gate-drain overlap) │ │ ┌───┐ │ │ ┌────┤ ├────┐ │ │ │ └───┘ │ │ │ ─┴─ GATE ─┴─ │ │ SOURCE DRAIN │ │ │ │ │ │ └──── C_ds ───┘ (source-drain fringe) │ │ │ └─────────────────────────────────────────────────┘ ``` ## 6. Power and Performance Metrics ### 6.1 Dynamic Power $$ P_{dynamic} = \alpha \cdot C_{load} \cdot V_{DD}^2 \cdot f $$ Where: - $\alpha$ = Activity factor - $C_{load}$ = Load capacitance - $V_{DD}$ = Supply voltage - $f$ = Operating frequency **FinFET advantage:** Lower $V_{DD}$ possible due to better electrostatics $$ V_{DD,FinFET} \approx 0.7 - 0.9 \cdot V_{DD,planar} $$ ### 6.2 Static (Leakage) Power $$ P_{static} = I_{leak} \cdot V_{DD} $$ **Subthreshold leakage current:** $$ I_{sub} = I_0 \cdot e^{\frac{V_{GS} - V_{th}}{n \cdot V_T}} \cdot \left(1 - e^{-\frac{V_{DS}}{V_T}}\right) $$ Where: - $V_T = \frac{kT}{q} \approx 26 \text{ mV}$ at room temperature - $n$ = Subthreshold slope factor ($\approx 1.0-1.1$ for FinFET vs $1.3-1.5$ for planar) **FinFET leakage reduction:** $$ \frac{I_{leak,FinFET}}{I_{leak,planar}} \approx \frac{1}{5} \text{ to } \frac{1}{10} $$ ### 6.3 Energy-Delay Product $$ EDP = E \cdot t_d = C_{load} \cdot V_{DD}^2 \cdot \frac{C_{load} \cdot V_{DD}}{I_{on}} $$ $$ EDP \propto \frac{V_{DD}^3}{I_{on}/C_{load}} $$ ## 7. Manufacturing Considerations ### 7.1 Critical Process Steps 1. **Fin Patterning** - Self-Aligned Double Patterning (SADP) - Self-Aligned Quadruple Patterning (SAQP) - EUV Lithography (at 7nm and below) 2. **Fin Etch** - High aspect ratio etching - Target: $AR = \frac{H_{fin}}{W_{fin}} \approx 5:1$ to $10:1$ 3. **Gate Stack Formation** - High-κ dielectric ($\kappa \approx 20-25$ for HfO₂) - Metal gate (TiN, TaN, work function metals) 4. **Source/Drain Engineering** - Epitaxial raised S/D - Stress engineering (SiGe for PMOS, SiC/SiP for NMOS) ### 7.2 Variability Sources **Line Edge Roughness (LER):** $$ \sigma_{V_{th}} \propto \frac{\sigma_{LER}}{W_{fin}} $$ **Fin Height Variation:** $$ \frac{\Delta I_{on}}{I_{on}} \approx \frac{\Delta H_{fin}}{H_{fin}} $$ **Random Dopant Fluctuation (reduced in FinFETs):** $$ \sigma_{V_{th,RDF}} = \frac{A_{VT}}{\sqrt{W_{eff} \cdot L_g}} $$ Where $A_{VT}$ is the Pelgrom coefficient (lower for FinFETs due to undoped channels). ## 8. FinFET Variants ### 8.1 Bulk FinFET vs. SOI FinFET | Feature | Bulk FinFET | SOI FinFET | |---------|-------------|------------| | Substrate | Standard Si wafer | Silicon-on-Insulator | | Isolation | STI + punch-through stopper | Buried oxide (BOX) | | Cost | Lower | Higher (~2× wafer cost) | | Body bias | Possible | Limited | | Thermal | Better heat dissipation | Worse (floating body) | | Latch-up | Possible | Immune | ### 8.2 Fin Profile Variations **Rectangular fin:** $$ W_{eff} = 2H_{fin} + W_{fin} $$ **Tapered/Trapezoidal fin:** $$ W_{eff} = 2\sqrt{H_{fin}^2 + \left(\frac{W_{top} - W_{bottom}}{2}\right)^2} + W_{top} $$ ## 9. Historical Timeline | Year | Milestone | |------|-----------| | 1989 | Hisamoto (Hitachi) invents DELTA device | | 1999 | UC Berkeley demonstrates practical FinFET | | 2002 | TSMC demonstrates 25nm FinFET | | 2011 | **Intel 22nm Tri-Gate** (Ivy Bridge) — First commercial FinFET | | 2014 | Intel 14nm, Samsung/TSMC 16nm/14nm | | 2017 | 10nm production (Intel, Samsung, TSMC) | | 2018 | 7nm production (TSMC, Samsung) | | 2020 | 5nm production (TSMC, Samsung) | | 2022 | 3nm production begins; Samsung transitions to GAA | | 2024 | Intel 20A/18A (RibbonFET), TSMC N2 (GAA) | ## 10. Beyond FinFETs: Gate-All-Around (GAA) ### 10.1 Nanosheet/Nanoribbon Architecture At 2nm and below, the industry transitions to **Gate-All-Around** structures: ``` - ┌─────────────────────────┐ │ GATE │ │ ┌───────────────────┐ │ │ │ Nanosheet 3 │ │ │ └───────────────────┘ │ │ ┌───────────────────┐ │ │ │ Nanosheet 2 │ │ │ └───────────────────┘ │ │ ┌───────────────────┐ │ │ │ Nanosheet 1 │ │ │ └───────────────────┘ │ │ GATE │ └─────────────────────────┘ ``` ### 10.2 GAA Electrostatics For a cylindrical nanowire of diameter $d$: $$ \lambda_{GAA} = \sqrt{\frac{\epsilon_{Si}}{4\epsilon_{ox}} \cdot t_{ox} \cdot \frac{d}{4}} = \sqrt{\frac{\epsilon_{Si} \cdot t_{ox} \cdot d}{16\epsilon_{ox}}} $$ **Scaling comparison:** $$ \lambda_{GAA} < \lambda_{FinFET} < \lambda_{double-gate} < \lambda_{planar} $$ ### 10.3 Nanosheet Width Flexibility Unlike FinFETs (quantized width), GAA nanosheets offer: $$ W_{eff,GAA} = N_{sheets} \cdot 2 \cdot (W_{sheet} + t_{sheet}) $$ Where $W_{sheet}$ can be varied during design, restoring some analog design flexibility. ## 11. Equations ### Fundamental FinFET Equations | Parameter | Equation | |-----------|----------| | Effective Width | $W_{eff} = N_{fin}(2H_{fin} + W_{fin})$ | | Subthreshold Swing | $SS = \ln(10)\frac{kT}{q}(1 + \frac{C_D}{C_{ox}})$ | | DIBL | $\text{DIBL} = \frac{V_{th,lin} - V_{th,sat}}{V_{DS,sat} - V_{DS,lin}}$ | | Natural Length | $\lambda = \sqrt{\frac{\epsilon_{Si}}{n\epsilon_{ox}} t_{ox} \frac{W_{fin}}{2}}$ | | Drive Current | $I_{D,sat} = \frac{W_{eff}}{L_g}\mu_{eff}C_{ox}\frac{(V_{GS}-V_{th})^2}{2}$ | | Leakage Current | $I_{sub} = I_0 e^{(V_{GS}-V_{th})/(nV_T)}$ | | Thermal Voltage | $V_T = \frac{kT}{q} \approx 26\text{ mV}$ | ## Physical Constants | Constant | Symbol | Value | |----------|--------|-------| | Boltzmann constant | $k$ | $1.38 \times 10^{-23}$ J/K | | Elementary charge | $q$ | $1.60 \times 10^{-19}$ C | | Permittivity of free space | $\epsilon_0$ | $8.85 \times 10^{-12}$ F/m | | Silicon permittivity | $\epsilon_{Si}$ | $11.7 \cdot \epsilon_0$ | | SiO₂ permittivity | $\epsilon_{ox}$ | $3.9 \cdot \epsilon_0$ | | Thermal voltage (300K) | $V_T$ | $\approx 26$ mV | | Electron saturation velocity (Si) | $v_{sat}$ | $\approx 10^7$ cm/s |