lru cache, lru, llm optimization
Least Recently Used cache evicts oldest accessed entries maintaining frequently used items.
3,145 technical terms and definitions
Least Recently Used cache evicts oldest accessed entries maintaining frequently used items.
LSTM-based anomaly detection flags time steps with high prediction error or unusual hidden states.
LSTM-VAE combines variational autoencoders with LSTM networks to detect anomalies in sequential data through reconstruction probability thresholds.
Long- and Short-term Time-series Network combines CNNs and RNNs with skip connections for multivariate forecasting.
Laser Voltage Imaging spatially maps voltage distributions across die surface revealing shorts or voltage drops.
Multiply-accumulate efficiency quantifies utilization of hardware MAC units during inference.
Structural key descriptors.
High-accuracy molecular modeling.
# Semiconductor Manufacturing Process: Machine Learning Applications & Mathematical Modeling
A comprehensive exploration of the intersection of advanced mathematics, statistical learning, and semiconductor physics.
## 1. The Problem Landscape
Semiconductor manufacturing is arguably the most complex manufacturing process ever devised:
- **500+ sequential process steps** for advanced chips
- **Thousands of control parameters** per tool
- **Sub-nanometer precision** requirements (modern nodes at 3nm, moving to 2nm)
- **Billions of transistors** per chip
- **Yield sensitivity** — a single defect can destroy a \$10,000+ chip
This creates an ideal environment for ML:
- High dimensionality
- Massive data generation
- Complex nonlinear physics
- Enormous economic stakes
### Key Manufacturing Stages
1. **Front-end processing (wafer fabrication)**
- Photolithography
- Etching (wet and dry)
- Deposition (CVD, PVD, ALD)
- Ion implantation
- Chemical mechanical planarization (CMP)
- Oxidation
- Metallization
2. **Back-end processing**
- Wafer testing
- Dicing
- Packaging
- Final testing
## 2. Core Mathematical Frameworks
### 2.1 Virtual Metrology (VM)
**Problem**: Physical metrology is slow and expensive. Predict metrology outcomes from in-situ sensor data.
**Mathematical formulation**:
Given process sensor data $\mathbf{X} \in \mathbb{R}^{n \times p}$ and sparse metrology measurements $\mathbf{y} \in \mathbb{R}^n$, learn:
$$
\hat{y} = f(\mathbf{x}; \theta)
$$
**Key approaches**:
| Method | Mathematical Form | Strengths |
|--------|-------------------|-----------|
| Partial Least Squares (PLS) | Maximize $\text{Cov}(\mathbf{Xw}, \mathbf{Yc})$ | Handles multicollinearity |
| Gaussian Process Regression | $f(x) \sim \mathcal{GP}(m(x), k(x,x'))$ | Uncertainty quantification |
| Neural Networks | Compositional nonlinear mappings | Captures complex interactions |
| Ensemble Methods | Aggregation of weak learners | Robustness |
**Critical mathematical consideration — Regularization**:
$$
L(\theta) = \|\mathbf{y} - f(\mathbf{X};\theta)\|^2 + \lambda_1\|\theta\|_1 + \lambda_2\|\theta\|_2^2
$$
The **elastic net penalty** is essential because semiconductor data has:
- High collinearity among sensors
- Far more features than samples for new processes
- Need for interpretable sparse solutions
### 2.2 Fault Detection and Classification (FDC)
**Mathematical framework for detection**:
Define normal operating region $\Omega$ from training data. For new observation $\mathbf{x}$, compute:
$$
d(\mathbf{x}, \Omega) = \text{anomaly score}
$$
#### PCA-based Approach (Industry Workhorse)
Project data onto principal components. Compute:
- **$T^2$ statistic** (variation within model):
$$
T^2 = \sum_{i=1}^{k} \frac{t_i^2}{\lambda_i}
$$
- **$Q$ statistic / SPE** (variation outside model):
$$
Q = \|\mathbf{x} - \hat{\mathbf{x}}\|^2 = \|(I - PP^T)\mathbf{x}\|^2
$$
#### Deep Learning Extensions
- **Autoencoders**: Reconstruction error as anomaly score
- **Variational Autoencoders**: Probabilistic anomaly detection via ELBO
- **One-class Neural Networks**: Learn decision boundary around normal data
#### Fault Classification
Given fault signatures, this becomes multi-class classification. The mathematical challenge is **class imbalance** — faults are rare.
**Solutions**:
- SMOTE and variants for synthetic oversampling
- Cost-sensitive learning
- **Focal loss**:
$$
FL(p) = -\alpha(1-p)^\gamma \log(p)
$$
### 2.3 Run-to-Run (R2R) Process Control
**The control problem**: Processes drift due to chamber conditioning, consumable wear, and environmental variation. Adjust recipe parameters between wafer runs to maintain targets.
#### EWMA Controller (Simplest Form)
$$
u_{k+1} = u_k + \lambda \cdot G^{-1}(y_{\text{target}} - y_k)
$$
where $G$ is the process gain matrix $\left(\frac{\partial y}{\partial u}\right)$.
#### Model Predictive Control Formulation
$$
\min_{u_k} J = (y_{\text{target}} - \hat{y}_k)^T Q (y_{\text{target}} - \hat{y}_k) + \Delta u_k^T R \, \Delta u_k
$$
**Subject to**:
- Process model: $\hat{y} = f(u, \text{state})$
- Constraints: $u_{\min} \leq u \leq u_{\max}$
#### Adaptive/Learning R2R
The process model drifts. Use recursive estimation:
$$
\hat{\theta}_{k+1} = \hat{\theta}_k + K_k(y_k - \hat{y}_k)
$$
where $K$ is the **Kalman gain**, or use online gradient descent for neural network models.
### 2.4 Yield Modeling and Optimization
#### Classical Defect-Limited Yield
**Poisson model**:
$$
Y = e^{-AD}
$$
where $A$ = chip area, $D$ = defect density.
**Negative binomial** (accounts for clustering):
$$
Y = \left(1 + \frac{AD}{\alpha}\right)^{-\alpha}
$$
#### ML-based Yield Prediction
The yield is a complex function of hundreds of process parameters across all steps. This is a high-dimensional regression problem with:
- Interactions between distant process steps
- Nonlinear effects
- Spatial patterns on wafer
**Gradient boosted trees** (XGBoost, LightGBM) excel here due to:
- Automatic feature selection
- Interaction detection
- Robustness to outliers
#### Spatial Yield Modeling
Uses Gaussian processes with spatial kernels:
$$
k(x_i, x_j) = \sigma^2 \exp\left(-\frac{\|x_i - x_j\|^2}{2\ell^2}\right)
$$
to capture systematic wafer-level patterns.
## 3. Physics-Informed Machine Learning
### 3.1 The Hybrid Paradigm
Pure data-driven models struggle with:
- Extrapolation beyond training distribution
- Limited data for new processes
- Physical implausibility of predictions
#### Physics-Informed Neural Networks (PINNs)
$$
L = L_{\text{data}} + \lambda_{\text{physics}} L_{\text{physics}}
$$
where $L_{\text{physics}}$ enforces physical laws.
**Examples in semiconductor context**:
| Process | Governing Physics | PDE Constraint |
|---------|-------------------|----------------|
| Thermal processing | Heat equation | $\frac{\partial T}{\partial t} = \alpha \nabla^2 T$ |
| Diffusion/implant | Fick's law | $\frac{\partial C}{\partial t} = D \nabla^2 C$ |
| Plasma etch | Boltzmann + fluid | Complex coupled system |
| CMP | Preston equation | $\frac{dh}{dt} = k_p \cdot P \cdot V$ |
### 3.2 Computational Lithography
#### The Forward Problem
Mask pattern $M(\mathbf{r})$ → Optical system $H(\mathbf{k})$ → Aerial image → Resist chemistry → Final pattern
$$
I(\mathbf{r}) = \left|\mathcal{F}^{-1}\{H(\mathbf{k}) \cdot \mathcal{F}\{M(\mathbf{r})\}\}\right|^2
$$
#### Inverse Lithography / OPC
Given target pattern, find mask that produces it. This is a **non-convex optimization**:
$$
\min_M \|P_{\text{target}} - P(M)\|^2 + R(M)
$$
#### ML Acceleration
- **CNNs** learn the forward mapping (1000× faster than rigorous simulation)
- **GANs** for mask synthesis
- **Differentiable lithography simulators** for end-to-end optimization
## 4. Time Series and Sequence Modeling
### 4.1 Equipment Health Monitoring
#### Remaining Useful Life (RUL) Prediction
Model equipment degradation as a stochastic process:
$$
S(t) = S_0 + \int_0^t g(S(\tau), u(\tau)) \, d\tau + \sigma W(t)
$$
#### Deep Learning Approaches
- **LSTM/GRU**: Capture long-range temporal dependencies in sensor streams
- **Temporal Convolutional Networks**: Dilated convolutions for efficient long sequences
- **Transformers**: Attention over maintenance history and operating conditions
### 4.2 Trace Data Analysis
Each wafer run produces high-frequency sensor traces (temperature, pressure, RF power, etc.).
#### Feature Extraction Approaches
- Statistical moments (mean, variance, skewness)
- Frequency domain (FFT coefficients)
- Wavelet decomposition
- Learned features via 1D CNNs or autoencoders
#### Dynamic Time Warping (DTW)
For trace comparison:
$$
DTW(X, Y) = \min_{\pi} \sum_{(i,j) \in \pi} d(x_i, y_j)
$$
## 5. Bayesian Optimization for Process Development
### 5.1 The Experimental Challenge
New process development requires finding optimal recipe settings with minimal experiments (each wafer costs \$1000+, time is critical).
#### Bayesian Optimization Framework
1. Fit Gaussian Process surrogate to observations
2. Compute acquisition function
3. Query next point: $x_{\text{next}} = \arg\max_x \alpha(x)$
4. Repeat
#### Acquisition Functions
- **Expected Improvement**:
$$
EI(x) = \mathbb{E}[\max(f(x) - f^*, 0)]
$$
- **Knowledge Gradient**: Value of information from observing at $x$
- **Upper Confidence Bound**:
$$
UCB(x) = \mu(x) + \kappa\sigma(x)
$$
### 5.2 High-Dimensional Extensions
Standard BO struggles beyond ~20 dimensions. Semiconductor recipes have 50-200 parameters.
**Solutions**:
- **Random embeddings** (REMBO)
- **Additive structure**: $f(\mathbf{x}) = \sum_i f_i(x_i)$
- **Trust region methods** (TuRBO)
- **Neural network surrogates**
## 6. Causal Inference for Root Cause Analysis
### 6.1 The Problem
**Correlation ≠ Causation**. When yield drops, engineers need to find the *cause*, not just correlated variables.
#### Granger Causality (Time Series)
$X$ Granger-causes $Y$ if past $X$ improves prediction of $Y$ beyond past $Y$ alone:
$$
\sigma^2(Y_t | Y_{
Apply ML to optimize recipes predict defects or improve yield.
Learn forces from quantum calculations.
Use ML to interpret optical spectra.
Use neural networks to interpret scatterometry.
ESD from charged machine.
Macro search spaces in NAS define entire network topologies rather than repeatable cells or modules.
Mask large portions and reconstruct.
Find unexplained constants.
Magnetic field imaging detects current flow through inductively coupled sensors revealing shorts and current paths.
Magnitude pruning removes weights with smallest absolute values based on importance threshold.
Remove weights with smallest magnitudes.
Metapath Aggregated Graph Neural Network learns from heterogeneous graphs using metapath-based neighbor encoding.
Use model-generated explanations recursively.
Main effects represent average impact of factors ignoring other factors.
Impact of single factor on response.
Bulk material removal step.
Main body of cluster tool housing transfer chamber and modules.
Score code maintainability.
Maintainability is ease and speed of performing maintenance activities.
Maintenance prevention designs equipment for reliability and ease of maintenance.
Record time spent on maintenance.
Maintenance windows schedule downtime minimizing production impact.
Make-A-Video generates videos from text using spatiotemporal diffusion models.
Mamba uses selective state space models for efficient sequence modeling without attention.
State-space model architecture efficient for long sequences.
Mamba/S4 are state-space models that replace full attention with more efficient recurrence-style updates, aiming for faster long-sequence processing.
Meta-learning method that finds good initialization for fast adaptation.
Transform latent code to style.
Marching cubes extracts mesh surfaces from volumetric data or implicit functions.
Marked point processes attach additional information marks to events capturing both occurrence times and event attributes.
Sample from posterior distributions.
State-based reliability model.
Focus on misclassified examples.
Marvin provides AI functions in Python. Natural language to structured data.
Soften mask edges.
Fix defects on photomasks.
Predict masked patches.
MLM pretraining masks random tokens, model predicts them. BERT-style. Bidirectional understanding.
Predict masked words given image.
MLM conditioned on images.
BERT-style masked token prediction.