← Back to AI Factory Chat

AI Factory Glossary

9,967 technical terms and definitions

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z All
Showing page 152 of 200 (9,967 entries)

rtp (rapid thermal processing),rtp,rapid thermal processing,diffusion

Fast heating for short high-temperature treatments.

ruff,lint,fast

Ruff is fast Python linter. Replaces flake8, isort.

rule extraction from neural networks, explainable ai

Derive interpretable rules from trained models.

rule-based filtering, data quality

Apply hand-crafted rules.

run chart, quality & reliability

Run charts plot measurements over time revealing trends and patterns.

run rules, spc

Patterns indicating process changes.

run-around loop, environmental & sustainability

Run-around loops circulate fluid between exhaust and supply coils transferring thermal energy.

run-to-failure, production

Operate until breakdown.

run-to-run control, manufacturing operations

Run-to-run control adjusts process parameters based on previous lot results.

run-to-run control, r2r, process control

Adjust recipe between wafer lots.

runner system, packaging

Channels feeding compound.

runner waste, packaging

Compound in runners.

runpod,gpu cloud,inference

RunPod provides GPU instances for ML. Spot and on-demand. Inference endpoints.

ruptures library, time series models

Ruptures provides algorithms for offline change point detection including binary segmentation and dynamic programming.

ruthenium contact, process integration

Ruthenium contacts offer excellent scaling properties with low resistivity and good electromigration resistance.

ruthenium interconnect,beol

Emerging metal for advanced nodes with better performance than Cu at small dimensions.

rutherford backscattering spectrometry (rbs),rutherford backscattering spectrometry,rbs,metrology

Elemental depth profiling.

rvae, rvae, time series models

Recurrent Variational Autoencoder models sequences through hierarchical latent variables with temporal structure.

rwkv for vision, rwkv, computer vision

RNN-like architecture for images.

rwkv, rwkv, llm architecture

Receptance Weighted Key Value combines RNN efficiency with transformer expressiveness.

rwkv,foundation model

RNN-like architecture competitive with Transformers.

rx equalization, signal & power integrity

Receiver equalization compensates for channel loss at receiver through active or passive filtering.

s-parameters, signal & power integrity

S-parameters characterize multi-port networks in frequency domain describing transmission and reflection for high-speed interconnect analysis.

s-parameters,rf design

Scattering parameters for RF characterization.

s/d extension, s/d, process integration

Source-drain extensions are shallow lightly-doped regions formed with spacers present managing short-channel effects and junction capacitance.

s3-rec, recommendation systems

Self-Supervised learning for Sequential Recommendation uses data augmentation and contrastive learning.

s4 (structured state spaces),s4,structured state spaces,llm architecture

Efficient SSM using special parameterization.

s4 model, s4, llm architecture

Structured State Space model uses diagonal approximations for efficient training.

s5 model, s5, llm architecture

Simplified diagonal state space model improves training stability and efficiency.

sac alloy, sac, packaging

Common lead-free solder.

sac, soft actor critic, reinforcement learning advanced, continuous control, maximum entropy, actor critic, policy gradient

# Soft Actor-Critic (SAC) Advanced Reinforcement Learning Logic ## Core Philosophy SAC is a **maximum entropy reinforcement learning** algorithm that fundamentally reframes the RL objective. Instead of simply maximizing expected cumulative reward, SAC maximizes a modified objective that includes an entropy bonus. ### Standard RL Objective $$ J_{\text{standard}}(\pi) = \sum_{t=0}^{T} \mathbb{E}_{(s_t, a_t) \sim \rho_\pi} \left[ r(s_t, a_t) \right] $$ ### SAC's Maximum Entropy Objective $$ J(\pi) = \sum_{t=0}^{T} \mathbb{E}_{(s_t, a_t) \sim \rho_\pi} \left[ r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot | s_t)) \right] $$ Where: - $\pi$ — Policy (maps states to action distributions) - $\rho_\pi$ — State-action marginal induced by policy $\pi$ - $r(s_t, a_t)$ — Reward function - $\alpha$ — Temperature parameter (entropy coefficient) - $\mathcal{H}(\pi(\cdot|s))$ — Entropy of the policy at state $s$ ## Maximum Entropy Objective ### Entropy Definition The entropy of a continuous policy distribution is: $$ \mathcal{H}(\pi(\cdot|s)) = -\int_{\mathcal{A}} \pi(a|s) \log \pi(a|s) \, da $$ For discrete actions: $$ \mathcal{H}(\pi(\cdot|s)) = -\sum_{a \in \mathcal{A}} \pi(a|s) \log \pi(a|s) $$ ### Intuition - **High entropy** → Policy is close to uniform (exploratory) - **Low entropy** → Policy is concentrated/deterministic (exploitative) - **SAC incentivizes** acting as randomly as possible while achieving high reward ## Why Entropy Maximization Matters ### 1. Exploration-Exploitation Balance Built-In - Traditional RL separates exploration from the objective - $\epsilon$-greedy exploration - Noise injection (Ornstein-Uhlenbeck, Gaussian) - Exploration bonuses - SAC embeds exploration **into** the optimization target - A policy that collapses to deterministic actions pays an entropy penalty - This forces continued exploration even as the policy improves ### 2. Robustness and Multi-Modality - Stochastic policies naturally capture **multiple near-optimal solutions** - If two action sequences yield similar returns: - Deterministic policy: arbitrarily picks one - SAC policy: maintains probability mass on both - Result: More robust policies that handle perturbations better ### 3. Better Credit Assignment - Entropy bonus provides a **dense intrinsic reward signal** - Even with sparse external rewards, the agent receives learning signal - Effectively a form of **curiosity** without explicit curiosity modules ### 4. Improved Convergence Properties - Entropy regularization smooths the optimization landscape - Prevents premature convergence to local optima - Provides better gradient signal in early training ## Architecture: Three Networks SAC employs an **actor-critic architecture** with three main components: ### 1. Critic Networks (Q-functions) Two independent Q-networks: $Q_{\theta_1}(s, a)$ and $Q_{\theta_2}(s, a)$ **Training objective (for each Q-network):** $$ \mathcal{L}_Q(\theta_i) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( Q_{\theta_i}(s, a) - y \right)^2 \right] $$ **Target value:** $$ y = r + \gamma \mathbb{E}_{a' \sim \pi_\phi} \left[ \min_{j=1,2} Q_{\bar{\theta}_j}(s', a') - \alpha \log \pi_\phi(a'|s') \right] $$ **Key features:** - **Clipped Double-Q trick** — Uses $\min(Q_{\theta_1}, Q_{\theta_2})$ to combat overestimation bias - **Target networks** $\bar{\theta}$ — Soft-updated for stability: $$ \bar{\theta} \leftarrow \tau \theta + (1 - \tau) \bar{\theta} $$ ### 2. Actor Network (Policy) A stochastic policy $\pi_\phi(a|s)$ typically parameterized as a **squashed Gaussian**: $$ a = \tanh\left( \mu_\phi(s) + \sigma_\phi(s) \odot \epsilon \right), \quad \epsilon \sim \mathcal{N}(0, I) $$ **Policy objective:** $$ \mathcal{L}_\pi(\phi) = \mathbb{E}_{s \sim \mathcal{D}, \, \epsilon \sim \mathcal{N}} \left[ \alpha \log \pi_\phi(a_\phi(s, \epsilon)|s) - Q_\theta(s, a_\phi(s, \epsilon)) \right] $$ **Key features:** - **Reparameterization trick** — Enables low-variance gradient estimation - **Tanh squashing** — Bounds actions to valid range $[-1, 1]$ - **Log-probability correction** — For the change of variables: $$ \log \pi(a|s) = \log \mu(u|s) - \sum_{i=1}^{D} \log(1 - \tanh^2(u_i)) $$ ### 3. Temperature Parameter ($\alpha$) **Automatic tuning via constrained optimization:** $$ \alpha^* = \arg\min_\alpha \mathbb{E}_{a \sim \pi^*} \left[ -\alpha \log \pi^*(a|s) - \alpha \bar{\mathcal{H}} \right] $$ **Simplified loss:** $$ \mathcal{L}(\alpha) = \mathbb{E}_{a \sim \pi} \left[ -\alpha \left( \log \pi(a|s) + \bar{\mathcal{H}} \right) \right] $$ Where $\bar{\mathcal{H}}$ is the target entropy, typically set to: $$ \bar{\mathcal{H}} = -\dim(\mathcal{A}) $$ **Intuition:** - If current entropy < target → $\alpha$ increases → more exploration - If current entropy > target → $\alpha$ decreases → more exploitation ## The Soft Bellman Equation ### Standard Bellman Equation $$ Q(s, a) = r(s, a) + \gamma \max_{a'} Q(s', a') $$ ### Soft Bellman Equation $$ Q(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim p} \left[ V(s') \right] $$ Where the **soft value function** is: $$ V(s) = \mathbb{E}_{a \sim \pi} \left[ Q(s, a) - \alpha \log \pi(a|s) \right] $$ Alternatively, the soft value function can be expressed as: $$ V(s) = \alpha \log \int_{\mathcal{A}} \exp\left( \frac{1}{\alpha} Q(s, a) \right) da $$ ### Soft Q-Learning Update The soft Q-function satisfies: $$ Q(s, a) = r(s, a) + \gamma \mathbb{E}_{s'} \left[ \mathbb{E}_{a' \sim \pi}[Q(s', a')] + \alpha \mathcal{H}(\pi(\cdot|s')) \right] $$ ### Optimal Soft Policy The optimal policy under the maximum entropy framework is: $$ \pi^*(a|s) = \frac{\exp\left( \frac{1}{\alpha} Q^*(s, a) \right)}{\int_{\mathcal{A}} \exp\left( \frac{1}{\alpha} Q^*(s, a') \right) da'} $$ This is a **Boltzmann distribution** with $Q$ as negative energy. ## Advanced Logic: Why SAC Works ### 1. Off-Policy Learning + Sample Efficiency **Off-policy learning properties:** - Can learn from **any** data in replay buffer - No need for data from current policy - Aggressive experience reuse possible **Why this matters for SAC:** - Entropy framework provides stability for off-policy learning - Stochastic policies avoid the brittleness of deterministic off-policy methods - Better sample efficiency than on-policy methods (PPO, A2C) ### 2. Reparameterization Trick **Classic REINFORCE gradient (high variance):** $$ \nabla_\phi J = \mathbb{E}_{a \sim \pi_\phi} \left[ \nabla_\phi \log \pi_\phi(a|s) \cdot Q(s, a) \right] $$ **SAC's reparameterized gradient (low variance):** $$ \nabla_\phi J = \mathbb{E}_{s \sim \mathcal{D}, \, \epsilon \sim \mathcal{N}} \left[ \nabla_\phi \alpha \log \pi_\phi(a_\phi|s) - \nabla_a Q(s, a)|_{a=a_\phi} \cdot \nabla_\phi a_\phi(s, \epsilon) \right] $$ **Key insight:** - Action $a_\phi = f_\phi(\epsilon; s)$ is deterministically computed from noise - Gradients flow through the sampling process - Much lower variance than score-function estimators ### 3. Clipped Double-Q Learning **Overestimation bias in Q-learning:** $$ \mathbb{E}[\max_a \hat{Q}(s, a)] \geq \max_a \mathbb{E}[\hat{Q}(s, a)] $$ **SAC's solution:** $$ Q_{\text{target}} = \min(Q_{\theta_1}, Q_{\theta_2}) $$ **Benefits:** - Combats overestimation from function approximation errors - Borrowed from TD3 (Twin Delayed DDPG) - Provides more conservative, stable value estimates ### 4. Automatic Temperature Adjustment **Constrained optimization formulation:** $$ \max_\pi \mathbb{E}\left[ \sum_t r(s_t, a_t) \right] \quad \text{subject to} \quad \mathbb{E}[-\log \pi(a|s)] \geq \bar{\mathcal{H}} $$ **Lagrangian dual:** $$ \min_\alpha \max_\pi \mathbb{E}\left[ \sum_t r(s_t, a_t) + \alpha(\mathcal{H}(\pi) - \bar{\mathcal{H}}) \right] $$ **Interpretation:** - $\alpha$ is the Lagrange multiplier - Automatically adjusts based on constraint satisfaction - Removes sensitive hyperparameter tuning ## Connections to Other Frameworks ### 1. Information-Theoretic View **Rate-Distortion Theory Connection:** - Policy is a "channel" from states to actions - Entropy regularization controls "bandwidth" - Trade-off: information transmission vs. compression **Mutual Information Perspective:** $$ I(\mathcal{S}; \mathcal{A}) = \mathbb{E}[\log \pi(a|s)] - \mathbb{E}[\log \pi(a)] $$ Maximum entropy policies minimize $I(\mathcal{S}; \mathcal{A})$ subject to achieving target reward. ### 2. Control as Inference **Graphical Model Formulation:** - Introduce binary optimality variable $\mathcal{O}_t \in \{0, 1\}$ - Define: $p(\mathcal{O}_t = 1 | s_t, a_t) \propto \exp(r(s_t, a_t))$ - Optimal policy is posterior: $\pi^*(a|s) = p(a|s, \mathcal{O}_{1:T} = 1)$ **Result:** - RL becomes probabilistic inference - Soft Bellman equations emerge naturally - Unifies RL with probabilistic planning (Levine, 2018) ### 3. Energy-Based Models **Energy function:** $E(s, a) = -Q(s, a)$ **Boltzmann policy:** $$ \pi(a|s) = \frac{\exp(-E(s, a) / \alpha)}{Z(s)} = \frac{\exp(Q(s, a) / \alpha)}{\int \exp(Q(s, a') / \alpha) da'} $$ **Interpretation:** - Q-function defines an energy landscape - Policy samples proportional to $\exp(Q/\alpha)$ - SAC approximates this with tractable Gaussian ### 4. Connection to SQL and Other Algorithms | Algorithm | Key Difference from SAC | |-----------|------------------------| | **SQL** (Soft Q-Learning) | Value-based, no explicit policy | | **TD3** | Deterministic policy, no entropy | | **PPO** | On-policy, clipped surrogate | | **MPO** | E-step/M-step optimization | ## Limitations and Frontiers ### Current Limitations 1. **Gaussian Assumption** - Unimodal Gaussian may miss multi-modal optimal policies - Potential solutions: - Normalizing flows - Mixture density networks - Implicit policies (SVGD) 2. **Discrete Action Spaces** - Original formulation assumes continuous actions - SAC-Discrete exists but loses some elegance - Gumbel-Softmax reparameterization for discrete 3. **Hyperparameter Sensitivity** - Despite auto-tuning $\alpha$, SAC remains sensitive to: - Target entropy $\bar{\mathcal{H}}$ choice - Learning rates (actor, critic, alpha) - Network architectures - Replay buffer size 4. **Sample Efficiency vs Model-Based** - SAC is efficient for model-free methods - Model-based approaches can be orders of magnitude better: - Dreamer - MBPO (Model-Based Policy Optimization) - MuZero ### Research Frontiers 1. **Distributional SAC** - Learn distribution over returns, not just expectation - Better risk-sensitive control 2. **Multi-Agent SAC** - Extend to cooperative/competitive settings - Handle non-stationarity 3. **Hierarchical SAC** - Temporal abstraction - Options framework integration 4. **Offline/Batch RL** - CQL (Conservative Q-Learning) - Behavior-constrained SAC ## Algorithm ### SAC Algorithm Pseudocode ``` Initialize: - Policy network π_φ - Q-networks Q_θ₁, Q_θ₂ - Target networks Q̄_θ₁, Q̄_θ₂ - Temperature α - Replay buffer D For each iteration: For each environment step: 1. Sample action: a ~ π_φ(a|s) 2. Execute action, observe (r, s') 3. Store (s, a, r, s') in D For each gradient step: 4. Sample batch from D 5. Update Q-networks: y = r + γ(min Q̄(s',ã') - α log π(ã'|s')) θᵢ ← θᵢ - λ_Q ∇_θᵢ (Q_θᵢ(s,a) - y)² 6. Update policy: φ ← φ - λ_π ∇_φ (α log π_φ(ã|s) - Q(s,ã)) 7. Update temperature: α ← α - λ_α ∇_α (-α(log π(a|s) + H̄)) 8. Update targets: θ̄ᵢ ← τθᵢ + (1-τ)θ̄ᵢ ``` ### Hyperparameters | Parameter | Typical Value | Description | |-----------|---------------|-------------| | $\gamma$ | 0.99 | Discount factor | | $\tau$ | 0.005 | Target smoothing coefficient | | $\alpha$ | Auto-tuned | Temperature / entropy coefficient | | $\bar{\mathcal{H}}$ | $-\dim(\mathcal{A})$ | Target entropy | | Learning rate | 3e-4 | For all networks | | Batch size | 256 | Samples per gradient step | | Buffer size | $10^6$ | Replay buffer capacity | ## Mathematical | Symbol | Meaning | |--------|---------| | $\pi_\phi$ | Policy parameterized by $\phi$ | | $Q_\theta$ | Q-function parameterized by $\theta$ | | $V(s)$ | Soft value function | | $\mathcal{H}(\cdot)$ | Entropy | | $\alpha$ | Temperature parameter | | $\bar{\mathcal{H}}$ | Target entropy | | $\mathcal{D}$ | Replay buffer | | $\gamma$ | Discount factor | | $\tau$ | Target network update rate | | $\rho_\pi$ | State-action distribution under $\pi$ | | $\mathcal{A}$ | Action space | | $\mathcal{S}$ | State space |

sacred, mlops

Tool for configuring and logging experiments.

sacrificial layer, process

Temporary layer later removed.

sacvd (sub-atmospheric cvd),sacvd,sub-atmospheric cvd,cvd

CVD at pressure between atmospheric and vacuum.

sadp / saqp,lithography

Self-aligned patterning using spacers as mentioned earlier.

safe rl, reinforcement learning advanced

Safe reinforcement learning constrains exploration and learned policies to satisfy safety constraints during training and deployment.

safetensors,format,safe

Safetensors is safe tensor serialization format. No code execution. Fast loading.

safety benchmarks,evaluation

Standard datasets for testing model safety (ToxiGen RealToxicityPrompts).

safety classifier, ai safety

Safety classifiers predict whether content violates policy guidelines.

safety fine-tuning, ai safety

Safety fine-tuning adjusts model parameters to reduce harmful outputs.

safety guardrails, ai safety

Systems preventing harmful outputs.

safety stock, supply chain & logistics

Safety stock is buffer inventory maintained to protect against demand variability and supply disruptions ensuring production continuity.

safety training, ai safety

Safety training teaches models to decline harmful requests and follow guidelines.

safety,guardrail,filter,policy

Safety = preventing harmful, illegal, or sensitive outputs. Use policies, classifiers, rule-based filters, and human review for high-risk use cases.

sagemaker,aws,mlops

SageMaker is AWS ML platform. Training, hosting, MLOps. Enterprise integration.

sagpool, graph neural networks

Self-Attention Graph Pooling selects important nodes based on learned attention scores enabling differentiable coarsening for graph classification.

sagpool, graph neural networks

Use attention for pooling.

salicide (self-aligned silicide),salicide,self-aligned silicide,feol

Silicide formed selectively on silicon.

salicide, process integration

Self-aligned silicide forms selectively on exposed silicon surfaces without masks reducing contact resistance to source drain and gate.

saliency map, interpretability

Saliency maps visualize input regions most influential to predictions using gradient magnitudes.