rtp (rapid thermal processing),rtp,rapid thermal processing,diffusion
Fast heating for short high-temperature treatments.
9,967 technical terms and definitions
Fast heating for short high-temperature treatments.
Ruff is fast Python linter. Replaces flake8, isort.
Derive interpretable rules from trained models.
Apply hand-crafted rules.
Run charts plot measurements over time revealing trends and patterns.
Patterns indicating process changes.
Run-around loops circulate fluid between exhaust and supply coils transferring thermal energy.
Operate until breakdown.
Run-to-run control adjusts process parameters based on previous lot results.
Adjust recipe between wafer lots.
Channels feeding compound.
Compound in runners.
RunPod provides GPU instances for ML. Spot and on-demand. Inference endpoints.
Ruptures provides algorithms for offline change point detection including binary segmentation and dynamic programming.
Ruthenium contacts offer excellent scaling properties with low resistivity and good electromigration resistance.
Emerging metal for advanced nodes with better performance than Cu at small dimensions.
Elemental depth profiling.
Recurrent Variational Autoencoder models sequences through hierarchical latent variables with temporal structure.
RNN-like architecture for images.
Receptance Weighted Key Value combines RNN efficiency with transformer expressiveness.
RNN-like architecture competitive with Transformers.
Receiver equalization compensates for channel loss at receiver through active or passive filtering.
S-parameters characterize multi-port networks in frequency domain describing transmission and reflection for high-speed interconnect analysis.
Scattering parameters for RF characterization.
Source-drain extensions are shallow lightly-doped regions formed with spacers present managing short-channel effects and junction capacitance.
Self-Supervised learning for Sequential Recommendation uses data augmentation and contrastive learning.
Efficient SSM using special parameterization.
Structured State Space model uses diagonal approximations for efficient training.
Simplified diagonal state space model improves training stability and efficiency.
Common lead-free solder.
# Soft Actor-Critic (SAC) Advanced Reinforcement Learning Logic ## Core Philosophy SAC is a **maximum entropy reinforcement learning** algorithm that fundamentally reframes the RL objective. Instead of simply maximizing expected cumulative reward, SAC maximizes a modified objective that includes an entropy bonus. ### Standard RL Objective $$ J_{\text{standard}}(\pi) = \sum_{t=0}^{T} \mathbb{E}_{(s_t, a_t) \sim \rho_\pi} \left[ r(s_t, a_t) \right] $$ ### SAC's Maximum Entropy Objective $$ J(\pi) = \sum_{t=0}^{T} \mathbb{E}_{(s_t, a_t) \sim \rho_\pi} \left[ r(s_t, a_t) + \alpha \mathcal{H}(\pi(\cdot | s_t)) \right] $$ Where: - $\pi$ — Policy (maps states to action distributions) - $\rho_\pi$ — State-action marginal induced by policy $\pi$ - $r(s_t, a_t)$ — Reward function - $\alpha$ — Temperature parameter (entropy coefficient) - $\mathcal{H}(\pi(\cdot|s))$ — Entropy of the policy at state $s$ ## Maximum Entropy Objective ### Entropy Definition The entropy of a continuous policy distribution is: $$ \mathcal{H}(\pi(\cdot|s)) = -\int_{\mathcal{A}} \pi(a|s) \log \pi(a|s) \, da $$ For discrete actions: $$ \mathcal{H}(\pi(\cdot|s)) = -\sum_{a \in \mathcal{A}} \pi(a|s) \log \pi(a|s) $$ ### Intuition - **High entropy** → Policy is close to uniform (exploratory) - **Low entropy** → Policy is concentrated/deterministic (exploitative) - **SAC incentivizes** acting as randomly as possible while achieving high reward ## Why Entropy Maximization Matters ### 1. Exploration-Exploitation Balance Built-In - Traditional RL separates exploration from the objective - $\epsilon$-greedy exploration - Noise injection (Ornstein-Uhlenbeck, Gaussian) - Exploration bonuses - SAC embeds exploration **into** the optimization target - A policy that collapses to deterministic actions pays an entropy penalty - This forces continued exploration even as the policy improves ### 2. Robustness and Multi-Modality - Stochastic policies naturally capture **multiple near-optimal solutions** - If two action sequences yield similar returns: - Deterministic policy: arbitrarily picks one - SAC policy: maintains probability mass on both - Result: More robust policies that handle perturbations better ### 3. Better Credit Assignment - Entropy bonus provides a **dense intrinsic reward signal** - Even with sparse external rewards, the agent receives learning signal - Effectively a form of **curiosity** without explicit curiosity modules ### 4. Improved Convergence Properties - Entropy regularization smooths the optimization landscape - Prevents premature convergence to local optima - Provides better gradient signal in early training ## Architecture: Three Networks SAC employs an **actor-critic architecture** with three main components: ### 1. Critic Networks (Q-functions) Two independent Q-networks: $Q_{\theta_1}(s, a)$ and $Q_{\theta_2}(s, a)$ **Training objective (for each Q-network):** $$ \mathcal{L}_Q(\theta_i) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( Q_{\theta_i}(s, a) - y \right)^2 \right] $$ **Target value:** $$ y = r + \gamma \mathbb{E}_{a' \sim \pi_\phi} \left[ \min_{j=1,2} Q_{\bar{\theta}_j}(s', a') - \alpha \log \pi_\phi(a'|s') \right] $$ **Key features:** - **Clipped Double-Q trick** — Uses $\min(Q_{\theta_1}, Q_{\theta_2})$ to combat overestimation bias - **Target networks** $\bar{\theta}$ — Soft-updated for stability: $$ \bar{\theta} \leftarrow \tau \theta + (1 - \tau) \bar{\theta} $$ ### 2. Actor Network (Policy) A stochastic policy $\pi_\phi(a|s)$ typically parameterized as a **squashed Gaussian**: $$ a = \tanh\left( \mu_\phi(s) + \sigma_\phi(s) \odot \epsilon \right), \quad \epsilon \sim \mathcal{N}(0, I) $$ **Policy objective:** $$ \mathcal{L}_\pi(\phi) = \mathbb{E}_{s \sim \mathcal{D}, \, \epsilon \sim \mathcal{N}} \left[ \alpha \log \pi_\phi(a_\phi(s, \epsilon)|s) - Q_\theta(s, a_\phi(s, \epsilon)) \right] $$ **Key features:** - **Reparameterization trick** — Enables low-variance gradient estimation - **Tanh squashing** — Bounds actions to valid range $[-1, 1]$ - **Log-probability correction** — For the change of variables: $$ \log \pi(a|s) = \log \mu(u|s) - \sum_{i=1}^{D} \log(1 - \tanh^2(u_i)) $$ ### 3. Temperature Parameter ($\alpha$) **Automatic tuning via constrained optimization:** $$ \alpha^* = \arg\min_\alpha \mathbb{E}_{a \sim \pi^*} \left[ -\alpha \log \pi^*(a|s) - \alpha \bar{\mathcal{H}} \right] $$ **Simplified loss:** $$ \mathcal{L}(\alpha) = \mathbb{E}_{a \sim \pi} \left[ -\alpha \left( \log \pi(a|s) + \bar{\mathcal{H}} \right) \right] $$ Where $\bar{\mathcal{H}}$ is the target entropy, typically set to: $$ \bar{\mathcal{H}} = -\dim(\mathcal{A}) $$ **Intuition:** - If current entropy < target → $\alpha$ increases → more exploration - If current entropy > target → $\alpha$ decreases → more exploitation ## The Soft Bellman Equation ### Standard Bellman Equation $$ Q(s, a) = r(s, a) + \gamma \max_{a'} Q(s', a') $$ ### Soft Bellman Equation $$ Q(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim p} \left[ V(s') \right] $$ Where the **soft value function** is: $$ V(s) = \mathbb{E}_{a \sim \pi} \left[ Q(s, a) - \alpha \log \pi(a|s) \right] $$ Alternatively, the soft value function can be expressed as: $$ V(s) = \alpha \log \int_{\mathcal{A}} \exp\left( \frac{1}{\alpha} Q(s, a) \right) da $$ ### Soft Q-Learning Update The soft Q-function satisfies: $$ Q(s, a) = r(s, a) + \gamma \mathbb{E}_{s'} \left[ \mathbb{E}_{a' \sim \pi}[Q(s', a')] + \alpha \mathcal{H}(\pi(\cdot|s')) \right] $$ ### Optimal Soft Policy The optimal policy under the maximum entropy framework is: $$ \pi^*(a|s) = \frac{\exp\left( \frac{1}{\alpha} Q^*(s, a) \right)}{\int_{\mathcal{A}} \exp\left( \frac{1}{\alpha} Q^*(s, a') \right) da'} $$ This is a **Boltzmann distribution** with $Q$ as negative energy. ## Advanced Logic: Why SAC Works ### 1. Off-Policy Learning + Sample Efficiency **Off-policy learning properties:** - Can learn from **any** data in replay buffer - No need for data from current policy - Aggressive experience reuse possible **Why this matters for SAC:** - Entropy framework provides stability for off-policy learning - Stochastic policies avoid the brittleness of deterministic off-policy methods - Better sample efficiency than on-policy methods (PPO, A2C) ### 2. Reparameterization Trick **Classic REINFORCE gradient (high variance):** $$ \nabla_\phi J = \mathbb{E}_{a \sim \pi_\phi} \left[ \nabla_\phi \log \pi_\phi(a|s) \cdot Q(s, a) \right] $$ **SAC's reparameterized gradient (low variance):** $$ \nabla_\phi J = \mathbb{E}_{s \sim \mathcal{D}, \, \epsilon \sim \mathcal{N}} \left[ \nabla_\phi \alpha \log \pi_\phi(a_\phi|s) - \nabla_a Q(s, a)|_{a=a_\phi} \cdot \nabla_\phi a_\phi(s, \epsilon) \right] $$ **Key insight:** - Action $a_\phi = f_\phi(\epsilon; s)$ is deterministically computed from noise - Gradients flow through the sampling process - Much lower variance than score-function estimators ### 3. Clipped Double-Q Learning **Overestimation bias in Q-learning:** $$ \mathbb{E}[\max_a \hat{Q}(s, a)] \geq \max_a \mathbb{E}[\hat{Q}(s, a)] $$ **SAC's solution:** $$ Q_{\text{target}} = \min(Q_{\theta_1}, Q_{\theta_2}) $$ **Benefits:** - Combats overestimation from function approximation errors - Borrowed from TD3 (Twin Delayed DDPG) - Provides more conservative, stable value estimates ### 4. Automatic Temperature Adjustment **Constrained optimization formulation:** $$ \max_\pi \mathbb{E}\left[ \sum_t r(s_t, a_t) \right] \quad \text{subject to} \quad \mathbb{E}[-\log \pi(a|s)] \geq \bar{\mathcal{H}} $$ **Lagrangian dual:** $$ \min_\alpha \max_\pi \mathbb{E}\left[ \sum_t r(s_t, a_t) + \alpha(\mathcal{H}(\pi) - \bar{\mathcal{H}}) \right] $$ **Interpretation:** - $\alpha$ is the Lagrange multiplier - Automatically adjusts based on constraint satisfaction - Removes sensitive hyperparameter tuning ## Connections to Other Frameworks ### 1. Information-Theoretic View **Rate-Distortion Theory Connection:** - Policy is a "channel" from states to actions - Entropy regularization controls "bandwidth" - Trade-off: information transmission vs. compression **Mutual Information Perspective:** $$ I(\mathcal{S}; \mathcal{A}) = \mathbb{E}[\log \pi(a|s)] - \mathbb{E}[\log \pi(a)] $$ Maximum entropy policies minimize $I(\mathcal{S}; \mathcal{A})$ subject to achieving target reward. ### 2. Control as Inference **Graphical Model Formulation:** - Introduce binary optimality variable $\mathcal{O}_t \in \{0, 1\}$ - Define: $p(\mathcal{O}_t = 1 | s_t, a_t) \propto \exp(r(s_t, a_t))$ - Optimal policy is posterior: $\pi^*(a|s) = p(a|s, \mathcal{O}_{1:T} = 1)$ **Result:** - RL becomes probabilistic inference - Soft Bellman equations emerge naturally - Unifies RL with probabilistic planning (Levine, 2018) ### 3. Energy-Based Models **Energy function:** $E(s, a) = -Q(s, a)$ **Boltzmann policy:** $$ \pi(a|s) = \frac{\exp(-E(s, a) / \alpha)}{Z(s)} = \frac{\exp(Q(s, a) / \alpha)}{\int \exp(Q(s, a') / \alpha) da'} $$ **Interpretation:** - Q-function defines an energy landscape - Policy samples proportional to $\exp(Q/\alpha)$ - SAC approximates this with tractable Gaussian ### 4. Connection to SQL and Other Algorithms | Algorithm | Key Difference from SAC | |-----------|------------------------| | **SQL** (Soft Q-Learning) | Value-based, no explicit policy | | **TD3** | Deterministic policy, no entropy | | **PPO** | On-policy, clipped surrogate | | **MPO** | E-step/M-step optimization | ## Limitations and Frontiers ### Current Limitations 1. **Gaussian Assumption** - Unimodal Gaussian may miss multi-modal optimal policies - Potential solutions: - Normalizing flows - Mixture density networks - Implicit policies (SVGD) 2. **Discrete Action Spaces** - Original formulation assumes continuous actions - SAC-Discrete exists but loses some elegance - Gumbel-Softmax reparameterization for discrete 3. **Hyperparameter Sensitivity** - Despite auto-tuning $\alpha$, SAC remains sensitive to: - Target entropy $\bar{\mathcal{H}}$ choice - Learning rates (actor, critic, alpha) - Network architectures - Replay buffer size 4. **Sample Efficiency vs Model-Based** - SAC is efficient for model-free methods - Model-based approaches can be orders of magnitude better: - Dreamer - MBPO (Model-Based Policy Optimization) - MuZero ### Research Frontiers 1. **Distributional SAC** - Learn distribution over returns, not just expectation - Better risk-sensitive control 2. **Multi-Agent SAC** - Extend to cooperative/competitive settings - Handle non-stationarity 3. **Hierarchical SAC** - Temporal abstraction - Options framework integration 4. **Offline/Batch RL** - CQL (Conservative Q-Learning) - Behavior-constrained SAC ## Algorithm ### SAC Algorithm Pseudocode ``` Initialize: - Policy network π_φ - Q-networks Q_θ₁, Q_θ₂ - Target networks Q̄_θ₁, Q̄_θ₂ - Temperature α - Replay buffer D For each iteration: For each environment step: 1. Sample action: a ~ π_φ(a|s) 2. Execute action, observe (r, s') 3. Store (s, a, r, s') in D For each gradient step: 4. Sample batch from D 5. Update Q-networks: y = r + γ(min Q̄(s',ã') - α log π(ã'|s')) θᵢ ← θᵢ - λ_Q ∇_θᵢ (Q_θᵢ(s,a) - y)² 6. Update policy: φ ← φ - λ_π ∇_φ (α log π_φ(ã|s) - Q(s,ã)) 7. Update temperature: α ← α - λ_α ∇_α (-α(log π(a|s) + H̄)) 8. Update targets: θ̄ᵢ ← τθᵢ + (1-τ)θ̄ᵢ ``` ### Hyperparameters | Parameter | Typical Value | Description | |-----------|---------------|-------------| | $\gamma$ | 0.99 | Discount factor | | $\tau$ | 0.005 | Target smoothing coefficient | | $\alpha$ | Auto-tuned | Temperature / entropy coefficient | | $\bar{\mathcal{H}}$ | $-\dim(\mathcal{A})$ | Target entropy | | Learning rate | 3e-4 | For all networks | | Batch size | 256 | Samples per gradient step | | Buffer size | $10^6$ | Replay buffer capacity | ## Mathematical | Symbol | Meaning | |--------|---------| | $\pi_\phi$ | Policy parameterized by $\phi$ | | $Q_\theta$ | Q-function parameterized by $\theta$ | | $V(s)$ | Soft value function | | $\mathcal{H}(\cdot)$ | Entropy | | $\alpha$ | Temperature parameter | | $\bar{\mathcal{H}}$ | Target entropy | | $\mathcal{D}$ | Replay buffer | | $\gamma$ | Discount factor | | $\tau$ | Target network update rate | | $\rho_\pi$ | State-action distribution under $\pi$ | | $\mathcal{A}$ | Action space | | $\mathcal{S}$ | State space |
Tool for configuring and logging experiments.
Temporary layer later removed.
CVD at pressure between atmospheric and vacuum.
Self-aligned patterning using spacers as mentioned earlier.
Safe reinforcement learning constrains exploration and learned policies to satisfy safety constraints during training and deployment.
Safetensors is safe tensor serialization format. No code execution. Fast loading.
Standard datasets for testing model safety (ToxiGen RealToxicityPrompts).
Safety classifiers predict whether content violates policy guidelines.
Safety fine-tuning adjusts model parameters to reduce harmful outputs.
Systems preventing harmful outputs.
Safety stock is buffer inventory maintained to protect against demand variability and supply disruptions ensuring production continuity.
Safety training teaches models to decline harmful requests and follow guidelines.
Safety = preventing harmful, illegal, or sensitive outputs. Use policies, classifiers, rule-based filters, and human review for high-risk use cases.
SageMaker is AWS ML platform. Training, hosting, MLOps. Enterprise integration.
Self-Attention Graph Pooling selects important nodes based on learned attention scores enabling differentiable coarsening for graph classification.
Use attention for pooling.
Silicide formed selectively on silicon.
Self-aligned silicide forms selectively on exposed silicon surfaces without masks reducing contact resistance to source drain and gate.
Saliency maps visualize input regions most influential to predictions using gradient magnitudes.