{311} defects, process
Rod-like defects from implant damage.
9,967 technical terms and definitions
Rod-like defects from implant damage.
Extreme quantization for gradients.
2.5D field solvers approximate 3D effects with efficiency suitable for large interconnect extraction.
Use interposer for multiple dies.
Extend 1D sinusoidal to 2D.
99.7% of parts within 3 standard deviations.
Three-dimensional atomic force microscopy.
Convolutional networks with temporal dimension.
3D field solvers accurately compute electromagnetic fields in complex structures at higher computational cost.
Explicit 3D scene representation.
Efficient 3D scene representation.
3D Gaussians are primitives with position scale rotation and color rendered through splatting.
AI 3D generation from images. NeRFs, Gaussian Splatting. Photorealistic 3D scenes.
Stack multiple dies vertically with interconnects.
Build 3D models from images.
Generate 3D objects.
Build 3D structures by bonding multiple wafers.
Generate with 3D understanding.
3D generation from text/images is emerging. NeRFs, point clouds, mesh generation. Early but advancing fast.
3D space plus time.
Record full diffraction pattern at each position.
Root cause technique asking why repeatedly.
Iterative questioning technique.
5S organizes workplaces through Sort Set in order Shine Standardize and Sustain improving efficiency.
99.99966% yield target.
Structured problem-solving method.
8D methodology provides structured approach to problem solving with team-based root cause analysis.
Problem-solving methodology.
Minimize average variance.
Create variants for testing.
Deploy multiple model versions and compare performance.
Compare two model versions by showing different outputs to users.
One-page problem-solving report.
Parallel asynchronous RL.
# A3C Asynchronous Advantage Actor-Critic ## A3C A3C (Asynchronous Advantage Actor-Critic) was introduced by DeepMind in 2016 and represented a paradigm shift in deep reinforcement learning by solving several fundamental problems simultaneously: - Sample inefficiency - Training instability - The need for expensive hardware (GPUs with large replay buffers) ## Core Architecture ### The Key Insight Instead of using experience replay (as in DQN), A3C achieves decorrelated training data through **parallelism**: - Multiple agents interact with separate environment instances simultaneously - Each agent contributes gradients to a shared global network - Temporal correlation is broken without requiring replay buffers ## The Three A's Explained ### 1. Asynchronous Multiple worker threads run in parallel, each with: - Its own copy of the environment - A local copy of the policy network - Independent exploration trajectories Workers periodically sync with a global network: - **Push**: Send computed gradients to global network - **Pull**: Receive updated parameters from global network ### 2. Advantage Rather than using raw returns or Q-values, A3C uses the **advantage function**: $$ A(s, a) = Q(s, a) - V(s) $$ In practice, the advantage is estimated using n-step returns: $$ A_t = \sum_{i=0}^{k-1} \gamma^i r_{t+i} + \gamma^k V(s_{t+k}) - V(s_t) $$ Where: - $\gamma$ = discount factor - $r_{t+i}$ = reward at timestep $t+i$ - $V(s)$ = value function estimate - $k$ = number of steps (typically 5 or 20) **Why use advantage?** - Reduces variance significantly compared to REINFORCE - Maintains unbiased gradients - Tells us: "How much better was this action than expected on average?" ### 3. Actor-Critic Two components share a neural network backbone: | Component | Output | Role | |-----------|--------|------| | **Actor** ($\pi$) | Action probabilities | Policy improvement | | **Critic** ($V$) | State value estimate | Variance reduction via baseline | The shared representation allows feature learning to benefit both objectives. ## Loss Function The total loss combines three terms: $$ L_{total} = L_{policy} + c_1 \cdot L_{value} + c_2 \cdot L_{entropy} $$ ### Policy Loss (Actor) $$ L_{policy} = -\log \pi(a_t | s_t) \cdot A_t $$ Where: - $\pi(a_t | s_t)$ = probability of taking action $a_t$ in state $s_t$ - $A_t$ = advantage estimate at time $t$ ### Value Loss (Critic) $$ L_{value} = \frac{1}{2}(R_t - V(s_t))^2 $$ Where: - $R_t$ = discounted return (target) - $V(s_t)$ = predicted state value ### Entropy Bonus (Exploration) $$ L_{entropy} = -\sum_{a} \pi(a | s) \log \pi(a | s) $$ Purpose: - Prevents premature convergence to deterministic policies - Encourages exploration - Typical coefficient: $c_2 = 0.01$ ## N-Step Returns A3C uses n-step bootstrapping for return estimation: $$ R_t = \sum_{i=0}^{k-1} \gamma^i r_{t+i} + \gamma^k V(s_{t+k}) $$ This balances: - **Bias**: From bootstrapping with imperfect $V$ - **Variance**: From Monte Carlo-style long rollouts Common choices: - $n = 5$ for faster updates - $n = 20$ for more accurate returns ## Algorithm Pseudocode ``` Global shared parameters: θ (policy), θ_v (value) Global shared counter: T = 0 Maximum timesteps: T_max For each worker thread: Initialize thread step counter: t = 1 Initialize local parameters: θ' = θ, θ'_v = θ_v Repeat: Reset gradients: dθ = 0, dθ_v = 0 Synchronize: θ' = θ, θ'_v = θ_v t_start = t Get state: s_t Repeat: Perform a_t according to π(a_t | s_t; θ') Receive reward r_t and new state s_{t+1} t = t + 1 T = T + 1 Until terminal s_t OR t - t_start == t_max R = 0 if terminal else V(s_t; θ'_v) For i in {t-1, ..., t_start}: R = r_i + γ * R Accumulate gradients for π: dθ += ∇_{θ'} log π(a_i | s_i; θ') * (R - V(s_i; θ'_v)) Accumulate gradients for V: dθ_v += ∂(R - V(s_i; θ'_v))² / ∂θ'_v Perform asynchronous update of θ using dθ Perform asynchronous update of θ_v using dθ_v Until T > T_max ``` ## Network Architecture ### Shared Backbone ``` - Input State s │ ▼ ┌─────────────────┐ │ Conv Layer 1 │ (if image input) │ 32 filters 8x8 │ │ stride 4, ReLU │ └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Conv Layer 2 │ │ 64 filters 4x4 │ │ stride 2, ReLU │ └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Conv Layer 3 │ │ 64 filters 3x3 │ │ stride 1, ReLU │ └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Fully Connected│ │ 512 units, ReLU│ └────────┬────────┘ │ ┌────┴────┐ │ │ ▼ ▼ ┌───────┐ ┌───────┐ │ Actor │ │Critic │ │ π(s) │ │ V(s) │ │Softmax│ │Linear │ └───────┘ └───────┘ ``` ### Output Specifications **Actor (Policy) Head:** $$ \pi(a | s) = \text{softmax}(W_\pi \cdot h + b_\pi) $$ **Critic (Value) Head:** $$ V(s) = W_v \cdot h + b_v $$ Where $h$ is the shared hidden representation. ## Continuous Action Spaces For continuous control, the actor outputs parameters of a Gaussian distribution: $$ \pi(a | s) = \mathcal{N}(\mu(s), \sigma(s)^2) $$ Where: - $\mu(s)$ = mean action (network output) - $\sigma(s)$ = standard deviation (can be learned or fixed) **Sampling:** $$ a = \mu(s) + \sigma(s) \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1) $$ **Log probability:** $$ \log \pi(a | s) = -\frac{(a - \mu(s))^2}{2\sigma(s)^2} - \log(\sigma(s)) - \frac{1}{2}\log(2\pi) $$ ## Hyperparameters | Parameter | Typical Value | Description | |-----------|---------------|-------------| | $\gamma$ | 0.99 | Discount factor | | Learning rate | $10^{-4}$ to $7 \times 10^{-4}$ | Step size for optimization | | $n$-step | 5 or 20 | Steps before bootstrapping | | Entropy coef ($c_2$) | 0.01 | Exploration encouragement | | Value coef ($c_1$) | 0.5 | Value loss weight | | Max grad norm | 40 | Gradient clipping threshold | | Workers | 16 | Number of parallel threads | ## Comparison: Before and After A3C ### Before A3C (DQN Era) - Required massive replay buffers (millions of transitions) - GPU-bound, single environment - Off-policy complications - Discrete actions only (without modifications) - Memory intensive: $O(10^6)$ transitions stored ### A3C's Contributions - **CPU-friendly**: Runs efficiently on multi-core CPUs - **On-policy**: Simpler, more stable gradients - **Continuous actions**: Natural extension via Gaussian policies - **Faster wall-clock training**: Parallelism compensates for sample inefficiency - **Memory efficient**: No replay buffer needed ## Limitations and Evolution | Limitation | Successor | Solution | |------------|-----------|----------| | Noisy gradients from async updates | A2C | Synchronous updates | | Sample inefficiency (on-policy) | IMPALA | V-trace off-policy correction | | No trust region | PPO/TRPO | Clipped/constrained updates | | Hyperparameter sensitivity | PPO | Robust clipped objective | ### A2C (Synchronous Version) - Waits for all workers to finish before updating - Enables GPU batching - Often matches or beats A3C - Simpler to implement and debug ### PPO (Proximal Policy Optimization) Clipped surrogate objective: $$ L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right] $$ Where: $$ r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)} $$ ## Generalized Advantage Estimation (GAE) An improvement often used with A3C/A2C: $$ \hat{A}_t^{GAE(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l} $$ Where the TD residual is: $$ \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) $$ Properties: - $\lambda = 0$: One-step TD (high bias, low variance) - $\lambda = 1$: Monte Carlo (low bias, high variance) - $\lambda = 0.95$: Common choice balancing bias-variance ## Implementation Considerations ### Gradient Clipping Essential for stability: ```python # Clip by global norm grad_norm = torch.nn.utils.clip_grad_norm_( model.parameters(), max_norm=40.0 ) ``` ### Shared Optimizer State Use optimizers that handle asynchronous updates: - **RMSprop** (original paper) - **Adam** with shared statistics ### Environment Normalization Normalize observations and rewards: $$ \hat{s} = \frac{s - \mu_s}{\sigma_s + \epsilon} $$ $$ \hat{r} = \frac{r}{\sigma_r + \epsilon} $$ ## When to Use A3C Today ### Good Use Cases - Many CPU cores but limited GPU - Environment simulation is the bottleneck - Need continuous control with minimal tuning - Teaching/learning RL fundamentals - Rapid prototyping ### Better Alternatives - **PPO**: Most practical applications (robust, simple) - **SAC**: Continuous control with sample efficiency - **IMPALA**: Large-scale distributed training - **DreamerV3**: Model-based with better sample efficiency ## Mathematical ### Core Equations **Policy Gradient Theorem:** $$ \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot A^{\pi_\theta}(s, a) \right] $$ **Advantage Function:** $$ A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s) $$ **Bellman Equation for V:** $$ V^\pi(s) = \mathbb{E}_{a \sim \pi} \left[ r(s, a) + \gamma V^\pi(s') \right] $$ **Total Loss:** $$ L(\theta) = -\mathbb{E}_t \left[ \log \pi_\theta(a_t|s_t) A_t \right] + c_1 \mathbb{E}_t \left[ (V_\theta(s_t) - R_t)^2 \right] - c_2 H(\pi_\theta(\cdot|s_t)) $$ ## Card ``` - ┌─────────────────────────────────────────────────────────┐ │ A3C QUICK REFERENCE │ ├─────────────────────────────────────────────────────────┤ │ Algorithm Type: On-policy, Actor-Critic │ │ Action Space: Discrete or Continuous │ │ Parallelization: Asynchronous multi-threading │ │ Memory: No replay buffer │ │ Hardware: CPU-friendly │ ├─────────────────────────────────────────────────────────┤ │ Key Hyperparameters: │ │ γ (discount) = 0.99 │ │ lr = 1e-4 to 7e-4 │ │ n-step = 5 or 20 │ │ entropy_coef = 0.01 │ │ value_coef = 0.5 │ │ max_grad_norm = 40 │ │ num_workers = 16 │ ├─────────────────────────────────────────────────────────┤ │ Loss = -log(π) * A + 0.5 * (R - V)² - 0.01 * H(π) │ └─────────────────────────────────────────────────────────┘ ```
First-principles quantum mechanical calculations.
ABC analysis categorizes inventory by value and usage prioritizing management attention on high-value items contributing most to costs.
Infer most likely explanation.
Ultra-high resolution TEM with correctors.
Use ablation to generate activation maps.
I can help you design ablations to see which components (data, features, modules) really drive performance.
Ablation studies isolate impact of each component. Remove or modify one thing, measure effect. Scientific rigor.
Ablation removes or zeros components to measure contribution. Essential interpretability technique.
Score single output on scale.
Diffusion where tokens gradually become mask tokens.
Refuse to answer when uncertain.
Sound over-approximation of network behavior.
Soundly approximate program behavior.
Use A/B tests and canary rollouts to compare models safely. Start with a small traffic slice, check metrics + human feedback, then scale up.
AC parametric tests characterize frequency-dependent behavior like capacitance propagation delay and switching characteristics.