← Back to AI Factory Chat

AI Factory Glossary

513 technical terms and definitions

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z All
Showing page 1 of 11 (513 entries)

a-optimal design, doe

Minimize average variance.

a/b test generation,content creation

Create variants for testing.

a/b testing for models,mlops

Deploy multiple model versions and compare performance.

a/b testing,evaluation

Compare two model versions by showing different outputs to users.

a3 problem solving, a3, quality

One-page problem-solving report.

a3c, a3c, reinforcement learning

Parallel asynchronous RL.

a3c, asynchronous advantage actor critic, reinforcement learning advanced, actor critic, asynchronous rl, deepmind, advanced rl

# A3C Asynchronous Advantage Actor-Critic ## A3C A3C (Asynchronous Advantage Actor-Critic) was introduced by DeepMind in 2016 and represented a paradigm shift in deep reinforcement learning by solving several fundamental problems simultaneously: - Sample inefficiency - Training instability - The need for expensive hardware (GPUs with large replay buffers) ## Core Architecture ### The Key Insight Instead of using experience replay (as in DQN), A3C achieves decorrelated training data through **parallelism**: - Multiple agents interact with separate environment instances simultaneously - Each agent contributes gradients to a shared global network - Temporal correlation is broken without requiring replay buffers ## The Three A's Explained ### 1. Asynchronous Multiple worker threads run in parallel, each with: - Its own copy of the environment - A local copy of the policy network - Independent exploration trajectories Workers periodically sync with a global network: - **Push**: Send computed gradients to global network - **Pull**: Receive updated parameters from global network ### 2. Advantage Rather than using raw returns or Q-values, A3C uses the **advantage function**: $$ A(s, a) = Q(s, a) - V(s) $$ In practice, the advantage is estimated using n-step returns: $$ A_t = \sum_{i=0}^{k-1} \gamma^i r_{t+i} + \gamma^k V(s_{t+k}) - V(s_t) $$ Where: - $\gamma$ = discount factor - $r_{t+i}$ = reward at timestep $t+i$ - $V(s)$ = value function estimate - $k$ = number of steps (typically 5 or 20) **Why use advantage?** - Reduces variance significantly compared to REINFORCE - Maintains unbiased gradients - Tells us: "How much better was this action than expected on average?" ### 3. Actor-Critic Two components share a neural network backbone: | Component | Output | Role | |-----------|--------|------| | **Actor** ($\pi$) | Action probabilities | Policy improvement | | **Critic** ($V$) | State value estimate | Variance reduction via baseline | The shared representation allows feature learning to benefit both objectives. ## Loss Function The total loss combines three terms: $$ L_{total} = L_{policy} + c_1 \cdot L_{value} + c_2 \cdot L_{entropy} $$ ### Policy Loss (Actor) $$ L_{policy} = -\log \pi(a_t | s_t) \cdot A_t $$ Where: - $\pi(a_t | s_t)$ = probability of taking action $a_t$ in state $s_t$ - $A_t$ = advantage estimate at time $t$ ### Value Loss (Critic) $$ L_{value} = \frac{1}{2}(R_t - V(s_t))^2 $$ Where: - $R_t$ = discounted return (target) - $V(s_t)$ = predicted state value ### Entropy Bonus (Exploration) $$ L_{entropy} = -\sum_{a} \pi(a | s) \log \pi(a | s) $$ Purpose: - Prevents premature convergence to deterministic policies - Encourages exploration - Typical coefficient: $c_2 = 0.01$ ## N-Step Returns A3C uses n-step bootstrapping for return estimation: $$ R_t = \sum_{i=0}^{k-1} \gamma^i r_{t+i} + \gamma^k V(s_{t+k}) $$ This balances: - **Bias**: From bootstrapping with imperfect $V$ - **Variance**: From Monte Carlo-style long rollouts Common choices: - $n = 5$ for faster updates - $n = 20$ for more accurate returns ## Algorithm Pseudocode ``` Global shared parameters: θ (policy), θ_v (value) Global shared counter: T = 0 Maximum timesteps: T_max For each worker thread: Initialize thread step counter: t = 1 Initialize local parameters: θ' = θ, θ'_v = θ_v Repeat: Reset gradients: dθ = 0, dθ_v = 0 Synchronize: θ' = θ, θ'_v = θ_v t_start = t Get state: s_t Repeat: Perform a_t according to π(a_t | s_t; θ') Receive reward r_t and new state s_{t+1} t = t + 1 T = T + 1 Until terminal s_t OR t - t_start == t_max R = 0 if terminal else V(s_t; θ'_v) For i in {t-1, ..., t_start}: R = r_i + γ * R Accumulate gradients for π: dθ += ∇_{θ'} log π(a_i | s_i; θ') * (R - V(s_i; θ'_v)) Accumulate gradients for V: dθ_v += ∂(R - V(s_i; θ'_v))² / ∂θ'_v Perform asynchronous update of θ using dθ Perform asynchronous update of θ_v using dθ_v Until T > T_max ``` ## Network Architecture ### Shared Backbone ``` - Input State s │ ▼ ┌─────────────────┐ │ Conv Layer 1 │ (if image input) │ 32 filters 8x8 │ │ stride 4, ReLU │ └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Conv Layer 2 │ │ 64 filters 4x4 │ │ stride 2, ReLU │ └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Conv Layer 3 │ │ 64 filters 3x3 │ │ stride 1, ReLU │ └────────┬────────┘ │ ▼ ┌─────────────────┐ │ Fully Connected│ │ 512 units, ReLU│ └────────┬────────┘ │ ┌────┴────┐ │ │ ▼ ▼ ┌───────┐ ┌───────┐ │ Actor │ │Critic │ │ π(s) │ │ V(s) │ │Softmax│ │Linear │ └───────┘ └───────┘ ``` ### Output Specifications **Actor (Policy) Head:** $$ \pi(a | s) = \text{softmax}(W_\pi \cdot h + b_\pi) $$ **Critic (Value) Head:** $$ V(s) = W_v \cdot h + b_v $$ Where $h$ is the shared hidden representation. ## Continuous Action Spaces For continuous control, the actor outputs parameters of a Gaussian distribution: $$ \pi(a | s) = \mathcal{N}(\mu(s), \sigma(s)^2) $$ Where: - $\mu(s)$ = mean action (network output) - $\sigma(s)$ = standard deviation (can be learned or fixed) **Sampling:** $$ a = \mu(s) + \sigma(s) \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1) $$ **Log probability:** $$ \log \pi(a | s) = -\frac{(a - \mu(s))^2}{2\sigma(s)^2} - \log(\sigma(s)) - \frac{1}{2}\log(2\pi) $$ ## Hyperparameters | Parameter | Typical Value | Description | |-----------|---------------|-------------| | $\gamma$ | 0.99 | Discount factor | | Learning rate | $10^{-4}$ to $7 \times 10^{-4}$ | Step size for optimization | | $n$-step | 5 or 20 | Steps before bootstrapping | | Entropy coef ($c_2$) | 0.01 | Exploration encouragement | | Value coef ($c_1$) | 0.5 | Value loss weight | | Max grad norm | 40 | Gradient clipping threshold | | Workers | 16 | Number of parallel threads | ## Comparison: Before and After A3C ### Before A3C (DQN Era) - Required massive replay buffers (millions of transitions) - GPU-bound, single environment - Off-policy complications - Discrete actions only (without modifications) - Memory intensive: $O(10^6)$ transitions stored ### A3C's Contributions - **CPU-friendly**: Runs efficiently on multi-core CPUs - **On-policy**: Simpler, more stable gradients - **Continuous actions**: Natural extension via Gaussian policies - **Faster wall-clock training**: Parallelism compensates for sample inefficiency - **Memory efficient**: No replay buffer needed ## Limitations and Evolution | Limitation | Successor | Solution | |------------|-----------|----------| | Noisy gradients from async updates | A2C | Synchronous updates | | Sample inefficiency (on-policy) | IMPALA | V-trace off-policy correction | | No trust region | PPO/TRPO | Clipped/constrained updates | | Hyperparameter sensitivity | PPO | Robust clipped objective | ### A2C (Synchronous Version) - Waits for all workers to finish before updating - Enables GPU batching - Often matches or beats A3C - Simpler to implement and debug ### PPO (Proximal Policy Optimization) Clipped surrogate objective: $$ L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right] $$ Where: $$ r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)} $$ ## Generalized Advantage Estimation (GAE) An improvement often used with A3C/A2C: $$ \hat{A}_t^{GAE(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l} $$ Where the TD residual is: $$ \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) $$ Properties: - $\lambda = 0$: One-step TD (high bias, low variance) - $\lambda = 1$: Monte Carlo (low bias, high variance) - $\lambda = 0.95$: Common choice balancing bias-variance ## Implementation Considerations ### Gradient Clipping Essential for stability: ```python # Clip by global norm grad_norm = torch.nn.utils.clip_grad_norm_( model.parameters(), max_norm=40.0 ) ``` ### Shared Optimizer State Use optimizers that handle asynchronous updates: - **RMSprop** (original paper) - **Adam** with shared statistics ### Environment Normalization Normalize observations and rewards: $$ \hat{s} = \frac{s - \mu_s}{\sigma_s + \epsilon} $$ $$ \hat{r} = \frac{r}{\sigma_r + \epsilon} $$ ## When to Use A3C Today ### Good Use Cases - Many CPU cores but limited GPU - Environment simulation is the bottleneck - Need continuous control with minimal tuning - Teaching/learning RL fundamentals - Rapid prototyping ### Better Alternatives - **PPO**: Most practical applications (robust, simple) - **SAC**: Continuous control with sample efficiency - **IMPALA**: Large-scale distributed training - **DreamerV3**: Model-based with better sample efficiency ## Mathematical ### Core Equations **Policy Gradient Theorem:** $$ \nabla_\theta J(\theta) = \mathbb{E}_{\pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) \cdot A^{\pi_\theta}(s, a) \right] $$ **Advantage Function:** $$ A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s) $$ **Bellman Equation for V:** $$ V^\pi(s) = \mathbb{E}_{a \sim \pi} \left[ r(s, a) + \gamma V^\pi(s') \right] $$ **Total Loss:** $$ L(\theta) = -\mathbb{E}_t \left[ \log \pi_\theta(a_t|s_t) A_t \right] + c_1 \mathbb{E}_t \left[ (V_\theta(s_t) - R_t)^2 \right] - c_2 H(\pi_\theta(\cdot|s_t)) $$ ## Card ``` - ┌─────────────────────────────────────────────────────────┐ │ A3C QUICK REFERENCE │ ├─────────────────────────────────────────────────────────┤ │ Algorithm Type: On-policy, Actor-Critic │ │ Action Space: Discrete or Continuous │ │ Parallelization: Asynchronous multi-threading │ │ Memory: No replay buffer │ │ Hardware: CPU-friendly │ ├─────────────────────────────────────────────────────────┤ │ Key Hyperparameters: │ │ γ (discount) = 0.99 │ │ lr = 1e-4 to 7e-4 │ │ n-step = 5 or 20 │ │ entropy_coef = 0.01 │ │ value_coef = 0.5 │ │ max_grad_norm = 40 │ │ num_workers = 16 │ ├─────────────────────────────────────────────────────────┤ │ Loss = -log(π) * A + 0.5 * (R - V)² - 0.01 * H(π) │ └─────────────────────────────────────────────────────────┘ ```

ab initio simulation, simulation

First-principles quantum mechanical calculations.

abc analysis, abc, supply chain & logistics

ABC analysis categorizes inventory by value and usage prioritizing management attention on high-value items contributing most to costs.

abductive reasoning,reasoning

Infer most likely explanation.

aberration-corrected tem, metrology

Ultra-high resolution TEM with correctors.

ablation cam, explainable ai

Use ablation to generate activation maps.

ablation study,analysis,what matters

I can help you design ablations to see which components (data, features, modules) really drive performance.

ablation,experiment,study

Ablation studies isolate impact of each component. Remove or modify one thing, measure effect. Scientific rigor.

ablation,remove,contribution

Ablation removes or zeros components to measure contribution. Essential interpretability technique.

absolute grading,evaluation

Score single output on scale.

absorbing state diffusion, generative models

Diffusion where tokens gradually become mask tokens.

abstention,ai safety

Refuse to answer when uncertain.

abstract interpretation for neural networks, ai safety

Sound over-approximation of network behavior.

abstract interpretation,software engineering

Soundly approximate program behavior.

abtest,online eval,rollout,canary

Use A/B tests and canary rollouts to compare models safely. Start with a small traffic slice, check metrics + human feedback, then scale up.

ac parametric, ac, advanced test & probe

AC parametric tests characterize frequency-dependent behavior like capacitance propagation delay and switching characteristics.

ac termination, ac, signal & power integrity

AC termination uses capacitor in series blocking DC while providing high-frequency impedance matching.

ac testing,testing

Test dynamic electrical characteristics.

accelerate,distributed,huggingface

Accelerate simplifies distributed training. Multi-GPU, TPU, mixed precision. Hugging Face.

accelerated life test design, reliability

Plan accelerated testing.

accelerated life test, alt, reliability

Test under stress.

accelerated life testing,reliability

Speed up aging to test reliability.

accelerated testing correlation, reliability

Relate accelerated tests to field life.

accelerated testing, business & standards

Accelerated testing uses stress to induce failures faster than normal operation.

accelerated thermal cycling, reliability

Speed up thermal cycling tests.

acceleration factor, business & standards

Acceleration factors relate accelerated test conditions to normal use predicting field reliability.

acceleration factor, reliability

Ratio of stress to use conditions.

acceleration voltage,implant

Energy given to ions before they hit wafer (keV to MeV).

accent adaptation, audio & speech

Accent adaptation fine-tunes ASR models for specific accents or dialects using targeted data.

accent removal, nlp

Remove diacritics.

accept/reject criteria, reliability

Decision rules for RDT.

acceptance control charts, spc

Combine control and acceptance.

acceptance rate, inference

Fraction of draft tokens accepted.

acceptance sampling, quality & reliability

Acceptance sampling uses sampling plans to accept or reject lots based on measured defect rates.

accessibility,a11y,audit

AI audits for accessibility. WCAG compliance suggestions.

accordion, distributed training

Adaptive communication-computation tradeoff.

accountability,ethics

Clear responsibility for AI system outcomes.

accuracy, evaluation

Accuracy measures fraction of correct predictions.

accuracy,metrology

How close measurement is to true value.

acid contamination, contamination

Acidic vapor contamination.

acid exhaust,facility

Dedicated exhaust system for corrosive acidic fumes.

acid gas scrubbing, environmental & sustainability

Acid gas scrubbing neutralizes acidic vapors by contacting with alkaline solution.

acid neutralization, environmental & sustainability

Acid neutralization treats acidic waste streams by adding bases precipitating metals and adjusting pH before discharge.

acid neutralization,facility

Treat acidic waste streams to safe pH.