← Back to AI Factory Chat

AI Factory Glossary

9,967 technical terms and definitions

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z All
Showing page 68 of 200 (9,967 entries)

full factorial design,doe

Test all combinations of factor levels.

full scan, design & verification

Full scan makes all flip-flops scannable achieving maximum fault coverage.

full wafer test, testing

Test all dies on wafer.

full-grad, explainable ai

Use full gradient information for saliency.

function calling api,ai agent

LLM outputs structured function calls with parameters to invoke external tools.

function calling formatting, tool use

Structure API calls correctly.

function calling, prompting techniques

Function calling allows models to invoke structured functions with parameters.

function calling,tool use,json

Function calling: model outputs structured JSON to invoke tools. Parse JSON, execute function, return result to model.

functional causal models, time series models

Functional causal models represent causal relationships as structural equations with noise terms.

functional test vectors, advanced test & probe

Functional test vectors are input patterns applied to devices to verify logic functionality and detect manufacturing defects.

functional test, advanced test & probe

Functional testing validates digital logic or circuit behavior at speed using test vectors that exercise operational functionality.

functional testing,testing

Verify circuit operates correctly.

functional yield loss, production

Loss from functional failures.

funnel transformer, transformer

Progressively reduce sequence length.

furnace anneal,implant

Longer anneal in batch furnace for diffusion and activation.

fuse programming, yield enhancement

Fuse programming permanently disconnects defective memory elements and enables redundant replacements by laser or electrical blowing.

fused attention, optimization

Optimized attention kernel.

fused layernorm, optimization

Efficient normalization implementation.

fused operations, optimization

Combine multiple ops into one.

fusion bonding, advanced packaging

Silicon oxide bonds formed by thermal treatment.

fusion-in-decoder (fid),fusion-in-decoder,fid,rag

Process retrieved documents separately then fuse.

future,agi,superintelligence

AGI timeline uncertain. Focus on current capabilities and near-term safety. Prepare but dont over-hype.

fuzzing input generation, code ai

Generate inputs for fuzzing.

fuzzing with llms,software testing

Generate test inputs to find bugs.

fuzzy deduplication, data quality

Remove similar examples.

gage capability, metrology

Measurement system capability.

gage r&r (repeatability and reproducibility),gage r&r,repeatability and reproducibility,quality

Evaluate measurement system variation.

gage r&r study, quality

Quantify measurement system variation.

gage r&r, quality & reliability

Gage repeatability and reproducibility studies assess measurement system variation.

gage repeatability, quality

Same operator same part.

gage reproducibility, quality

Different operators same part.

gaia benchmark, gaia, ai agents

GAIA tests general AI assistants on questions requiring reasoning and tool use.

gail, generative adversarial imitation learning, reinforcement learning advanced, imitation learning, adversarial training, advanced rl

# GAIL: Generative Adversarial Imitation Learning ## Advanced Reinforcement Learning Guide ## 1. Introduction and Core Concept GAIL (Generative Adversarial Imitation Learning), introduced by Ho and Ermon (2016), is an imitation learning algorithm that combines ideas from **inverse reinforcement learning (IRL)** and **generative adversarial networks (GANs)** to learn policies directly from expert demonstrations. The fundamental insight is that imitation learning can be cast as a **distribution matching problem**: we want the state-action occupancy measure of our learned policy to match that of the expert. ## 2. The Occupancy Measure Perspective ### 2.1 Definition For a policy $\pi$, the **occupancy measure** $\rho_\pi(s,a)$ represents the distribution of state-action pairs encountered when following $\pi$: $$ \rho_\pi(s,a) = \pi(a|s) \sum_{t=0}^{\infty} \gamma^t P(s_t = s \mid \pi) $$ Where: - $\pi(a|s)$ — Policy: probability of taking action $a$ in state $s$ - $\gamma$ — Discount factor: $\gamma \in [0, 1)$ - $P(s_t = s \mid \pi)$ — Probability of being in state $s$ at time $t$ under policy $\pi$ ### 2.2 Key Theoretical Result There exists a **bijection** between policies and valid occupancy measures: - Every policy induces a unique occupancy measure - Every valid occupancy measure corresponds to a unique policy **Implication:** Matching occupancy measures $\Leftrightarrow$ Matching policies $$ \rho_\pi = \rho_{\pi_E} \iff \pi \equiv \pi_E $$ ## 3. From Inverse RL to GAIL ### 3.1 Maximum Entropy IRL Formulation Traditional Maximum Entropy IRL solves the following optimization: $$ \max_{c \in \mathcal{C}} \left( \min_\pi -H(\pi) + \mathbb{E}_\pi[c(s,a)] \right) - \mathbb{E}_{\pi_E}[c(s,a)] $$ Where: - $c(s,a)$ — Cost function to be learned - $H(\pi)$ — Causal entropy of policy $\pi$ - $\pi_E$ — Expert policy - $\mathcal{C}$ — Set of candidate cost functions ### 3.2 The Computational Problem This is computationally expensive because: - The inner RL problem must be solved **to completion** for each update to the cost function - Requires nested optimization loops - Poor scalability to complex environments ### 3.3 Ho & Ermon's Key Insight With a specific choice of regularizer $\psi$ (convex conjugate of entropy-regularized term), the problem reduces to minimizing **Jensen-Shannon divergence**: $$ \min_\pi D_{JS}(\rho_\pi \| \rho_{\pi_E}) $$ The Jensen-Shannon divergence is defined as: $$ D_{JS}(P \| Q) = \frac{1}{2} D_{KL}(P \| M) + \frac{1}{2} D_{KL}(Q \| M) $$ Where $M = \frac{1}{2}(P + Q)$ is the mixture distribution. ## 4. The GAIL Algorithm ### 4.1 Adversarial Framework GAIL operationalizes the distribution matching as an adversarial game between two networks: #### Discriminator Objective The discriminator $D_\phi(s,a)$ is trained to distinguish expert state-action pairs from policy-generated ones: $$ \max_{D_\phi} \mathbb{E}_{(s,a) \sim \pi_E}[\log D_\phi(s,a)] + \mathbb{E}_{(s,a) \sim \pi_\theta}[\log(1 - D_\phi(s,a))] $$ #### Policy Objective The policy $\pi_\theta$ is trained via policy gradient methods using the discriminator's output as a reward signal: $$ \max_{\pi_\theta} \mathbb{E}_{(s,a) \sim \pi_\theta}[\log D_\phi(s,a)] + \lambda H(\pi_\theta) $$ Where $\lambda$ is an entropy regularization coefficient. ### 4.2 Reward Formulations Two equivalent reward formulations are commonly used: **Formulation 1 (Log-likelihood):** $$ r(s,a) = -\log(1 - D_\phi(s,a)) $$ **Formulation 2 (Log-odds ratio):** $$ r(s,a) = \log D_\phi(s,a) - \log(1 - D_\phi(s,a)) $$ ### 4.3 Algorithm Pseudocode ``` Algorithm: GAIL ───────────────────────────────────────────────────────── Input: Expert trajectories τ_E, initial policy π_θ, discriminator D_φ 1. Initialize policy parameters θ and discriminator parameters φ 2. For iteration i = 1, 2, ..., N do: 2.1 Sample trajectories τ_i ~ π_θ from current policy 2.2 Update discriminator φ via gradient ascent: ∇_φ [ E_{τ_E}[log D_φ(s,a)] + E_{τ_i}[log(1 - D_φ(s,a))] ] 2.3 Compute rewards: r(s,a) = -log(1 - D_φ(s,a)) 2.4 Update policy θ using TRPO/PPO with rewards r(s,a) 3. Return: Learned policy π_θ ───────────────────────────────────────────────────────── ``` ## 5. Theoretical Properties ### 5.1 Convergence Guarantee At the **Nash equilibrium** of the adversarial game: $$ \rho_{\pi^*} = \rho_{\pi_E} $$ The optimal discriminator at equilibrium outputs: $$ D^*(s,a) = \frac{\rho_{\pi_E}(s,a)}{\rho_{\pi_E}(s,a) + \rho_\pi(s,a)} = 0.5 $$ ### 5.2 Reward Ambiguity Like all IRL methods, GAIL faces **reward ambiguity**: - Many reward functions can explain the same behavior - Set of equivalent rewards forms an equivalence class GAIL sidesteps this by: - Never explicitly recovering a reward function - Using the discriminator as an implicit, adaptive reward signal ### 5.3 Sample Efficiency Analysis **Behavioral Cloning (BC):** $$ \text{Error}_{\text{BC}} = O\left(\frac{|S|}{N_{\text{expert}}}\right) $$ **GAIL:** $$ \text{Error}_{\text{GAIL}} = O\left(\frac{1}{\sqrt{N_{\text{expert}}}}\right) $$ GAIL achieves better dependence on expert data due to leveraging the MDP structure. ## 6. Advanced Extensions ### 6.1 AIRL (Adversarial Inverse Reinforcement Learning) Fu et al. (2018) modified GAIL to recover **disentangled, transferable** reward functions: $$ D_\theta(s,a,s') = \frac{\exp(f_\theta(s,a,s'))}{\exp(f_\theta(s,a,s')) + \pi(a|s)} $$ The reward function $f_\theta$ can be decomposed: $$ f_\theta(s,a,s') = g_\theta(s,a) + \gamma h_\phi(s') - h_\phi(s) $$ Where: - $g_\theta(s,a)$ — True reward component - $h_\phi(s)$ — Shaping potential function **Key benefit:** Enables reward transfer across different dynamics. ### 6.2 InfoGAIL Addresses **multimodal expert behavior** by adding a latent code $c$: **Objective:** $$ \max_\pi \mathbb{E}_{c \sim p(c), \tau \sim \pi(\cdot|c)}[I(c; \tau)] - D_{JS}(\rho_\pi \| \rho_{\pi_E}) $$ Where $I(c; \tau)$ is the mutual information between latent codes and trajectories. **Capabilities:** - Discovers distinct strategies from mixed demonstrations - Reproduces different expert modes with different latent codes - Enables controllable imitation ### 6.3 Off-Policy GAIL Variants Standard GAIL requires **on-policy** samples (computationally expensive). Extensions include: #### DAC (Discriminator-Actor-Critic) Combines GAIL with off-policy actor-critic: $$ \mathcal{L}_{\text{DAC}} = \mathbb{E}_{(s,a) \sim \mathcal{B}}[Q_\phi(s,a) - r_D(s,a) - \gamma \mathbb{E}_{s'}[V_\phi(s')]] $$ Where $\mathcal{B}$ is a replay buffer. #### ValueDICE Uses distribution correction estimation: $$ \min_\pi D_{KL}\left(\rho_\pi \| \rho_{\pi_E}\right) \approx \min_\pi \max_\nu \mathbb{E}_{\rho_\pi}[\nu(s,a)] - \log \mathbb{E}_{\rho_{\pi_E}}[e^{\nu(s,a)}] $$ ### 6.4 PWIL (Primal Wasserstein Imitation Learning) Replaces Jensen-Shannon divergence with **Wasserstein distance**: $$ W_1(\rho_\pi, \rho_{\pi_E}) = \inf_{\gamma \in \Pi(\rho_\pi, \rho_{\pi_E})} \mathbb{E}_{(x,y) \sim \gamma}[\|x - y\|] $$ **Advantages:** - More stable gradients when distributions have limited overlap - Better behavior in high-dimensional spaces - Does not require adversarial training ### 6.5 SQIL (Soft Q Imitation Learning) A simplified approach: $$ r(s,a) = \begin{cases} +1 & \text{if } (s,a) \in \mathcal{D}_{\text{expert}} \\ 0 & \text{if } (s,a) \in \mathcal{D}_{\text{agent}} \end{cases} $$ Then run soft Q-learning. Surprisingly effective and avoids discriminator training instabilities. ## 7. Practical Challenges and Solutions ### 7.1 Mode Collapse / Reward Hacking **Problem:** Policy finds degenerate solutions that fool the discriminator without actually imitating the expert. **Solutions:** - Gradient penalties (WGAN-GP style): $$ \mathcal{L}_{\text{GP}} = \lambda \mathbb{E}_{\hat{x} \sim P_{\hat{x}}}[(|\nabla_{\hat{x}} D(\hat{x})|_2 - 1)^2] $$ - Spectral normalization of discriminator weights - Careful architecture design with limited discriminator capacity ### 7.2 Discriminator Overfitting **Problem:** With limited expert data, the discriminator memorizes rather than generalizes. **Solutions:** - Dropout regularization: $p_{\text{drop}} \in [0.1, 0.5]$ - Data augmentation on state observations - Limiting discriminator capacity (fewer layers/units) - Early stopping based on validation performance ### 7.3 Reward Signal Instability **Problem:** As the discriminator improves, rewards become sparse (always $\approx 0$ or $\approx 1$). **Solutions:** - Gradient penalty regularization - Reward clipping: $r(s,a) = \text{clip}(r(s,a), -R_{\max}, R_{\max})$ - Reward normalization with running statistics - Soft labels for discriminator training ### 7.4 Covariate Shift **Problem:** Early in training, policy visits very different states than the expert. **Solutions:** - Curriculum learning (start from expert states) - Demonstrations covering diverse initial conditions - State resetting to expert states during training - Importance weighting of samples ### 7.5 Hyperparameter Sensitivity **Critical hyperparameters:** | Parameter | Typical Range | Notes | |-----------|---------------|-------| | Discriminator LR | $10^{-4}$ to $10^{-3}$ | Often lower than policy LR | | Policy LR | $3 \times 10^{-4}$ | Standard for PPO | | Discriminator updates per policy update | 1-5 | More can cause instability | | Entropy coefficient $\lambda$ | 0.001 to 0.01 | Encourages exploration | | Batch size | 64-2048 | Larger for stability | ## 8. Comparison with Other Methods ### 8.1 Method Comparison Table | Method | Expert Data | Env. Interactions | Recovers Reward | Online Learning | |--------|-------------|-------------------|-----------------|-----------------| | Behavioral Cloning | High | None | No | No | | DAgger | Medium | Expert queries | No | Yes | | MaxEnt IRL | Low | Many | Yes | Yes | | GAIL | Low | Many | No (implicit) | Yes | | AIRL | Low | Many | Yes | Yes | | SQIL | Low | Many | No | Yes | ### 8.2 Sample Complexity Comparison **Expert demonstrations required for $\epsilon$-optimal policy:** - Behavioral Cloning: $O(|S|^2 / \epsilon^2)$ - GAIL: $O(1 / \epsilon^2)$ - AIRL: $O(1 / \epsilon^2)$ **Environment interactions required:** - GAIL/AIRL: $O(\text{poly}(|S|, |A|, H) / \epsilon^2)$ Where $H$ is the horizon length. ## 9. When to Use GAIL ### 9.1 Good Fit - Limited expert demonstrations available ($<$ 100 trajectories) - Can interact extensively with environment/simulator - Expert behavior is unimodal (or use InfoGAIL for multimodal) - Don't need an interpretable reward function - Continuous control problems - Complex state spaces where BC fails ### 9.2 Poor Fit - No simulator available (pure offline setting) - Need to transfer learned behavior to different dynamics - Expert demonstrations are highly multimodal without labels - Need sample efficiency in environment interactions - Reward function interpretability is required - Very limited computational budget ## 10. Mathematical Derivations ### 10.1 Occupancy Measure Properties The occupancy measure satisfies the **Bellman flow constraint**: $$ \sum_a \rho(s,a) = (1-\gamma) p_0(s) + \gamma \sum_{s',a'} P(s|s',a') \rho(s',a') $$ Where $p_0(s)$ is the initial state distribution. ### 10.2 Dual Formulation The GAIL objective can be written in dual form: $$ \min_\pi \max_D \mathbb{E}_{\rho_\pi}[\log(1 - D(s,a))] + \mathbb{E}_{\rho_{\pi_E}}[\log D(s,a)] + \lambda H(\pi) $$ At optimality: $$ D^*(s,a) = \frac{\rho_{\pi_E}(s,a)}{\rho_\pi(s,a) + \rho_{\pi_E}(s,a)} $$ ### 10.3 Policy Gradient for GAIL Using the REINFORCE estimator: $$ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \hat{A}_t \right] $$ Where the advantage is computed using GAIL rewards: $$ \hat{A}_t = \sum_{k=0}^{T-t} (\gamma \lambda)^k \delta_{t+k} $$ $$ \delta_t = r_D(s_t, a_t) + \gamma V(s_{t+1}) - V(s_t) $$ ## 11. Implementation Checklist ### 11.1 Network Architectures **Discriminator:** - Input: $(s, a)$ concatenated - Hidden layers: 2-3 layers, 256-512 units each - Activation: Tanh or ReLU - Output: Sigmoid for $D(s,a) \in (0,1)$ **Policy (Actor):** - Input: State $s$ - Hidden layers: 2-3 layers, 256-512 units - Output: Gaussian parameters $(\mu, \sigma)$ for continuous actions **Value Function (Critic):** - Input: State $s$ - Hidden layers: 2-3 layers, 256-512 units - Output: Scalar $V(s)$ ### 11.2 Training Loop ```python Pseudocode structure for iteration in range(num_iterations): 1. Collect trajectories trajectories = collect_trajectories(policy, env, num_steps) 2. Update discriminator for _ in range(disc_updates): expert_batch = sample(expert_demos) policy_batch = sample(trajectories) disc_loss = compute_disc_loss(expert_batch, policy_batch) discriminator.update(disc_loss) 3. Compute GAIL rewards rewards = -log(1 - discriminator(trajectories)) 4. Update policy with PPO advantages = compute_gae(rewards, values, gamma, lambda) policy.update(trajectories, advantages) ``` ## 12. Recent Research Directions ### 12.1 Offline Imitation Learning Learning from fixed datasets without environment interaction: $$ \min_\pi D_f(\rho_\pi \| \rho_{\pi_E}) + \alpha \cdot \text{Constraint}(\pi, \mathcal{D}) $$ ### 12.2 GAIL from Observations Only Learning without action labels, using only state sequences: $$ \min_\pi D_{JS}(\rho_\pi^s \| \rho_{\pi_E}^s) $$ Where $\rho^s$ denotes the marginal state occupancy. ### 12.3 Multi-Agent GAIL Extending to settings with multiple interacting agents: $$ \min_{\pi_1, ..., \pi_n} \sum_{i=1}^{n} D_{JS}(\rho_{\pi_i} \| \rho_{\pi_E^i}) $$ ### 12.4 Model-Based GAIL Using learned dynamics models to improve sample efficiency: $$ \hat{P}(s'|s,a) \approx P(s'|s,a) $$ Enables planning and reduces real environment interactions.

galactica,meta,scientific

Galactica is Meta scientific knowledge model. Trained on papers.

galvanic corrosion, reliability

Corrosion from dissimilar metals.

gamma-gate (γ-gate),gamma-gate,γ-gate,rf design

Gate shape optimizing current flow.

gan anomaly ts, gan, time series models

GANs for time series anomaly detection learn normal data distribution flagging samples with low discriminator scores.

gan inversion, gan, generative models

Find latent code for real image.

gan inversion, gan, multimodal ai

GAN inversion finds latent codes that reconstruct real images enabling editing.

gan time series, gan, time series models

GANs for time series learn distribution of normal patterns flagging samples discriminator rejects as anomalies.

gan-gcl, gan-gcl, reinforcement learning advanced

Generative adversarial imitation from observation learns policies from state-only demonstrations without action labels.

gang bonding, packaging

Bond multiple connections simultaneously.

gang scheduling, infrastructure

Schedule related jobs together.

gans for data augmentation, data analysis

Create realistic training data.

gantt chart, quality & reliability

Gantt charts display project schedules showing task timing and relationships.

gap fill, process integration

Gap fill techniques ensure complete filling of narrow high-aspect-ratio features with dielectrics or metals without creating voids.

gap fill,cvd

Filling high-aspect-ratio trenches without voids.

gap sentence generation, nlp

Generate missing sentences.

garat, garat, reinforcement learning advanced

Guided adversarial reward augmented trajectory learning balances imitation and exploration through adaptive reward shaping.

garbage collection,gc,memory

Garbage collection frees unused memory. GC pauses can hurt latency. Tune or use manual memory.