adversarial training for safety,ai safety
Train on adversarial examples to improve robustness.
3,145 technical terms and definitions
Train on adversarial examples to improve robustness.
Train on adversarial examples.
Train on adversarial examples to make model more robust.
Adversarial training includes adversarial examples in training. Makes models more robust. Expensive.
Perturb weights for robustness.
Find medication side effects in text.
Attention Free Transformer uses element-wise operations instead of attention.
Agent approval requires human confirmation before executing high-stakes actions.
Agent benchmarking evaluates performance across standardized task suites.
Agent communication protocols enable information exchange and coordination between agents.
Agent debugging identifies and resolves issues in planning and execution logic.
Feedback loops allow humans to correct and guide agent behavior iteratively.
Agent handoff transfers responsibility for tasks between agents smoothly.
Agent logging records decisions actions and reasoning for debugging and auditing.
Agent loops repeatedly observe plan act and update until objectives are achieved.
Agent memory maintains conversation history observations and learned information across interactions.
Agent negotiation resolves conflicts through offers counteroffers and compromise.
Agent protocols standardize interfaces for agent interoperability.
Stopping criteria define conditions when agents should terminate execution.
Model fab using autonomous agents.
AgentBench provides comprehensive evaluation framework for LLM-based agents.
Aggregate functions in GNNs combine neighbor information using operations like sum mean max or attention.
EU AI Act regulates AI by risk level. High-risk requires compliance. Global regulatory trend.
Framework for protecting people from algorithmic harm.
AI feedback uses model-generated evaluations to train or align other models.
Purpose-built systems for AI training.
Aider is AI pair programming in terminal. Edit files with LLM.
Tool for aerial image inspection.
Ultra-stable surface for metrology.
Number of times cleanroom air is completely replaced per hour.
Use air (k=1) as insulator between metal lines for lowest capacitance.
Enclosed space that blows high-velocity air to remove particles before cleanroom entry.
Gaseous contaminants.
Apache Airflow orchestrates data pipelines. DAGs define dependencies. Standard for ETL.
Airgaps introduce air k equals one between metal lines providing the lowest possible dielectric constant reducing capacitance and crosstalk.
# Adversarial Inverse Reinforcement Learning (AIRL) ## AIRL **AIRL** (Adversarial Inverse Reinforcement Learning) is an advanced algorithm that combines inverse reinforcement learning with adversarial training to recover reward functions from expert demonstrations. ## The Core Problem AIRL Solves Traditional **Inverse Reinforcement Learning (IRL)** aims to recover a reward function from expert demonstrations. The fundamental challenges include: - **Reward ambiguity**: Many different reward functions can explain the same observed behavior - **Computational expense**: Requires solving an RL problem in an inner loop - **Poor scalability**: Struggles with high-dimensional problems - **Dynamics dependence**: Learned rewards often don't transfer to new environments ## Mathematical Formulation ### Discriminator Architecture The discriminator in AIRL has a specifically structured form: $$ D_\theta(s, a, s') = \frac{\exp(f_\theta(s, a, s'))}{\exp(f_\theta(s, a, s')) + \pi(a|s)} $$ Where: - $s$ = current state - $a$ = action taken - $s'$ = next state - $\pi(a|s)$ = policy probability - $f_\theta$ = learned function (detailed below) ### Reward-Shaping Decomposition The function $f_\theta$ is decomposed as: $$ f_\theta(s, a, s') = g_\theta(s, a) + \gamma h_\phi(s') - h_\phi(s) $$ | Component | Description | Role | |-----------|-------------|------| | $g_\theta(s, a)$ | Reward approximator | Transferable reward signal | | $h_\phi(s)$ | Shaping potential | Captures dynamics-dependent info | | $\gamma$ | Discount factor | Temporal discounting (typically 0.99) | ### State-Only Reward Variant For better transfer, use state-only rewards: $$ f_\theta(s, s') = g_\theta(s) + \gamma h_\phi(s') - h_\phi(s) $$ ## Training Algorithm ### Objective Functions **Discriminator Loss** (minimize): $$ \mathcal{L}_D = -\mathbb{E}_{\tau_E}\left[\log D_\theta(s, a, s')\right] - \mathbb{E}_{\tau_\pi}\left[\log(1 - D_\theta(s, a, s'))\right] $$ Where: - $\tau_E$ = expert trajectories - $\tau_\pi$ = policy-generated trajectories **Generator (Policy) Objective** (maximize): $$ \mathcal{L}_\pi = \mathbb{E}_{\tau_\pi}\left[\sum_{t=0}^{T} \gamma^t \log D_\theta(s_t, a_t, s_{t+1})\right] $$ ### Training Loop Pseudocode ```python AIRL Training Loop for iteration in range(max_iterations): Step 1: Sample trajectories from current policy policy_trajectories = sample_trajectories(policy, env, n_samples) Step 2: Update Discriminator for d_step in range(discriminator_steps): expert_batch = sample_batch(expert_demonstrations) policy_batch = sample_batch(policy_trajectories) Discriminator predictions D_expert = discriminator(expert_batch) D_policy = discriminator(policy_batch) Binary cross-entropy loss loss_D = -torch.mean(torch.log(D_expert)) \ -torch.mean(torch.log(1 - D_policy)) optimizer_D.zero_grad() loss_D.backward() optimizer_D.step() Step 3: Compute rewards for policy update rewards = torch.log(D_policy) - torch.log(1 - D_policy) Step 4: Update Policy (using PPO, TRPO, etc.) policy.update(policy_trajectories, rewards) ``` ## Theoretical Properties ### 1. Reward Recovery Guarantees At optimality, under ergodicity and sufficient expressiveness: $$ g_\theta(s, a) \rightarrow A^*(s, a) = Q^*(s, a) - V^*(s) $$ Or for state-only rewards: $$ g_\theta(s) \rightarrow r^*(s) $$ This recovers the **ground-truth reward** up to a constant. ### 2. Disentanglement Theorem The decomposition separates: $$ \underbrace{f_\theta(s, a, s')}_{\text{Full signal}} = \underbrace{g_\theta(s, a)}_{\text{Reward (transferable)}} + \underbrace{\gamma h_\phi(s') - h_\phi(s)}_{\text{Shaping (dynamics-dependent)}} $$ **Key insight**: Potential-based shaping ($\gamma h(s') - h(s)$) does not change the optimal policy, so $g_\theta$ captures the "true" reward. ### 3. Connection to Maximum Entropy IRL AIRL approximates MaxEnt IRL: $$ \max_\theta \mathbb{E}_{\tau_E}\left[\sum_t r_\theta(s_t, a_t)\right] + \mathcal{H}(\pi) $$ Where $\mathcal{H}(\pi)$ is the policy entropy. AIRL achieves this without the expensive inner-loop policy optimization. ## Comparison | Method | Recovers Reward | Dynamics-Invariant | Scalable | Sample Efficiency | |--------|-----------------|-------------------|----------|-------------------| | Behavioral Cloning | ❌ No | N/A | ✅ Yes | ✅ High | | GAIL | ❌ No (policy only) | ❌ No | ✅ Yes | ⚠️ Medium | | MaxEnt IRL | ✅ Yes | ⚠️ Partially | ❌ No | ❌ Low | | **AIRL** | ✅ **Yes** | ✅ **Yes** | ✅ **Yes** | ⚠️ Medium | ### GAIL vs AIRL **GAIL Discriminator**: $$ D_\theta^{GAIL}(s, a) = \sigma(f_\theta(s, a)) $$ **AIRL Discriminator**: $$ D_\theta^{AIRL}(s, a, s') = \frac{\exp(f_\theta(s, a, s'))}{\exp(f_\theta(s, a, s')) + \pi(a|s)} $$ The key difference: AIRL's structure enables reward recovery; GAIL's does not. ## Implementation Details ### Network Architecture ```python import torch import torch.nn as nn class AIRLDiscriminator(nn.Module): """ AIRL Discriminator with reward-shaping decomposition. """ def __init__(self, state_dim, action_dim, hidden_dim=256, gamma=0.99, state_only=True): super().__init__() self.gamma = gamma self.state_only = state_only Reward network g(s) or g(s,a) if state_only: self.g_net = nn.Sequential( nn.Linear(state_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, 1) ) else: self.g_net = nn.Sequential( nn.Linear(state_dim + action_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, 1) ) Shaping potential h(s) self.h_net = nn.Sequential( nn.Linear(state_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, 1) ) def get_reward(self, states, actions=None): """Extract the learned reward g(s) or g(s,a).""" if self.state_only: return self.g_net(states) else: sa = torch.cat([states, actions], dim=-1) return self.g_net(sa) def forward(self, states, actions, next_states, log_pi, dones): """ Compute f(s,a,s') = g(s,a) + gamma*h(s') - h(s) Args: states: Current states [batch, state_dim] actions: Actions taken [batch, action_dim] next_states: Next states [batch, state_dim] log_pi: Log probability of actions [batch, 1] dones: Episode termination flags [batch, 1] Returns: D(s,a,s'): Discriminator output [batch, 1] """ Reward component g = self.get_reward(states, actions) Shaping component h_s = self.h_net(states) h_s_next = self.h_net(next_states) f = g + gamma*h(s') - h(s), with masking for terminal states shaping = self.gamma * (1 - dones) * h_s_next - h_s f = g + shaping D(s,a,s') = exp(f) / (exp(f) + pi(a|s)) In log space: D = sigmoid(f - log_pi) log_D = f - log_pi D = torch.sigmoid(log_D) return D, f, g ``` ### Hyperparameters ```python Recommended hyperparameters for AIRL config = { Environment "gamma": 0.99, # Discount factor Networks "hidden_dim": 256, # Hidden layer size "n_hidden_layers": 2, # Number of hidden layers "state_only_reward": True, # Use g(s) instead of g(s,a) Training "batch_size": 256, # Batch size for updates "discriminator_lr": 3e-4, # Discriminator learning rate "policy_lr": 3e-4, # Policy learning rate "discriminator_steps": 1, # D updates per policy update Regularization "gradient_penalty_coef": 10.0, # Gradient penalty (optional) "entropy_coef": 0.01, # Policy entropy bonus Data "n_expert_trajectories": 50, # Number of expert demos "samples_per_iteration": 2048, # Policy samples per iteration } ``` ## Practical Considerations ### Advantages - **Reward transfer**: Learned $g_\theta$ transfers to new dynamics - **Interpretability**: Explicit reward function for analysis - **Data efficiency**: Better than BC with limited demonstrations - **Theoretical grounding**: Provable reward recovery guarantees ### Challenges - **Training instability**: GAN-like adversarial dynamics - **Hyperparameter sensitivity**: Requires careful tuning - **Discriminator overfitting**: Can memorize expert data - **Absorbing states**: Terminal states need special handling ### Stability Tricks ```python 1. Gradient Penalty (from WGAN-GP) def gradient_penalty(discriminator, expert_data, policy_data): alpha = torch.rand(expert_data.size(0), 1) interpolated = alpha * expert_data + (1 - alpha) * policy_data interpolated.requires_grad_(True) d_interpolated = discriminator(interpolated) gradients = torch.autograd.grad( outputs=d_interpolated, inputs=interpolated, grad_outputs=torch.ones_like(d_interpolated), create_graph=True )[0] gradient_norm = gradients.norm(2, dim=1) penalty = ((gradient_norm - 1) ** 2).mean() return penalty 2. Spectral Normalization from torch.nn.utils import spectral_norm layer = spectral_norm(nn.Linear(256, 256)) 3. Label Smoothing expert_labels = 0.9 # Instead of 1.0 policy_labels = 0.1 # Instead of 0.0 ``` ## Extensions and Variants ### 1. FAIRL (Forward Adversarial IRL) Corrects for state distribution shift: $$ r_{FAIRL}(s, a) = r_{AIRL}(s, a) - \log \pi(a|s) $$ ### 2. Off-Policy AIRL Uses replay buffer for sample efficiency: $$ \mathcal{L}_D = -\mathbb{E}_{\tau_E}[\log D] - \mathbb{E}_{\mathcal{B}}[\rho(s,a) \log(1-D)] $$ Where $\rho(s,a)$ is an importance weight. ### 3. Multi-Task AIRL Learns shared reward structure across tasks: $$ g_\theta(s, a) = g_{shared}(s, a) + g_{task}(s, a) $$ ## When to Use AIRL ### Good Fit ✅ - Need the **reward function**, not just the policy - Want to **transfer behavior** to different dynamics - Have **limited but high-quality** demonstrations - **Interpretability** of learned behavior matters ### Consider Alternatives - Only need to **match behavior** → Use GAIL (simpler) - Have **abundant demonstrations** → BC might suffice - **Reward function is known** → Use standard RL - Need **real-time performance** → BC is faster ## AIRL AIRL provides a principled approach to learning **transferable reward functions** from demonstrations by: 1. Using a **structured discriminator** that separates reward from dynamics 2. Leveraging **adversarial training** for scalability 3. Providing **theoretical guarantees** on reward recovery 4. Enabling **reward transfer** across different environments The key equation to remember: $$ \boxed{f_\theta(s, a, s') = g_\theta(s, a) + \gamma h_\phi(s') - h_\phi(s)} $$ Where $g_\theta$ is your transferable reward signal.
Lighter BERT using parameter sharing and factorization.
Aleatoric uncertainty comes from inherent randomness irreducible even with perfect knowledge.
Inherent randomness in data.
Alias-free GANs eliminate coordinate-dependent artifacts through continuous signal processing.
Simple relative position encoding.
Efficiently aggregate across nodes.
Exchange data between all devices.
Fast equivariant neural network.
Allegro achieves fast equivariant message passing through strict locality and efficient tensor operations.
Alpaca demonstrates instruction-following through distillation from stronger models.
DeepMind's competitive programming model.
DeepMind's protein structure prediction system.
Altair is declarative visualization. Vega-Lite based.
Alternative chemistries develop less hazardous process chemicals maintaining performance while reducing environmental impact.