agent orchestration,multi-agent
Framework to coordinate multiple specialized agents working on subtasks.
9,967 technical terms and definitions
Framework to coordinate multiple specialized agents working on subtasks.
Agent protocols standardize interfaces for agent interoperability.
Stopping criteria define conditions when agents should terminate execution.
Model fab using autonomous agents.
An AI agent can call tools (APIs, DBs, code) based on the conversation. LLM plans, picks tools, reads results, responds with updated knowledge.
AgentBench provides comprehensive evaluation framework for LLM-based agents.
RAG system where agent decides when to retrieve what queries to use and how to synthesize.
Aggregate functions in GNNs combine neighbor information using operations like sum mean max or attention.
Aggregation strategies combine individual preferences into group recommendations through averaging or consensus.
Track degradation over time.
Include degradation in timing.
Mobile robot that transports wafers or materials on fab floor.
Optimize AGV paths.
EU AI Act regulates AI by risk level. High-risk requires compliance. Global regulatory trend.
Framework for protecting people from algorithmic harm.
AI feedback uses model-generated evaluations to train or align other models.
Purpose-built systems for AI training.
Aider is AI pair programming in terminal. Edit files with LLM.
Tool for aerial image inspection.
Ultra-stable surface for metrology.
Number of times cleanroom air is completely replaced per hour.
Use air (k=1) as insulator between metal lines for lowest capacitance.
Enclosed space that blows high-velocity air to remove particles before cleanroom entry.
Gaseous contaminants.
Apache Airflow orchestrates data pipelines. DAGs define dependencies. Standard for ETL.
Airgaps introduce air k equals one between metal lines providing the lowest possible dielectric constant reducing capacitance and crosstalk.
# Adversarial Inverse Reinforcement Learning (AIRL) ## AIRL **AIRL** (Adversarial Inverse Reinforcement Learning) is an advanced algorithm that combines inverse reinforcement learning with adversarial training to recover reward functions from expert demonstrations. ## The Core Problem AIRL Solves Traditional **Inverse Reinforcement Learning (IRL)** aims to recover a reward function from expert demonstrations. The fundamental challenges include: - **Reward ambiguity**: Many different reward functions can explain the same observed behavior - **Computational expense**: Requires solving an RL problem in an inner loop - **Poor scalability**: Struggles with high-dimensional problems - **Dynamics dependence**: Learned rewards often don't transfer to new environments ## Mathematical Formulation ### Discriminator Architecture The discriminator in AIRL has a specifically structured form: $$ D_\theta(s, a, s') = \frac{\exp(f_\theta(s, a, s'))}{\exp(f_\theta(s, a, s')) + \pi(a|s)} $$ Where: - $s$ = current state - $a$ = action taken - $s'$ = next state - $\pi(a|s)$ = policy probability - $f_\theta$ = learned function (detailed below) ### Reward-Shaping Decomposition The function $f_\theta$ is decomposed as: $$ f_\theta(s, a, s') = g_\theta(s, a) + \gamma h_\phi(s') - h_\phi(s) $$ | Component | Description | Role | |-----------|-------------|------| | $g_\theta(s, a)$ | Reward approximator | Transferable reward signal | | $h_\phi(s)$ | Shaping potential | Captures dynamics-dependent info | | $\gamma$ | Discount factor | Temporal discounting (typically 0.99) | ### State-Only Reward Variant For better transfer, use state-only rewards: $$ f_\theta(s, s') = g_\theta(s) + \gamma h_\phi(s') - h_\phi(s) $$ ## Training Algorithm ### Objective Functions **Discriminator Loss** (minimize): $$ \mathcal{L}_D = -\mathbb{E}_{\tau_E}\left[\log D_\theta(s, a, s')\right] - \mathbb{E}_{\tau_\pi}\left[\log(1 - D_\theta(s, a, s'))\right] $$ Where: - $\tau_E$ = expert trajectories - $\tau_\pi$ = policy-generated trajectories **Generator (Policy) Objective** (maximize): $$ \mathcal{L}_\pi = \mathbb{E}_{\tau_\pi}\left[\sum_{t=0}^{T} \gamma^t \log D_\theta(s_t, a_t, s_{t+1})\right] $$ ### Training Loop Pseudocode ```python AIRL Training Loop for iteration in range(max_iterations): Step 1: Sample trajectories from current policy policy_trajectories = sample_trajectories(policy, env, n_samples) Step 2: Update Discriminator for d_step in range(discriminator_steps): expert_batch = sample_batch(expert_demonstrations) policy_batch = sample_batch(policy_trajectories) Discriminator predictions D_expert = discriminator(expert_batch) D_policy = discriminator(policy_batch) Binary cross-entropy loss loss_D = -torch.mean(torch.log(D_expert)) \ -torch.mean(torch.log(1 - D_policy)) optimizer_D.zero_grad() loss_D.backward() optimizer_D.step() Step 3: Compute rewards for policy update rewards = torch.log(D_policy) - torch.log(1 - D_policy) Step 4: Update Policy (using PPO, TRPO, etc.) policy.update(policy_trajectories, rewards) ``` ## Theoretical Properties ### 1. Reward Recovery Guarantees At optimality, under ergodicity and sufficient expressiveness: $$ g_\theta(s, a) \rightarrow A^*(s, a) = Q^*(s, a) - V^*(s) $$ Or for state-only rewards: $$ g_\theta(s) \rightarrow r^*(s) $$ This recovers the **ground-truth reward** up to a constant. ### 2. Disentanglement Theorem The decomposition separates: $$ \underbrace{f_\theta(s, a, s')}_{\text{Full signal}} = \underbrace{g_\theta(s, a)}_{\text{Reward (transferable)}} + \underbrace{\gamma h_\phi(s') - h_\phi(s)}_{\text{Shaping (dynamics-dependent)}} $$ **Key insight**: Potential-based shaping ($\gamma h(s') - h(s)$) does not change the optimal policy, so $g_\theta$ captures the "true" reward. ### 3. Connection to Maximum Entropy IRL AIRL approximates MaxEnt IRL: $$ \max_\theta \mathbb{E}_{\tau_E}\left[\sum_t r_\theta(s_t, a_t)\right] + \mathcal{H}(\pi) $$ Where $\mathcal{H}(\pi)$ is the policy entropy. AIRL achieves this without the expensive inner-loop policy optimization. ## Comparison | Method | Recovers Reward | Dynamics-Invariant | Scalable | Sample Efficiency | |--------|-----------------|-------------------|----------|-------------------| | Behavioral Cloning | ❌ No | N/A | ✅ Yes | ✅ High | | GAIL | ❌ No (policy only) | ❌ No | ✅ Yes | ⚠️ Medium | | MaxEnt IRL | ✅ Yes | ⚠️ Partially | ❌ No | ❌ Low | | **AIRL** | ✅ **Yes** | ✅ **Yes** | ✅ **Yes** | ⚠️ Medium | ### GAIL vs AIRL **GAIL Discriminator**: $$ D_\theta^{GAIL}(s, a) = \sigma(f_\theta(s, a)) $$ **AIRL Discriminator**: $$ D_\theta^{AIRL}(s, a, s') = \frac{\exp(f_\theta(s, a, s'))}{\exp(f_\theta(s, a, s')) + \pi(a|s)} $$ The key difference: AIRL's structure enables reward recovery; GAIL's does not. ## Implementation Details ### Network Architecture ```python import torch import torch.nn as nn class AIRLDiscriminator(nn.Module): """ AIRL Discriminator with reward-shaping decomposition. """ def __init__(self, state_dim, action_dim, hidden_dim=256, gamma=0.99, state_only=True): super().__init__() self.gamma = gamma self.state_only = state_only Reward network g(s) or g(s,a) if state_only: self.g_net = nn.Sequential( nn.Linear(state_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, 1) ) else: self.g_net = nn.Sequential( nn.Linear(state_dim + action_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, 1) ) Shaping potential h(s) self.h_net = nn.Sequential( nn.Linear(state_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, 1) ) def get_reward(self, states, actions=None): """Extract the learned reward g(s) or g(s,a).""" if self.state_only: return self.g_net(states) else: sa = torch.cat([states, actions], dim=-1) return self.g_net(sa) def forward(self, states, actions, next_states, log_pi, dones): """ Compute f(s,a,s') = g(s,a) + gamma*h(s') - h(s) Args: states: Current states [batch, state_dim] actions: Actions taken [batch, action_dim] next_states: Next states [batch, state_dim] log_pi: Log probability of actions [batch, 1] dones: Episode termination flags [batch, 1] Returns: D(s,a,s'): Discriminator output [batch, 1] """ Reward component g = self.get_reward(states, actions) Shaping component h_s = self.h_net(states) h_s_next = self.h_net(next_states) f = g + gamma*h(s') - h(s), with masking for terminal states shaping = self.gamma * (1 - dones) * h_s_next - h_s f = g + shaping D(s,a,s') = exp(f) / (exp(f) + pi(a|s)) In log space: D = sigmoid(f - log_pi) log_D = f - log_pi D = torch.sigmoid(log_D) return D, f, g ``` ### Hyperparameters ```python Recommended hyperparameters for AIRL config = { Environment "gamma": 0.99, # Discount factor Networks "hidden_dim": 256, # Hidden layer size "n_hidden_layers": 2, # Number of hidden layers "state_only_reward": True, # Use g(s) instead of g(s,a) Training "batch_size": 256, # Batch size for updates "discriminator_lr": 3e-4, # Discriminator learning rate "policy_lr": 3e-4, # Policy learning rate "discriminator_steps": 1, # D updates per policy update Regularization "gradient_penalty_coef": 10.0, # Gradient penalty (optional) "entropy_coef": 0.01, # Policy entropy bonus Data "n_expert_trajectories": 50, # Number of expert demos "samples_per_iteration": 2048, # Policy samples per iteration } ``` ## Practical Considerations ### Advantages - **Reward transfer**: Learned $g_\theta$ transfers to new dynamics - **Interpretability**: Explicit reward function for analysis - **Data efficiency**: Better than BC with limited demonstrations - **Theoretical grounding**: Provable reward recovery guarantees ### Challenges - **Training instability**: GAN-like adversarial dynamics - **Hyperparameter sensitivity**: Requires careful tuning - **Discriminator overfitting**: Can memorize expert data - **Absorbing states**: Terminal states need special handling ### Stability Tricks ```python 1. Gradient Penalty (from WGAN-GP) def gradient_penalty(discriminator, expert_data, policy_data): alpha = torch.rand(expert_data.size(0), 1) interpolated = alpha * expert_data + (1 - alpha) * policy_data interpolated.requires_grad_(True) d_interpolated = discriminator(interpolated) gradients = torch.autograd.grad( outputs=d_interpolated, inputs=interpolated, grad_outputs=torch.ones_like(d_interpolated), create_graph=True )[0] gradient_norm = gradients.norm(2, dim=1) penalty = ((gradient_norm - 1) ** 2).mean() return penalty 2. Spectral Normalization from torch.nn.utils import spectral_norm layer = spectral_norm(nn.Linear(256, 256)) 3. Label Smoothing expert_labels = 0.9 # Instead of 1.0 policy_labels = 0.1 # Instead of 0.0 ``` ## Extensions and Variants ### 1. FAIRL (Forward Adversarial IRL) Corrects for state distribution shift: $$ r_{FAIRL}(s, a) = r_{AIRL}(s, a) - \log \pi(a|s) $$ ### 2. Off-Policy AIRL Uses replay buffer for sample efficiency: $$ \mathcal{L}_D = -\mathbb{E}_{\tau_E}[\log D] - \mathbb{E}_{\mathcal{B}}[\rho(s,a) \log(1-D)] $$ Where $\rho(s,a)$ is an importance weight. ### 3. Multi-Task AIRL Learns shared reward structure across tasks: $$ g_\theta(s, a) = g_{shared}(s, a) + g_{task}(s, a) $$ ## When to Use AIRL ### Good Fit ✅ - Need the **reward function**, not just the policy - Want to **transfer behavior** to different dynamics - Have **limited but high-quality** demonstrations - **Interpretability** of learned behavior matters ### Consider Alternatives - Only need to **match behavior** → Use GAIL (simpler) - Have **abundant demonstrations** → BC might suffice - **Reward function is known** → Use standard RL - Need **real-time performance** → BC is faster ## AIRL AIRL provides a principled approach to learning **transferable reward functions** from demonstrations by: 1. Using a **structured discriminator** that separates reward from dynamics 2. Leveraging **adversarial training** for scalability 3. Providing **theoretical guarantees** on reward recovery 4. Enabling **reward transfer** across different environments The key equation to remember: $$ \boxed{f_\theta(s, a, s') = g_\theta(s, a) + \gamma h_\phi(s') - h_\phi(s)} $$ Where $g_\theta$ is your transferable reward signal.
Monitor and respond to tool alarms via automation system.
Lighter BERT using parameter sharing and factorization.
Albumentations is fast image augmentation library. Many transforms.
Sequential self-limiting surface reactions for atomic-level thickness control.
One precursor pulse + purge + reactant pulse + purge.
Aleatoric uncertainty comes from inherent randomness irreducible even with perfect knowledge.
Inherent randomness in data.
Set thresholds and notifications for issues.
Alerts notify on-call when issues occur. PagerDuty, OpsGenie. Escalation policies.
Which effects are confounded.
Alias-free GANs eliminate coordinate-dependent artifacts through continuous signal processing.
Simple relative position encoding.
Aligners position wafers accurately using notch or flat detection.
Mechanism to orient wafer notch or flat to a standard position.
Reference patterns on wafer used to align each layer.
Alignment tax: safety measures may reduce capability. Balance helpfulness and harmlessness.
Alignment = making models follow human values and instructions. RLHF/DPO leverage human preference data to push the model toward desired behavior.
Vision models without convolution or attention.
Efficiently aggregate across nodes.
Exchange data between all devices.
Fast equivariant neural network.
Allegro achieves fast equivariant message passing through strict locality and efficient tensor operations.
Distribute limited supply among customers.