← Back to AI Factory Chat

AI Factory Glossary

653 technical terms and definitions

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z All
Showing page 2 of 14 (653 entries)

task sampling strategies, multi-task learning

How to sample tasks during training.

task similarity, multi-task learning

Relatedness between tasks.

task tokens, multi-task learning

Prepend task identifiers.

task-incremental learning,continual learning

Learn tasks sequentially with task labels at test time.

task-oriented dialogue, dialogue

Goal-driven conversations.

task-specific heads, multi-task learning

Separate output layers per task.

task-specific parameters, multi-task learning

Separate parameters per task.

task-specific pre-training, transfer learning

Pre-train for specific downstream task.

taskfile,yaml,runner

Taskfile defines tasks in YAML. Go-based task runner.

tasnet, audio & speech

TasNet performs end-to-end time-domain audio separation using temporal convolutional networks.

taylor expansion pruning, model optimization

Taylor expansion pruning approximates loss change from removing weights using Taylor series.

tbats, tbats, time series models

TBATS combines Box-Cox transformation Fourier seasonality ARMA errors and trend for complex seasonal time series.

tcad (technology cad),tcad,technology cad,design

Simulation tools for process and device physics.

tcad model parameters, tcad, simulation

Physical parameters used in device/process simulation.

tcn, tcn, time series models

Temporal Convolutional Networks use dilated causal convolutions for sequence modeling with long effective history.

td3, td3, reinforcement learning

Improved continuous control algorithm.

td3, twin delayed ddpg, reinforcement learning advanced, continuous control, actor critic, ddpg, advanced rl

# TD3: Twin Delayed Deep Deterministic Policy Gradient **Advanced Reinforcement Learning Algorithm for Continuous Control** ## Overview TD3, introduced by Fujimoto et al. (2018), addresses a fundamental problem in continuous control: **overestimation bias** in actor-critic methods. DDPG (Deep Deterministic Policy Gradient), while effective, suffers from significant value overestimation that compounds over training, leading to poor policies. ## Core Motivation: The Overestimation Problem ### Mathematical Formulation In Q-learning, the Bellman target is: $$ y = r + \gamma \max_{a'} Q(s', a') $$ The problem arises because: $$ \mathbb{E}\left[\max_{a'} \hat{Q}(s', a')\right] \geq \max_{a'} \mathbb{E}\left[\hat{Q}(s', a')\right] $$ Where $\hat{Q}$ is the estimated Q-function with approximation error. ### Why This Matters - Function approximation introduces noise: $\hat{Q}(s,a) = Q^*(s,a) + \epsilon$ - The $\max$ operator preferentially selects overestimated values - Errors propagate and amplify through bootstrapping - Policy exploits these overestimations, leading to divergence ## The Three Pillars of TD3 ### 1. Clipped Double Q-Learning TD3 maintains **two** critic networks $(Q_{\theta_1}, Q_{\theta_2})$ and uses the minimum for targets: $$ y = r + \gamma \min_{i=1,2} Q_{\theta'_i}(s', \tilde{a}') $$ Where: - $Q_{\theta'_1}, Q_{\theta'_2}$ are target networks - $\tilde{a}'$ is the smoothed target action **Loss function for each critic:** $$ \mathcal{L}(\theta_i) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( Q_{\theta_i}(s,a) - y \right)^2 \right] $$ #### Key Insight | Method | Approach | Effect | |--------|----------|--------| | Double DQN | Decouples selection and evaluation | Reduces overestimation | | TD3 | Takes minimum of two estimates | More aggressive bias reduction | ### 2. Delayed Policy Updates The actor is updated **less frequently** than critics: $$ \theta_{\pi} \leftarrow \theta_{\pi} + \alpha \nabla_{\theta_\pi} J(\theta_\pi) \quad \text{every } d \text{ steps} $$ Where $d$ is typically 2. **Policy gradient:** $$ \nabla_{\theta_\pi} J(\theta_\pi) = \mathbb{E}_{s \sim \mathcal{D}} \left[ \nabla_a Q_{\theta_1}(s,a) \big|_{a=\pi_{\theta_\pi}(s)} \nabla_{\theta_\pi} \pi_{\theta_\pi}(s) \right] $$ #### Rationale - Policy updates depend on accurate value estimates - High-variance critic estimates cause policy divergence - Delayed updates allow critic to stabilize - Reduces the "moving target" problem ### 3. Target Policy Smoothing Noise is added to target actions: $$ \tilde{a}' = \pi_{\theta'_\pi}(s') + \epsilon, \quad \epsilon \sim \text{clip}(\mathcal{N}(0, \sigma), -c, c) $$ **Purpose:** $$ Q(s', a') \approx Q(s', a' + \epsilon) \quad \text{for small } \epsilon $$ This regularizes the Q-function, preventing exploitation of narrow peaks. ## Complete TD3 Algorithm ### Pseudocode ``` Initialize: - Critic networks Q_θ₁, Q_θ₂ - Actor network π_φ - Target networks θ'₁ ← θ₁, θ'₂ ← θ₂, φ' ← φ - Replay buffer D For each timestep t: 1. Select action with exploration: a ~ π_φ(s) + ε, ε ~ N(0, σ) 2. Execute a, observe r, s' 3. Store (s, a, r, s') in D 4. Sample mini-batch from D 5. Compute target: ã ← π_φ'(s') + clip(ε, -c, c), ε ~ N(0, σ̃) y ← r + γ min_{i=1,2} Q_θ'ᵢ(s', ã) 6. Update critics: θᵢ ← θᵢ - α∇_θᵢ (Q_θᵢ(s,a) - y)² 7. If t mod d = 0: - Update actor: φ ← φ + β∇_φ Q_θ₁(s, π_φ(s)) - Update targets: θ'ᵢ ← τθᵢ + (1-τ)θ'ᵢ φ' ← τφ + (1-τ)φ' ``` ### Hyperparameters | Parameter | Symbol | Typical Value | Description | |-----------|--------|---------------|-------------| | Discount factor | $\gamma$ | 0.99 | Future reward weighting | | Soft update rate | $\tau$ | 0.005 | Target network update rate | | Policy delay | $d$ | 2 | Critic updates per actor update | | Target noise | $\tilde{\sigma}$ | 0.2 | Smoothing noise std | | Noise clip | $c$ | 0.5 | Smoothing noise bounds | | Exploration noise | $\sigma$ | 0.1 | Action exploration std | | Batch size | $N$ | 256 | Mini-batch size | | Learning rate | $\alpha, \beta$ | 3e-4 | Network learning rates | ## Mathematical Deep Dive ### Overestimation Bias Analysis Let the true Q-value be $Q^*(s,a)$ and the estimate be: $$ \hat{Q}(s,a) = Q^*(s,a) + \epsilon(s,a) $$ Where $\epsilon(s,a)$ is zero-mean noise with variance $\sigma^2$. **Single estimator bias:** $$ \mathbb{E}\left[\max_a \hat{Q}(s,a)\right] - \max_a Q^*(s,a) \approx \sigma\sqrt{\frac{2\log n}{\pi}} $$ For $n$ actions sampled. **Double estimator (TD3) bias:** $$ \mathbb{E}\left[\min_{i=1,2} \hat{Q}_i(s,a)\right] \leq Q^*(s,a) $$ TD3 trades overestimation for slight underestimation, which is empirically more stable. ### Deterministic Policy Gradient The actor maximizes expected return: $$ J(\theta_\pi) = \mathbb{E}_{s \sim \rho^\pi} \left[ Q^{\pi}(s, \pi_{\theta_\pi}(s)) \right] $$ Gradient (Silver et al., 2014): $$ \nabla_{\theta_\pi} J(\theta_\pi) = \mathbb{E}_{s \sim \rho^\pi} \left[ \nabla_{\theta_\pi} \pi_{\theta_\pi}(s) \nabla_a Q^{\pi}(s,a) \big|_{a=\pi_{\theta_\pi}(s)} \right] $$ ## Comparison with Related Algorithms ### TD3 vs DDPG vs SAC | Aspect | DDPG | TD3 | SAC | |--------|------|-----|-----| | Number of critics | 1 | 2 (min) | 2 (min) | | Policy type | Deterministic | Deterministic | Stochastic | | Entropy regularization | ✗ | ✗ | ✓ | | Exploration method | External noise | External noise | Policy entropy | | Actor update frequency | Every step | Delayed | Every step | | Target smoothing | ✗ | ✓ | ✗ | ### SAC Objective (for comparison) $$ J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{\infty} \gamma^t \left( r_t + \alpha \mathcal{H}(\pi(\cdot|s_t)) \right) \right] $$ Where $\mathcal{H}$ is entropy and $\alpha$ is the temperature parameter. ## Practical Implementation Notes ### Network Architecture **Critic Network:** $$ Q_\theta(s,a) = f_\theta(\text{concat}(s, a)) $$ Typical architecture: - Input: $\text{dim}(s) + \text{dim}(a)$ - Hidden layers: [256, 256] with ReLU - Output: 1 (scalar Q-value) **Actor Network:** $$ \pi_\phi(s) = \tanh(f_\phi(s)) \cdot a_{\max} $$ Typical architecture: - Input: $\text{dim}(s)$ - Hidden layers: [256, 256] with ReLU - Output: $\text{dim}(a)$ with tanh activation ### Common Failure Modes 1. **Insufficient exploration** - Symptom: Premature convergence to suboptimal policy - Solution: Increase exploration noise, use parameter noise 2. **Critic divergence** - Symptom: Q-values grow unboundedly - Solution: Reduce learning rate, gradient clipping 3. **Slow learning** - Symptom: Policy improves very slowly - Solution: Reduce policy delay, increase batch size ### Debugging Tips - Monitor Q-value statistics: $\mathbb{E}[Q]$, $\text{Var}[Q]$, $\max Q$ - Track actor and critic losses separately - Visualize learned policy periodically - Compare predicted vs actual returns ## When to Use TD3 ### Good Fit - Continuous control with dense rewards - Robotic manipulation and locomotion - When sample efficiency matters - Environments with smooth dynamics ### Consider Alternatives - **Discrete actions** → DQN, Rainbow, C51 - **Maximum exploration needed** → SAC - **Model available** → MBPO, Dreamer, MuZero - **Multi-task/Meta-learning** → MAML, RL² ## Equations ### Target Computation $$ y = r + \gamma \min_{i=1,2} Q_{\theta'_i}\left(s', \pi_{\theta'_\pi}(s') + \text{clip}(\epsilon, -c, c)\right) $$ ### Critic Loss $$ \mathcal{L}_{\text{critic}} = \frac{1}{N} \sum_{j=1}^{N} \left( Q_{\theta_i}(s_j, a_j) - y_j \right)^2 $$ ### Actor Loss $$ \mathcal{L}_{\text{actor}} = -\frac{1}{N} \sum_{j=1}^{N} Q_{\theta_1}(s_j, \pi_{\theta_\pi}(s_j)) $$ ### Soft Target Update $$ \theta' \leftarrow \tau \theta + (1 - \tau) \theta' $$

tddb testing,reliability

Time-dependent dielectric breakdown testing.

tdr, tdr, signal & power integrity

Time Domain Reflectometry measures impedance discontinuities along signal paths by analyzing reflected waveforms from incident step signals.

te-nas, te-nas, neural architecture search

Training-free ensemble NAS combines multiple zero-cost proxies improving architecture evaluation reliability.

teacher-student cl, advanced training

Teacher-student curriculum learning uses a teacher model to assess sample difficulty and guide curriculum design for student training.

teacher-student framework, model compression

General paradigm for distillation.

teacher-student training, model optimization

Teacher-student training transfers knowledge from complex to simple models through soft targets.

teaching assistant, model compression

Intermediate model between teacher and student.

team training,internal course,playbook

I can help you turn your knowledge into internal docs, playbooks, or mini-courses for onboarding your team.

team,hire,skills,culture

AI teams need diverse skills: ML, engineering, product, domain. Culture of experimentation and learning.

technical debt identification, code ai

Find areas needing refactoring.

technical debt,refactor,maintain

AI tech debt: hacky prompts, hardcoded logic, missing tests. Schedule time to refactor and maintain.

technical document generation, content creation

Create technical specifications.

technology licensing, business

License process technology.

technology node comparison, business

Compare different process nodes.

technology nodes, business

Process generation names.

technology readiness level, trl, production

Maturity of technology.

technology roadmap, business

Plan for future technology development.

technology roadmap, business & strategy

Technology roadmaps project evolution of capabilities and requirements over time.

technology transfer, production

Move from development to production.

tee,secure enclave,confidential

TEEs (Trusted Execution Environments) run code in isolated secure enclaves. Protect model and data.

teep, teep, manufacturing operations

Total Effective Equipment Performance includes all calendar time in denominator.

tem (transmission electron microscopy),tem,transmission electron microscopy,metrology

Ultra-high resolution imaging for defect analysis.

temperature calibration,ai safety

Adjust temperature for better calibrated probabilities.

temperature control unit (tcu),temperature control unit,tcu,facility

Precision temperature controller for wafer chucks and process zones.

temperature control, manufacturing operations

Temperature control maintains stable thermal environments for processes and metrology.

temperature cycling during burn-in, reliability

Thermal stress during screening.

temperature cycling simulation, simulation

Model thermal stress from cycling.

temperature cycling,reliability

Repeatedly heat and cool to test thermal stress.

temperature distillation, model optimization

Temperature parameter in distillation softens predictions revealing relative class probabilities.

temperature humidity bias, thb, reliability

Combined environmental stress test.

temperature in distillation, model compression

Soften probability distributions.

temperature sampling for tasks, multi-task learning

Adjust task probabilities.

temperature sampling, text generation

Scale logits before sampling.