task sampling strategies, multi-task learning
How to sample tasks during training.
653 technical terms and definitions
How to sample tasks during training.
Relatedness between tasks.
Prepend task identifiers.
Learn tasks sequentially with task labels at test time.
Goal-driven conversations.
Separate output layers per task.
Separate parameters per task.
Pre-train for specific downstream task.
Taskfile defines tasks in YAML. Go-based task runner.
TasNet performs end-to-end time-domain audio separation using temporal convolutional networks.
Taylor expansion pruning approximates loss change from removing weights using Taylor series.
TBATS combines Box-Cox transformation Fourier seasonality ARMA errors and trend for complex seasonal time series.
Simulation tools for process and device physics.
Physical parameters used in device/process simulation.
Temporal Convolutional Networks use dilated causal convolutions for sequence modeling with long effective history.
Improved continuous control algorithm.
# TD3: Twin Delayed Deep Deterministic Policy Gradient **Advanced Reinforcement Learning Algorithm for Continuous Control** ## Overview TD3, introduced by Fujimoto et al. (2018), addresses a fundamental problem in continuous control: **overestimation bias** in actor-critic methods. DDPG (Deep Deterministic Policy Gradient), while effective, suffers from significant value overestimation that compounds over training, leading to poor policies. ## Core Motivation: The Overestimation Problem ### Mathematical Formulation In Q-learning, the Bellman target is: $$ y = r + \gamma \max_{a'} Q(s', a') $$ The problem arises because: $$ \mathbb{E}\left[\max_{a'} \hat{Q}(s', a')\right] \geq \max_{a'} \mathbb{E}\left[\hat{Q}(s', a')\right] $$ Where $\hat{Q}$ is the estimated Q-function with approximation error. ### Why This Matters - Function approximation introduces noise: $\hat{Q}(s,a) = Q^*(s,a) + \epsilon$ - The $\max$ operator preferentially selects overestimated values - Errors propagate and amplify through bootstrapping - Policy exploits these overestimations, leading to divergence ## The Three Pillars of TD3 ### 1. Clipped Double Q-Learning TD3 maintains **two** critic networks $(Q_{\theta_1}, Q_{\theta_2})$ and uses the minimum for targets: $$ y = r + \gamma \min_{i=1,2} Q_{\theta'_i}(s', \tilde{a}') $$ Where: - $Q_{\theta'_1}, Q_{\theta'_2}$ are target networks - $\tilde{a}'$ is the smoothed target action **Loss function for each critic:** $$ \mathcal{L}(\theta_i) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( Q_{\theta_i}(s,a) - y \right)^2 \right] $$ #### Key Insight | Method | Approach | Effect | |--------|----------|--------| | Double DQN | Decouples selection and evaluation | Reduces overestimation | | TD3 | Takes minimum of two estimates | More aggressive bias reduction | ### 2. Delayed Policy Updates The actor is updated **less frequently** than critics: $$ \theta_{\pi} \leftarrow \theta_{\pi} + \alpha \nabla_{\theta_\pi} J(\theta_\pi) \quad \text{every } d \text{ steps} $$ Where $d$ is typically 2. **Policy gradient:** $$ \nabla_{\theta_\pi} J(\theta_\pi) = \mathbb{E}_{s \sim \mathcal{D}} \left[ \nabla_a Q_{\theta_1}(s,a) \big|_{a=\pi_{\theta_\pi}(s)} \nabla_{\theta_\pi} \pi_{\theta_\pi}(s) \right] $$ #### Rationale - Policy updates depend on accurate value estimates - High-variance critic estimates cause policy divergence - Delayed updates allow critic to stabilize - Reduces the "moving target" problem ### 3. Target Policy Smoothing Noise is added to target actions: $$ \tilde{a}' = \pi_{\theta'_\pi}(s') + \epsilon, \quad \epsilon \sim \text{clip}(\mathcal{N}(0, \sigma), -c, c) $$ **Purpose:** $$ Q(s', a') \approx Q(s', a' + \epsilon) \quad \text{for small } \epsilon $$ This regularizes the Q-function, preventing exploitation of narrow peaks. ## Complete TD3 Algorithm ### Pseudocode ``` Initialize: - Critic networks Q_θ₁, Q_θ₂ - Actor network π_φ - Target networks θ'₁ ← θ₁, θ'₂ ← θ₂, φ' ← φ - Replay buffer D For each timestep t: 1. Select action with exploration: a ~ π_φ(s) + ε, ε ~ N(0, σ) 2. Execute a, observe r, s' 3. Store (s, a, r, s') in D 4. Sample mini-batch from D 5. Compute target: ã ← π_φ'(s') + clip(ε, -c, c), ε ~ N(0, σ̃) y ← r + γ min_{i=1,2} Q_θ'ᵢ(s', ã) 6. Update critics: θᵢ ← θᵢ - α∇_θᵢ (Q_θᵢ(s,a) - y)² 7. If t mod d = 0: - Update actor: φ ← φ + β∇_φ Q_θ₁(s, π_φ(s)) - Update targets: θ'ᵢ ← τθᵢ + (1-τ)θ'ᵢ φ' ← τφ + (1-τ)φ' ``` ### Hyperparameters | Parameter | Symbol | Typical Value | Description | |-----------|--------|---------------|-------------| | Discount factor | $\gamma$ | 0.99 | Future reward weighting | | Soft update rate | $\tau$ | 0.005 | Target network update rate | | Policy delay | $d$ | 2 | Critic updates per actor update | | Target noise | $\tilde{\sigma}$ | 0.2 | Smoothing noise std | | Noise clip | $c$ | 0.5 | Smoothing noise bounds | | Exploration noise | $\sigma$ | 0.1 | Action exploration std | | Batch size | $N$ | 256 | Mini-batch size | | Learning rate | $\alpha, \beta$ | 3e-4 | Network learning rates | ## Mathematical Deep Dive ### Overestimation Bias Analysis Let the true Q-value be $Q^*(s,a)$ and the estimate be: $$ \hat{Q}(s,a) = Q^*(s,a) + \epsilon(s,a) $$ Where $\epsilon(s,a)$ is zero-mean noise with variance $\sigma^2$. **Single estimator bias:** $$ \mathbb{E}\left[\max_a \hat{Q}(s,a)\right] - \max_a Q^*(s,a) \approx \sigma\sqrt{\frac{2\log n}{\pi}} $$ For $n$ actions sampled. **Double estimator (TD3) bias:** $$ \mathbb{E}\left[\min_{i=1,2} \hat{Q}_i(s,a)\right] \leq Q^*(s,a) $$ TD3 trades overestimation for slight underestimation, which is empirically more stable. ### Deterministic Policy Gradient The actor maximizes expected return: $$ J(\theta_\pi) = \mathbb{E}_{s \sim \rho^\pi} \left[ Q^{\pi}(s, \pi_{\theta_\pi}(s)) \right] $$ Gradient (Silver et al., 2014): $$ \nabla_{\theta_\pi} J(\theta_\pi) = \mathbb{E}_{s \sim \rho^\pi} \left[ \nabla_{\theta_\pi} \pi_{\theta_\pi}(s) \nabla_a Q^{\pi}(s,a) \big|_{a=\pi_{\theta_\pi}(s)} \right] $$ ## Comparison with Related Algorithms ### TD3 vs DDPG vs SAC | Aspect | DDPG | TD3 | SAC | |--------|------|-----|-----| | Number of critics | 1 | 2 (min) | 2 (min) | | Policy type | Deterministic | Deterministic | Stochastic | | Entropy regularization | ✗ | ✗ | ✓ | | Exploration method | External noise | External noise | Policy entropy | | Actor update frequency | Every step | Delayed | Every step | | Target smoothing | ✗ | ✓ | ✗ | ### SAC Objective (for comparison) $$ J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{\infty} \gamma^t \left( r_t + \alpha \mathcal{H}(\pi(\cdot|s_t)) \right) \right] $$ Where $\mathcal{H}$ is entropy and $\alpha$ is the temperature parameter. ## Practical Implementation Notes ### Network Architecture **Critic Network:** $$ Q_\theta(s,a) = f_\theta(\text{concat}(s, a)) $$ Typical architecture: - Input: $\text{dim}(s) + \text{dim}(a)$ - Hidden layers: [256, 256] with ReLU - Output: 1 (scalar Q-value) **Actor Network:** $$ \pi_\phi(s) = \tanh(f_\phi(s)) \cdot a_{\max} $$ Typical architecture: - Input: $\text{dim}(s)$ - Hidden layers: [256, 256] with ReLU - Output: $\text{dim}(a)$ with tanh activation ### Common Failure Modes 1. **Insufficient exploration** - Symptom: Premature convergence to suboptimal policy - Solution: Increase exploration noise, use parameter noise 2. **Critic divergence** - Symptom: Q-values grow unboundedly - Solution: Reduce learning rate, gradient clipping 3. **Slow learning** - Symptom: Policy improves very slowly - Solution: Reduce policy delay, increase batch size ### Debugging Tips - Monitor Q-value statistics: $\mathbb{E}[Q]$, $\text{Var}[Q]$, $\max Q$ - Track actor and critic losses separately - Visualize learned policy periodically - Compare predicted vs actual returns ## When to Use TD3 ### Good Fit - Continuous control with dense rewards - Robotic manipulation and locomotion - When sample efficiency matters - Environments with smooth dynamics ### Consider Alternatives - **Discrete actions** → DQN, Rainbow, C51 - **Maximum exploration needed** → SAC - **Model available** → MBPO, Dreamer, MuZero - **Multi-task/Meta-learning** → MAML, RL² ## Equations ### Target Computation $$ y = r + \gamma \min_{i=1,2} Q_{\theta'_i}\left(s', \pi_{\theta'_\pi}(s') + \text{clip}(\epsilon, -c, c)\right) $$ ### Critic Loss $$ \mathcal{L}_{\text{critic}} = \frac{1}{N} \sum_{j=1}^{N} \left( Q_{\theta_i}(s_j, a_j) - y_j \right)^2 $$ ### Actor Loss $$ \mathcal{L}_{\text{actor}} = -\frac{1}{N} \sum_{j=1}^{N} Q_{\theta_1}(s_j, \pi_{\theta_\pi}(s_j)) $$ ### Soft Target Update $$ \theta' \leftarrow \tau \theta + (1 - \tau) \theta' $$
Time-dependent dielectric breakdown testing.
Time Domain Reflectometry measures impedance discontinuities along signal paths by analyzing reflected waveforms from incident step signals.
Training-free ensemble NAS combines multiple zero-cost proxies improving architecture evaluation reliability.
Teacher-student curriculum learning uses a teacher model to assess sample difficulty and guide curriculum design for student training.
General paradigm for distillation.
Teacher-student training transfers knowledge from complex to simple models through soft targets.
Intermediate model between teacher and student.
I can help you turn your knowledge into internal docs, playbooks, or mini-courses for onboarding your team.
AI teams need diverse skills: ML, engineering, product, domain. Culture of experimentation and learning.
Find areas needing refactoring.
AI tech debt: hacky prompts, hardcoded logic, missing tests. Schedule time to refactor and maintain.
Create technical specifications.
License process technology.
Compare different process nodes.
Process generation names.
Maturity of technology.
Plan for future technology development.
Technology roadmaps project evolution of capabilities and requirements over time.
Move from development to production.
TEEs (Trusted Execution Environments) run code in isolated secure enclaves. Protect model and data.
Total Effective Equipment Performance includes all calendar time in denominator.
Ultra-high resolution imaging for defect analysis.
Adjust temperature for better calibrated probabilities.
Precision temperature controller for wafer chucks and process zones.
Temperature control maintains stable thermal environments for processes and metrology.
Thermal stress during screening.
Model thermal stress from cycling.
Repeatedly heat and cool to test thermal stress.
Temperature parameter in distillation softens predictions revealing relative class probabilities.
Combined environmental stress test.
Soften probability distributions.
Adjust task probabilities.
Scale logits before sampling.