tally sheet, quality & reliability
Tally sheets record event counts or observations for frequency analysis.
9,967 technical terms and definitions
Tally sheets record event counts or observations for frequency analysis.
Tantalum nitride barrier.
Test Access Port controller is a state machine that interprets boundary scan instructions and controls test data register operations.
Tape-out: final design sent to foundry as GDSII file. Point of no return. Months to get chips back.
Standard widths 8mm 12mm 16mm etc.
Tapeout completes design sending data for mask fabrication.
Finalize design and send to foundry for mask making.
ARC layer on top of resist.
Target encoding uses target mean per category. Can leak information.
Target impedance specifies maximum allowable power distribution network impedance versus frequency ensuring acceptable voltage ripple.
Target speaker extraction isolates specific speaker from mixture using enrollment utterance.
Desired final wafer thickness.
Target tracking adjusts process aims as equipment drifts maintaining centered output.
Design to hit nominal not just meet specs.
Desired nominal value.
Source material being sputtered (metal alloy).
Task allocation assigns responsibilities to agents based on capabilities and load.
Combine high-level planning with motion.
Add/subtract task vectors.
Ensure fair task representation.
Task decomposition breaks complex goals into manageable subtasks.
Task diversity in instruction tuning exposes models to varied problem types.
Cluster related tasks.
Task instructions specify desired actions clearly guiding model behavior.
Tasks hurting each other's performance.
Use prompts to specify task.
Model identifying task from examples.
Direct inputs to task-specific modules.
How to sample tasks during training.
Relatedness between tasks.
Prepend task identifiers.
Learn tasks sequentially with task labels at test time.
Goal-driven conversations.
Separate output layers per task.
Separate parameters per task.
Pre-train for specific downstream task.
Taskfile defines tasks in YAML. Go-based task runner.
TasNet performs end-to-end time-domain audio separation using temporal convolutional networks.
Taylor expansion pruning approximates loss change from removing weights using Taylor series.
TBATS combines Box-Cox transformation Fourier seasonality ARMA errors and trend for complex seasonal time series.
Simulation tools for process and device physics.
Physical parameters used in device/process simulation.
Temporal Convolutional Networks use dilated causal convolutions for sequence modeling with long effective history.
Improved continuous control algorithm.
# TD3: Twin Delayed Deep Deterministic Policy Gradient **Advanced Reinforcement Learning Algorithm for Continuous Control** ## Overview TD3, introduced by Fujimoto et al. (2018), addresses a fundamental problem in continuous control: **overestimation bias** in actor-critic methods. DDPG (Deep Deterministic Policy Gradient), while effective, suffers from significant value overestimation that compounds over training, leading to poor policies. ## Core Motivation: The Overestimation Problem ### Mathematical Formulation In Q-learning, the Bellman target is: $$ y = r + \gamma \max_{a'} Q(s', a') $$ The problem arises because: $$ \mathbb{E}\left[\max_{a'} \hat{Q}(s', a')\right] \geq \max_{a'} \mathbb{E}\left[\hat{Q}(s', a')\right] $$ Where $\hat{Q}$ is the estimated Q-function with approximation error. ### Why This Matters - Function approximation introduces noise: $\hat{Q}(s,a) = Q^*(s,a) + \epsilon$ - The $\max$ operator preferentially selects overestimated values - Errors propagate and amplify through bootstrapping - Policy exploits these overestimations, leading to divergence ## The Three Pillars of TD3 ### 1. Clipped Double Q-Learning TD3 maintains **two** critic networks $(Q_{\theta_1}, Q_{\theta_2})$ and uses the minimum for targets: $$ y = r + \gamma \min_{i=1,2} Q_{\theta'_i}(s', \tilde{a}') $$ Where: - $Q_{\theta'_1}, Q_{\theta'_2}$ are target networks - $\tilde{a}'$ is the smoothed target action **Loss function for each critic:** $$ \mathcal{L}(\theta_i) = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( Q_{\theta_i}(s,a) - y \right)^2 \right] $$ #### Key Insight | Method | Approach | Effect | |--------|----------|--------| | Double DQN | Decouples selection and evaluation | Reduces overestimation | | TD3 | Takes minimum of two estimates | More aggressive bias reduction | ### 2. Delayed Policy Updates The actor is updated **less frequently** than critics: $$ \theta_{\pi} \leftarrow \theta_{\pi} + \alpha \nabla_{\theta_\pi} J(\theta_\pi) \quad \text{every } d \text{ steps} $$ Where $d$ is typically 2. **Policy gradient:** $$ \nabla_{\theta_\pi} J(\theta_\pi) = \mathbb{E}_{s \sim \mathcal{D}} \left[ \nabla_a Q_{\theta_1}(s,a) \big|_{a=\pi_{\theta_\pi}(s)} \nabla_{\theta_\pi} \pi_{\theta_\pi}(s) \right] $$ #### Rationale - Policy updates depend on accurate value estimates - High-variance critic estimates cause policy divergence - Delayed updates allow critic to stabilize - Reduces the "moving target" problem ### 3. Target Policy Smoothing Noise is added to target actions: $$ \tilde{a}' = \pi_{\theta'_\pi}(s') + \epsilon, \quad \epsilon \sim \text{clip}(\mathcal{N}(0, \sigma), -c, c) $$ **Purpose:** $$ Q(s', a') \approx Q(s', a' + \epsilon) \quad \text{for small } \epsilon $$ This regularizes the Q-function, preventing exploitation of narrow peaks. ## Complete TD3 Algorithm ### Pseudocode ``` Initialize: - Critic networks Q_θ₁, Q_θ₂ - Actor network π_φ - Target networks θ'₁ ← θ₁, θ'₂ ← θ₂, φ' ← φ - Replay buffer D For each timestep t: 1. Select action with exploration: a ~ π_φ(s) + ε, ε ~ N(0, σ) 2. Execute a, observe r, s' 3. Store (s, a, r, s') in D 4. Sample mini-batch from D 5. Compute target: ã ← π_φ'(s') + clip(ε, -c, c), ε ~ N(0, σ̃) y ← r + γ min_{i=1,2} Q_θ'ᵢ(s', ã) 6. Update critics: θᵢ ← θᵢ - α∇_θᵢ (Q_θᵢ(s,a) - y)² 7. If t mod d = 0: - Update actor: φ ← φ + β∇_φ Q_θ₁(s, π_φ(s)) - Update targets: θ'ᵢ ← τθᵢ + (1-τ)θ'ᵢ φ' ← τφ + (1-τ)φ' ``` ### Hyperparameters | Parameter | Symbol | Typical Value | Description | |-----------|--------|---------------|-------------| | Discount factor | $\gamma$ | 0.99 | Future reward weighting | | Soft update rate | $\tau$ | 0.005 | Target network update rate | | Policy delay | $d$ | 2 | Critic updates per actor update | | Target noise | $\tilde{\sigma}$ | 0.2 | Smoothing noise std | | Noise clip | $c$ | 0.5 | Smoothing noise bounds | | Exploration noise | $\sigma$ | 0.1 | Action exploration std | | Batch size | $N$ | 256 | Mini-batch size | | Learning rate | $\alpha, \beta$ | 3e-4 | Network learning rates | ## Mathematical Deep Dive ### Overestimation Bias Analysis Let the true Q-value be $Q^*(s,a)$ and the estimate be: $$ \hat{Q}(s,a) = Q^*(s,a) + \epsilon(s,a) $$ Where $\epsilon(s,a)$ is zero-mean noise with variance $\sigma^2$. **Single estimator bias:** $$ \mathbb{E}\left[\max_a \hat{Q}(s,a)\right] - \max_a Q^*(s,a) \approx \sigma\sqrt{\frac{2\log n}{\pi}} $$ For $n$ actions sampled. **Double estimator (TD3) bias:** $$ \mathbb{E}\left[\min_{i=1,2} \hat{Q}_i(s,a)\right] \leq Q^*(s,a) $$ TD3 trades overestimation for slight underestimation, which is empirically more stable. ### Deterministic Policy Gradient The actor maximizes expected return: $$ J(\theta_\pi) = \mathbb{E}_{s \sim \rho^\pi} \left[ Q^{\pi}(s, \pi_{\theta_\pi}(s)) \right] $$ Gradient (Silver et al., 2014): $$ \nabla_{\theta_\pi} J(\theta_\pi) = \mathbb{E}_{s \sim \rho^\pi} \left[ \nabla_{\theta_\pi} \pi_{\theta_\pi}(s) \nabla_a Q^{\pi}(s,a) \big|_{a=\pi_{\theta_\pi}(s)} \right] $$ ## Comparison with Related Algorithms ### TD3 vs DDPG vs SAC | Aspect | DDPG | TD3 | SAC | |--------|------|-----|-----| | Number of critics | 1 | 2 (min) | 2 (min) | | Policy type | Deterministic | Deterministic | Stochastic | | Entropy regularization | ✗ | ✗ | ✓ | | Exploration method | External noise | External noise | Policy entropy | | Actor update frequency | Every step | Delayed | Every step | | Target smoothing | ✗ | ✓ | ✗ | ### SAC Objective (for comparison) $$ J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{\infty} \gamma^t \left( r_t + \alpha \mathcal{H}(\pi(\cdot|s_t)) \right) \right] $$ Where $\mathcal{H}$ is entropy and $\alpha$ is the temperature parameter. ## Practical Implementation Notes ### Network Architecture **Critic Network:** $$ Q_\theta(s,a) = f_\theta(\text{concat}(s, a)) $$ Typical architecture: - Input: $\text{dim}(s) + \text{dim}(a)$ - Hidden layers: [256, 256] with ReLU - Output: 1 (scalar Q-value) **Actor Network:** $$ \pi_\phi(s) = \tanh(f_\phi(s)) \cdot a_{\max} $$ Typical architecture: - Input: $\text{dim}(s)$ - Hidden layers: [256, 256] with ReLU - Output: $\text{dim}(a)$ with tanh activation ### Common Failure Modes 1. **Insufficient exploration** - Symptom: Premature convergence to suboptimal policy - Solution: Increase exploration noise, use parameter noise 2. **Critic divergence** - Symptom: Q-values grow unboundedly - Solution: Reduce learning rate, gradient clipping 3. **Slow learning** - Symptom: Policy improves very slowly - Solution: Reduce policy delay, increase batch size ### Debugging Tips - Monitor Q-value statistics: $\mathbb{E}[Q]$, $\text{Var}[Q]$, $\max Q$ - Track actor and critic losses separately - Visualize learned policy periodically - Compare predicted vs actual returns ## When to Use TD3 ### Good Fit - Continuous control with dense rewards - Robotic manipulation and locomotion - When sample efficiency matters - Environments with smooth dynamics ### Consider Alternatives - **Discrete actions** → DQN, Rainbow, C51 - **Maximum exploration needed** → SAC - **Model available** → MBPO, Dreamer, MuZero - **Multi-task/Meta-learning** → MAML, RL² ## Equations ### Target Computation $$ y = r + \gamma \min_{i=1,2} Q_{\theta'_i}\left(s', \pi_{\theta'_\pi}(s') + \text{clip}(\epsilon, -c, c)\right) $$ ### Critic Loss $$ \mathcal{L}_{\text{critic}} = \frac{1}{N} \sum_{j=1}^{N} \left( Q_{\theta_i}(s_j, a_j) - y_j \right)^2 $$ ### Actor Loss $$ \mathcal{L}_{\text{actor}} = -\frac{1}{N} \sum_{j=1}^{N} Q_{\theta_1}(s_j, \pi_{\theta_\pi}(s_j)) $$ ### Soft Target Update $$ \theta' \leftarrow \tau \theta + (1 - \tau) \theta' $$
Time-dependent dielectric breakdown testing.
Time Domain Reflectometry measures impedance discontinuities along signal paths by analyzing reflected waveforms from incident step signals.
Training-free ensemble NAS combines multiple zero-cost proxies improving architecture evaluation reliability.
Teacher-student curriculum learning uses a teacher model to assess sample difficulty and guide curriculum design for student training.
General paradigm for distillation.