← Back to AI Factory Chat

AI Factory Glossary

653 technical terms and definitions

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z All
Showing page 11 of 14 (653 entries)

transfer chamber,production

Vacuum chamber with robot for moving wafers between modules.

transfer entropy, time series models

Transfer entropy quantifies directed information flow between time series by measuring reduction in uncertainty when conditioning on source history.

transfer learning for defect detection, data analysis

Use pre-trained models.

transfer learning rec, recommendation systems

Transfer learning in recommendations adapts models trained on data-rich domains to improve sparse target domains.

transfer learning theory, advanced training

Transfer learning theory studies how knowledge from source tasks generalizes to target tasks through shared representations and domains.

transfer learning,pretrain finetune

Transfer learning: pretrain on large data, fine-tune on task. Foundation of modern NLP and vision.

transfer molding, packaging

Inject molding compound under pressure.

transfer nas, neural architecture search

Transfer learning in NAS reuses search results or weight-sharing supernets across related tasks or datasets.

transfer pressure, packaging

Pressure forcing compound.

transfer standard,metrology

Move calibration between locations.

transformation for normality, spc

Transform data to make normal.

transformer as memory network, theory

Transformer viewed as memory system.

transformer tts, audio & speech

Transformer TTS replaces recurrent layers with self-attention for parallel efficient text-to-speech synthesis.

transformer-hawkes, time series models, hawkes process, transformer, attention mechanism, temporal modeling, event prediction

# Transformer-Hawkes and Time Series Models ## Introduction **Transformer-Hawkes models** represent a cutting-edge fusion of classical stochastic processes and modern deep learning architectures, specifically designed for temporal event sequence modeling in continuous time. ### Key Components - **Hawkes Processes**: Self-exciting temporal point processes - **Transformer Architecture**: Attention-based neural networks - **Temporal Point Processes**: Models for events occurring in continuous time ## Mathematical Foundations ### Classical Hawkes Process The conditional intensity function of a Hawkes process is defined as: λ(t) = μ + Σ φ(t - t_i) for all t_i < t Where: - $\mu$ : baseline intensity (background rate) - $\phi(t - t_i)$ : excitation kernel measuring influence of past events - $t_i$ : time of the $i$-th event ### Common Excitation Kernels **Exponential kernel** (most common): φ(t - t_i) = α exp(-β(t - t_i)) Where: - $\alpha$ : excitation magnitude (how much each event increases future intensity) - $\beta$ : decay rate (how quickly the influence fades) **Power-law kernel**: $$\phi(t - t_i) = \frac{\alpha}{(t - t_i + c)^{1+\beta}}$$ ### Multivariate Hawkes Process For $K$ event types, the intensity of type $k$ is: $$\lambda_k(t) = \mu_k + \sum_{j=1}^{K} \sum_{t_i^j < t} \phi_{kj}(t - t_i^j)$$ Where $\phi_{kj}$ captures cross-excitation from type $j$ to type $k$. ### Log-Likelihood Function For observed event sequence $\mathcal{H} = \{(t_i, m_i)\}_{i=1}^N$: $$\log \mathcal{L}(\mathcal{H}) = \sum_{i=1}^{N} \log \lambda_{m_i}(t_i) - \sum_{k=1}^{K} \int_{0}^{T} \lambda_k(s) \, ds$$ **Components**: - First term: log-intensity at observed event times (fitting term) - Second term: integral of intensity over observation window (compensator) ## Core Concepts ### Self-Exciting Property - **Definition**: Past events increase the probability of future events - **Mathematical representation**: The intensity $\lambda(t)$ increases immediately after each event - **Key insight**: Creates clustering patterns in temporal data - **Real-world manifestation**: - Earthquake aftershocks - Social media cascades - Financial trade clustering ### Temporal Point Process A temporal point process is characterized by: - **Event times**: $\{t_1, t_2, ..., t_n\}$ occurring in continuous time - **Counting process**: $N(t) = \sum_{i} \mathbb{1}(t_i \leq t)$ - **Conditional intensity**: $\lambda(t | \mathcal{H}_t)$ depends on history up to time $t$ - **Probability formulation**: $$P(\text{event in } [t, t+dt) | \mathcal{H}_t) = \lambda(t | \mathcal{H}_t) \, dt + o(dt)$$ ### Marked Point Process Extension where each event has an associated mark (type/category): - **Event representation**: $(t_i, m_i)$ where $m_i \in \{1, 2, ..., K\}$ - **Type-specific intensities**: $\lambda^*(t) = \sum_{k=1}^{K} \lambda_k(t)$ - **Applications**: - Multi-asset trading (mark = asset type) - Hospital events (mark = procedure type) - Social networks (mark = action type) ## Architecture Details ### Transformer Hawkes Process (THP) #### Input Representation For event sequence $\mathcal{H} = \{(t_1, m_1), (t_2, m_2), ..., (t_n, m_n)\}$: **Event embedding**: e_i = Embed(m_i) + TimeEncode(t_i) Where: - Embed(m_i) ∈ R^d : learnable type embedding - TimeEncode(t_i) : continuous-time positional encoding #### Temporal Encoding **Continuous-time positional encoding** (inspired by Transformer-XL): TimeEncode(t) = [sin(t/ω₁), cos(t/ω₁), ..., sin(t/ω_{d/2}), cos(t/ω_{d/2})] Where ω_j = 10000^(-2j/d) are frequency parameters. **Alternative: Learnable time encoding**: TimeEncode(t) = W_t · [1, t, t², ..., t^p]ᵀ #### Self-Attention Mechanism **Multi-head self-attention** for event sequence: Attention(Q, K, V) = softmax(QK^T / √d_k) V Where: - $\mathbf{Q} = \mathbf{E}\mathbf{W}_Q$ : queries - $\mathbf{K} = \mathbf{E}\mathbf{W}_K$ : keys - $\mathbf{V} = \mathbf{E}\mathbf{W}_V$ : values - $\mathbf{E} \in \mathbb{R}^{n \times d}$ : embedded event sequence **Causal masking**: Ensures event i only attends to events j < i: Mask_ij = 0 if t_j < t_i, otherwise -∞ \end{cases}$$ #### Intensity Function Parameterization After transformer encoding, history representation at time $t$: $$\mathbf{h}(t) = \text{Transformer}(\{\mathbf{e}_i : t_i < t\})$$ **Neural intensity function**: λ*(t | H_t) = Softplus(w^T h(t) + b) Or for marked processes: λ_k(t | H_t) = Softplus(w_k^T h(t) + b_k) **Softplus activation** ensures positivity: Softplus(x) = log(1 + exp(x)) ### Training Objective **Maximum likelihood estimation**: $$\mathcal{L}(\theta) = \sum_{i=1}^{N} \log \lambda^*(t_i | \mathcal{H}_{t_i}) - \int_{0}^{T} \lambda^*(s | \mathcal{H}_s) \, ds$$ **Integral approximation** via Monte Carlo: $$\int_{0}^{T} \lambda^*(s | \mathcal{H}_s) \, ds \approx \frac{T}{M} \sum_{j=1}^{M} \lambda^*(s_j | \mathcal{H}_{s_j})$$ Where $s_j$ are uniformly sampled time points. **Alternative: Exact integration** for specific forms: If intensity has closed-form antiderivative, compute exactly: $$\int_{t_i}^{t_{i+1}} \lambda^*(s) \, ds$$ ### Full Architecture Stack ``` - Input: Event sequence {(t₁, m₁), ..., (tₙ, mₙ)} ↓ [Event Type Embedding Layer] ↓ [Continuous Time Encoding] ↓ [Input: Combined embeddings e₁, ..., eₙ] ↓ [Transformer Layer 1] • Multi-Head Self-Attention (with causal mask) • Layer Normalization • Feed-Forward Network • Residual Connection ↓ [Transformer Layer 2] • ... ↓ ... ↓ [Transformer Layer L] ↓ [History Representation: h(t)] ↓ [Intensity Network] • Linear layer: W^T h(t) + b • Softplus activation ↓ Output: λ*(t) or {λ₁(t), ..., λₖ(t)} ``` ### Key Architectural Choices #### Attention Variants - **Full attention**: $O(n^2)$ complexity, attends to all past events - **Pros**: Complete information flow - **Cons**: Quadratic complexity - **Sparse attention**: Only attends to recent $k$ events or events within time window - **Pros**: $O(nk)$ complexity, more scalable - **Cons**: May miss long-range dependencies - **Hierarchical attention**: Multi-scale attention (fine + coarse) - **Pros**: Captures both local and global patterns - **Cons**: More complex implementation #### Position Encoding Strategies - **Absolute time encoding**: Encodes actual event times - **Use case**: When absolute timing matters (e.g., circadian patterns) - **Relative time encoding**: Encodes inter-event intervals $\Delta t_i = t_i - t_{i-1}$ - **Use case**: When relative timing matters (e.g., click patterns) - **Learned time embedding**: Neural network learns optimal encoding - **Use case**: When temporal patterns are complex/unknown ## Applications ### 1. Financial Markets #### High-Frequency Trading **Problem formulation**: - **Events**: Market orders, limit orders, cancellations - **Marks**: Order type, asset ID, bid/ask side - **Goal**: Predict next order arrival time and type **Model setup**: $$\lambda_k^{\text{asset}}(t) = f_\theta(\text{order history}, t)$$ **Applications**: - **Market making**: Optimal quote placement based on predicted order flow - **Execution algorithms**: Timing of large order splitting - **Risk management**: Tail risk from order clustering **Key insights**: - Cross-asset excitation: Trade in Asset A triggers trades in Asset B - Microstructure effects: Bid-ask bounce, momentum - Regime changes: Volatility clustering #### Market Microstructure **Order book dynamics**: $$\lambda^{\text{trade}}(t) = \mu + \alpha_1 N^{\text{trades}}(t^-) + \alpha_2 \text{Spread}(t) + \alpha_3 \text{Imbalance}(t)$$ Where: - $N^{\text{trades}}(t^-)$ : recent trade count - $\text{Spread}(t)$ : bid-ask spread - $\text{Imbalance}(t)$ : order book imbalance ### 2. Healthcare Applications #### Patient Event Modeling **Electronic health records** as marked point process: - **Events**: Admissions, procedures, diagnoses, prescriptions - **Marks**: ICD codes, procedure codes, medication IDs - **Time**: Timestamp of each clinical event **Risk prediction**: $$\lambda^{\text{readmission}}(t) = f_\theta(\text{clinical history}, \text{demographics}, t)$$ **Clinical applications**: - **Readmission prediction**: When will patient return to hospital? - **Disease progression**: Modeling transition between disease states - **Treatment response**: Predicting response to interventions - **Resource allocation**: ICU bed demand forecasting **Example: ICU monitoring**: Events sequence: $(t_1, \text{admission}) \to (t_2, \text{lab\_test}) \to (t_3, \text{medication}) \to ...$ Model learns: - Which events typically cluster (e.g., abnormal lab triggers medication) - Time-to-next-event distributions - Patient-specific risk trajectories ### 3. Social Networks #### Information Diffusion **Cascade modeling**: - **Events**: Retweets, shares, mentions - **Network structure**: Follower graph $\mathcal{G} = (V, E)$ - **Goal**: Predict cascade size and velocity **Intensity with network structure**: $$\lambda_u(t) = \mu_u + \sum_{v \in \text{neighbors}(u)} \sum_{t_v < t} \alpha_{uv} \exp(-\beta(t - t_v))$$ Where: - $u$ : target user - $v$ : influencing neighbors - $\alpha_{uv}$ : influence weight (can be learned) **Applications**: - **Viral content detection**: Early identification of trending content - **Influence maximization**: Selecting seed users for marketing - **Bot detection**: Unusual retweeting patterns - **Trend prediction**: When will topic peak? #### User Engagement Modeling **Event types**: - Views, clicks, likes, comments, shares - Each with different excitation patterns **Multi-type intensity**: $$\lambda_{\text{share}}(t) > \lambda_{\text{like}}(t) > \lambda_{\text{view}}(t)$$ Capturing engagement hierarchy. ### 4. E-commerce #### Session-Based Recommendation **Click stream as point process**: - **Events**: Page views, add-to-cart, purchases - **Marks**: Product IDs, categories - **Goal**: Predict next action and timing **Intensity for product $p$**: $$\lambda_p(t) = f_\theta(\text{click history}, \text{product features}, t)$$ **Use cases**: - **Real-time recommendations**: What to show next? - **Conversion prediction**: Will user purchase? - **Session length modeling**: When will user leave? - **Cross-selling**: What products are bought together? #### Customer Lifetime Value **Purchase events over time**: $$\text{CLV} = \mathbb{E}\left[\sum_{t_i > \text{now}} r(t_i) \cdot \exp(-\delta(t_i - \text{now}))\right]$$ Where: - $r(t_i)$ : revenue at purchase $i$ - $\delta$ : discount rate - Expectation over predicted purchase times from Transformer-Hawkes ### 5. System Reliability #### Server Log Analysis **Event types in logs**: - Errors, warnings, info messages - Different severity levels - Multiple services/components **Anomaly detection**: $$\text{Anomaly score}(t) = \frac{\lambda^{\text{observed}}(t)}{\lambda^{\text{expected}}(t)}$$ High ratio indicates unusual activity clustering. **Predictive maintenance**: - Learn failure precursors from historical logs - Predict time-to-failure distributions - Optimize maintenance scheduling **Cascading failure detection**: - Error in Service A triggers errors in Services B, C - Model cross-service excitation patterns #### Network Traffic Analysis **Packet arrival modeling**: - **Events**: Packet arrivals, connections, disconnections - **Marks**: Protocol, source/destination, payload size **DDoS detection**: $$\lambda^{\text{traffic}}(t) \gg \lambda^{\text{baseline}}$$ Unusual intensity spike indicates attack. ### 6. Earthquake Seismology **Epidemic-Type Aftershock Sequence (ETAS) model**: $$\lambda(t) = \mu + \sum_{t_i < t} K \exp(\alpha(M_i - M_0)) (t - t_i + c)^{-p}$$ Where: - $M_i$ : magnitude of earthquake $i$ - $K, \alpha, c, p$ : model parameters - **Omori's law**: Aftershock rate decays as power-law **Transformer-Hawkes extension**: - Learn excitation patterns from data - Incorporate spatial information - Capture magnitude-dependent triggering ## Comparative Analysis ### Classical Hawkes vs Neural Approaches | Aspect | Classical Hawkes | RNN-Based | Transformer-Hawkes | |--------|-----------------|-----------|-------------------| | **Interpretability** | ⭐⭐⭐⭐⭐ High | ⭐⭐ Low | ⭐⭐⭐ Medium | | **Flexibility** | ⭐⭐ Limited | ⭐⭐⭐⭐ High | ⭐⭐⭐⭐⭐ Very High | | **Long-range dependencies** | ⭐⭐⭐ Medium | ⭐⭐ Poor | ⭐⭐⭐⭐⭐ Excellent | | **Training speed** | ⭐⭐⭐⭐⭐ Fast | ⭐⭐⭐ Medium | ⭐⭐⭐⭐ Fast (parallel) | | **Data requirements** | ⭐⭐ Low | ⭐⭐⭐⭐ High | ⭐⭐⭐⭐ High | | **Inference speed** | ⭐⭐⭐⭐⭐ Fast | ⭐⭐⭐ Medium | ⭐⭐⭐⭐ Fast | | **Parameter count** | ⭐⭐⭐⭐⭐ Few | ⭐⭐⭐ Medium | ⭐⭐ Many | ### Detailed Comparison #### Classical Hawkes Processes **Advantages**: - **Mathematical elegance**: Closed-form solutions often available - **Interpretability**: Parameters have clear meaning ($\alpha$ = excitation, $\beta$ = decay) - **Efficiency**: Fast inference, no GPU needed - **Theoretical guarantees**: Stationarity conditions, stability analysis - **Small data**: Works well with limited observations **Limitations**: - **Fixed functional form**: Must specify kernel shape (exponential, power-law) - **Linear superposition**: Assumes additive excitation - **Homogeneity**: Parameters typically constant over time - **Limited expressivity**: Cannot capture complex triggering patterns **When to use**: - Small datasets (< 1000 events) - Need interpretable parameters - Well-understood domain with simple excitation - Real-time constraints with limited compute #### RNN-Based Neural Hawkes **Architecture**: LSTM/GRU to model history, output intensity function $$h_t = \text{LSTM}(e_t, h_{t-1})$$ $$\lambda(t) = f(h_t, t - t_{\text{last}})$$ **Advantages**: - **Recurrent state**: Natural for sequential processing - **Continuous-time**: Can handle irregular event timing - **Non-parametric**: Learns excitation from data **Limitations**: - **Vanishing gradients**: Struggles with long sequences - **Sequential training**: Cannot parallelize over time - **Long-range dependencies**: Difficulty attending to distant past - **Hidden state**: Less interpretable than attention weights **When to use**: - Medium-length sequences (100-1000 events) - Sequential nature fits problem structure - Moderate model complexity needed #### Transformer Hawkes Processes **Advantages**: - **Parallel training**: Processes all events simultaneously - **Long-range attention**: Can attend to any past event - **Flexible**: Learns complex triggering patterns - **Scalable**: Handles long sequences better than RNNs - **Interpretable attention**: Weights show event influence - **State-of-the-art performance**: Often best predictive accuracy **Limitations**: - **Data hungry**: Needs substantial training data (>10k events) - **Computational cost**: $O(n^2)$ attention complexity - **Memory requirements**: Stores attention matrices - **Hyperparameter sensitivity**: Many architectural choices - **Overparameterization**: Risk of overfitting on small data **When to use**: - Large datasets (>10k events) - Long sequences with long-range dependencies - Complex, unknown excitation patterns - Computational resources available - Predictive accuracy is priority ### Benchmark Performance **Typical metrics**: - **Log-likelihood**: Higher is better - **Time prediction error**: MAE/RMSE of predicted vs actual event times - **Type accuracy**: Classification accuracy for event marks - **Cascade prediction**: F1 score for predicting cascade size/shape **Example results** (averaged across multiple datasets): ``` Dataset: Financial trades (100k events) - Classical Hawkes: Log-lik = -8542, Time MAE = 2.3s - RNN-Hawkes: Log-lik = -7891, Time MAE = 1.8s - Transformer-Hawkes: Log-lik = -7234, Time MAE = 1.4s Dataset: Social cascades (50k events) - Classical Hawkes: F1 = 0.62, Type Acc = 71% - RNN-Hawkes: F1 = 0.71, Type Acc = 78% - Transformer-Hawkes: F1 = 0.79, Type Acc = 84% ``` ### Hybrid Approaches **Combining strengths**: 1. **Parametric + Neural**: Use classical kernel with learned parameters $$\lambda(t) = \mu + \sum_{t_i < t} \alpha(m_i; \theta) \exp(-\beta(m_i; \theta)(t - t_i))$$ Where $\alpha(\cdot), \beta(\cdot)$ are neural networks. 2. **Physics-informed**: Constrain neural model with domain knowledge - Enforce stability conditions: $\int \phi(s) ds < 1$ - Respect causality: Only past can influence future - Incorporate known excitation patterns 3. **Multi-scale**: Classical for short-term, neural for long-term $$\lambda(t) = \lambda^{\text{classic}}_{\text{short}}(t) + \lambda^{\text{neural}}_{\text{long}}(t)$$ ## Implementation Considerations ### Numerical Challenges #### Intensity Integration **Challenge**: Computing $\int_{0}^{T} \lambda(s) ds$ for likelihood. **Monte Carlo approximation**: $$\int_{0}^{T} \lambda(s) ds \approx \frac{T}{M} \sum_{j=1}^{M} \lambda(s_j)$$ Where $s_j \sim \text{Uniform}(0, T)$. **Pros**: - Unbiased estimator - Works for any intensity function - Easy to implement **Cons**: - High variance with small $M$ - Requires many function evaluations - Stochastic gradients **Adaptive quadrature**: Use numerical integration (e.g., Simpson's rule, Gaussian quadrature) on sub-intervals where intensity changes rapidly. **Pros**: - Deterministic - Lower error for smooth functions - Adaptive refinement possible **Cons**: - Requires more careful implementation - May be expensive for complex $\lambda(t)$ **Thinning-based estimation**: Use Ogata's thinning algorithm to generate samples, estimate integral from sample statistics. #### Numerical Stability **Softplus overflow**: $$\text{Softplus}(x) = \log(1 + e^x)$$ For large $x$, $e^x$ overflows. **Solution**: Use identity for $x > 20$: $$\text{Softplus}(x) \approx x \text{ for large } x$$ **Log-space computations**: For log-likelihood, work in log-space to avoid underflow: $$\log \mathcal{L} = \sum_i \log \lambda(t_i) - \int \lambda(s) ds$$ Use log-sum-exp trick for stable summation. ### Computational Efficiency #### Attention Complexity **Full attention**: $O(n^2 d)$ for sequence length $n$, dimension $d$ **Sparse attention strategies**: 1. **Local attention**: Only attend to recent $k$ events - Complexity: $O(nkd)$ - Implementation: Mask out distant events 2. **Strided attention**: Attend to every $s$-th event - Reduces effective sequence length - May miss important events 3. **Adaptive attention**: Learn which events to attend to - Combines full + sparse patterns - More complex but more flexible #### Caching for Sequential Prediction When predicting next event, cache transformer states: ``` h_1, ..., h_n = Transformer(e_1, ..., e_n) # Compute once # For intensity queries at different times: λ(t) = IntensityNet(h_n, t) # Reuse h_n ``` **Savings**: Avoid recomputing transformer for each $t$ query. #### Batch Processing **Training**: Batch multiple sequences together - Pad to same length or use packing - Mask out padding in attention **Inference**: Can process multiple queries in parallel - Batch different time points $\{t_1, ..., t_K\}$ - Single forward pass gives $\{\lambda(t_1), ..., \lambda(t_K)\}$ ### Training Strategies #### Curriculum Learning **Idea**: Start with easier examples, gradually increase difficulty. **Implementation**: 1. **Stage 1**: Train on short sequences (50-100 events) 2. **Stage 2**: Increase to medium sequences (100-500 events) 3. **Stage 3**: Full-length sequences (500+ events) **Benefits**: - Faster initial convergence - Better final performance - More stable training #### Regularization Techniques **Attention dropout**: Randomly drop attention weights during training $$\text{Attention}^{\text{drop}} = \text{Dropout}(\text{softmax}(\mathbf{QK}^T / \sqrt{d_k})) \mathbf{V}$$ **Intensity smoothness**: Penalize rapid changes in intensity $$\mathcal{L}_{\text{smooth}} = \int \left|\frac{d\lambda(t)}{dt}\right|^2 dt$$ **Total variation regularization**: Encourage sparse excitation patterns $$\mathcal{L}_{\text{TV}} = \sum_{i,j} |\alpha_{ij}|$$ Where $\alpha_{ij}$ are learned excitation weights. #### Learning Rate Scheduling **Warmup + decay**: Common for transformers ``` lr(t) = d_model^(-0.5) * min(t^(-0.5), t * warmup_steps^(-1.5)) ``` **Cosine annealing**: Smooth decay to minimum $$lr(t) = lr_{\min} + \frac{1}{2}(lr_{\max} - lr_{\min})(1 + \cos(\pi t / T))$$ ### Sampling Procedures #### Generating Event Sequences **Ogata's thinning algorithm** for simulating from learned intensity: ``` Algorithm: Sample next event 1. Initialize t = t_last 2. Compute λ_max = max_{s>t} λ(s) (upper bound) 3. Sample dt ~ Exponential(λ_max) 4. Set t = t + dt 5. Sample u ~ Uniform(0, 1) 6. If u < λ(t) / λ_max: Accept t as next event time Sample event type m ~ Categorical(λ_1(t), ..., λ_K(t)) Return (t, m) Else: Go to step 2 (reject and continue) ``` **Finding $\lambda_{\max}$**: - For neural models, use optimization or grid search - Overestimate for safety (higher rejection rate but correct) #### Conditional Sampling **Problem**: Sample future given observed prefix **Importance**: - Counterfactual analysis (what if intervention happened?) - Missing data imputation - Scenario testing **Implementation**: ``` Given: Observed events H_obs = {(t_1, m_1), ..., (t_n, m_n)} Goal: Sample future events {(t_{n+1}, m_{n+1}), ..., (t_N, m_N)} λ(t | H_obs) = Transformer-Hawkes(H_obs, t) Use thinning algorithm with conditional intensity ``` ### Software Frameworks #### PyTorch Implementation Sketch ```python import torch import torch.nn as nn class TransformerHawkes(nn.Module): def __init__(self, num_types, d_model, nhead, num_layers): super().__init__() self.type_embed = nn.Embedding(num_types, d_model) self.time_encode = PositionalEncoding(d_model) encoder_layer = nn.TransformerEncoderLayer( d_model=d_model, nhead=nhead, batch_first=True ) self.transformer = nn.TransformerEncoder( encoder_layer, num_layers=num_layers ) self.intensity_net = nn.Sequential( nn.Linear(d_model, d_model), nn.ReLU(), nn.Linear(d_model, num_types) ) def forward(self, event_types, event_times, query_times): # Embed events type_emb = self.type_embed(event_types) time_emb = self.time_encode(event_times) x = type_emb + time_emb # Causal mask mask = self.generate_causal_mask(event_times, query_times) # Transformer encoding h = self.transformer(x, mask=mask) # Intensity at query times intensities = torch.nn.functional.softplus( self.intensity_net(h) ) return intensities def log_likelihood(self, events, T): # Sum over log-intensities at event times log_lambda = torch.log(self.forward(events) + 1e-8) sum_log_lambda = log_lambda.sum() # Integral approximation (Monte Carlo) sample_times = torch.rand(1000) * T lambda_samples = self.forward(sample_times) integral = (T / 1000) * lambda_samples.sum() return sum_log_lambda - integral ``` #### Existing Libraries **Python packages**: - `tick`: Classical point process models - `tpp`: Temporal point processes with neural variants - `pytorch-lightning`: Training framework - `transformers`: Hugging Face transformers (adapt for TPP) **Julia packages**: - `PointProcesses.jl`: High-performance implementations - `HawkesProcesses.jl`: Specialized for Hawkes models **R packages**: - `hawkes`: Basic Hawkes process functionality - `ppstat`: Point process statistics ### Debugging and Validation #### Sanity Checks 1. **Intensity positivity**: Verify $\lambda(t) > 0$ always 2. **Causality**: Event $i$ only depends on $\{j : t_j < t_i\}$ 3. **Likelihood ordering**: Training LL should improve 4. **Overfitting check**: Training LL >> Test LL indicates overfitting #### Visualization Tools **Intensity plot**: Plot $\lambda(t)$ over time with events marked ```python import matplotlib.pyplot as plt times = np.linspace(0, T, 1000) intensities = model.intensity(times).detach().numpy() plt.plot(times, intensities) plt.scatter(event_times, model.intensity(event_times), c='red', marker='x') plt.xlabel('Time') plt.ylabel('Intensity λ(t)') ``` **Attention heatmap**: Visualize which past events influence current ```python attention_weights = model.get_attention_weights() plt.imshow(attention_weights, aspect='auto') plt.xlabel('Event index (past)') plt.ylabel('Event index (query)') plt.colorbar(label='Attention weight') ``` **Quantile-quantile plot**: Check model fit - Transform events to standard Poisson: $\tau_i = \int_0^{t_i} \lambda(s) ds$ - If model correct, $\tau_i - \tau_{i-1} \sim \text{Exp}(1)$ - Plot empirical vs theoretical quantiles ## Recent Research ### Spatiotemporal Extensions **Problem**: Events have both time and location $(t_i, \mathbf{x}_i)$ **Spatiotemporal intensity**: $$\lambda(t, \mathbf{x}) = \mu(\mathbf{x}) + \sum_{(t_i, \mathbf{x}_i) < t} \phi(t - t_i, \|\mathbf{x} - \mathbf{x}_i\|)$$ **Transformer approach**: - Encode location with learnable positional encoding - Spatial attention mechanism - Applications: Crime prediction, disease spread, earthquake forecasting **Example architecture**: ``` Event representation: e_i = Embed(m_i) + TimeEncode(t_i) + SpaceEncode(x_i) Attention: Combines temporal and spatial similarity ``` ### Graph-Structured Hawkes **Problem**: Events occur on nodes of a network $\mathcal{G} = (V, E)$ **Graph neural intensity**: $$\lambda_u(t) = f_\theta(\text{GNN}(\mathcal{G}), \text{history}_u, t)$$ Where GNN aggregates neighborhood information. **Applications**: - Social networks with explicit follower graphs - Financial contagion across connected institutions - Traffic networks with road connectivity - Neural spike trains with brain connectivity **Key insight**: Combine graph structure with temporal dynamics **Architecture**: 1. **Graph encoder**: GNN to get node embeddings 2. **Temporal encoder**: Transformer for event sequences 3. **Fusion**: Combine graph + temporal representations 4. **Intensity predictor**: Output $\lambda_u(t)$ per node ### Continuous-Time Attention **Motivation**: Standard transformers use discrete positions **Continuous-time self-attention**: $$\text{Attention}(t) = \int_{0}^{t} \alpha(t, s) \cdot v(s) \, ds$$ Where: - $\alpha(t, s) \propto \exp(q(t)^T k(s) / \sqrt{d})$ : continuous attention weight - $v(s)$ : value function over continuous time **Benefits**: - True continuous-time modeling - No discretization artifacts - Natural for irregular event timing **Challenges**: - Integral computation expensive - Approximations needed for practical implementation **Recent work**: - Neural ODE + Attention - Continuous-time transformers with RBF kernels - Stochastic process attention ### Neural Relational Inference **Problem**: Unknown causal structure between event types **Goal**: Learn $\mathcal{G}$ where edge $j \to k$ means type $j$ excites type $k$ **Approach**: 1. **Structure learning**: Estimate adjacency matrix $\mathbf{A}$ 2. **Parameter learning**: Learn $\phi_{kj}$ for edges in $\mathbf{A}$ **Intensity with learned graph**: $$\lambda_k(t) = \mu_k + \sum_{j : A_{jk} = 1} \sum_{t_i^j < t} \phi_{kj}(t - t_i^j)$$ **Transformer implementation**: - Attention weights reveal influence structure - Sparse attention corresponds to graph edges - Regularization to encourage sparsity **Applications**: - Discovering causal relationships in multivariate time series - Understanding cross-asset spillover effects - Inferring brain connectivity from spike trains ### Few-Shot and Transfer Learning **Challenge**: Limited data for new domains **Transfer learning strategies**: 1. **Pre-train on large corpus**: E.g., financial trades across many assets 2. **Fine-tune on target task**: Specific asset or user **Few-shot learning**: - Meta-learning approaches (MAML, Prototypical Networks) - Learn intensity function with few examples - Transfer temporal patterns across domains **Domain adaptation**: - Source domain: Abundant labeled data - Target domain: Limited data, different distribution - Adversarial training to align representations **Example**: ``` Pre-training: 1M trading events across 100 stocks Fine-tuning: 10K events for new stock Result: Better performance than training from scratch ``` ### Interpretability Research **Attention analysis**: - Which past events most influence current intensity? - Do attention patterns match domain knowledge? **Counterfactual queries**: - What if event $i$ didn't occur? How would intensity change? - Use to identify critical events in cascades **Feature importance**: - Which event features (type, time, covariates) matter most? - Integrated gradients, SHAP values for neural TPPs **Parametric distillation**: - Train transformer model - Distill into interpretable classical Hawkes - Best of both: performance + interpretability ### Robust and Uncertainty-Aware Models **Uncertainty quantification**: - Bayesian neural Hawkes: Posterior over intensity functions - Monte Carlo dropout for epistemic uncertainty - Prediction intervals for event times **Robust to outliers**: - Heavy-tailed excitation kernels - Robust loss functions (Huber loss) - Anomaly detection via likelihood thresholding **Handling missing data**: - Missing event types: Marginalize over possibilities - Observation windows: Account for censoring - Incomplete sequences: Use partial likelihood ### Hybrid Physics-Neural Models **Motivation**: Combine domain knowledge with learning **Approaches**: 1. **Constrained neural architectures**: - Enforce stability: $\|A\|_{\text{spectral}} < 1$ - Respect physical laws: Conservation, causality 2. **Physics-informed losses**: - Add penalty for violating known constraints - Example: Force decay in excitation over time 3. **Decomposition**: $$\lambda(t) = \lambda_{\text{physics}}(t) + \lambda_{\text{neural}}(t)$$ - Physics part handles known dynamics - Neural part learns residuals **Benefits**: - Better generalization with less data - More interpretable - Respects domain constraints **Applications**: - Earthquake modeling (ETAS + neural corrections) - Option pricing (Black-Scholes + learned volatility) - Epidemiology (SIR model + behavioral factors) ### Multimodal Temporal Modeling **Problem**: Events with rich context (text, images, etc.) **Architecture**: ``` Event: (time, type, text, image) ↓ [Text Encoder: BERT] → text_emb [Image Encoder: ResNet] → image_emb [Type Embedding] → type_emb [Time Encoding] → time_emb ↓ Combined: e_i = [text_emb; image_emb; type_emb; time_emb] ↓ [Transformer Encoder] ↓ λ(t) ``` **Applications**: - Social media: Tweets with images/videos - Healthcare: Clinical notes + imaging + lab results - E-commerce: Product descriptions + images + reviews ## Advanced Topics ### Marked Spatio-Temporal Point Processes **Full representation**: $(t_i, \mathbf{x}_i, m_i)$ - time, location, type **Intensity**: $$\lambda(t, \mathbf{x}, m) = f_\theta(\mathcal{H}_t, t, \mathbf{x}, m)$$ **Conditional distributions**: - **When**: $f_t(\tau | \mathcal{H})$ - next event time - **Where**: $f_x(\mathbf{x} | t, \mathcal{H})$ - event location - **What**: $f_m(m | t, \mathbf{x}, \mathcal{H})$ - event type **Decomposition**: $$\lambda(t, \mathbf{x}, m) = \lambda^*(t) \cdot p(\mathbf{x} | t) \cdot p(m | t, \mathbf{x})$$ Separate when, where, what components. ### Recurrent Transformer Architectures **Motivation**: Very long sequences exceed memory **Approach**: Combine recurrence + attention ``` For each time window [t_k, t_{k+1}]: 1. Process events with transformer 2. Summarize to fixed-size state s_k 3. Pass state to next window: s_{k+1} = f(s_k, events_k) ``` **Benefits**: - Constant memory regardless of sequence length - Retain long-term dependencies via state - Efficient for streaming applications ### Online Learning and Adaptation **Problem**: Distribution shifts over time **Approaches**: 1. **Sliding window**: Only use recent data - Discard old events - Retrain periodically 2. **Exponential weighting**: Down-weight old events $$\mathcal{L} = \sum_i \exp(-\lambda(t_{\text{now}} - t_i)) \log \lambda(t_i)$$ 3. **Meta-learning**: Quick adaptation - Few gradient steps on new data - Maintain base model + adapted model 4. **Continual learning**: Update without forgetting - Elastic weight consolidation - Progressive neural networks **Real-time deployment**: - Periodically re-fit model - A/B test new vs old model - Gradual rollout of updates ### Variational Inference for TPPs **Bayesian neural Hawkes**: Posterior over parameters $$p(\theta | \mathcal{D}) \propto p(\mathcal{D} | \theta) p(\theta)$$ **Challenge**: Intractable posterior **Solution**: Variational approximation $$q_\phi(\theta) \approx p(\theta | \mathcal{D})$$ **ELBO objective**: $$\mathcal{L}(\phi) = \mathbb{E}_{q_\phi}[\log p(\mathcal{D} | \theta)] - \text{KL}(q_\phi(\theta) \| p(\theta))$$ **Benefits**: - Uncertainty quantification - Regularization through prior - Ensemble predictions **Implementation**: Reparameterization trick + Monte Carlo gradients ### Amortized Inference **Problem**: Per-sequence inference expensive **Idea**: Train inference network $$\phi^* = \text{InferenceNet}(\text{observed events})$$ Maps observations directly to parameters. **Training**: 1. Sample sequence from model 2. Run inference to get true posterior 3. Train network to predict posterior 4. Use network at test time (fast inference) **Applications**: - Real-time parameter estimation - Embedding sequences into latent space - Transfer learning across domains ## Practical Guidelines ### When to Use Transformer-Hawkes **✅ Good fit when**: - Large datasets (>10k events) - Complex, unknown excitation patterns - Long-range temporal dependencies critical - Multiple event types with interactions - Need state-of-the-art predictive performance - Computational resources available **❌ Avoid when**: - Small datasets (<1k events) - Simple, well-understood dynamics (use classical) - Interpretability is critical requirement - Real-time inference with strict latency (<1ms) - Limited computational budget - Domain has strong physical constraints ### Model Selection Checklist 1. **Data size**: How many events? (Classical if <1k, Neural if >10k) 2. **Sequence length**: Average events per sequence? (Transformer better for long) 3. **Complexity**: Are excitation patterns simple or complex? 4. **Interpretability**: Do you need explainable parameters? 5. **Compute**: GPU available? Training time acceptable? 6. **Generalization**: Will distribution shift? (Consider robustness) 7. **Deployment**: Latency requirements? Model size constraints? ### Hyperparameter Tuning **Key hyperparameters**: | Parameter | Typical Range | Impact | |-----------|---------------|--------| | `d_model` | 64-512 | Model capacity, memory | | `n_heads` | 4-16 | Attention diversity | | `n_layers` | 2-8 | Depth, long-range modeling | | `dropout` | 0.1-0.3 | Regularization | | `learning_rate` | 1e-5 to 1e-3 | Convergence speed | | `batch_size` | 16-256 | Training stability | **Search strategies**: - Grid search for critical params - Random search for broader exploration - Bayesian optimization for efficiency **Cross-validation**: Time-series aware splitting - Never shuffle: Maintain temporal order - Use rolling origin: Train on [0, T], test on [T, T+Δ] ### Deployment Considerations **Model serving**: - Serialize with ONNX or TorchScript - Use model compression (quantization, pruning) - Batch requests for throughput **Monitoring**: - Track log-likelihood on recent data - Alert on distribution shifts - A/B test model updates **Versioning**: - Track model version, training data, hyperparameters - Enable rollback to previous version - Gradual traffic shifting ## Conclusion **Transformer-Hawkes models** represent a powerful synthesis of: - **Classical theory**: Hawkes processes, point process mathematics - **Modern deep learning**: Transformers, attention mechanisms - **Applied statistics**: Maximum likelihood, Bayesian inference **Key takeaways**: - Significant performance gains over classical methods on complex data - Trade-off between interpretability and flexibility - Computational cost justified for large-scale applications - Active research area with rapid developments **Future directions**: - Improved efficiency (sparse attention, quantization) - Better uncertainty quantification - Transfer learning and few-shot adaptation - Integration with causal discovery - Real-time streaming applications Mathematical Derivations ### Derivation of Log-Likelihood **Counting process**: $N(t) = \sum_{i=1}^{\infty} \mathbb{1}(t_i \leq t)$ **Likelihood of observing events** $\{t_1, ..., t_n\}$ in $[0, T]$: $$L = p(N(T) = n) \cdot p(t_1, ..., t_n | N(T) = n)$$ **First term** (Poisson probability): $$p(N(T) = n) = \frac{\Lambda(T)^n}{n!} \exp(-\Lambda(T))$$ where $\Lambda(T) = \int_0^T \lambda(s) ds$ is the compensator. **Second term** (order statistics): $$p(t_1, ..., t_n | N(T) = n) = \frac{n!}{\Lambda(T)^n} \prod_{i=1}^{n} \lambda(t_i)$$ **Combined**: $$L = \exp(-\Lambda(T)) \prod_{i=1}^{n} \lambda(t_i)$$ **Log-likelihood**: $$\log L = \sum_{i=1}^{n} \log \lambda(t_i) - \int_0^T \lambda(s) ds$$ ### Gradient Computation **Gradient of log-likelihood** w.r.t. parameters $\theta$: $$\nabla_\theta \log L = \sum_{i=1}^{n} \frac{\nabla_\theta \lambda(t_i)}{\lambda(t_i)} - \int_0^T \nabla_\theta \lambda(s) ds$$ **Monte Carlo estimate of integral**: $$\int_0^T \nabla_\theta \lambda(s) ds \approx \frac{T}{M} \sum_{j=1}^{M} \nabla_\theta \lambda(s_j)$$ where $s_j \sim \text{Uniform}(0, T)$. **Backpropagation**: Use automatic differentiation (PyTorch, TensorFlow) to compute $\nabla_\theta \lambda(t)$. ### Stability Condition for Hawkes Process **Branching ratio**: Expected number of offspring per event $$n = \int_0^\infty \phi(s) ds$$ **Stability condition**: $n < 1$ **Proof sketch**: - Each event produces $n$ offspring in expectation - Total events: $1 + n + n^2 + ... = \frac{1}{1-n}$ (geometric series) - Converges iff $n < 1$ **For exponential kernel** $\phi(s) = \alpha e^{-\beta s}$: $$n = \int_0^\infty \alpha e^{-\beta s} ds = \frac{\alpha}{\beta}$$ Stable iff $\alpha < \beta$. ### Continuous-Time Transformer Attention **Standard discrete attention**: $$\text{Attention}(Q, K, V) = \text{softmax}(QK^T / \sqrt{d}) V$$ **Continuous analog**: Replace sum with integral $$\text{Attention}(t) = \int_{-\infty}^{t} \alpha(t, s) v(s) ds$$ where: $$\alpha(t, s) = \frac{\exp(q(t)^T k(s) / \sqrt{d})}{\int_{-\infty}^{t} \exp(q(t)^T k(s') / \sqrt{d}) ds'}$$ **Practical approximation**: Discretize time or use kernels $$\alpha(t, s) \approx \kappa(t - s) \quad \text{(e.g., RBF kernel)}$$

transformer, transformer architecture, self-attention, attention mechanism, encoder-decoder, multi-head attention, positional encoding, BERT, GPT, neural networks

# The Transformer Architecture **A comprehensive technical guide to the architecture that revolutionized deep learning** ## Historical Context The Transformer architecture was introduced in the landmark 2017 paper **"Attention Is All You Need"** by Vaswani et al. It replaced recurrence with pure attention mechanisms and has since become the foundation for virtually all modern large language models. ### Problems with Previous Approaches (RNNs/LSTMs) - **Sequential bottleneck**: Processing proceeded step-by-step through sequences, preventing parallelization - **Long-range dependency challenges**: Information from distant positions had to flow through many intermediate steps - **Vanishing gradient problems**: Training signals degraded over long sequences, even with gating mechanisms - **Computational inefficiency**: Sequential nature created fundamental bottlenecks on modern parallel hardware ### The Key Insight *Attention alone is sufficient.* By allowing every position to directly attend to every other position in a single operation, the sequential constraint is eliminated entirely. ## Core Mechanism: Self-Attention ### Scaled Dot-Product Attention The heart of the Transformer is **scaled dot-product attention**. Given an input sequence of embeddings, we compute three projections: - **Query ($Q$)**: What information is this position looking for? - **Key ($K$)**: What information does this position contain? - **Value ($V$)**: What information should be transmitted if attended to? ### Mathematical Formulation $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$ Where: - $Q \in \mathbb{R}^{n \times d_k}$ — Query matrix - $K \in \mathbb{R}^{n \times d_k}$ — Key matrix - $V \in \mathbb{R}^{n \times d_v}$ — Value matrix - $d_k$ — Dimension of keys/queries - $n$ — Sequence length ### Why the Scaling Factor? The scaling factor $\sqrt{d_k}$ is critical. Without it: $$ \text{For large } d_k: \quad q \cdot k = \sum_{i=1}^{d_k} q_i k_i \quad \text{grows as } O(d_k) $$ This pushes softmax into regions of extremely small gradients: $$ \frac{\partial}{\partial x_i} \text{softmax}(x)_j = \text{softmax}(x)_j \left(\delta_{ij} - \text{softmax}(x)_i\right) $$ When inputs are large, softmax outputs approach one-hot vectors, and gradients vanish. ### Properties of Self-Attention - **Parallelization**: All positions computed simultaneously — $O(1)$ sequential operations - **Direct connectivity**: Any position can directly access any other - **Learned routing**: Attention patterns are computed fresh for each input - **Computational complexity**: $O(n^2 \cdot d)$ time and $O(n^2)$ memory ## Multi-Head Attention Rather than computing a single attention function, Transformers use multiple parallel attention "heads." ### Mathematical Formulation $$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O $$ Where each head is: $$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$ ### Projection Dimensions - $W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$ - $W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$ - $W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$ - $W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}$ ### Typical Configuration For a model with $d_{\text{model}} = 512$ and $h = 8$ heads: $$ d_k = d_v = \frac{d_{\text{model}}}{h} = \frac{512}{8} = 64 $$ ### Why Multiple Heads? - **Different representation subspaces**: Each head can learn different relationship types - **Specialization**: One head might track syntactic dependencies, another semantic relationships - **Redundancy and robustness**: Information captured across multiple heads - **Efficient computation**: Same total dimensionality as single-head attention ## Position Encoding ### The Problem Self-attention is **permutation-equivariant**: $$ \text{Attention}(\pi(X)) = \pi(\text{Attention}(X)) $$ Where $\pi$ is any permutation. The operation has no inherent notion of position or order. ### Sinusoidal Position Encodings (Original) The original paper used fixed sinusoidal encodings: $$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) $$ $$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) $$ Where: - $pos$ — Position in the sequence $(0, 1, 2, \ldots)$ - $i$ — Dimension index $(0, 1, \ldots, d_{\text{model}}/2 - 1)$ - $d_{\text{model}}$ — Model dimension ### Properties of Sinusoidal Encodings - **Unique encoding**: Each position gets a distinct vector - **Bounded values**: All values in $[-1, 1]$ - **Relative position as linear transformation**: $PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$ $$ PE_{pos+k} = T_k \cdot PE_{pos} $$ Where $T_k$ is a rotation matrix depending only on $k$. ### Modern Alternatives #### Rotary Position Embeddings (RoPE) Encodes position through rotation in 2D subspaces: $$ f(x_m, m) = \begin{pmatrix} \cos m\theta & -\sin m\theta \\ \sin m\theta & \cos m\theta \end{pmatrix} \begin{pmatrix} x_m^{(1)} \\ x_m^{(2)} \end{pmatrix} $$ For query $q$ at position $m$ and key $k$ at position $n$: $$ q_m^T k_n = (R_m q)^T (R_n k) = q^T R_{n-m} k $$ This makes attention depend only on relative position $(n-m)$. #### ALiBi (Attention with Linear Biases) Adds a linear bias based on distance: $$ \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} - m \cdot |i-j|\right)V $$ Where $m$ is a head-specific slope and $|i-j|$ is the distance between positions. ## The Complete Transformer Layer ### Layer Composition A single Transformer layer consists of: ``` Input → [Layer Norm] → Multi-Head Attention → [+ Residual] → → [Layer Norm] → Feed-Forward Network → [+ Residual] → Output ``` ### Feed-Forward Network (FFN) Applied position-wise (identically to each position): $$ \text{FFN}(x) = \sigma(xW_1 + b_1)W_2 + b_2 $$ Where: - $W_1 \in \mathbb{R}^{d_{\text{model}} \times d_{ff}}$ — Expansion projection - $W_2 \in \mathbb{R}^{d_{ff} \times d_{\text{model}}}$ — Contraction projection - $d_{ff}$ — Inner dimension (typically $4 \times d_{\text{model}}$) - $\sigma$ — Activation function ### Activation Functions #### ReLU (Original) $$ \text{ReLU}(x) = \max(0, x) $$ #### GELU (Common in modern models) $$ \text{GELU}(x) = x \cdot \Phi(x) \approx x \cdot \sigma(1.702x) $$ Where $\Phi$ is the standard Gaussian CDF. #### SwiGLU (State-of-the-art) $$ \text{SwiGLU}(x) = \text{Swish}(xW_1) \odot (xW_2) $$ Where $\text{Swish}(x) = x \cdot \sigma(x)$ and $\odot$ is element-wise multiplication. ### Layer Normalization $$ \text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta $$ Where: - $\mu = \frac{1}{d}\sum_{i=1}^{d} x_i$ — Mean across features - $\sigma^2 = \frac{1}{d}\sum_{i=1}^{d} (x_i - \mu)^2$ — Variance across features - $\gamma, \beta$ — Learned scale and shift parameters - $\epsilon$ — Small constant for numerical stability #### Pre-LN vs Post-LN **Post-LN (Original)**: $$ x' = \text{LayerNorm}(x + \text{Attention}(x)) $$ **Pre-LN (Modern, more stable)**: $$ x' = x + \text{Attention}(\text{LayerNorm}(x)) $$ ### RMSNorm (Simplified Alternative) $$ \text{RMSNorm}(x) = \gamma \odot \frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}} $$ Removes the mean-centering step for efficiency. ### Residual Connections $$ x_{l+1} = x_l + F_l(x_l) $$ Essential for: - **Gradient flow**: Direct path for gradients in deep networks - **Incremental learning**: Layers learn refinements rather than complete transformations - **Training stability**: Easier optimization landscape ## Architectural Variants ### Encoder-Only (BERT-style) **Attention Pattern**: Bidirectional (each position attends to all positions) $$ \text{Mask}_{ij} = 0 \quad \forall i, j $$ **Use Cases**: - Text classification - Named entity recognition - Question answering - Sentence embeddings **Pre-training Objective**: Masked Language Modeling (MLM) $$ \mathcal{L}_{\text{MLM}} = -\mathbb{E}_{x \sim \mathcal{D}} \left[ \sum_{i \in \mathcal{M}} \log P(x_i | x_{\backslash \mathcal{M}}) \right] $$ ### Decoder-Only (GPT-style) **Attention Pattern**: Causal (positions only attend to previous positions) $$ \text{Mask}_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases} $$ **Use Cases**: - Text generation - Conversational AI - Code completion - General-purpose LLMs (GPT, Claude, LLaMA) **Pre-training Objective**: Next Token Prediction $$ \mathcal{L}_{\text{LM}} = -\sum_{t=1}^{T} \log P(x_t | x_{

transformers library,huggingface,models

Hugging Face Transformers provides pretrained models. Easy fine-tuning. Model hub.

transient enhanced diffusion, ted, process

Increased diffusion after implant damage.

transient thermal analysis, simulation

Time-dependent thermal behavior.

transient thermal analysis, thermal management

Transient thermal analysis simulates time-dependent temperature changes during power-up cycles or varying loads.

transient thermal, thermal management

Transient thermal analysis simulates time-dependent temperature changes during power cycling or variable load conditions.

transition fault, advanced test & probe

Transition faults model defects causing slow-to-rise or slow-to-fall transitions detected by delay testing.

transition fault,testing

Fault during signal transition.

transition metal dichalcogenides, research

2D semiconductor materials.

transition-based parsing, structured prediction

Transition-based parsing builds parse trees through sequences of shift-reduce actions using trained classifiers for action selection.

translate-test, transfer learning

Machine translate test data.

translate-train, transfer learning

Machine translate then train.

translate,language,convert

Translate between languages. Preserve meaning, tone.

translation adequacy, evaluation

How well meaning is preserved.

translation fluency, evaluation

How natural translation sounds.

translation,multilingual,mt

LLMs can translate between languages. For best quality, use models trained on parallel corpora or multilingual data.

transliteration, nlp

Convert between scripts.

transmission electron microscope (tem),transmission electron microscope,tem,metrology

See through thin samples at atomic scale.

transmission kikuchi diffraction, tkd, metrology

EBSD on thin samples.

transmission line effect, signal & power integrity

Transmission line effects become significant when interconnect length exceeds signal wavelength fraction.

transmission line effects,design

High-frequency behavior of long interconnects.

transnas, neural architecture search

TransNAS applies neural architecture search specifically to transformer architectures optimizing attention patterns.

transparency, ai safety

Transparency exposes internal mechanisms enabling analysis of model reasoning.

transparency,ethics

Making model behavior and decisions understandable.

transparent substrate processing, process

Process see-through materials.

transportation waste, manufacturing operations

Transportation waste moves materials without adding value.

transportation waste, production

Unnecessary movement of materials.

trap-assisted tunneling, tat, device physics

Tunneling via defect states.

traveler, manufacturing operations

Travelers are documents accompanying lots recording processing steps and results.

tray packaging, packaging

Components in matrix tray.

treatment recommendation,healthcare ai

Suggest treatment options.

tree diagram, quality & reliability

Tree diagrams systematically break goals into progressively detailed tasks.

tree of thought,search,planning

Tree-of-Thought explores multiple reasoning branches, backtracks if needed. Better for complex planning problems.

tree of thoughts (tot),tree of thoughts,tot,reasoning

Explore multiple reasoning paths in a tree structure and backtrack if needed.

tree of thoughts, prompting techniques

Tree of thoughts explores multiple reasoning branches systematically searching solution space.

trench contact, process integration

Trench contacts etch deeply into silicon reducing resistance but requiring careful profile control.