n-beats, n-beats, time series models
N-BEATS is a neural basis expansion analysis architecture for interpretable time series forecasting using forward and backward forecast stacks.
122 technical terms and definitions
N-BEATS is a neural basis expansion analysis architecture for interpretable time series forecasting using forward and backward forecast stacks.
Naive Bayes is simple probabilistic classifier. Fast, baseline.
Replace names to test bias.
Cell-based neural architecture search discovers repeatable computational blocks that are stacked for full networks.
NAS-Bench provides standardized benchmarks with pre-computed architecture performance metrics to enable reproducible and efficient NAS research.
Reinforcement learning agents for NAS explore architecture spaces using policy gradients to maximize validation performance.
Neural Architecture Search Without Training uses gradient magnitude statistics at initialization to predict architecture performance.
Model NBTI degradation.
NCHW layout stores tensors in batch-channel-height-width order preferred by some accelerators.
Ranking quality metric.
More realistic model with defect clustering.
Specify what to avoid.
Negative prompting specifies undesired attributes guiding generation away from certain features.
Neighborhood sampling limits aggregation to random subsets of neighbors reducing computational cost in large graphs.
NeMo Guardrails is programmable safety layer. Colang language. NVIDIA open source.
ML experiment management.
E(3)-equivariant network for atomistic simulations.
Neural Equivariant Interatomic Potentials combine E(3) equivariance with message passing for molecular dynamics.
Optimize NeRF from images.
NeRF synthesizes novel views by volumetrically rendering scenes from learned implicit representations.
Net zero emissions balance residual greenhouse gas releases with equivalent removals or offsets.
Transform networks while preserving function.
Remove entire structures like filters.
Remove individual weights.
Additive models using neural shape functions.
Neural architecture distillation transfers architectural knowledge not just parameter values.
Generate architectures automatically.
Automatically discover optimal model architectures.
Sophisticated AutoML for architecture discovery.
Find architectures suitable for edge.
NAS automatically finds optimal architectures. Search space, search algorithm, evaluation. Expensive but effective.
Transfer architectures across tasks.
Neural articulation represents articulated objects with learned kinematic structures.
Neural beamforming learns spatial filters through deep learning rather than classical signal processing.
Neural caching stores and reuses intermediate computations for similar inputs reducing redundancy.
Neural Collaborative Filtering replaces inner products with multi-layer perceptrons to learn complex non-linear user-item interactions.
Neural Chat is Intel-optimized chat model. Runs well on Intel hardware.
Control policies implemented as interpretable neural circuits with liquid time constants.
Policies implemented as neural circuits.
Neural codecs compress audio or images using learned representations for efficient transmission.
Neural constituency parsing uses recursive or tree-structured neural networks to predict hierarchical phrase structure trees.
Use neural networks to parameterize CDEs.
Use neural models for data verbalization.
Neural architecture encoders convert graph structures into fixed-dimensional vectors for predictor training.
Specialized hardware for on-device ML (Apple's).
Neural fabrics represent search spaces as trellis structures where NAS learns to select paths through pre-defined computational building blocks.
# Neural Hawkes Process and Time Series Models ## Neural Hawkes Process ## 1. Introduction A **Hawkes process** is a self-exciting point process used to model events that occur randomly in continuous time, where the occurrence of past events increases the likelihood of future events. **Key characteristics:** - Events occur at random times $t_1, t_2, t_3, \ldots$ - Past events "excite" or increase the probability of future events - The process has memory—history matters - Widely used in finance, seismology, social networks, and neuroscience ## 2. Classical Hawkes Process ### 2.1 Intensity Function The **conditional intensity function** $\lambda(t)$ represents the instantaneous rate of event occurrence: $$ \lambda(t) = \mu + \sum_{t_i < t} \phi(t - t_i) $$ **Where:** - $\mu > 0$ — Base intensity (background rate) - $\phi(\cdot)$ — Triggering kernel (excitation function) - $t_i$ — Times of past events - $\sum_{t_i < t}$ — Sum over all events before time $t$ ### 2.2 Common Triggering Kernels **Exponential kernel (most common):** $$ \phi(\tau) = \alpha \cdot e^{-\beta \tau} $$ - $\alpha > 0$ — Excitation magnitude - $\beta > 0$ — Decay rate - Constraint: $\frac{\alpha}{\beta} < 1$ for stationarity **Power-law kernel:** $$ \phi(\tau) = \frac{\alpha}{(\tau + c)^{(1+\omega)}} $$ - Used in seismology (Omori's law) - Heavier tails than exponential ### 2.3 Likelihood Function For a sequence of events $\{t_1, t_2, \ldots, t_n\}$ in interval $[0, T]$: $$ \mathcal{L} = \prod_{i=1}^{n} \lambda(t_i) \cdot \exp\left( -\int_0^T \lambda(s) \, ds \right) $$ **Log-likelihood:** $$ \log \mathcal{L} = \sum_{i=1}^{n} \log \lambda(t_i) - \int_0^T \lambda(s) \, ds $$ ### 2.4 Branching Structure The Hawkes process has a **branching interpretation:** - **Immigrants:** Events from the background rate $\mu$ - **Offspring:** Events triggered by previous events - **Branching ratio:** $n^* = \int_0^\infty \phi(\tau) \, d\tau$ - If $n^* < 1$: Process is subcritical (stationary) - If $n^* = 1$: Process is critical - If $n^* > 1$: Process is supercritical (explosive) ## 3. Neural Hawkes Process ### 3.1 Motivation **Limitations of classical Hawkes processes:** - Parametric kernels may not capture complex dynamics - Difficult to model **inhibition** (events reducing future probability) - Limited expressiveness for multi-type event interactions - Manual feature engineering required **Solution:** Replace parametric components with neural networks. ### 3.2 Continuous-Time LSTM (CT-LSTM) The Neural Hawkes Process (Mei & Eisner, 2017) uses a **continuous-time LSTM** where the hidden state evolves between events. **Standard LSTM update at event $t_i$:** $$ \begin{aligned} i_i &= \sigma(W_i x_i + U_i h_{i-1} + b_i) \\ f_i &= \sigma(W_f x_i + U_f h_{i-1} + b_f) \\ o_i &= \sigma(W_o x_i + U_o h_{i-1} + b_o) \\ \tilde{c}_i &= \tanh(W_c x_i + U_c h_{i-1} + b_c) \\ c_i &= f_i \odot c_{i-1} + i_i \odot \tilde{c}_i \\ h_i &= o_i \odot \tanh(c_i) \end{aligned} $$ **Where:** - $i_i$ — Input gate - $f_i$ — Forget gate - $o_i$ — Output gate - $c_i$ — Cell state - $h_i$ — Hidden state - $\sigma(\cdot)$ — Sigmoid function - $\odot$ — Element-wise multiplication ### 3.3 Continuous-Time Dynamics **Key innovation:** Cell state decays continuously between events. **Cell state at time $t$ (between events $t_i$ and $t_{i+1}$):** $$ c(t) = \bar{c}_i + (c_i - \bar{c}_i) \cdot e^{-\delta_i (t - t_i)} $$ **Where:** - $c_i$ — Cell state immediately after event $t_i$ - $\bar{c}_i$ — Target cell state (what $c(t)$ decays toward) - $\delta_i > 0$ — Decay rate (learned) - $t - t_i$ — Time elapsed since last event **Target cell state:** $$ \bar{c}_i = \bar{f}_i \odot \bar{c}_{i-1} + \bar{i}_i \odot \tilde{c}_i $$ **Hidden state at time $t$:** $$ h(t) = o_i \odot \tanh(c(t)) $$ ### 3.4 Intensity Function The intensity for event type $k$ at time $t$: $$ \lambda_k(t) = f_k(h(t)) = \text{softplus}(w_k^\top h(t) + b_k) $$ **Softplus function:** $$ \text{softplus}(x) = \log(1 + e^x) $$ **Properties:** - Ensures $\lambda_k(t) > 0$ (intensity must be positive) - Smooth approximation to ReLU - Allows for both excitation and inhibition ### 3.5 Training Objective **Negative log-likelihood:** $$ \mathcal{L} = -\sum_{i=1}^{n} \log \lambda_{k_i}(t_i) + \sum_{k=1}^{K} \int_0^T \lambda_k(s) \, ds $$ **Where:** - $k_i$ — Type of the $i$-th event - $K$ — Total number of event types - The integral is computed via Monte Carlo sampling or numerical integration ### 3.6 Architecture Summary ``` Input: Event sequence {(t_1, k_1), (t_2, k_2), ..., (t_n, k_n)} │ ▼ ┌───────────────────┐ │ Event Embedding │ │ x_i = embed(k_i)│ └───────────────────┘ │ ▼ ┌───────────────────┐ │ CT-LSTM Cell │ │ c(t), h(t) │ └───────────────────┘ │ ▼ ┌───────────────────┐ │ Intensity Layer │ │ λ_k(t) = softplus│ └───────────────────┘ │ ▼ Output: λ(t) for prediction, NLL for training ``` ## 4. Relationship to Time Series Models ### 4.1 Comparison Table | Aspect | Traditional Time Series | Point Processes | |:-------|:-----------------------|:----------------| | **Data** | Regular samples $y_1, y_2, \ldots$ | Event times $t_1, t_2, \ldots$ | | **Question** | What is $y$ at time $t$? | When does next event occur? | | **Spacing** | Fixed intervals $\Delta t$ | Irregular, continuous | | **Models** | ARIMA, GARCH, RNN | Poisson, Hawkes, Neural TPP | ### 4.2 Key Differences **Time series models:** - Observations at fixed time intervals: $y_t, y_{t+1}, y_{t+2}, \ldots$ - Model the value/magnitude of observations - Examples: stock prices, temperature, sensor readings **Point processes:** - Events at irregular, continuous times: $t_1, t_2, t_3, \ldots$ - Model **when** events occur (and optionally what type) - Examples: transactions, earthquakes, social media posts ### 4.3 Connections **Converting between representations:** - **Point process → Time series:** Count events in fixed bins $$N_t = \#\{t_i : t_i \in [t, t+\Delta t)\}$$ - **Time series → Point process:** Treat threshold crossings as events **Shared neural architectures:** - Both use RNNs, LSTMs, Transformers - Attention mechanisms applicable to both - Encoder-decoder frameworks common ## 5. Modern Extensions ### 5.1 Transformer Hawkes Process **Reference:** Zuo et al., 2020 **Key idea:** Replace RNN with self-attention mechanism. **Advantages:** - Parallelizable training (no sequential dependency) - Better long-range dependency modeling - Scales to longer sequences **Self-attention for events:** $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V $$ **Temporal encoding:** $$ \text{PE}(t, 2i) = \sin\left(\frac{t}{10000^{2i/d}}\right) $$ $$ \text{PE}(t, 2i+1) = \cos\left(\frac{t}{10000^{2i/d}}\right) $$ ### 5.2 Neural Jump SDEs **Combines:** - Continuous diffusion dynamics (SDEs) - Discrete jumps (point processes) **Formulation:** $$ dX_t = f(X_t) \, dt + g(X_t) \, dW_t + h(X_t) \, dN_t $$ **Where:** - $f(X_t) \, dt$ — Drift term - $g(X_t) \, dW_t$ — Diffusion (Brownian motion) - $h(X_t) \, dN_t$ — Jump term (point process) ### 5.3 Variational Approaches **Variational Autoencoder for Point Processes:** $$ \mathcal{L}_{\text{ELBO}} = \mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{\text{KL}}(q(z|x) \| p(z)) $$ **Benefits:** - Uncertainty quantification - Latent structure discovery - Generative modeling ### 5.4 Marked Temporal Point Processes Events carry additional information (**marks**): $$ \{(t_1, m_1), (t_2, m_2), \ldots, (t_n, m_n)\} $$ **Joint intensity:** $$ \lambda(t, m) = \lambda_g(t) \cdot f(m | t, \mathcal{H}_t) $$ - $\lambda_g(t)$ — Ground intensity (when) - $f(m | t, \mathcal{H}_t)$ — Mark distribution (what) ## 6. Applications ### 6.1 When to Use Neural Hawkes **Good fit:** - Event data with self-exciting patterns - Multiple interacting event types - Complex, nonlinear dependencies - Large datasets where neural networks can generalize **Specific domains:** - **Finance:** High-frequency trading, order book dynamics - **Social networks:** Information cascades, retweets, viral content - **Healthcare:** Patient events, hospital admissions, disease outbreaks - **Criminology:** Crime prediction, recidivism modeling - **Seismology:** Earthquake aftershock prediction - **Neuroscience:** Neural spike train modeling ### 6.2 When to Consider Alternatives | Scenario | Recommended Alternative | |:---------|:-----------------------| | Regularly sampled data | Standard time series (ARIMA, LSTM) | | Need interpretability | Classical Hawkes with explicit kernels | | Very sparse data | Simple parametric models | | Real-time constraints | Lightweight models, online learning | ### 6.3 Implementation Resources **Libraries:** - `tick` (Python) — Classical point processes - `PtPack` (Python) — Neural temporal point processes - `pytorch-transformer-hawkes` — Transformer-based models **Key papers:** - Mei & Eisner (2017): "The Neural Hawkes Process" - Zuo et al. (2020): "Transformer Hawkes Process" - Du et al. (2016): "Recurrent Marked Temporal Point Processes" ## 7. Mathematical ### Core Equations Reference **Classical Hawkes intensity:** $$ \lambda(t) = \mu + \sum_{t_i < t} \alpha e^{-\beta(t - t_i)} $$ **Neural Hawkes continuous-time cell:** $$ c(t) = \bar{c}_i + (c_i - \bar{c}_i) e^{-\delta_i(t - t_i)} $$ **Neural intensity function:** $$ \lambda_k(t) = \text{softplus}(w_k^\top h(t) + b_k) $$ **Log-likelihood:** $$ \log \mathcal{L} = \sum_{i=1}^{n} \log \lambda_{k_i}(t_i) - \int_0^T \sum_{k=1}^{K} \lambda_k(s) \, ds $$
Networks representing shapes.
Learn surfaces as zero level sets.
Encode mesh with networks.