graphtransformer, graph neural networks
Graph Transformer applies full self-attention over graph nodes with positional encodings for structural information.
3,145 technical terms and definitions
Graph Transformer applies full self-attention over graph nodes with positional encodings for structural information.
Graph Variational Autoencoder generates graphs by learning latent distributions over graph structures.
Green chemistry principles minimize hazardous substances in semiconductor processes through alternative chemistries and process optimization.
Green solvents replace hazardous organic solvents with safer alternatives like supercritical CO2 or water-based solutions.
Try all combinations of hyperparameter values.
Sudden generalization long after overfitting.
Sudden jump in generalization long after training loss plateaus.
Convolutions over symmetry groups.
Grouped convolutions partition input channels into groups processing each independently reducing parameters.
Middle ground between MQA and multi-head attention.
Normalize within groups of channels.
Quantum search algorithm.
Graph Transformer Networks learn new graph structures through soft edge selection for heterogeneous graphs.
Framework for adding structure validation and safety to LLM outputs.
Guardrails constrain model behavior preventing specific undesired outputs.
Guardrails prevent unwanted model behavior. Topic restrictions, format requirements, safety filters.
Strength of conditioning guidance.
Guidance scale controls trade-off between prompt adherence and sample diversity in guided generation.
Modified backprop for visualization.
Hybrid SSM+attention architecture.
Halide separates algorithm from schedule enabling portable high-performance image processing.
Identify false statements.
Generating false information.
Software complexity measures.
Learn energy-conserving dynamics.
Heterogeneous Graph Attention Network uses hierarchical attention at node-level and semantic-level to learn from multi-relational graph structures.
Hard example mining focuses training on samples with high loss or misclassification to improve model performance on difficult cases.
Focus on difficult examples.
Hard routing assigns tokens exclusively to selected experts.
Hardware-aware design optimizes architectures for specific deployment platforms considering latency memory and energy.
Search considering hardware constraints.
Hardware-aware neural architecture search optimizes architectures jointly for accuracy and hardware metrics like latency energy or memory footprint.
Jointly optimize hardware and algorithms.
Harmful content includes text promoting violence illegal activity or other dangers.
Hash routing deterministically assigns tokens based on hashing.
Hybrid Attention Transformer improves super-resolution through channel and spatial attention.
Hardware-Aware Transformers optimize transformer architectures jointly for accuracy and hardware-specific latency constraints.
Identify hateful or discriminatory content.
Self-excitation in Hawkes processes models how past events increase likelihood of future events.
Hazardous waste from semiconductor manufacturing requires specialized handling storage and disposal.
Heat recovery systems capture waste heat from process tools and HVAC for space heating or power generation improving energy efficiency.
Heat wheels transfer thermal energy between exhaust and supply air streams through rotating matrix.
Crack at bond heel.
Filter that removes 99.97% of particles 0.3 microns and larger.
GNNs for graphs with different node/edge types.
Heterogeneous graphs contain multiple node types and edge types requiring specialized message passing for different relation semantics.
Heterogeneous skip-gram predicts context nodes of different types given target nodes.
Heterogeneous Self-Attention Neural Network adaptively learns importance of different metapaths and neighbors.
Second-order ODE solver.
# Heterogeneous Graph Transformer (HGT) ## HGT Graph Neural Networks **HGT (Heterogeneous Graph Transformer)** is a graph neural network architecture designed specifically for **heterogeneous graphs** — graphs where nodes and edges can have different types. It was introduced by Hu et al. in 2020. ## 1. Problem Setting ### 1.1 Heterogeneous Graph Definition A heterogeneous graph is defined as: $$ G = (V, E, \tau, \phi) $$ Where: - $V$ — Set of nodes - $E$ — Set of edges - $\tau: V \rightarrow \mathcal{T}$ — Node type mapping function - $\phi: E \rightarrow \mathcal{R}$ — Edge type mapping function - $\mathcal{T}$ — Set of node types - $\mathcal{R}$ — Set of edge/relation types ### 1.2 Real-World Examples - **Academic Networks**: - Node types: `Paper`, `Author`, `Venue`, `Institution` - Edge types: `writes`, `cites`, `published_in`, `affiliated_with` - **E-commerce Graphs**: - Node types: `User`, `Product`, `Brand`, `Category` - Edge types: `purchases`, `reviews`, `belongs_to`, `manufactures` - **Knowledge Graphs**: - Node types: `Person`, `Organization`, `Location`, `Event` - Edge types: `works_at`, `located_in`, `participated_in` ## 2. HGT Architecture ### 2.1 Core Components The HGT layer consists of three main operations: 1. **Heterogeneous Mutual Attention** 2. **Heterogeneous Message Passing** 3. **Target-Specific Aggregation** ### 2.2 Type-Dependent Linear Projections For each node type $\tau \in \mathcal{T}$, HGT defines separate projection matrices: $$ Q_{\tau}^{(i)} \in \mathbb{R}^{d \times \frac{d}{h}}, \quad K_{\tau}^{(i)} \in \mathbb{R}^{d \times \frac{d}{h}}, \quad V_{\tau}^{(i)} \in \mathbb{R}^{d \times \frac{d}{h}} $$ Where: - $d$ — Hidden dimension - $h$ — Number of attention heads - $i$ — Attention head index $(i = 1, 2, \ldots, h)$ ## 3. Mathematical Formulation ### 3.1 Attention Mechanism For a source node $s$ and target node $t$ connected by edge $e$: #### Step 1: Compute Query and Key $$ \text{Query}^{(i)}(t) = Q_{\tau(t)}^{(i)} \cdot H^{(l-1)}[t] $$ $$ \text{Key}^{(i)}(s) = K_{\tau(s)}^{(i)} \cdot H^{(l-1)}[s] $$ #### Step 2: Compute Attention Score $$ \text{ATT-head}^{(i)}(s, e, t) = \left( \text{Key}^{(i)}(s) \cdot W_{\phi(e)}^{\text{ATT}} \cdot \text{Query}^{(i)}(t)^T \right) \cdot \frac{\mu_{\langle \tau(s), \phi(e), \tau(t) \rangle}}{\sqrt{d}} $$ Where: - $W_{\phi(e)}^{\text{ATT}} \in \mathbb{R}^{\frac{d}{h} \times \frac{d}{h}}$ — Edge-type-specific attention matrix - $\mu_{\langle \tau(s), \phi(e), \tau(t) \rangle}$ — Prior importance of meta-relation (learnable scalar) #### Step 3: Softmax Normalization $$ \text{Attention}(s, e, t) = \text{softmax}_{s \in \mathcal{N}(t)} \left( \text{ATT-head}^{(i)}(s, e, t) \right) $$ ### 3.2 Message Computation $$ \text{Message}^{(i)}(s, e, t) = V_{\tau(s)}^{(i)} \cdot H^{(l-1)}[s] \cdot W_{\phi(e)}^{\text{MSG}} $$ Where: - $W_{\phi(e)}^{\text{MSG}} \in \mathbb{R}^{\frac{d}{h} \times \frac{d}{h}}$ — Edge-type-specific message matrix ### 3.3 Multi-Head Aggregation $$ \tilde{H}^{(l)}[t] = \bigoplus_{i=1}^{h} \left( \sum_{s \in \mathcal{N}(t)} \text{Attention}^{(i)}(s, e, t) \cdot \text{Message}^{(i)}(s, e, t) \right) $$ Where $\bigoplus$ denotes concatenation across heads. ### 3.4 Final Output with Residual Connection $$ H^{(l)}[t] = \sigma \left( W_{\tau(t)}^{\text{OUT}} \cdot \tilde{H}^{(l)}[t] + H^{(l-1)}[t] \right) $$ Where: - $W_{\tau(t)}^{\text{OUT}} \in \mathbb{R}^{d \times d}$ — Target-type-specific output projection - $\sigma$ — Activation function (e.g., ReLU, GELU) ## 4. Relative Temporal Encoding (RTE) For temporal/dynamic graphs, HGT incorporates time information: $$ \text{RTE}(\Delta t) = \text{Linear}\left( \text{T2V}(\Delta t) \right) $$ Where $\Delta t = t_{\text{target}} - t_{\text{source}}$ is the time difference. ### Time2Vec Encoding $$ \text{T2V}(\Delta t)[i] = \begin{cases} \omega_i \cdot \Delta t + \varphi_i & \text{if } i = 0 \\ \sin(\omega_i \cdot \Delta t + \varphi_i) & \text{if } i > 0 \end{cases} $$ The temporal attention becomes: $$ \text{ATT-head}^{(i)}(s, e, t) = \left( \text{Key}^{(i)}(s) + \text{RTE}(\Delta t) \right) \cdot W_{\phi(e)}^{\text{ATT}} \cdot \text{Query}^{(i)}(t)^T $$ ## 5. Comparison | Method | Heterogeneity Handling | Metapaths Required | Parameter Efficiency | |--------|----------------------|-------------------|---------------------| | **GCN** | ❌ Homogeneous only | N/A | ✅ High | | **GAT** | ❌ Homogeneous only | N/A | ✅ High | | **R-GCN** | ✅ Yes | ❌ No | ❌ Low (separate weights per relation) | | **HAN** | ✅ Yes | ✅ Yes (manual design) | ⚠️ Medium | | **HGT** | ✅ Yes | ❌ No (automatic) | ✅ High (decomposition) | ## 6. Implementation ### 6.1 PyTorch Geometric Implementation ```python import torch import torch.nn as nn from torch_geometric.nn import HGTConv, Linear class HGT(nn.Module): def __init__(self, metadata, hidden_channels, out_channels, num_heads, num_layers): super().__init__() self.node_types = metadata[0] self.edge_types = metadata[1] # Linear projections for each node type self.lin_dict = nn.ModuleDict() for node_type in self.node_types: self.lin_dict[node_type] = Linear(-1, hidden_channels) # HGT convolutional layers self.convs = nn.ModuleList() for _ in range(num_layers): conv = HGTConv( in_channels=hidden_channels, out_channels=hidden_channels, metadata=metadata, heads=num_heads, group='sum' ) self.convs.append(conv) # Output projection self.out_lin = Linear(hidden_channels, out_channels) def forward(self, x_dict, edge_index_dict): # Initial projection x_dict = { node_type: self.lin_dict[node_type](x).relu() for node_type, x in x_dict.items() } # HGT layers for conv in self.convs: x_dict = conv(x_dict, edge_index_dict) return x_dict ``` ### 6.2 Usage Example ```python # Define metadata metadata = ( ['paper', 'author', 'venue'], # Node types [ ('author', 'writes', 'paper'), ('paper', 'cites', 'paper'), ('paper', 'published_in', 'venue'), ] # Edge types as (src, relation, dst) ) # Initialize model model = HGT( metadata=metadata, hidden_channels=64, out_channels=16, num_heads=4, num_layers=2 ) # Forward pass out_dict = model(x_dict, edge_index_dict) ``` ## 7. Training Objective ### 7.1 Node Classification $$ \mathcal{L}_{\text{node}} = -\sum_{v \in V_{\text{labeled}}} \sum_{c=1}^{C} y_{v,c} \log(\hat{y}_{v,c}) $$ Where: - $y_{v,c}$ — Ground truth label (one-hot) - $\hat{y}_{v,c} = \text{softmax}(H^{(L)}[v])_c$ — Predicted probability ### 7.2 Link Prediction $$ \mathcal{L}_{\text{link}} = -\sum_{(s,e,t) \in E} \log \sigma(H^{(L)}[s]^T \cdot W_{\phi(e)} \cdot H^{(L)}[t]) - \sum_{(s,e,t') \in E^{-}} \log \sigma(-H^{(L)}[s]^T \cdot W_{\phi(e)} \cdot H^{(L)}[t']) $$ Where: - $E^{-}$ — Negative edge samples - $\sigma$ — Sigmoid function ## 8. Complexity Analysis ### 8.1 Time Complexity $$ O\left( |E| \cdot d^2 / h + |V| \cdot d^2 \right) $$ Where: - $|E|$ — Number of edges - $|V|$ — Number of nodes - $d$ — Hidden dimension - $h$ — Number of heads ### 8.2 Space Complexity (Parameters) $$ O\left( |\mathcal{T}| \cdot d^2 + |\mathcal{R}| \cdot d^2 / h \right) $$ This is more efficient than R-GCN which requires $O(|\mathcal{R}| \cdot d^2)$. ## 9. Key Advantages - **No Manual Metapath Design**: Unlike HAN, HGT automatically learns the importance of different meta-relations - **Parameter Efficient**: Uses decomposition to avoid parameter explosion with many relation types - **Unified Framework**: Handles any heterogeneous graph schema - **Temporal Support**: Can incorporate relative time encoding for dynamic graphs - **Interpretable**: Attention weights reveal learned importance of different relations ## 10. Limitations - **Computational Overhead**: More complex than homogeneous GNNs - **Data Requirements**: Needs sufficient examples per node/edge type - **Memory Usage**: Multi-head attention increases memory consumption - **Hyperparameter Sensitivity**: Performance depends on number of heads, layers, hidden dimensions ## 12. Reference | Symbol | Description | |--------|-------------| | $G = (V, E, \tau, \phi)$ | Heterogeneous graph | | $\tau(v)$ | Type of node $v$ | | $\phi(e)$ | Type of edge $e$ | | $H^{(l)}[v]$ | Node $v$ representation at layer $l$ | | $\mathcal{N}(t)$ | Neighbors of target node $t$ | | $Q, K, V$ | Query, Key, Value projections | | $W^{\text{ATT}}, W^{\text{MSG}}$ | Attention and Message weight matrices | | $\mu$ | Learnable meta-relation prior |