AI/ML for HPC Optimization

AI/ML for HPC Optimization represents an emerging paradigm leveraging machine learning to automate parameter tuning, performance modeling, and resource scheduling, addressing the exponential complexity of modern HPC systems tuning.

ML-Based Autotuning (OpenTuner, Bayesian Optimization)

- Autotuning Problem: Optimize kernel parameters (block size, loop unroll factor, cache tiling dimensions) for performance. Exponential search space (10^6+ combinations).
- OpenTuner Framework: Bandit-based algorithm sampling parameter space intelligently. Focuses search on promising regions, eliminates poor performers early.
- Bayesian Optimization: Probabilistic model of objective function (kernel performance vs parameters). Samples most promising points, refines model iteratively.
- Performance Gain: Autotuning typically achieves 80-95% of hand-optimized performance with zero manual tuning. Speedup: 2-10x over baseline default parameters.

Neural Network Performance Models

- Prediction Task: Input = kernel code, parameters, hardware. Output = predicted execution time (GFLOP/s, memory bandwidth).
- Training Data: Run kernel on hardware with various parameter combinations. Collect statistics (memory bandwidth, cache hits, branch mispredictions).
- Model Architecture: Multi-layer neural network (5-10 layers, 100-1000 neurons). ReLU activations, batch normalization. Trained via supervised learning (MSE loss).
- Accuracy: Typical error: 10-30% (acceptable for ranking kernels, less suitable for absolute performance). Accuracy sufficient for optimization decisions.

Roofline Prediction via ML

- Roofline Model Integration: ML model predicts arithmetic intensity (FLOP/byte) and achieved occupancy. Roofline model maps to performance ceiling.
- Hybrid Approach: ML predicts occupancy + arithmetic intensity; roofline formula yields performance. More accurate than direct performance regression.
- Symbolic Execution: Code analysis (loop depth, memory access patterns) extracts symbolic features. ML model trained on (features, performance) pairs.
- Transfer Learning: Model trained on one GPU, transfers to similar GPU with fine-tuning. Reduces training data requirement.

Reinforcement Learning for HPC Job Scheduling

- Scheduling Problem: Assign jobs to nodes, optimize for throughput, latency, fairness. Combinatorial search space (exponential in job count).
- RL Formulation: State = job queue, node status. Action = assign job to node (or defer). Reward = throughput increase (negative penalty for idle nodes).
- Agent Training: Deep Q-learning (DQN) or policy gradient (PPO) trained via simulation. Agent learns optimal scheduling policy.
- Benchmark Results: RL-based scheduler (e.g., Deepmind Borg model) outperforms heuristic schedulers (first-fit, best-fit) by 10-20% throughput improvement.

AI-Guided Compiler Optimization

- Compiler Problem: Select best optimization order (loop unroll → vectorization → inlining) for input program. Order impacts final performance (10-30% variation).
- ML Integration in LLVM: ML model predicts which optimization sequence yields best performance for given function. Replaces hand-written heuristics.
- Feature Engineering: Extract program features (instruction count, loop depth, call-graph properties). Train model on (features, optimization sequence, performance) triplets.
- Production Deployment: Compiler leverages model during optimization phase. Transparently improves optimization quality without user awareness.

Learned Prefetching and Memory Optimization

- Prefetch Policy Prediction: ML model learns data access pattern from instruction history. Predicts next memory address, pre-fetches from DRAM.
- Address Pattern Recognition: Recurrent neural networks (LSTM) model access sequences. Train on execution traces (millions of memory accesses).
- Performance Improvement: 10-20% speedup on memory-bound kernels (FFT, GEMM variants). Trade-off: prefetcher power overhead.
- Hardware Implementation: Prefetcher implemented in CPU microarchitecture (no ISA changes). Transparent to software.

AI for Power Management in HPC Centers

- Power Prediction: ML model predicts power consumption (watts) per job, given parameters (clock frequency, core count, vectorization level).
- Dynamic Frequency Scaling (DVFS): Adjust clock frequency per node based on power budget. ML model optimizes frequency for power constraint while maintaining performance.
- Thermal Management: Predict temperature rise; throttle hot nodes, boost cool nodes. Uniform temperature distribution achieved via ML-guided DVFS.
- Data Center Savings: Power oversubscription enables 20-40% cost reduction (fewer power supplies, cooler requirements). ML-guided power management maintains reliability.

Current Limitations and Future Directions

- Generalization Challenge: ML models trained on specific hardware (GPU architecture, interconnect topology). Transfer to different hardware requires retraining.
- Interpretability: "Black box" ML models don't explain optimization decisions. Hard to debug if model performance degrades.
- Data Requirements: Large training datasets necessary (100k+ kernel runs). Expensive to collect; limits applicability to niche domains.
- Emerging Trends: AutoML techniques (neural architecture search) automatically design model architectures. Federated learning enables knowledge sharing across systems without data centralization.

Want to learn more?