Hyperparameter Optimization (HPO)

Hyperparameter Optimization (HPO) is the systematic process of selecting the best configuration of training hyperparameters — learning rate, batch size, architecture choices, regularization strength, and optimizer settings — using principled search strategies that maximize model performance while minimizing computational cost — replacing manual trial-and-error tuning with automated methods ranging from Bayesian optimization to population-based training.

Search Strategy Taxonomy:
- Grid Search: Evaluate all combinations of discretized hyperparameter values; exhaustive but exponentially expensive in the number of hyperparameters (curse of dimensionality)
- Random Search: Sample hyperparameter configurations uniformly at random; provably more efficient than grid search when only a few hyperparameters matter (Bergstra & Bengio, 2012)
- Bayesian Optimization: Build a probabilistic surrogate model of the objective function and use an acquisition function to select the most promising configuration to evaluate next
- Tree-Structured Parzen Estimator (TPE): Model the density of good and bad configurations separately using kernel density estimators, selecting points with high probability under the good distribution (used in Optuna and Hyperopt)
- Gaussian Process (GP): Fit a Gaussian process to observed (configuration, performance) pairs, using Expected Improvement or Upper Confidence Bound acquisition functions
- Successive Halving / Hyperband: Allocate a small budget to many configurations, then progressively eliminate the worst performers and allocate more resources to survivors
- Population-Based Training (PBT): Maintain a population of models training in parallel, periodically replacing poor performers with perturbed copies of good performers — enabling hyperparameter schedules to evolve during training

Key Frameworks and Tools:
- Optuna: Python framework with TPE-based sampler, pruning via median/percentile stopping, multi-objective optimization, and rich visualization (contour plots, parameter importance, optimization history)
- Ray Tune: Distributed HPO library integrated with Ray, supporting multiple search algorithms (Bayesian, Hyperband, PBT, BOHB), fault-tolerant distributed execution, and seamless scaling from laptop to cluster
- Weights & Biases Sweeps: Cloud-integrated HPO with Bayesian and random search, real-time experiment tracking, and collaborative visualization
- KerasTuner: Keras-native HPO with built-in Hyperband, random search, and Bayesian optimization for Keras/TensorFlow models
- SMAC3: Sequential Model-Based Algorithm Configuration using random forests as surrogate models, excelling on conditional and high-dimensional search spaces
- Ax/BoTorch: Meta's adaptive experimentation platform built on BoTorch (Bayesian optimization in PyTorch), supporting multi-objective and constrained optimization

Early Stopping and Pruning:
- Median Pruner: Stop a trial if its intermediate performance falls below the median of completed trials at the same step
- Percentile Pruner: Generalize median pruning to any percentile threshold, trading aggressiveness for risk of pruning eventually-good trials
- ASHA (Asynchronous Successive Halving): Asynchronously promote or stop trials based on their performance at predefined rungs, enabling efficient utilization of distributed resources
- Learning Curve Extrapolation: Fit parametric curves to partial training histories to predict final performance and prune unlikely candidates early

Multi-Objective and Constrained HPO:
- Pareto Optimization: Simultaneously optimize accuracy, latency, and model size, returning a Pareto front of non-dominated solutions
- Constrained Optimization: Enforce hard constraints (e.g., model must be under 50MB, inference under 10ms) while maximizing accuracy
- Cost-Aware Search: Weight the acquisition function by the computational cost of each configuration, preferring cheap evaluations when uncertainty is high

Practical Recommendations:
- Start with Random Search: Establish baselines and understand the hyperparameter landscape before deploying more sophisticated methods
- Use Log-Uniform Sampling: For learning rates, weight decay, and other scale-sensitive parameters, sample uniformly in log space
- Budget Allocation: Allocate 20–50% of total compute budget to HPO; use Hyperband-style early stopping to maximize configurations evaluated
- Warm-Starting: Initialize Bayesian optimization with previously observed configurations from related tasks or model architectures
- Feature Importance Analysis: Use fANOVA (functional ANOVA) to quantify which hyperparameters most impact performance, focusing future search on the most influential ones

Hyperparameter optimization has evolved from a manual art into a rigorous engineering discipline — with modern frameworks enabling practitioners to efficiently navigate vast configuration spaces, discover non-obvious hyperparameter interactions, and systematically extract maximum performance from deep learning models within fixed computational budgets.

Want to learn more?