Hyperparameter Optimization (Bayesian, Optuna, Population-Based Training) is the systematic process of selecting optimal training configurations—learning rates, batch sizes, architectures, regularization strengths—that maximize model performance — replacing manual trial-and-error tuning with principled search algorithms that efficiently explore high-dimensional configuration spaces.
The Hyperparameter Challenge
Neural network performance is highly sensitive to hyperparameter choices: a 2x change in learning rate can mean the difference between convergence and divergence; batch size affects generalization; weight decay interacts non-linearly with learning rate and architecture. Manual tuning is time-consuming and biased by practitioner experience. The search space grows combinatorially—10 hyperparameters with 10 values each yields 10 billion combinations, making exhaustive search impossible.
Grid Search and Random Search
- Grid search: Evaluates all combinations of discrete hyperparameter values; scales exponentially O(k^d) where k is values per dimension and d is number of hyperparameters
- Random search (Bergstra and Bengio, 2012): Randomly samples configurations from specified distributions; provably more efficient than grid search when some hyperparameters matter more than others
- Why random beats grid: Grid search wastes evaluations exploring irrelevant hyperparameter dimensions uniformly; random search allocates more unique values to each dimension
- Practical recommendation: Random search with 60 trials covers the space well enough for many problems; serves as baseline for more sophisticated methods
Bayesian Optimization
- Surrogate model: Builds a probabilistic model (Gaussian Process, Tree-Parzen Estimator, or Random Forest) of the objective function from evaluated configurations
- Acquisition function: Balances exploration (uncertain regions) and exploitation (promising regions)—Expected Improvement (EI), Upper Confidence Bound (UCB), or Knowledge Gradient
- Sequential refinement: Each trial's result updates the surrogate model, and the next configuration is chosen to maximize the acquisition function
- Gaussian Process BO: Models the objective as a GP with RBF kernel; provides uncertainty estimates but scales poorly beyond ~20 dimensions and ~1000 evaluations
- Tree-Parzen Estimator (TPE): Models the distribution of good and bad configurations separately using kernel density estimation; handles conditional and hierarchical hyperparameters naturally; default algorithm in Optuna and HyperOpt
Optuna Framework
- Define-by-run API: Hyperparameter search spaces are defined within the objective function using trial.suggest_* methods, enabling dynamic and conditional parameters
- Pruning (early stopping): MedianPruner and HyperbandPruner terminate unpromising trials early based on intermediate results, saving 2-5x compute
- Multi-objective optimization: Simultaneously optimizes accuracy and latency/model size using Pareto-optimal trial selection (NSGA-II)
- Distributed search: Scales across multiple workers with shared storage backend (MySQL, PostgreSQL, Redis)
- Visualization: Built-in plotting for optimization history, parameter importance, parallel coordinate plots, and contour maps
- Integration: Direct support for PyTorch Lightning, Keras, XGBoost, and scikit-learn through callback-based pruning
Population-Based Training (PBT)
- Evolutionary approach: Maintains a population of models training in parallel, each with different hyperparameters
- Exploit and explore: Periodically, underperforming members copy weights from top performers (exploit) and perturb hyperparameters (explore)
- Online schedule discovery: PBT implicitly learns hyperparameter schedules (e.g., learning rate warmup then decay) rather than fixed values—discovering that optimal hyperparameters change during training
- DeepMind results: PBT discovered training schedules for transformers, GANs, and RL agents that outperform manually designed schedules
- Communication overhead: Requires shared filesystem or network storage for model checkpoints; population size of 20-50 is typical
Advanced Methods and Practical Guidance
- BOHB (Bayesian Optimization HyperBand): Combines Bayesian optimization (TPE) with Hyperband's adaptive resource allocation for efficient multi-fidelity search
- Multi-fidelity optimization: Evaluate configurations cheaply first (few epochs, subset of data, smaller model) and allocate full resources only to promising candidates
- Transfer learning for HPO: Warm-start optimization using results from related tasks or datasets, reducing required evaluations by 50-80%
- Learning rate range test: Smith's learning rate finder sweeps learning rate from small to large in a single epoch, identifying optimal range without full HPO
- Hyperparameter importance: fANOVA (functional ANOVA) decomposes objective variance to identify which hyperparameters matter most, focusing search on high-impact dimensions
Hyperparameter optimization has evolved from ad-hoc manual tuning to a principled engineering practice, with frameworks like Optuna and methods like PBT enabling practitioners to systematically discover training configurations that unlock the full potential of their neural network architectures.
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.