Neural Architecture Search (NAS) Efficiency Methods is a set of techniques that reduce the computational cost of automated architecture discovery from thousands of GPU-days to single GPU-hours — transforming NAS from a prohibitively expensive research curiosity into a practical tool for designing high-performance neural networks.
Early NAS and the Cost Problem
The original NAS (Zoph and Le, 2017) used reinforcement learning to search over architectures, requiring 22,400 GPU-hours (≈$40K in cloud compute) to find a single CNN architecture for CIFAR-10. NASNet extended this to ImageNet but cost 48,000 GPU-hours. Each candidate architecture was trained from scratch to convergence before evaluation, making the search combinatorially explosive. This motivated efficient alternatives that share computation across candidates.
One-Shot NAS and Supernet Training
- Supernet concept: A single over-parameterized network (supernet) encodes all candidate architectures as subnetworks within a shared weight space
- Weight sharing: All candidate architectures share parameters; evaluating a candidate requires only a forward pass through the relevant subnetwork
- Single training run: The supernet is trained once (typically 100-200 epochs), then candidates are evaluated by inheriting supernet weights
- Path sampling: During supernet training, random paths (subnetworks) are sampled each batch, approximating joint training of all candidates
- Cost reduction: From thousands of GPU-days to 1-4 GPU-days for complete search
DARTS: Differentiable Architecture Search
- Continuous relaxation: DARTS (Liu et al., 2019) replaces discrete architecture choices with continuous softmax weights over operations (convolution, pooling, skip connection)
- Bilevel optimization: Architecture parameters (α) optimized on validation loss; network weights (w) optimized on training loss via alternating gradient descent
- Search cost: Approximately 1.5 GPU-days on CIFAR-10 (1000x cheaper than original NAS)
- Collapse problem: DARTS tends to converge to parameter-free operations (skip connections, pooling) due to optimization bias—addressed by DARTS+, FairDARTS, and progressive shrinking
- Cell-based search: Discovers normal and reduction cells that are stacked to form the final architecture
Progressive and Predictor-Based Methods
- Progressive NAS (PNAS): Grows architectures incrementally from simple to complex, pruning unpromising candidates early
- Predictor-based NAS: Trains a surrogate model (MLP, GNN, or Gaussian process) to predict architecture performance from encoding
- Zero-cost proxies: Evaluate architectures at initialization without training using metrics like Jacobian covariance, synaptic saliency, or gradient norm
- Hardware-aware NAS: Jointly optimizes accuracy and latency/FLOPs/energy using multi-objective search (e.g., MnasNet, FBNet, EfficientNet)
Search Space Design
- Cell-based: Search within a repeatable cell structure; stack cells to form network (NASNet, DARTS)
- Network-level: Search over depth, width, resolution, and connectivity patterns (EfficientNet compound scaling)
- Operation set: Typically includes 3x3/5x5 convolutions, depthwise separable convolutions, dilated convolutions, skip connections, and zero (no connection)
- Macro search: Full topology discovery including branching and merging paths
- Hierarchical search: Multi-level search combining cell-level and network-level decisions
Practical Deployment and Recent Advances
- Once-for-All (OFA): Trains a single supernet supporting elastic depth, width, kernel size, and resolution; extracts specialized subnets for different hardware targets without retraining
- NAS benchmarks: NAS-Bench-101, NAS-Bench-201, and NAS-Bench-301 provide precomputed results for reproducible NAS research
- AutoML frameworks: Auto-PyTorch, NNI (Microsoft), and AutoGluon integrate NAS into end-to-end pipelines
- Transferability: Architectures found on proxy tasks (CIFAR-10) often transfer well to larger datasets (ImageNet) via scaling
Efficient NAS methods have democratized architecture design, enabling practitioners to discover hardware-optimized networks in hours rather than weeks, making automated architecture engineering a standard component of the modern deep learning workflow.