Energy-Based Models (EBMs)

Energy-Based Models (EBMs) are a general class of generative models that define a probability distribution over data by assigning a scalar energy value to each input configuration, with lower energy corresponding to higher probability — offering a flexible, unnormalized modeling framework where the energy function can be parameterized by arbitrary neural networks without the architectural constraints imposed by normalizing flows or the training instability of GANs.

Mathematical Foundation:
- Energy Function: A learned function E_theta(x) maps each data point x to a scalar energy value; the model does not require E to have any specific structure beyond being differentiable with respect to its parameters
- Boltzmann Distribution: The probability density is defined as p_theta(x) = exp(-E_theta(x)) / Z_theta, where Z_theta is the partition function (normalizing constant) obtained by integrating exp(-E) over all possible inputs
- Intractable Partition Function: Computing Z_theta requires integrating over the entire data space, which is infeasible for high-dimensional inputs — making maximum likelihood training challenging and motivating approximate training methods
- Free Energy: For models with latent variables, the free energy marginalizes over latent configurations: F(x) = -log(sum_h exp(-E(x, h))), connecting EBMs to traditional probabilistic graphical models

Training Methods:
- Contrastive Divergence (CD): Approximate the gradient of the log-likelihood by running k steps of MCMC (typically Gibbs sampling) starting from data points; CD-1 uses a single step and was instrumental in training Restricted Boltzmann Machines
- Persistent Contrastive Divergence (PCD): Maintain persistent MCMC chains across training iterations rather than reinitializing from data, producing better gradient estimates at the cost of maintaining a replay buffer of negative samples
- Score Matching: Minimize the squared difference between the model's score function (gradient of log-density) and the data score, avoiding partition function computation entirely; equivalent to denoising score matching when noise is added to data
- Noise Contrastive Estimation (NCE): Train a binary classifier to distinguish data from noise samples, implicitly learning the energy function as the log-ratio of data to noise density
- Sliced Score Matching: Project the score matching objective onto random directions, reducing computational cost from computing the full Hessian trace to evaluating directional derivatives
- Denoising Score Matching (DSM): Perturb data with known noise and train the model to estimate the score of the noised distribution — directly connected to the training of diffusion models

Sampling from EBMs:
- Langevin Dynamics (SGLD): Initialize samples from noise, then iteratively update them by following the gradient of the log-density plus Gaussian noise: x_t+1 = x_t + (step/2) grad_x log p(x_t) + sqrt(step) noise
- Hamiltonian Monte Carlo (HMC): Augment the state with momentum variables and simulate Hamiltonian dynamics to produce distant, low-autocorrelation samples
- Replay Buffer: Maintain a buffer of previously generated samples and use them to initialize SGLD chains, dramatically reducing the mixing time needed for high-quality samples
- Short-Run MCMC: Use very few MCMC steps (10–100) for each sample, accepting that samples are not fully converged but sufficient for training signal
- Amortized Sampling: Train a separate generator network to produce approximate samples, which are then refined with a few MCMC steps — combining the speed of amortized inference with EBM flexibility

Connections to Other Generative Models:
- Diffusion Models: Score-based diffusion models can be viewed as EBMs trained at multiple noise levels, with Langevin dynamics providing the sampling mechanism — DSM is their primary training objective
- GANs: The discriminator in a GAN can be interpreted as an energy function, and some EBM training methods resemble adversarial training
- Normalizing Flows: Flows provide tractable density evaluation but with architectural constraints; EBMs trade tractable density for maximal architectural flexibility
- Variational Autoencoders: VAEs optimize a lower bound on log-likelihood with amortized inference; EBMs can use MCMC for more accurate but slower posterior estimation

Applications:
- Compositional Generation: Energy functions naturally compose through addition (product of experts), enabling modular generation where multiple EBMs controlling different attributes combine during sampling
- Out-of-Distribution Detection: Use energy values as confidence scores — in-distribution data receives low energy, out-of-distribution inputs receive high energy
- Classifier-Free Guidance: The guidance mechanism in modern diffusion models is interpretable as composing conditional and unconditional energy functions
- Protein Structure Prediction: Model the energy landscape of protein conformations, with low-energy states corresponding to stable folded structures

Energy-based models provide the most general and flexible framework for probabilistic generative modeling — where the freedom to define arbitrary energy landscapes comes at the cost of intractable normalization, motivating a rich ecosystem of approximate training and sampling methods that have profoundly influenced the development of modern diffusion models and score-based generative approaches.

Want to learn more?