Energy-Based Models (EBMs) is the probabilistic framework assigning energy values to configurations, where probability inversely proportional to energy — trainable via contrastive divergence or score matching to enable joint learning of generative and discriminative patterns.
Energy-Based Modeling Framework:
- Energy function: E(x) assigns scalar energy to each configuration x; lower energy → higher probability
- Unnormalized probability: p(x) ∝ exp(-E(x)); partition function Z = ∫exp(-E(x))dx often intractable
- Boltzmann distribution: statistical mechanics connection; energy models sample from Gibbs/Boltzmann distribution
- Inference: finding minimum-energy configuration (MAP inference); related to constraint satisfaction
Training via Contrastive Divergence:
- Contrastive divergence (CD): approximate maximum likelihood training without computing partition function
- Data distribution: positive phase collects samples from data; learning increases probability of data
- Model distribution: negative phase collects samples from model; learning decreases probability of model samples
- K-step CD: run K steps MCMC from data point; data samples naturally distributed; model samples biased but practical
- Practical approximation: CD-1 (single Gibbs step) often sufficient; reduces computational cost from intractable exact MLE
MCMC Sampling via Langevin Dynamics:
- Langevin dynamics: gradient-based MCMC sampling from energy function; iterative process: x_{t+1} = x_t - η∇E(x_t) + noise
- Gradient direction: move opposite to energy gradient (downhill in energy landscape); noise ensures Markov chain ergodicity
- Convergence: Langevin dynamics samples from exp(-E(x)) after sufficient iterations; enables efficient sampling
- Mixing time: number of steps to converge depends on energy landscape; sharp minima require more steps
Score Matching:
- Score function: ∇_x log p(x) is score; matching score equivalent to matching density without computing partition function
- Denoising score matching: add Gaussian noise to data; match denoised score; avoids manifold singularities
- Sliced score matching: project score onto random directions; reduces dimensionality and computational cost
- Score-based generative models: train score function; sample via reverse SDE (score-based diffusion models); related to EBMs
Joint EBM Architecture:
- Discriminative + generative: single energy function used for both classification and generation
- Discriminative application: conditional energy E(y|x); enables joint learning of class boundaries and data generation
- Hybrid learning: supervised loss + generative contrastive loss; improves both classification and generation
- Parameter sharing: single network learns both tasks; more parameter-efficient than separate models
EBM Applications:
- Anomaly detection: high-energy examples are anomalous; learned energy function detects out-of-distribution examples
- Image generation: sample via MCMC from learned energy function; slower than GANs but theoretically principled
- Structured prediction: energy incorporates constraints; inference finds satisfying assignments; useful for combinatorial problems
- Collaborative filtering: energy models user-item interactions; joint learning with side information
Connection to Denoising Diffusion Models:
- Score matching foundation: modern diffusion models train score function via score matching; equivalent to denoising objective
- Reverse process: sampling uses score (energy gradient); Langevin dynamics evolution generates samples
- Generative modeling: diffusion models successful application of score-based approach; practical and scalable
EBM Challenges:
- Sampling inefficiency: MCMC sampling slow compared to direct generation (GANs); limits practical application
- Evaluation difficulty: partition function intractable; evaluating likelihood challenging; no natural likelihood objective
- Scalability: contrastive divergence requires two phases (data + model); computational overhead
- Mode coverage: mode collapse possible if positive/negative phases don't mix well
Energy-based models provide principled probabilistic framework assigning energy to configurations — trainable without computing intractable partition functions via contrastive divergence or score matching for generation and discrimination.