Energy-Based Models (EBMs) are the class of generative models that define a scalar energy function E(x) over inputs, where low energy corresponds to high probability — providing a flexible and principled framework for modeling complex distributions without requiring normalized probability computation, with applications spanning generation, anomaly detection, and compositional reasoning, and deep connections to both diffusion models and contrastive learning.
Core Concept
``
Probability: p(x) = exp(-E(x)) / Z
where Z = ∫ exp(-E(x)) dx (partition function / normalizing constant)
Low energy E(x) → high probability p(x)
High energy E(x) → low probability p(x)
The energy landscape defines the data distribution:
Training data → valleys (low energy)
Non-data → hills (high energy)
`
Why EBMs Are Attractive
| Property | EBM | GAN | VAE | Autoregressive |
|----------|-----|-----|-----|----------------|
| Unnormalized OK | Yes | N/A | No | No |
| Flexible architecture | Any f(x) → scalar | Generator + discriminator | Encoder + decoder | Sequential |
| Compositional | Yes (add energies) | Difficult | Difficult | Difficult |
| Mode coverage | Full | Mode collapse risk | Good | Full |
| Sampling | Slow (MCMC) | Fast (one forward pass) | Fast | Sequential |
Training EBMs
| Method | How | Trade-offs |
|--------|-----|----------|
| Contrastive divergence (CD) | MCMC samples for negative phase | Biased but practical |
| Score matching | Match ∇ₓ log p(x) | Avoids partition function |
| Noise contrastive estimation (NCE) | Discriminate data from noise | Scalable |
| Denoising score matching | Predict noise added to data | = Diffusion models! |
Connection to Diffusion Models
`
Diffusion model training:
L = ||ε_θ(x_t, t) - ε||² (predict noise)
This is equivalent to:
L = ||s_θ(x_t, t) - ∇ₓ log p_t(x_t|x_0)||² (score matching)
where s_θ(x) = ∇ₓ log p(x) = -∇ₓ E(x) (score = negative energy gradient)
→ Diffusion models ARE energy-based models trained with denoising score matching!
`
Compositional Generation
`
Key advantage of EBMs: Compose concepts by adding energies
E_dog(x): Low for images of dogs
E_red(x): Low for red images
E_composed(x) = E_dog(x) + E_red(x)
→ Low energy = high probability for RED DOGS
→ Zero-shot composition without training on "red dog" examples!
Sampling: Run MCMC/Langevin dynamics on E_composed → generate red dogs
`
Langevin Dynamics Sampling
`python``
def langevin_sample(energy_fn, x_init, n_steps=100, step_size=0.01):
x = x_init.clone().requires_grad_(True)
for _ in range(n_steps):
energy = energy_fn(x)
grad = torch.autograd.grad(energy, x)[0]
noise = torch.randn_like(x) math.sqrt(2 step_size)
x = x - step_size * grad + noise # Move toward low energy + noise
return x.detach()
Applications
| Application | How EBM Is Used |
|------------|----------------|
| Image generation | Energy landscape over images → sample via Langevin/MCMC |
| Anomaly detection | High energy = anomalous, low energy = normal |
| Protein design | Energy over protein conformations → sample stable structures |
| Reinforcement learning | Energy over state-action pairs → optimal policy |
| Compositional generation | Sum energies for novel concept combinations |
| Molecular design | Energy = binding affinity → optimize drug candidates |
Modern EBM Research
- Classifier-free guidance in diffusion = implicit energy composition.
- Score-based generative models (Song & Ermon) = continuous-time EBMs.
- Energy-based concept composition: combine text prompts as energy terms.
- Equilibrium models: Learn energy minimization as a forward pass.
Energy-based models are the theoretical foundation that unifies many approaches in generative AI — from the contrastive loss in CLIP to the denoising objective in diffusion models, the energy perspective provides a principled framework for understanding and combining generative models, with the unique advantage of compositional generation that allows zero-shot combination of learned concepts in ways that other generative frameworks cannot naturally achieve.