Diffusion Models are the class of generative models that learn to reverse a gradual noising process — training a neural network to iteratively denoise random Gaussian noise back into realistic data samples, achieving state-of-the-art image generation quality that has surpassed GANs in fidelity, diversity, and training stability.
Forward Diffusion Process:
- Noise Schedule: progressively add Gaussian noise to data over T timesteps (typically T=1000) — x_t = √(ᾱ_t)x_0 + √(1-ᾱ_t)ε where ᾱ_t decreases from 1 to ~0; by t=T, x_T ≈ N(0,I) pure noise
- Variance Schedule: β_t controls noise added at each step — linear schedule (β₁=10⁻⁴ to β_T=0.02), cosine schedule (smoother transition, better for high-resolution), or learned schedule
- Markov Chain: each step depends only on the previous step — q(x_t|x_{t-1}) = N(x_t; √(1-β_t)x_{t-1}, β_tI); forward process has no learnable parameters
- Closed-Form Sampling: x_t can be computed directly from x_0 at any t without sequential simulation — key efficiency trick for training: sample random t, compute x_t, predict noise
Reverse Denoising Process:
- Noise Prediction Network: U-Net (or Transformer) ε_θ(x_t, t) trained to predict the noise ε added to x_0 to produce x_t — loss = ||ε - ε_θ(x_t, t)||² averaged over random t and random noise ε
- Score Matching Equivalence: predicting noise is equivalent to estimating the score ∇_x log p(x_t) — score function points toward higher data density; denoising follows the gradient of log-probability
- Sampling: starting from x_T ~ N(0,I), iteratively denoise: x_{t-1} = (1/√α_t)(x_t - (β_t/√(1-ᾱ_t))ε_θ(x_t,t)) + σ_t z — each step removes predicted noise and adds small random noise for stochasticity
- Accelerated Sampling: DDIM (deterministic implicit sampling) reduces 1000 steps to 50-100 — DPM-Solver and consistency models further reduce to 1-4 steps while maintaining quality
Guidance and Conditioning:
- Classifier Guidance: use a pre-trained classifier's gradient to steer generation toward a target class — ε̃ = ε_θ(x_t,t) - s∇_x log p(y|x_t); guidance scale s controls class adherence vs. diversity
- Classifier-Free Guidance (CFG): train unconditional and conditional models together (randomly dropping conditioning) — guided prediction = (1+w)ε_θ(x_t,t,c) - wε_θ(x_t,t) where w controls guidance strength; eliminates need for separate classifier
- Text-to-Image (Stable Diffusion): diffusion in learned latent space of a VAE — CLIP text encoder provides conditioning; 4× compressed latent space enables high-resolution (512-1024px) generation at reasonable compute cost
- ControlNet: adds spatial conditioning (edges, depth, pose) to pre-trained diffusion models — trainable copy of encoder with zero-convolution connections; preserves original model quality while adding precise spatial control
Diffusion models represent the current frontier of generative AI — powering Stable Diffusion, DALL-E, Midjourney, and Sora with unprecedented image and video generation quality, fundamentally changing creative workflows and establishing new benchmarks in generative modeling that GANs and VAEs could not achieve.