Diffusion Models

Diffusion Models are the class of generative models that learn to reverse a gradual noising process — training a neural network to iteratively denoise random Gaussian noise back into realistic data samples, achieving state-of-the-art image generation quality that has surpassed GANs in fidelity, diversity, and training stability.

Forward Diffusion Process:
- Noise Schedule: progressively add Gaussian noise to data over T timesteps (typically T=1000) — x_t = √(ᾱ_t)x_0 + √(1-ᾱ_t)ε where ᾱ_t decreases from 1 to ~0; by t=T, x_T ≈ N(0,I) pure noise
- Variance Schedule: β_t controls noise added at each step — linear schedule (β₁=10⁻⁴ to β_T=0.02), cosine schedule (smoother transition, better for high-resolution), or learned schedule
- Markov Chain: each step depends only on the previous step — q(x_t|x_{t-1}) = N(x_t; √(1-β_t)x_{t-1}, β_tI); forward process has no learnable parameters
- Closed-Form Sampling: x_t can be computed directly from x_0 at any t without sequential simulation — key efficiency trick for training: sample random t, compute x_t, predict noise

Reverse Denoising Process:
- Noise Prediction Network: U-Net (or Transformer) ε_θ(x_t, t) trained to predict the noise ε added to x_0 to produce x_t — loss = ||ε - ε_θ(x_t, t)||² averaged over random t and random noise ε
- Score Matching Equivalence: predicting noise is equivalent to estimating the score ∇_x log p(x_t) — score function points toward higher data density; denoising follows the gradient of log-probability
- Sampling: starting from x_T ~ N(0,I), iteratively denoise: x_{t-1} = (1/√α_t)(x_t - (β_t/√(1-ᾱ_t))ε_θ(x_t,t)) + σ_t z — each step removes predicted noise and adds small random noise for stochasticity
- Accelerated Sampling: DDIM (deterministic implicit sampling) reduces 1000 steps to 50-100 — DPM-Solver and consistency models further reduce to 1-4 steps while maintaining quality

Guidance and Conditioning:
- Classifier Guidance: use a pre-trained classifier's gradient to steer generation toward a target class — ε̃ = ε_θ(x_t,t) - s∇_x log p(y|x_t); guidance scale s controls class adherence vs. diversity
- Classifier-Free Guidance (CFG): train unconditional and conditional models together (randomly dropping conditioning) — guided prediction = (1+w)ε_θ(x_t,t,c) - wε_θ(x_t,t) where w controls guidance strength; eliminates need for separate classifier
- Text-to-Image (Stable Diffusion): diffusion in learned latent space of a VAE — CLIP text encoder provides conditioning; 4× compressed latent space enables high-resolution (512-1024px) generation at reasonable compute cost
- ControlNet: adds spatial conditioning (edges, depth, pose) to pre-trained diffusion models — trainable copy of encoder with zero-convolution connections; preserves original model quality while adding precise spatial control

Diffusion models represent the current frontier of generative AI — powering Stable Diffusion, DALL-E, Midjourney, and Sora with unprecedented image and video generation quality, fundamentally changing creative workflows and establishing new benchmarks in generative modeling that GANs and VAEs could not achieve.

Want to learn more?