Diffusion Models are a class of generative AI that learns to create images, audio, and other data by iteratively reversing a gradual noise-addition process — starting from pure Gaussian noise and progressively denoising it into a coherent output guided by text, images, or other conditioning signals. Introduced by Sohl-Dickstein et al. (2015) and made practical by Ho et al.'s DDPM (2020), diffusion models displaced GANs as the dominant image generation approach by 2022 and power Stable Diffusion, DALL-E 3, Midjourney, Google Imagen, and Adobe Firefly.
The Two Phases of Diffusion
Forward Process (Training): Adding Noise
During training, real images are progressively corrupted:
$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$$
At each timestep $t \in \{1, ..., T\}$ (typically $T=1000$), a small amount of Gaussian noise $\beta_t$ is added. After $T$ steps, the image is indistinguishable from pure noise: $x_T \sim \mathcal{N}(0, I)$.
Using the reparameterization trick, we can sample any timestep directly:
$$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$
Reverse Process (Inference): Removing Noise
A neural network $\epsilon_\theta(x_t, t)$ is trained to predict the noise $\epsilon$ added at each step:
$$L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]$$
At inference, start with pure noise $x_T \sim \mathcal{N}(0, I)$ and denoise step by step:
$$x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x_t, t)\right) + \sigma_t z$$
Latent Diffusion Models (Stable Diffusion)
Pixel-space diffusion is computationally expensive — a 512×512 image has 786,432 numbers. Latent diffusion (Rombach et al., 2022) operates in compressed latent space:
1. VAE Encoder: Compress image $x \in \mathbb{R}^{H \times W \times 3}$ to latent $z \in \mathbb{R}^{h \times w \times C}$ (typically 8x spatial compression, $C=4$) 2. Diffusion in latent space: Train U-Net to denoise latents (64×64×4 instead of 512×512×3 — 48x fewer values) 3. VAE Decoder: Decode final clean latent back to pixel space
This is why Stable Diffusion runs on consumer GPUs: the actual diffusion happens on a 64×64 latent, not a 512×512 pixel grid.
Major Diffusion Model Families
| Model | Organization | Architecture | Release | Key Innovation |
|---|---|---|---|---|
| DDPM | UC Berkeley | U-Net, pixel space | 2020 | Proved diffusion practical |
| DALL-E 2 | OpenAI | CLIP + cascaded diffusion | 2022 | Text-image alignment |
| Imagen | Cascaded pixel diffusion | 2022 | T5 text encoder, photorealism | |
| Stable Diffusion 1.x | Stability AI | LDM with CLIP | 2022 | Open source, runs on 8GB GPU |
| Stable Diffusion XL | Stability AI | Larger LDM, CLIP+OpenCLIP | 2023 | 1024×1024 native resolution |
| DALL-E 3 | OpenAI | Transformer-based diffusion | 2023 | Improved prompt following |
| Flux.1 | Black Forest Labs | Rectified flow transformer | 2024 | Best open-source quality |
| Stable Diffusion 3.5 | Stability AI | MMDiT architecture | 2024 | Multimodal diffusion transformer |
Text Conditioning: How It Guides Generation
Text-to-image generation uses Classifier-Free Guidance (CFG):
1. Encode text prompt with CLIP or T5 text encoder → text embedding 2. At each denoising step, run U-Net twice: once with text conditioning, once unconditionally 3. Combine: $\tilde{\epsilon} = \epsilon_{\text{uncond}} + w \cdot (\epsilon_{\text{cond}} - \epsilon_{\text{uncond}})$ 4. $w$ = guidance scale (default 7.5): higher = more prompt-adherent, less diverse
Sampling Schedulers
The original DDPM requires $T=1000$ denoising steps (slow). Modern schedulers reduce this dramatically:
| Scheduler | Steps Needed | Quality | Speed |
|---|---|---|---|
| DDPM | 1000 | Excellent | Slow |
| DDIM | 20-50 | Very good | 20-50x faster |
| DPM-Solver++ | 15-20 | Excellent | 50-70x faster |
| LCM (Consistency) | 4-8 | Good | 125-250x faster |
| Turbo/Flash (1-step) | 1-4 | Good | 250-1000x faster |
Customization Ecosystem
- LoRA (Low-Rank Adaptation): Fine-tune a specific style, character, or concept with 100-1000 images. LoRA weights are ~50-200MB, compared to 4GB for the full model.
- DreamBooth: Personalize the model to generate a specific subject (person, product, pet) from 3-30 images.
- ControlNet: Add spatial conditioning — users can guide generation with edge maps, depth maps, pose skeletons, or reference images. Enables precise layout control.
- IP-Adapter: Image-prompt adapter that conditions on a reference image style.
- Textual inversion: Encode a concept as a new learned text token.
Beyond Images: Diffusion in Other Modalities
- Video: Sora (OpenAI), Gen-2/Gen-3 (Runway), Kling (Kuaishou), Wan (Alibaba) — temporal consistency is the key challenge
- Audio: AudioLDM, Stable Audio — music and sound effects from text
- 3D: DreamFusion, Point-E — 3D assets from text
- Proteins: RFDiffusion (Baker Lab) — protein structure design that outperforms AlphaFold for design tasks
- Molecules: DiffSBDD, TargetDiff — drug molecule generation conditioned on protein binding sites
- Video generation for AI training: Synthetic training data generation using diffusion models
Hardware and Compute Requirements
- Minimum for inference: 4GB VRAM (SD 1.5 FP16, 512×512)
- Comfortable inference: 8-12GB VRAM (SDXL 1024×1024)
- Training from scratch: 8-64 A100s for weeks to months
- LoRA fine-tuning: Single RTX 4090 (24GB) is sufficient for most use cases
Diffusion models represent a fundamental advance in generative AI — their ability to produce photorealistic, controllable, diverse outputs has transformed creative workflows across design, filmmaking, advertising, gaming, and scientific research.
Related Topics
Explore 500+ Semiconductor & AI Topics
From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.