Home Knowledge Base Diffusion Models

Diffusion Models are a class of generative AI that learns to create images, audio, and other data by iteratively reversing a gradual noise-addition process — starting from pure Gaussian noise and progressively denoising it into a coherent output guided by text, images, or other conditioning signals. Introduced by Sohl-Dickstein et al. (2015) and made practical by Ho et al.'s DDPM (2020), diffusion models displaced GANs as the dominant image generation approach by 2022 and power Stable Diffusion, DALL-E 3, Midjourney, Google Imagen, and Adobe Firefly.

The Two Phases of Diffusion

Forward Process (Training): Adding Noise

During training, real images are progressively corrupted:

$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$$

At each timestep $t \in \{1, ..., T\}$ (typically $T=1000$), a small amount of Gaussian noise $\beta_t$ is added. After $T$ steps, the image is indistinguishable from pure noise: $x_T \sim \mathcal{N}(0, I)$.

Using the reparameterization trick, we can sample any timestep directly:

$$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

Reverse Process (Inference): Removing Noise

A neural network $\epsilon_\theta(x_t, t)$ is trained to predict the noise $\epsilon$ added at each step:

$$L_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]$$

At inference, start with pure noise $x_T \sim \mathcal{N}(0, I)$ and denoise step by step:

$$x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x_t, t)\right) + \sigma_t z$$

Latent Diffusion Models (Stable Diffusion)

Pixel-space diffusion is computationally expensive — a 512×512 image has 786,432 numbers. Latent diffusion (Rombach et al., 2022) operates in compressed latent space:

1. VAE Encoder: Compress image $x \in \mathbb{R}^{H \times W \times 3}$ to latent $z \in \mathbb{R}^{h \times w \times C}$ (typically 8x spatial compression, $C=4$) 2. Diffusion in latent space: Train U-Net to denoise latents (64×64×4 instead of 512×512×3 — 48x fewer values) 3. VAE Decoder: Decode final clean latent back to pixel space

This is why Stable Diffusion runs on consumer GPUs: the actual diffusion happens on a 64×64 latent, not a 512×512 pixel grid.

Major Diffusion Model Families

ModelOrganizationArchitectureReleaseKey Innovation
DDPMUC BerkeleyU-Net, pixel space2020Proved diffusion practical
DALL-E 2OpenAICLIP + cascaded diffusion2022Text-image alignment
ImagenGoogleCascaded pixel diffusion2022T5 text encoder, photorealism
Stable Diffusion 1.xStability AILDM with CLIP2022Open source, runs on 8GB GPU
Stable Diffusion XLStability AILarger LDM, CLIP+OpenCLIP20231024×1024 native resolution
DALL-E 3OpenAITransformer-based diffusion2023Improved prompt following
Flux.1Black Forest LabsRectified flow transformer2024Best open-source quality
Stable Diffusion 3.5Stability AIMMDiT architecture2024Multimodal diffusion transformer

Text Conditioning: How It Guides Generation

Text-to-image generation uses Classifier-Free Guidance (CFG):

1. Encode text prompt with CLIP or T5 text encoder → text embedding 2. At each denoising step, run U-Net twice: once with text conditioning, once unconditionally 3. Combine: $\tilde{\epsilon} = \epsilon_{\text{uncond}} + w \cdot (\epsilon_{\text{cond}} - \epsilon_{\text{uncond}})$ 4. $w$ = guidance scale (default 7.5): higher = more prompt-adherent, less diverse

Sampling Schedulers

The original DDPM requires $T=1000$ denoising steps (slow). Modern schedulers reduce this dramatically:

SchedulerSteps NeededQualitySpeed
DDPM1000ExcellentSlow
DDIM20-50Very good20-50x faster
DPM-Solver++15-20Excellent50-70x faster
LCM (Consistency)4-8Good125-250x faster
Turbo/Flash (1-step)1-4Good250-1000x faster

Customization Ecosystem

Beyond Images: Diffusion in Other Modalities

Hardware and Compute Requirements

Diffusion models represent a fundamental advance in generative AI — their ability to produce photorealistic, controllable, diverse outputs has transformed creative workflows across design, filmmaking, advertising, gaming, and scientific research.

diffusion modelstable diffusionddpmimage generationgenerative ai

Explore 500+ Semiconductor & AI Topics

From EUV lithography to CUDA optimization — search the full knowledge base or chat with our AI assistant.