Problem

ControlNet adds spatial control signals like edges, depth, or poses to guide diffusion model image generation. Problem: Text-to-image models have limited spatial control. Can't specify exact composition, poses, or structure. Solution: Condition diffusion model on additional spatial inputs alongside text. Control signals: Canny edges, depth maps, pose skeletons, segmentation maps, normal maps, scribbles, line art. Architecture: Clone encoder weights of diffusion U-Net, process control signal with cloned encoder, inject features into original network via zero convolutions. Zero convolutions: Initialize to zero, gradually learn contribution during training. Prevents destabilizing pretrained model. Training: Pairs of images and control signals, often extracted automatically (edge detection, depth estimation). Inference: Extract control signal from reference → generate image matching that structure. Use cases: Pose-to-image, architectural rendering from sketches, consistent character generation, style transfer with structure preservation. Multi-ControlNet: Combine multiple control signals (edges + depth + pose). Ecosystem: Many community models for different control types. Revolutionized controlled image generation.

Want to learn more?