Capsule Networks is the alternative neural architecture where capsules (groups of neurons encoding pose and viewpoint parameters) are routed through dynamic agreement — capturing part-whole hierarchies and equivariance to transformations better than traditional convolutional networks.
Capsule Entity Representation:
- Capsule abstraction: group of neurons (4-8 typically) represents entity at specific position/scale; vector encodes pose information
- Pose vector: contains position, size, orientation, and other transformation parameters for detected feature
- Equivariance property: when input transforms (rotation, translation), pose vectors transform correspondingly; not true for standard neurons
- Routing responsibility: capsule outputs routed to higher-level capsules based on agreement; mechanism for part-whole relationships
Dynamic Routing by Agreement:
- Routing algorithm: iterative procedure routes lower-level capsule outputs to higher-level capsules based on prediction agreement
- Coupling coefficients: learned soft weights determining capsule routing; updated each iteration based on agreement metrics
- Routing iterations: typically 2-3 iterations; each iteration refines coupling coefficients to route to agreeing capsules
- Squashing activation: output capsule activations squeezed to unit norm via non-linear squashing function
- Prediction agreement: if lower-capsule predicts upper-capsule's activity, coupling strength increases (routing to agreeing capsules)
EM Routing (Hinton 2018):
- Expectation-Maximization routing: alternative to dynamic routing; more principled probabilistic approach
- Gaussian modeling: model capsule outputs as mixture of Gaussians; EM algorithm learns mixture weights and parameters
- Linear transformation: pose predictions from lower to higher capsules via learned transformation matrices
- Iterative EM: alternating expectation (assign capsules to clusters) and maximization (update cluster parameters)
- Improved performance: EM routing slightly improves accuracy; computational cost vs marginal gain tradeoff
Part-Whole Relationships:
- Hierarchical structure: capsules explicitly encode part-whole relationships; lower-level features → higher-level entities
- Compositional learning: model learns that wheels, doors, windows compose cars; explicit semantic hierarchy
- Robustness to viewpoint: capsule vectors contain viewpoint information; networks generalize across viewpoints
- Inverse graphics: capsules hypothesized to learn inverse graphics model (generate images from poses)
Equivariance to Transformations:
- Equivariance advantage: standard CNNs have limited equivariance (only translation for convolution); capsules equivariant to more transforms
- Pose generalization: viewpoint transformation in input reflected in pose vector; enables better generalization
- Affine transformations: capsule networks hypothesized to be equivariant to affine transforms; supported empirically
- Robustness benefits: equivariance hypothesized to improve adversarial robustness; empirical validation ongoing
Capsule Network Architecture:
- CapsNet for MNIST: two convolutional capsule layers + fully-connected capsule layer; margin loss for multiclass classification
- Instance parameters: each capsule type shares weights across spatial positions; reduces parameters vs fully-connected networks
- Reconstruction regularizer: add reconstruction loss (decoder reconstructs image from class capsule); additional supervision signal
Limitations and Challenges:
- Scalability: routing complexity increases with network depth; computational overhead substantial
- Training difficulty: capsule networks harder to train than CNNs; require careful initialization and hyperparameter tuning
- Performance gains: improvements over CNNs modest on standard benchmarks; larger benefits hypothesized for novel viewpoints
- Interpretability: capsule poses should be interpretable (rotations, positions, etc.); empirical pose interpretability mixed
Capsule networks introduce geometric structure — routing by agreement and pose vectors encoding transformations — proposing a more biologically inspired alternative to standard convolutions with advantages in capturing part-whole hierarchies.