Modality Hallucination

Modality Hallucination is a knowledge distillation technique where a model learns to internally generate (hallucinate) the features of a missing modality at inference time — training a student network to mimic the representations that a teacher network produces from a modality that is available during training but unavailable during deployment, enabling the student to benefit from multimodal knowledge while operating on a single modality.

What Is Modality Hallucination?

- Definition: A training paradigm where a model that will only receive modality A at test time is trained to internally reconstruct the features of modality B (which was available during training), effectively "imagining" what the missing modality would look like and using those hallucinated features to improve predictions.
- Teacher-Student Framework: A teacher network processes both modalities (e.g., RGB + Depth) during training; a student network receives only one modality (RGB) but is trained to produce intermediate features that match what the teacher extracts from the missing modality (Depth).
- Feature Mimicry: The hallucination loss minimizes the distance between the student's hallucinated features and the teacher's real features: L_hall = ||f_student(x_RGB) - f_teacher(x_Depth)||², forcing the student to learn a mapping from available to missing modality features.
- Inference Efficiency: At test time, only the student network runs on the single available modality — no additional sensors, data collection, or processing for the missing modality is needed.

Why Modality Hallucination Matters

- Sensor Cost Reduction: Depth cameras (LiDAR, structured light) are expensive and power-hungry; hallucinating depth features from cheap RGB cameras provides depth-like understanding without the hardware cost.
- Missing Data Robustness: In real-world deployment, modalities frequently become unavailable (sensor failure, occlusion, privacy restrictions); hallucination enables graceful degradation rather than complete failure.
- Deployment Simplicity: A model that hallucinates missing modalities can be deployed with fewer sensors and simpler infrastructure while retaining much of the multimodal model's accuracy.
- Privacy Preservation: Some modalities (thermal imaging, depth) reveal sensitive information; hallucinating their features from less invasive modalities (RGB) enables the performance benefits without the privacy concerns.

Modality Hallucination Applications

- RGB → Depth: Training on RGB-D data, deploying with RGB only — the model hallucinates depth features for improved 3D understanding, object detection, and scene segmentation.
- Multimodal → Unimodal Medical Imaging: Training on MRI + CT + PET, deploying with MRI only — hallucinating CT and PET features improves diagnosis when only one imaging modality is available.
- Audio-Visual → Visual Only: Training on video with audio, deploying on silent video — hallucinated audio features improve action recognition and event detection in surveillance footage.
- Multi-Sensor → Single Sensor Autonomous Driving: Training on camera + LiDAR + radar, deploying with camera only — hallucinating LiDAR features enables 3D perception from monocular cameras.

| Scenario | Training Modalities | Test Modality | Hallucinated | Performance Recovery |
|----------|-------------------|--------------|-------------|---------------------|
| RGB → Depth | RGB + Depth | RGB only | Depth features | 85-95% of multimodal |
| MRI → CT | MRI + CT | MRI only | CT features | 80-90% of multimodal |
| Video → Audio | Video + Audio | Video only | Audio features | 75-85% of multimodal |
| Camera → LiDAR | Camera + LiDAR | Camera only | LiDAR features | 80-90% of multimodal |
| Text → Image | Text + Image | Text only | Image features | 70-85% of multimodal |

Modality hallucination is the knowledge distillation bridge between multimodal training and unimodal deployment — teaching models to internally imagine missing sensory inputs by mimicking a multimodal teacher's representations, enabling single-modality systems to achieve near-multimodal performance without the cost, complexity, or availability constraints of additional sensors.

Want to learn more?