Depth estimation from single image is the task of predicting per-pixel depth from a single RGB image — inferring 3D scene geometry from 2D appearance using learned priors about object sizes, perspective, occlusions, and scene layout, enabling 3D understanding without stereo cameras or depth sensors.
What Is Single-Image Depth Estimation?
- Definition: Predict depth map from single RGB image.
- Input: Single RGB image.
- Output: Depth map (distance to camera for each pixel).
- Challenge: Ill-posed problem — infinite 3D scenes project to same 2D image.
- Solution: Learn priors from data to resolve ambiguity.
Why Single-Image Depth?
- Accessibility: Works with any camera, no special hardware.
- Convenience: No stereo calibration, no multiple views needed.
- Ubiquity: Enable depth understanding on billions of existing images.
- Applications: AR, robotics, autonomous vehicles, photography.
Depth Estimation Approaches
Geometric Cues:
- Perspective: Parallel lines converge at vanishing points.
- Occlusion: Closer objects occlude farther objects.
- Relative Size: Known object sizes provide scale.
- Texture Gradient: Texture density increases with distance.
Learning-Based:
- Supervised: Train on images with ground truth depth.
- Self-Supervised: Train on stereo pairs or video sequences.
- Transfer Learning: Pre-train on large datasets, fine-tune.
Depth Estimation Methods
Supervised Learning:
- Training Data: RGB images + ground truth depth (from lidar, depth sensors).
- Network: CNN or Transformer encoder-decoder.
- Loss: L1, L2, or scale-invariant loss.
- Examples: MiDaS, DPT, AdaBins.
Self-Supervised Learning:
- Training Data: Stereo pairs or monocular video.
- Supervision: Photometric consistency.
- Process:
1. Predict depth from left image.
2. Warp right image using predicted depth.
3. Minimize difference between left and warped right.
- Examples: Monodepth, Monodepth2, PackNet.
Depth Estimation Architectures
Encoder-Decoder:
- Encoder: Extract features (ResNet, EfficientNet, ViT).
- Decoder: Upsample to full resolution depth map.
- Skip Connections: Preserve fine details.
Transformer-Based:
- DPT (Dense Prediction Transformer): Vision Transformer for depth.
- Benefit: Better global context, long-range dependencies.
Multi-Scale:
- Predict: Depth at multiple scales.
- Benefit: Capture both coarse structure and fine details.
Applications
Augmented Reality:
- Occlusion: Render AR objects behind real objects.
- Placement: Place virtual objects on real surfaces.
- Interaction: Enable realistic AR interactions.
Autonomous Vehicles:
- Obstacle Detection: Identify obstacles and their distances.
- Path Planning: Plan safe paths using depth information.
- Backup: Complement lidar with camera-based depth.
Robotics:
- Navigation: Avoid obstacles using depth.
- Manipulation: Understand object geometry for grasping.
- Mapping: Build 3D maps from monocular cameras.
Photography:
- Bokeh: Simulate depth-of-field effects.
- Refocusing: Change focus after capture.
- 3D Photos: Create 3D effects from 2D images.
Accessibility:
- Navigation Assistance: Help visually impaired navigate.
- Scene Description: Describe spatial layout of scenes.
Challenges
Scale Ambiguity:
- Problem: Monocular depth has unknown scale.
- Solution: Predict relative depth, or use known object sizes.
Textureless Regions:
- Problem: Smooth surfaces lack features.
- Solution: Learn priors, use global context.
Occlusions:
- Problem: Can't see behind objects.
- Solution: Infer from context, learned priors.
Generalization:
- Problem: Models trained on specific data may not generalize.
- Solution: Train on diverse datasets, domain adaptation.
Depth Estimation Datasets
Indoor:
- NYU Depth V2: Indoor scenes with Kinect depth.
- ScanNet: RGB-D scans of indoor environments.
Outdoor:
- KITTI: Autonomous driving with lidar depth.
- Cityscapes: Urban street scenes.
Mixed:
- MegaDepth: Internet photos with SfM depth.
- Taskonomy: Diverse indoor scenes.
Quality Metrics
Absolute Metrics:
- RMSE: Root mean squared error.
- MAE: Mean absolute error.
- Abs Rel: Mean absolute relative error.
Relative Metrics:
- δ < 1.25: Percentage of pixels with relative error < 25%.
- δ < 1.25²: Within 56% relative error.
- δ < 1.25³: Within 95% relative error.
Scale-Invariant:
- SILog: Scale-invariant logarithmic error.
- Benefit: Robust to scale ambiguity.
Depth Estimation Models
MiDaS:
- Training: Mixed datasets (multiple sources).
- Benefit: Generalizes well to diverse scenes.
- Output: Relative depth (scale ambiguous).
DPT (Dense Prediction Transformer):
- Architecture: Vision Transformer encoder + convolutional decoder.
- Benefit: State-of-the-art accuracy, good generalization.
AdaBins:
- Innovation: Adaptive bins for depth prediction.
- Benefit: Better handling of depth range.
Monodepth2:
- Training: Self-supervised on monocular video.
- Benefit: No ground truth depth needed.
Depth Estimation Techniques
Multi-Task Learning:
- Method: Train depth jointly with other tasks (segmentation, normals).
- Benefit: Shared representations improve all tasks.
Domain Adaptation:
- Method: Adapt model trained on synthetic data to real data.
- Benefit: Leverage large synthetic datasets.
Test-Time Optimization:
- Method: Fine-tune on test image using self-supervision.
- Benefit: Improve accuracy on specific image.
Future of Single-Image Depth
- Zero-Shot: Generalize to any scene without training.
- Metric Depth: Predict absolute depth, not just relative.
- Real-Time: Fast depth estimation for mobile devices.
- Video: Temporally consistent depth for video.
- Semantic: Integrate semantic understanding.
- Foundation Models: Large pre-trained models for depth.
Single-image depth estimation is a fundamental capability in computer vision — it enables 3D understanding from ordinary 2D images, making depth perception accessible without special hardware, supporting applications from augmented reality to robotics to photography.