Scene Flow is the 3D motion estimation task that computes a dense 3D velocity vector for every point in a scene — the three-dimensional generalization of optical flow that captures how objects move in real-world coordinates (meters per second) rather than just how their projections shift on the 2D image plane (pixels per frame) — essential for autonomous driving, robotics, and AR/VR systems where understanding true 3D motion is required for safe navigation, object manipulation, and realistic virtual object interaction.
What Is Scene Flow?
- Definition: A dense 3D vector field $(dx, dy, dz)$ assigned to every visible point $(x, y, z)$ in the scene, describing its 3D motion between consecutive time steps.
- Inputs: Typically stereo video (two cameras), RGB-D (depth sensor), or LiDAR point clouds — any sensor providing 3D geometry.
- Output: Per-point 3D displacement vectors representing real-world motion.
- Optical Flow vs. Scene Flow: Optical flow captures apparent 2D pixel motion. Scene flow captures true 3D world motion — two objects moving at the same 3D speed but at different depths have different optical flow but identical scene flow magnitude.
Why Scene Flow Matters
- Autonomous Driving: Distinguishing moving vehicles from parked ones in LiDAR point clouds — critical for collision avoidance and path planning. A parked car and a car approaching at 60 km/h look identical in a single frame but have dramatically different scene flow.
- Robotics Manipulation: Grasping a moving object requires predicting its 3D trajectory — scene flow provides the velocity field needed for interception planning.
- AR/VR: Realistic interaction between virtual objects and the real environment requires understanding 3D motion of real-world surfaces.
- Motion Segmentation: Scene flow enables automatic decomposition of a scene into independently moving objects — each rigid body has approximately uniform scene flow.
- Depth from Motion: Scene flow combined with ego-motion provides additional depth cues that complement stereo or monocular depth estimation.
Computation Methods
| Approach | Input | Method | Speed |
|----------|-------|--------|-------|
| Variational | Stereo video | Joint optimization of disparity + flow | Slow (minutes) |
| Deep Learning (Supervised) | Point clouds or stereo | FlowNet3D, PointPWC-Net | Real-time |
| Self-Supervised | Stereo or mono + depth | Learn from photometric/geometric consistency | Real-time |
| Scene Flow from LiDAR | Sequential LiDAR scans | Point cloud registration + flow estimation | Real-time |
| Neural Scene Flow Prior | Any 3D input | Implicit neural representation of flow field | Slow |
Scene Flow Components
| Component | Description | Representation |
|-----------|-------------|---------------|
| Disparity Change | Depth variation between frames | $Delta d$ (stereo) or $Delta z$ (metric) |
| 2D Optical Flow | Pixel displacement on the image plane | $(u, v)$ per pixel |
| 3D Translation | Combined 3D motion vector | $(dx, dy, dz)$ in world coordinates |
| Ego-Motion Compensation | Remove camera/vehicle self-motion | Rigid transform subtraction |
| Residual (Object) Flow | Motion after ego-motion removal | Per-object 3D velocity |
Key Applications in Autonomous Driving
- Moving Object Detection: Points with non-zero residual scene flow (after ego-motion subtraction) are moving objects — no object detector needed.
- Velocity Estimation: Scene flow directly provides the 3D velocity of every detected object — critical for trajectory prediction and collision risk assessment.
- Point Cloud Accumulation: Compensate object motion when stacking sequential LiDAR scans — static objects align, moving objects are correctly placed.
- Free Space Estimation: Flowing regions indicate occupied, potentially dangerous space — scene flow augments occupancy grid predictions.
Challenges
- Ambiguity: Large uniform surfaces (walls, roads) have ambiguous flow due to the aperture problem — similar to optical flow but in 3D.
- Occlusion: Points that become occluded or newly visible between frames have undefined flow.
- Computation Cost: Dense 3D flow estimation is substantially more expensive than 2D optical flow — real-time performance requires careful architecture design.
- Ground Truth Scarcity: Labeled 3D scene flow is extremely hard to obtain — synthetic datasets (FlyingThings3D, KITTI) are the primary training source.
Scene Flow is the ultimate motion perception for 3D understanding — providing the complete dynamic picture of how the physical world is moving, beyond the flat projection of optical flow, enabling autonomous systems to reason about, predict, and react to the true three-dimensional motion around them.