Abstract:
MonoFusion allows a user to build dense 3D reconstructions of their environment in real-time, utilizing only a single, off-the-shelf web-camera as the input sensor. The camera could be one already available in tablet or a phone, or a peripheral web camera. No additional input hardware is required. This removes the need for power intensive active sensors that do not work robustly in natural outdoor lighting. Using the input stream of the camera we first estimate the 6DoF pose of the camera using a sparse tracking method. These poses are then used for efficient dense wide-baseline stereo matching between the input frame and a key frame (extracted previously). The resulting dense depth maps are directly fused into a voxel-based implicit model (using a computationally inexpensive method) and surfaces are extracted per frame. The system is able to recover from tracking failures as well as filter out geometrically inconsistent noise from the 3D reconstruction. Furthermore, compared to existing approaches, our system does not require maintaining a compute and memory intensive cost volume and avoids using expensive global optimization methods for fusion. This paper details the algorithmic components that make up our system and a GPU implementation of our approach. Qualitative results demonstrate high quality reconstructions even visually comparable to active depth sensor-based systems such as KinectFusion.
Social Program