OCTraN

Modern approaches for vision-centric environment perception for autonomous navigation make extensive use of self-supervised monocular depth estimation algorithms to predict the depth of a 3D scene using only a single camera image. However, when this depth map is projected onto 3D space, the errors in disparity are magnified, resulting in a depth estimation error that increases quadratically as the distance from the camera increases. Though Light Detection and Ranging (LiDAR) can solve this issue, it is expensive and not feasible for many applications. To address the challenge of accurate ranging with low-cost sensors, we propose a transformer architecture that uses iterative-attention to convert 2D image features into 3D occupancy features and makes use of convolution and transpose convolution to efficiently operate on spatial information. We also develop a self-supervised training pipeline to generalize the model to any scene by eliminating the need for LiDAR ground truth by substituting it with pseudo-ground truth labels obtained from neural radiance fields (NeRFs) and boosting monocular depth estimation. We will be making our code and dataset public

We gathered a dataset spanning 114 minutes and 165K frames in Bengaluru, India. Our dataset consists of video data from a calibrated camera sensor with a resolution of 1920×1080 recorded at a framerate of 30 Hz. We utilize a Depth Dataset Generation pipeline that only uses videos as input to produce high-resolution disparity maps

Aditya N G	Dhruval P B	Harshith M K	Priya S S	Dr. Surabhi Narayan
adityang5@gmail.com	dhruvalpb@gmail.com	hiharshith18@gmail.com	sspriya147@gmail.com	surabhinarayan@pes.edu

Bengaluru Driving Dataset

Short Presentation

Source Code

Paper and Bibtex

Related Projects

Say Hi!