Lightweight deep learning for monocular depth estimation
Abstract
Monocular depth estimation is a challenging but significant part of computer
vision with many applications in other areas of study. This estimation method aims to
provide a relative depth prediction for a single input image. In the past, conventional
methods have been able to give rough depth estimations however their accuracies were
not sufficient. In recent years, due to the rise of deep convolutional neural networks
(DCNNs), the accuracy of the depth estimations has increased. However, DCNNs do
so at the expense of compute resources and time. This leads to the need for more
lightweight solutions for the task.
In this thesis, we use recent advances made in lightweight network design to reduce
complexity. Furthermore, we use conventional methods to increase the performance
of lightweight networks. Specifically, we propose a novel lightweight network architecture which has a significantly reduced complexity compared to current methods while
still maintaining a competitive accuracy. We propose an encoder-decoder architecture
that utilizes DiCE units [47] to reduce the complexity of the encoder. In addition, we
utilize a custom designed decoder based on depthwise-separable convolutions. Furthermore, we propose a novel lightweight self-supervised training framework which
leverages conventional methods to remove the need for pose estimation that current
self-supervised methods have. Similar to current unsupervised and self-supervised
methods, out method needs a pair of stereo images during training. However, we
take advantage of this need to compute a ground truth approximation. Doing this
we are able to eliminate the need for pose estimation that other self-supervised approaches have. Both our lightweight network and our self-supervised framework reduce the size and complexity of current state-of-the-art methods while maintaining
competitive results in their respective areas.