Tian Zhang

and 4 more

Self-supervised monocular depth estimation (MDE) relies solely on sequential image frames for supervision. However, its performance in dynamic scenes is suboptimal, as the model's depth estimation of dynamic objects deteriorates over the course of training. This paper proposes an innovative Depth Self-Inhibition (DSI) training framework to suppress this process, allowing the MDE model to maintain strong performance in dynamic scenes. This method consists of two stages: the raw stage and the self-inhibition stage. In the raw stage, we train the MDE model without considering dynamic objects, aiming to identify unreliable depth regions where the depth values change significantly during training. In the self-inhibition stage, we retrain the MDE model and use the novel Disparity Difference Mask (DD-mask) method to exclude these unreliable regions from the loss. Additionally, the Ground-contact-prior Disparity Smoothness Loss (GDS-Loss) is employed to supervise depth learning in these areas. In this stage, we use a novel Normal Distribution-based Loss (ND-Loss), which outperforms L1 loss in dynamic scenes. During the evaluation, we introduce the Ratio Median Scaling (RM-scaling) method to address scale ambiguity in the estimated depth, providing a more reliable performance evaluation than the existing median scaling method. Experimental results show that the DSI training framework can be conveniently applied into existing MDE models, significantly improving their performance on the Cityscapes and KITTI datasets, particularly in moving object regions. The code will be released.