The ability to perceive environments supports an important foundation for our self-developed robotic rat to improve kinematic performance and application potential. However, the most existing visual perception of quadruped robots suffers from poor perception accuracy in real-world dynamic environments. To mitigate the problem of erroneous data correlation, which is the main cause of low accuracy, the work presents an approach that combines leg odometry (LO) and IMU measurements with VSLAM to provide robust localization capabilities for small-scale quadruped robots in challenging scenarios by estimating the depth map and removing moving objects in dynamic environments. The method contains a depth estimation network with higher accuracy by combining the attention mechanism in the Transformer with the RAFT-Stereo depth estimation algorithm. Besides, the method combines target identification and segmentation with 3D projection of feature points to remove moving objects in dynamic environments. In addition, LO and IMU data are fused in the modified framework of ORB-SLAM3 to achieve highly accurate localization. The proposed approach is robust against erroneous data correlation due to moving objects and wobbles of quadruped robots. Evaluation results on multiple stages demonstrate that the system performs competitively in dynamic environments, outperforming existing visual perception methods in both public benchmarks and our costumed small-scale robotic rat.