Labeling fine-grained objects manually is extremely challenging, as it is not only label-intensive but also requires professional knowledge. Accordingly, robust learning methods for fine-grained recognition with web images collected from Internet of Things have drawn significant attention. However, training deep fine-grained models directly using untrusted web images is confronted by two primary obstacles: 1) label noise in web images and 2) domain variance between the online sources and test datasets. To this end, in this study, we mainly focus on addressing these two pivotal problems associated with untrusted web images. To be specific, we introduce an end-to-end network that collaboratively addresses these concerns in the process of separating trusted data from untrusted web images. To validate the efficacy of our proposed model, untrusted web images are first collected by utilizing the text category labels found within fine-grained datasets. Subsequently, we employ the designed deep model to eliminate label noise and ameliorate domain mismatch. And the chosen trusted web data are utilized for model training. Comprehensive experiments and ablation studies validate that our method consistently surpasses other state-of-the-art approaches for fine-grained recognition task in a real-world scenario. Simultaneously, this introduces a novel pipeline for fine-grained recognition with substantial efficacy in practical applications. The source code and models can be accessed at: https://github.com/NUST-Machine-Intelligence-Laboratory/DDN.