In complex environments such as night, fog and battle? field camouflage, a single camera sensor is not sufficient to reflect scene information, and multi-source sensor can raise environmental awareness. Fortunately, the combination of infrared sensors and visible cameras can be a perfect solution to the problem of sensing complex environments. However, there are huge differences in the inputs from different sensors, and how to fuse the information from two sensors and apply it to a specific task is a problem that needs to be solved. To address this problem, we propose a multi-source input detection algorithm with the combination of infrared sensors and visible cameras. Its purpose is to solve the problem of low detection accuracy of a single sensor in complex and changing environments. First, we design a differential feature enhancement(DFE) module to enhance features that are constantly degraded during network transmission. Second, we design a cross-modal fusion module(CF) to fuse features from multiple sources. Finally, we embed the designed module into a generic two-stream network. Experiments on the publicly available FLIR and LLVIP datasets show that the algorithm in this paper improves mAP75 by 8.3/4.3 compared to a single-source detector. In some special environments, our modules and even libraries occupy 0.1MB of space for a 1.7 mAP boost! Extensive ablation experiments demonstrate that the module pro? posed in this paper is lightweight, efficient, plug and play.