Pneumonia, a respiratory disease often caused by bacterial infection in the distal lung, necessitates prompt and precise diagnosis, particularly in critical care settings. Optical endomicroscopy (OEM) facilitates realtime acquisition of in vivo and in situ optical biopsies, thus expediting bacterial detection. Nonetheless, visually analysing the vast number of images generated by the OEM in real time can be challenging, potentially impeding timely intervention. In this regard, to rapidly segment and detect the bacteria, we propose EmiNet, a novel dual-stream network that integrates the capabilities of Transformer and Convolutional Neural Networks (CNN) within an encoder-decoder architecture that simultaneously captures local-global appearance and motion features. Within EmiNet, we introduce a multimodal cross-channel attention module that facilitates the integration of motion features with appearance features. Furthermore, to compensate for the lack of annotated training data, we developed a synthetic dataset by simulating bacterial motion and integrating these models onto real backgrounds devoid of bacteria. The authenticity of this dataset was confirmed through a Visual Turing Test, where medical experts assessed a mixture of synthetic and real bacterial images. The results indicate that the synthetic images are almost indistinguishable from the real ones. EmiNet's performance is evaluated on both real and synthetic datasets. Experiments show that EmiNet surpasses state-of-the-art segmentation models and leads to a 6.8% improvement in detection correlation over the state-of-the-art bacteria detection algorithms.