DUCFNet: Dual U-shaped Cross-modal Fusion Network for Lung Infection
Region Segmentation
Abstract
To promote further development of medical image segmentation, there is
an increasing demand for high-quality datasets. Regrettably, there are
two major obstacles which are the difficulty of acquiring available
medical images and the financial burden of data annotation for
constructing high-quality datasets. To overcome the difficulties, we
leverage medical text data to compensate for the defects of existing
image datasets. In this work, we propose a dual U-shaped network to
sufficiently achieve the cross-modal feature fusion of image and text.
Specifically, one of the U-shaped branches is based on convolution
neural network, named U-CNN, which mainly extracts global features of
images and generate the final prediction results. The other one is based
on vision transformer blocks, named U-ViT, which is responsible for
processing text information and merging the text features and image
features from U-CNN. Additionally, we utilize Cross-Attention Channel
Fusion module and Channel-wise Dual-branch Cross Fusion module to equip
the skip connection of U-CNN. And the two modules are greatly beneficial
for resolving the semantic gaps and enhancing further integration of
cross-modal information. Experimental results on two lung infection
image datasets with different modalities (X-Ray and CT) suggest our
method achieves excellent performance compared to other alternative
state-of-the-art methods.