DSAN: Exploring the Relationship between Deformable Convolution and Spatial Attention

Zewen Yu; Xiaoqin Zhang; Li Zhao; Guobao Xiao

doi:10.36227/techrxiv.171328860.02479778/v1

loading page

DSAN: Exploring the Relationship between Deformable Convolution and Spatial Attention

Zewen Yu,
Xiaoqin Zhang,
Li Zhao

Abstract

Recently, deformable convolutional neural network is commonly used in computer vision tasks, achieving remarkable results. The existing method DCNv3, focuses more on heavyweight models rather than lightweight ones. These heavyweight models are not suitable for small computing devices, which are limited by their hardware to deploy lightweight convolutional neural networks (CNNs). In this article, we focus on applying the DCNv3 operation to lightweight CNNs. To explore the performance of lightweight CNNs based on DCNv3, we conduct experiments and find that DCNv3 does not fully utilize its advantages with lightweight CNNs due to sparse sampling. Yet the traditional solution of increasing kernel size boosts computational load, making it unsuitable. Based on this situation, we solve this dilemma from two levels, the core operation and the visual feature extraction module. At the core operation level, we propose Deformable Strip Convolution (DSCN). As a simplified version of DCNv3 with large kernel, DSCN has only 63.2% computational load of the original with respect to the deformable sampling method. DSCN further avoids a quadratic increase in computational load with kernel size by limiting the deformation sampling kernels to single axis. At the visual feature extraction module level, we propose Deformable Spatial Attention (DSA) constructed from DSCN as a replacement for DCNv3. Specifically, we observe the similarity between the modulation mask branch in DCNv3 and spatial attention, and use spatial attention instead of modulation mask branch based on this similarity to reduce parameters and memory consumption. Finally, in order to verify the effectiveness of our improved design, we further propose a lightweight CNN backbone named DSAN. After conducting numerous extensive experiments, we find that DSA has an inference speed that is 2.1 times faster than that of DCNv3 with large kernel. In dense prediction tasks such as semantic segmentation, DSAN-S with a lightweight decoder achieves 48.8% mIoU on ADE20K, which is higher than the result of InternImage-T based on DCNv3 with a heavyweight decoder, while the number of parameters and computation is only 35.0% and 9.1% of its. Our code is available at https://github.com/MarcYugo/DSAN-Deformable-Spatial-Attention.