Convolutional Non-local Spatial-Temporal Learning for Multi-Modality
Action Recognition
Abstract
Traditional deep convolutional networks (ConvNets) have shown that both
RGB and depth are complementary for video action recognition. However,
it is difficult to enhance the action recognition accuracy because of
the limitation of the single ConvNets to extract the underlying
relationship and complementary features between these two kinds of
modalities. In this paper, we proposed a novel two stream ConvNet for
multi-modality action recognition by joint optimization learning to
extract global features from RGB and depth sequences. Specifically, a
non-local multi-modality compensation block (NL-MMCB) is introduced to
learn the semantic fusion features for the recognition performance.
Experimental results on two multi-modality human action datasets,
including NTU RGB+D 120 and PKU-MMD dataset, verify the effectiveness of
our proposed recognition framework and demonstrate that the proposed
NL-MMCB can learn complementary features and enhance the recognition
accuracy.