Evaluating Significant Features in Context-Aware Multimodal Emotion
Recognition with XAI Methods
Abstract
Analysis of human emotions from multimodal data for making
critical decisions is an emerging area of research. The evolution of
deep learning algorithms has improved the potential for extracting value
from multimodal data. However, these algorithms do not often explain how
certain outputs from the data are produced. This study focuses
on the risks of using black-box deep learning models for critical tasks,
such as emotion recognition, and describes how human understandable
interpretations of these models are extremely important. This study
utilizes one of the largest multimodal datasets available - CMU-MOSEI.
Many researchers have used the pre-extracted features provided by the
CMU Multimodal SDK with black-box deep learning models making it
difficult to interpret the contribution of individual features. This
study describes the implications of individual features from various
modalities (audio, video, text) in Context-Aware Multimodal Emotion
Recognition. It describes the process of curating reduced feature models
by using the GradientSHAP XAI method. These reduced models with highly
contributing features achieve comparable and even better results
compared to their corresponding all feature models as well as the
baseline model GraphMFN proving that carefully selecting significant
features can help improve the model robustness and performance and in
turn make it trustworthy.