Analysis of human emotions from multimodal data for making critical decisions is an emerging area of research. The evolution of deep learning algorithms has improved the potential for extracting value from multimodal data. However, these algorithms do not often explain how certain outputs from the data are produced. This study focuses on the risks of using black-box deep learning models for critical tasks, such as emotion recognition, and describes how human understandable interpretations of these models are extremely important. This study utilizes one of the largest multimodal datasets available - CMU-MOSEI. Many researchers have used the pre-extracted features provided by the CMU Multimodal SDK with black-box deep learning models making it difficult to interpret the contribution of individual features. This study describes the implications of individual features from various modalities (audio, video, text) in Context-Aware Multimodal Emotion Recognition. It describes the process of curating reduced feature models by using the GradientSHAP XAI method. These reduced models with highly contributing features achieve comparable and even better results compared to their corresponding all feature models as well as the baseline model GraphMFN proving that carefully selecting significant features can help improve the model robustness and performance and in turn make it trustworthy.