In the rapidly expanding domain of multimodal data, the field of emotion analysis has advanced through the sophisticated integration of diverse informational modalities. This study introduces the CIME model: Contextual Interaction-Based Multimodal Emotion Analysis with Enhanced Semantic Information. This innovative spatiotemporal interaction network model utilizes enhanced semantic information to elevate the accuracy and robustness of emotion analysis across both semantic and contextual dimensions. The model incorporates attention mechanisms and graph convolutional networks to enrich textual semantic comprehension through a cross-attention-based semantic interaction module and to delineate the contextual relationships among speakers via a graph convolution-based spatial interaction module. These enhancements enable the model to effectively mine the latent associations within multimodal emotional data. Through extensive evaluations on public datasets such as IEMOCAP and MOSEI, the proposed CIME model demonstrates superior performance in multimodal emotion classification tasks compared to existing methods. Further, modality ablation studies and comparative analysis of various fusion strategies affirm the model’s effectiveness and adaptability, providing new insights and methodologies for advancing the field of multimodal emotion analysis. Code supporting this study is available at https://github.com/gcp666/CIME.