This study examines multimodal emotion detection using the RAVDESS dataset. The dataset, which includes eight distinct labeled emotions and comprises hundreds of hours of speech data, was adapted to focus on five emotions (calm, angry, neutral, sad, and disgust). This reduction facilitates the development of a practical model suitable for workplace applications, potentially minimizing the need for extensive customer service resources in emotion detection. Additionally, noise was introduced into the video and speech samples to enhance real-world applicability. Utilizing a multimodal approach, two models were developed and optimized. The first model, employing mel spectrograms, convolutional neural networks (CNNs), and long short-term memory (LSTM) networks, achieved a speech-based emotion detection accuracy of 75.93%. The second model, designed for video-based emotion detection, leveraged CNNs, and a DenseNet architecture and attained an accuracy of 71.06%. By applying a late fusion softmax averaging technique, the combined accuracy was enhanced to 89.12%. However, as the model was tested on a single dataset, its generalizability remains limited. Future work incorporating additional datasets may improve its practical relevance.