Emotion detection plays a critical role in diagnosing mental health illnesses, supporting humans with learning disabilities, helping businesses detect customer sentiment, and enhancing human-computer interaction. However, despite recent progress being made and the high accuracy that neural networks have managed to achieve, the high complexity and parameter count makes them quite unreliable in practical applications. This paper proposes a new multimodal approach for emotion detection using both complementary audio and video stimuli. The proposed model has the advantage of using two stimuli that directly complement each other, leading to relatively high accuracy, while also being lightweight enough to be readily applied in realworld settings. For proving a concept, the RAVDESS dataset was used to create a KNN merged late fusion model that combines audio-based and video-based emotion predictions. The audio predictions were generated using a CNN-LSTM model, while video predictions used a 3D CNN model. The resulting approach achieved a 70.12% accuracy, which is comparable to previous studies, in the range of 65.71%-86.70%. Additionally, the proposed model has the advantage of utilizing only 1.4 million parameters, compared with previous studies that used 6.1 to 11.3 million parameters. This significant decrease in computing power needed means that the proposed model can be more readily accessible for industrial applications.