Speech Emotion Recognition Using Denoised speech signal Based on Dual-Tree Complex Wavelet Transform and 1-D Convolution Networks

Mehmet Bilal ER; Nagehan İlhan; Umut Kuran

doi:10.22541/au.172115131.18055980/v1

loading page

Speech Emotion Recognition Using Denoised speech signal Based on Dual-Tree Complex Wavelet Transform and 1-D Convolution Networks

Mehmet Bilal ER,
Nagehan İlhan,
Umut Kuran

Abstract

People use speech to express their feelings, thoughts, wishes, and opinions to others. The text is free from grammatical, spelling, and punctuation errors, and adheres strictly to metrics and units. Non-essential fillers have been removed. Speech emotion recognition (SER) is a technology designed to analyses human voices to comprehend the speaker’s emotional condition. The text follows a logical structure, with a coherent sequence of ideas and active voice predominance. Specific terms and standardized language are used consistently throughout the text, and precise word choice is employed to convey meaning accurately. No changes in content have been made, as per the instructions. It has recently gained attention from researchers in signal processing, human-computer interaction, and natural language processing. The language used in this text is clear, concise, and objective, with a formal tone and unambiguous language. This paper proposes a four-step method for recognizing emotions from speech sounds. Firstly, speech audio signals are preprocessed. Secondly, Dual-Tree Complex Wavelet Transform (DTCWT) is applied to the audio signals to remove noise. Thirdly, features are extracted from the DTCWT-applied audio signals using One Dimensional Convolutional Neural Networks (1D-CNN). The signals’ feature vector is classified in the final step using algorithms like Support Vector Machines with different kernels and Random Forest. This study used two widely used audio datasets for emotion recognition: EMO-DB and IEMOCAP. The DTCWT& L-SVM&1D-CNN achieved 90.38% accuracy for the EMO-DB dataset, while the DTCWT&Q-SVM&1D-CNN achieved 85.79% accuracy for the IEMOCAP dataset. The experimental findings validate the efficiency of the proposed architecture in handling SER tasks.