Speech Emotion Recognition Using Denoised speech signal Based on
Dual-Tree Complex Wavelet Transform and 1-D Convolution Networks
Abstract
People use speech to express their feelings, thoughts, wishes, and
opinions to others. The text is free from grammatical, spelling, and
punctuation errors, and adheres strictly to metrics and units.
Non-essential fillers have been removed. Speech emotion recognition
(SER) is a technology designed to analyses human voices to comprehend
the speaker’s emotional condition. The text follows a logical structure,
with a coherent sequence of ideas and active voice predominance.
Specific terms and standardized language are used consistently
throughout the text, and precise word choice is employed to convey
meaning accurately. No changes in content have been made, as per the
instructions. It has recently gained attention from researchers in
signal processing, human-computer interaction, and natural language
processing. The language used in this text is clear, concise, and
objective, with a formal tone and unambiguous language. This paper
proposes a four-step method for recognizing emotions from speech sounds.
Firstly, speech audio signals are preprocessed. Secondly, Dual-Tree
Complex Wavelet Transform (DTCWT) is applied to the audio signals to
remove noise. Thirdly, features are extracted from the DTCWT-applied
audio signals using One Dimensional Convolutional Neural Networks
(1D-CNN). The signals’ feature vector is classified in the final step
using algorithms like Support Vector Machines with different kernels and
Random Forest. This study used two widely used audio datasets for
emotion recognition: EMO-DB and IEMOCAP. The DTCWT& L-SVM&1D-CNN
achieved 90.38% accuracy for the EMO-DB dataset, while the
DTCWT&Q-SVM&1D-CNN achieved 85.79% accuracy for the IEMOCAP dataset.
The experimental findings validate the efficiency of the proposed
architecture in handling SER tasks.