Bin Zhang - 21DOCS Test Area

Sound event location is a momentous subtask of two-dimensional direction-of-arrival (2D-DOA) estimation to forecast the azimuth and elevation angle from each active sound event class of an audio fragment using multi-label regression in the 3D Cartesian coordinates. The main problem with the traditional multi-signal classification (MUSIC) algorithm and the existing baseline convolution recurrent neural network (BCRNN) are lower precision and huge computation in weaker signal-noise-ratio (SNR) environment. In particular, MUSIC algorithm will cause complete distortion when the SNR is lower than -5 dB. We thus design effective residual self-attention recurrent neural network (ESRNN) not only to overcome the distortion of traditional MUSIC algorithm under lower SNR but also further reduce predicted 2D-DOA error in different SNR reverberation environments. Two different filter structures, ESRNN-L and ESRNN-G, are designed to improve the SNR of our model when the SNR is above 0 dB or below -5 dB in a targeted manner respectively. To verify the efficiency and robustness, we train and test our model using TAU Spatial Sound Events 2019 datasets extracted the phase and magnitude spectrogram with different synthetic SNRs ranging from -10 dB to 30 dB. The experimental results show that the optimal 2D-DOA error of our ESRNN-L is 21% lower than BCRNN when the SNR is less than -5 dB, and ESRNN-G is 15% error lower with 10% parameters reduction when the SNR is higher than 0 dB.