Application II: REAL signal from speakers’ throats
More generally, REAL can directly capture the vibrations of the throat surfaces (Figure 4a). It is reminded that our laser power is far below the maximum permissible exposure of skin from IEC 60825-1:2014 and can be considered safe. In the simulated cocktail party environment, Figure 4b shows the respective STFT spectrogram from the microphone on the listener and the REAL system (as the listener). The speaker's speech cannot be distinguished from the listener microphone, while the REAL signal resembles the speaker’s speech only in the lower frequency (Figure 4d) due to the throat filtering. In addition, the REAL audio lacks the unvoiced signal (such as /sh/) because the unvoiced audio was generated in the mouth but not the throat. Direct understanding of the REAL audio without priors is difficult. However, REAL captured information should be adequate to understand the context because intrinsically the position of the vocal resonances, their timing characteristics and the speech pattern are correlated in this multi-modality space. To recover human-understandable speech, we propose a data-driven model (Figure 4c) on the STFT spectrogram to learn the mapping relationship from the REAL signal to the ground truth audio. A convolutional neural network (CNN) is used to capture both the frequency and short-time temporal patterns of human voice commands, with additional long short-term memory (LSTM) to correlate information at longer intervals in speeches. The model is expected to learn how to enhance the high-frequency texture and supplement unvoiced information. The details of the proposed model are described in Methods. The result is presented in Figure 4d, where the recovered audio from REAL is well augmented compared to Figure 4b. A video demonstrating this experiment is provided in Supplementary Video 2. We evaluate the performance using the source-to-distortion ratio (SDR)[29] and short-time objective intelligibility (STOI),[30] which are two commonly used evaluation metrics in speech enhancement tasks. As shown in Figure 4e, with iterations a final SDR score of around 6.8 is obtained for the recovered REAL audios in the testing set, representing a significant increase in the content clarity.[31] Figure 4f shows SDR and STOI histograms of the 293 testing samples (original and recovered), demonstrating the competence and accuracy of the audio recovery model (see Methods for evaluation metrics). Finally, we note that this model is capable of real-time inference with reasonable hardware to ensure REAL could operate with the robot’s onboard computing platform (see Supporting Information for real-time analysis).