Figure \ref{fig:1}(a) illustrates the working principle of the REAL system. A low-power collimated laser beam is targeted at the remote vibrating surface. A telescoping lens system is aligned with the remote spot to collect the back-scattered photons and focus them onto the avalanche photodetector (APD), from which the AC signal is amplified and processed as the REAL audio. This signal is the result of back-scattered light intensity change as the surface vibrates. If we assume the scattering surface is Lambertian\cite{RN176}, the collected back-scattered laser power will decay with squared distance ${P_c} \propto {P_0}/{r^2}$, where $P_0$ is the scattering laser power exiting normal to the surface, and $r$ is the distance between the surface spot and the collecting lens. As the surface vibrates, the relative distance changes ($r = {r_0} + \Delta r$) and this $\Delta r$ results in a change in collected laser power $\Delta {P_c} \propto {P_0}(1/{r^2} - 1/{(r + \Delta r)^2})$. The APD collects this change and convert to an audio signal. As an example, the waveform of the pronounced word ‘brother’ gathered by the REAL system (detecting from the speaker’s mask) is shown in Fig. \ref{fig:1}(a). A microphone waveform is also shown for comparison. Because of its construction simplicity, the cost of the REAL system is much lower than a conventional LDV system and the miniaturization is readily available. (see \textcolor{urlblue}{Supplement 1})
As a robotic ear, the REAL system needs to continuously capture a specific voice source. In Fig. \ref{fig:1}(b), the system is mounted on a motorized gimbal, and a camera is used to detect and track the throat or the mask of the speaking person. The detected target position is fed into the control loop of the gimbal, which points the laser to the target continuously as the target is moving. A microphone is attached to the REAL system to collect the contaminated audio signal for comparison, as well as further augment the REAL signal by fusing the two independent sensing modalities. Fig. \ref{fig:1}(c) shows a cocktail party scenario where the gimbaled REAL system operates to ‘hear’ a specific person remotely without acoustic channel interference.
[Replace the following with your content. Please refer to the General AISY guidelines for authors and the guidelines specific to the Interactive Papers for more information about how to structure your papers. The present template applies to Research Article types; for information on the structure of different article types (e.g., Reviews etc.,) please refer to the  General AISY guidelines for authors.
We ask you to stay as close as possible to the content and organization of the static version of your paper. Please use the appropriate section headers (h1, h2, etc) available in the toolbar above.

Inserting Figures/tables from other sources

Permission statement required for all figures reproduced or adapted from previously published articles/sources. You may add the following line to your figure/table caption, after getting permission from the copyright holder:  Reproduced with permission. [insert reference]  Copyright Year, Publisher. 

References

The style of the references in the interactive article is controlled by Authorea. More info on how to cite with Authorea available here. Adding citations in Authorea is easy and they look like this: \cite{Kumar_2021}.  Your bibliography will be automatically formatted according to a numeric style supported by AISY. 

Results

[Present here your key results.]

Conclusion

Our results demonstrate REAL can recover the speech signal by exploiting the back-scattered intensities from vibrating surfaces. With strong resistance to acoustic noise and the ability to collect specific audio signals over long distances, REAL provides a feasible solution to tackle the cocktail party problem in the optical channel. It is demonstrated that REAL could direct ‘hear’ the voices from masks and throats in a noisy environment, where the noise characteristics are fully considered in the hardware and the neural networks could help in signal recovery. Further work could include utilizing additional sensing modalities to enhance the overall detection accuracy such as the audio-visual cues and microphone array. With the high signal quality, simple construction, affordability and miniaturization readiness, we anticipate the REAL system will foster a new way in human-robot interaction, benefiting applications in speaker identification, speech understanding and accelerating the development of voice-guided home and field robots.

Experimental Section/Methods

[Describe here all the experimental procedures and/or methods adopted to collect your data.  The description should be complete enough to enable someone else to repeat your work.]