Recent advancements in deep learning methods have successfully detected COVID-19 from audio signals, primarily focusing on breathing, coughing, isolated phonemes, speech, or combinations of these sound modalities. However, existing methods are limited by the input audio type (modality) and, consequently, cannot operate on diverse datasets. In this paper, we present an end-to-end deep learning architecture capable of selectively analyzing multiple sources of audio information, even when certain modalities are absent from the dataset. Our method can process up to nine audio modalities, including coughing, breathing, phonemes, and counting, by employing parallel convolutional branches and introducing an attention-like modality selection mechanism. The proposed approach can also be viewed as a feature selector that determines, at runtime, the pertinent acoustic modalities for each classification decision. Our findings show that this type of attentionguided mechanism increases classification accuracy compared to standard multimodal approaches and enables the reuse of the trained network across diverse datasets. Specifically, when trained on the Coswara dataset, the proposed method achieves a 97.75% testing accuracy and, without retraining, attains 82% accuracy on the Virufy dataset, despite its different, unimodal structure.