Building Text-to-Speech Models for Low-Resourced Languages from
Crowdsourced Data
Abstract
Text-to-speech (TTS) models have expanded the scope of digital
inclusivity by becoming a basis for assistive communication technologies
for visually impaired people, facilitating language learning, and
allowing for digital textual content consumption in audio form across
various sectors. Despite these benefits, the full potential of TTS
models is often not realized for the majority of low-resourced African
languages because they have traditionally required large amounts of
high-quality single-speaker recordings, which are financially costly and
time-consuming to obtain. In this paper, we demonstrate that
crowdsourced recordings can help overcome the lack of single-speaker
data by compensating with data from other speakers of similar intonation
(how the voice rises and falls in speech). We fine-tuned an English
Variational Inference with adversarial learning for an end-to-end
Text-to-Speech (VITS) model on over 10 hours of speech from six female
Common Voice (CV) speech data speakers for Luganda and Kiswahili. A
human mean opinion score evaluation on 100 test sentences shows that the
model trained on six speakers sounds more natural than the benchmark
models trained on two speakers and a single speaker for both languages.
In addition to careful data curation, this approach shows promise for
advancing speech synthesis in the context of low-resourced African
languages. Our final models for Luganda and Kiswahili are available at
https://huggingface.co/marconilab/VITS-commonvoice-females.