Understanding speech in noisy environments remains a critical challenge for hearing-loss intervention development. Advancement in methods for enhancing speech recognition necessitate extensive datasets for training and testing algorithms. The majority of speech corpora focus on individual utterances, dyadic interactions and scripted recordings (e.g. of phonetically balanced sentences), despite the fact that much of our social communication occurs in dynamic, small group interactions. This paper presents a systematic review of existing speech corpora that encompass both audio and visual recordings of small group conversations. In doing so, we examine the currently available data for testing and evaluating computational audio-visual speech enhancement models. The available datasets for testing speech enhancement techniques are varied and often specific to set use cases. This results in restrictions in technology development due to limitations imposed by the available training and testing databases. There is a need to develop a corpus of audio-visual multi-talker speech that can be used to train audio-visual speech enhancement technology and further cognitive hearing-related research. Existing audio-visual corpora are typically confined to the specific paradigms for which they were collected. Studies utilising speech with various types and levels of background noise generally achieve this by combining datasets containing utterances or dialogue with databases of background noise. Existing 2 available datasets do not incorporate free-flowing speech between multiple talkers whilst also accounting for differing listening environments and hearing abilities. No database exists to computationally explore the needs and preferences of hearing aid users in typical situations such as conversations in social environments or while navigating busy events. We propose the generation of a new dataset containing audio and visual information from multiple talkers interacting spontaneously in a variety of sound environments. Such a corpus would provide audiovisual data of free-flowing conversations to evaluate and fine-tune technological developments in realistic scenarios. Such evaluation data is necessary to develop speech enhancement models that will be able to cope with the complex demands of hearing in the real world.