Sibitenda Harriet

and 6 more

The extraction of knowledge about the prevalent issues discussed on social media in Africa using Artificial Intelligence techniques is vital for informing public governance. The goals of our study are twofold: (a) to develop machine learning-based models to identify common topics of social concern about Africa on social media, and (b) to design a classifier capable of inferring a particular common topic associated with a given social media post. We designed a three-step framework to achieve the former goal, namely, topic identification. The first step uses text-based representation learning methods to generate text embeddings for feature representation. The second step leverages state-of-the-art Natural Language Processing models, commonly called topic modeling, to organize the representations into groups. The third step generates topics from each group, including the use of large language models to generate meaningful short-sentence labels from the bag-of-tokens associated with each group. To achieve the second goal of classification; we trained classifiers using ensemble voting and stacking learners to infer which among the identified common topics best characterizes the social media post. For our experimental study, we collected a text corpus called Social Media for Africa composed of 22,036 records extracted from social media comments on Twitter (X) and YouTube. The clustering-based model BERTopic yielded 304 topics, at topic coherence 0.81 C-v. On merging the topics into classes, the BERTopic+ created 11 common topic classes at topic coherence 0.76 C-v. We then utilized the identified topics based on the resulting groupings as labels for training a topic classifier. These labels were created using Llama2 on our SMA corpus. Our comparative study of topic classifiers using stacking and voting schemes shows that the BERTopic model features 0.83 accuracy and 0.82 F1 score with ensemble voting for training on topics. Furthermore, training on topic classes, BERTopic+ with ensemble voting had the highest accuracy of 0.95 and F1 score of 0.95 compared to other alternate methods on our corpus. The overall performance of classifiers using the ensemble stacking is slightly better than that of voting methods for short sentence topic labeling. For Africa, policymakers should focus on the most pressing social issues: COVID-19 restrictions affecting public health and economic recovery, promoting entrepreneurial innovation in energy and environmental sustainability to combat climate change, and strategically responding to China's rise in global politics to maintain geopolitical stability and foster international cooperation.