Tasmiul Alam Shopnil

and 2 more

Text document summarization is critical for managing today's vast textual data. This paper presents an approach to text document summarization that does not rely on word embedding techniques. Instead, our method follows a step-by-step process, including sentence segmentation, sentence embedding, K-means clustering, and summary generation. The input text is segmented into individual sentences using an NLP tool such as NLTK's sentence tokenizer. Next, we extract contextual embeddings for each sentence using the Sentence Transformer method. These embeddings capture the meaning of each sentence within the context of the surrounding text. The sentence embeddings are then subjected to K-means clustering. This step enables the creation of clusters that represent semantically related sentences. To generate the summary, depending on how far each sentence is from the cluster centroid, we choose one sentence from each cluster. The sentence with the lowest distance from the centroid is chosen, and the selected sentences are ordered as they appeared in the original text. We implemented the summarizer and evaluated its performance on the DUC 2007 dataset, a collection of news articles with manually crafted summaries by human experts. The results demonstrate that our summarizer produces informative and concise summaries, surpassing a baseline approach that solely extracts top-ranked sentences from the input text. Our work contributes to text document summarization by presenting an alternative approach that does not rely on word embedding techniques. By leveraging sentence segmentation, contextual embeddings, K-means clustering, and centroid-based selection, our method offers a viable solution for generating high-quality summaries. Further research can explore enhancements to our approach and its application in various domains where text summarization is essential.