Shalin Shah

and 3 more

Text summarization is an important problem in natural language processing. Summarization of large, complex documents has applications in many domains, such as finance (10K reports, earnings transcripts), customer reviews, healthcare (doctor notes, aggregate patient information), legal and compliance (contracts, court rulings), and academia (scholarly articles, textbooks). A summary can help an investor make better decisions quickly. However, naive summarization of complex documents can be incomplete and may omit important topics from the summary. Unlike documents such as 10k reports, which are required by regulatory compliance to follow a certain reporting structure, most other documents such as earnings transcripts, ESG disclosures, messaging, customer feedback, real-time news, social media posts and meeting notes are not formally organized and may lack standardization. This is where this method can be very helpful. The proposed solution in this paper is to first extract a table of contents (TOC) which is just a list of important topics present in the document, and then do retrieval augmented generation to extract relevant pieces of text from the original document for generating a summary. This approach works well, but since a TOC is used, there is risk of text fragmentation i.e. the text might not flow smoothly, and there might be abrupt discontinuities in the generated summary, affecting readability. To solve this problem, we extract the TOC, but with each entry topic in the TOC, we extract questions to be answered for the entry and a description generated from the entire document. We also generate high level objectives of the complex document generated from the entire document. We input the objectives, the TOC entry and the questions and descriptions to the LLM to generate a topic summary, and then synthesize the entire summary from the section summaries. This significantly reduces fragmentation and makes the summary very readable with a smooth flow. Also, our method can give structure to the documents, especially in documents which don't have a natural topic based structure, like earnings transcripts and news. Furthermore, our method can generate a summary based on a set of topics inputted to the method resulting in a customized summarization of the document. Results show that the proposed method has superior performance over chain of density summarization on the rouge scores, LLM based evaluation, BertScore and embeddings based evaluation. Of the 250 measurements, our method performs better on 72.8% of them. In the illustrations 4 and 5, we show that the COD summary focuses mostly on financial performance while TDS summary is more comprehensive. We also generate knowledge graphs of both summaries. Also, we wanted to see if the method scales for very large documents. We applied it to a 750 page Fannie-Mae servicing guide and the results are encouraging.