Shalin Shah - 21DOCS Test Area

Shalin Shah

Public Documents 8

Topic Driven Text Summarization with Defragmentation using LLMs

Shalin Shah

and 3 more

January 31, 2025

Text summarization is an important problem in natural language processing. Summarization of large, complex documents has applications in many domains, such as finance (10K reports, earnings transcripts), customer reviews, healthcare (doctor notes, aggregate patient information), legal and compliance (contracts, court rulings), and academia (scholarly articles, textbooks). A summary can help an investor make better decisions quickly. However, naive summarization of complex documents can be incomplete and may omit important topics from the summary. Unlike documents such as 10k reports, which are required by regulatory compliance to follow a certain reporting structure, most other documents such as earnings transcripts, ESG disclosures, messaging, customer feedback, real-time news, social media posts and meeting notes are not formally organized and may lack standardization. This is where this method can be very helpful. The proposed solution in this paper is to first extract a table of contents (TOC) which is just a list of important topics present in the document, and then do retrieval augmented generation to extract relevant pieces of text from the original document for generating a summary. This approach works well, but since a TOC is used, there is risk of text fragmentation i.e. the text might not flow smoothly, and there might be abrupt discontinuities in the generated summary, affecting readability. To solve this problem, we extract the TOC, but with each entry topic in the TOC, we extract questions to be answered for the entry and a description generated from the entire document. We also generate high level objectives of the complex document generated from the entire document. We input the objectives, the TOC entry and the questions and descriptions to the LLM to generate a topic summary, and then synthesize the entire summary from the section summaries. This significantly reduces fragmentation and makes the summary very readable with a smooth flow. Also, our method can give structure to the documents, especially in documents which don't have a natural topic based structure, like earnings transcripts and news. Furthermore, our method can generate a summary based on a set of topics inputted to the method resulting in a customized summarization of the document. Results show that the proposed method has superior performance over chain of density summarization on the rouge scores, LLM based evaluation, BertScore and embeddings based evaluation. Of the 250 measurements, our method performs better on 72.8% of them. In the illustrations 4 and 5, we show that the COD summary focuses mostly on financial performance while TDS summary is more comprehensive. We also generate knowledge graphs of both summaries. Also, we wanted to see if the method scales for very large documents. We applied it to a 750 page Fannie-Mae servicing guide and the results are encouraging.

A Survey of Latent Factor Models for Recommender Systems and Personalization

Shalin Shah

January 07, 2021

Recommender systems aim to personalize the experience of a user and are critical for businesses like retail portals, e-commerce websites, book sellers, streaming movie websites and so on. The earliest personalized algorithms use matrix factorization or matrix completion using algorithms like the singular value decomposition (SVD). There are other more advanced algorithms, like factorization machines, Bayesian personalized ranking (BPR), and a more recent Hebbian graph embeddings (HGE) algorithm. In this work, we implement BPR and HGE and compare our results with SVD, Non-negative matrix factorization (NMF) using the MovieLens dataset.

Parallel Taxonomy Discovery

Shalin Shah

December 04, 2020

Recommender systems aim to personalize the shopping experience of a user by suggesting related products, or products that are found to be in the general interests of the user. The information available for users and products is heterogenous, and many systems use one or some of the information. The information available include the user’s interactions history with the products and categories, textual information of the products, a hierarchical classification of the products into a taxonomy, user interests based on a questionnaire, the demographics of a user, inferred interests based on product reviews given by a user, interests based on the physical location of a user and so on. Taxonomy discovery for personalized recommendation is work published in 2014 which uses the first three information sources { the user’s interaction history, textual information of the products and optionally, an existing taxonomy of the products. In this paper, we describe a parallel implementation of this approach on Apache Spark and discuss the modifications to the algorithm in order to scale it to several hundreds of thousands of users with a large inventory of products at Target corporation. We run experiments on a sample of users and provide results including some sample recommendations generated by our parallel algorithm.

Comparison of Three Recent Personalization Algorithms

Shalin Shah

November 23, 2020

Personalization algorithms recommend products to users based on their previous interactions with the system. The products could be books, movies, or products in a retail system. The earliest personalization algorithms were based on factorization of the user-item matrix where each entry in the matrix would correspond to an interaction, or absence of an interaction of the user with the product. In this article, we compare three recently developed personalization algorithms. The three algorithms are Bayesian Personalized Ranking, Taxonomy Discovery for Personalized Recommendations and Multi-Matrix Factorization. We compare the three algorithms on the hit rate @ position 10 on a held out test set on 1 million users and 200 thousand items in the catalog of Target Corporation. We report our findings in table 1. We develop all three algorithms on an Apache Spark parallel implementation.

Analysis of E-Commerce Product Graphs

Shalin Shah

August 18, 2020

Consumer behavior in retail stores gives rise to product graphs based on copurchasing or co-viewing behavior. These product graphs can be analyzed using the known methods of graph analysis. In this paper, we analyze the product graph at Target Corporation based on the Erd˝os-Renyi random graph model. In particular, we compute clustering coefficients of actual and random graphs, and we find that the clustering coefficients of actual graphs are much higher than random graphs. We conduct the analysis on the entire set of products and also on a per category basis and find interesting results. We also compute the degree distribution and we find that the degree distribution is a power law as expected from real world networks, contrasting with the ER random graph.

Implementation of iterative local search (ILS) for the quadratic assignment problem

Shalin Shah

August 19, 2020

The quadratic assignment problem (QAP) is one of the hardest NPhard problems and problems with a dimension of 20 or more can be difficult to solve using exact methods. The QAP has a set of facilities and a set of locations. The goal is to assign each facility to a location such that the product of the flow between pairs of facilities and the distance between them are minimized. Sometimes there is also a cost associated with assigning a facility to a location. In this work, I solve the QAP using a population based iterative local search with open source code in C++. Results show that the code is able to solve all nug instances to optimality, thereby proving that the algorithm is capable of solving larger problems for which optimum solutions are not known.

Randomized heuristic for the maximum clique problem

Shalin Shah

August 18, 2020

A clique in a graph is a set of vertices that are all directly connected to each other i.e. a complete sub-graph. A clique of the largest size is called a maximum clique. Finding the maximum clique in a graph is an NP-hard problem and it cannot be solved by an approximation algorithm that returns a solution within a constant factor of the optimum. In this work, we present a simple and very fast randomized algorithm for the maximum clique problem. We also provide Java code of the algorithm in our git repository. Results show that the algorithm is able to find reasonably good solutions to some randomly chosen DIMACS benchmark graphs. Rather than aiming for optimality, we aim to find good solutions very fast.

GCLIQUE: An Open Source Genetic Algorithm for the Maximum Clique Problem

Shalin Shah

August 18, 2020

A clique in a graph is a set of vertices that are all connected to each other. A maximum clique is a clique of maximum size. A graph may have more than one maximum cliques. The problem of finding a maximum clique is a strongly hard NP-hard problem. It is not possible to find an approximation algorithm which finds a maximum clique that is a constant factor of the optimum solution. In this work, we present a genetic algorithm for the maximum clique problem that is able to find optimum or close to optimum solutions to most DIMACS graphs. The genetic algorithm uses new crossover mechanisms which are able to find reasonably good cliques which can then be used in other applications downstream. We also provide C++ code for our algorithm. Results show that our algorithm is able to find maximum cliques for most DIMACS instances, and if not, close to optimum solutions for the other instances.