METHODOLOGY

The proposed methodology involves the use of Optical Character Recognition (OCR) and Large Language Model (LLM) for creating an application that generates python codes from flowchart images. Flowchart images uploaded on the application’s user interface that is developed using Gradio are processed using a python library called EasyOCR to extract the textual content from the flowcharts. It uses deep learning models like VGG and ResNet for feature extraction, Long Short-Term Memory networks to understand the sequential context of the extracted features and Connectionist Temporal Classification algorithm to decode the labeled sequences into actual text. A query is constructed using the text extracted with OCR that is provided as a prompt to an LLM called Llama 2 Chat (70 Billion Parameters Version) accessed using an API from an online platform called Replicate. The output returned by the LLM contains the python code and a brief explanation of it which is extracted and stored in a python string object before returning it to the user interface as the output for the uploaded flowchart. Functionality to copy the python code displayed on the UI was also incorporated to allow the user to test the code generated for the uploaded flowchart.

Text Extraction using Optical Character Recognition

Optical Character Recognition (OCR) is a process in which readable text is extracted from images containing textual content. It helps to interpret printed or handwritten text from various sources such as scanned documents or captured images. OCR operates on the principles of pattern recognition and feature extraction. First, an input image containing text is preprocessed by reducing the noise using Gaussian filters and Morphological operations like Erosion and Dilation to enhance its quality. Then the processed image is segmented into individual characters to isolate them from the background employing techniques like text line extraction with hough transform and word extraction with connected component analysis. Statistical features like density and direction of the foreground pixels in the image are extracted using zoning, projection histograms, etc. Structural features like cross points, strokes, loops, horizontal curves, etc are also extracted. Finally, pattern matching algorithms are utilized to interpret the extracted features, generating readable text output. A library called EasyOCR is being used for extracting the text from the flowchart components in the uploaded image. The uploaded flowchart image is converted into an array using OpenCV. The image array is provided as an argument to the readtext method of the Reader class from the EasyOCR library. The method returns the extracted text, coordinates of the top left and bottom right corner of the bounding box and the confidence score of every word extracted from the image. All the extracted data is stored in a variable. The bounding boxes of the extracted text are drawn on the uploaded image and the text is put on the uploaded image at appropriate positions using OpenCV methods as shown in Fig 1. Using Matplotlib, the flowchart image with the bounding boxes and the extracted text are displayed in a window.
Fig 1. Extraction of text from flowchart components in the image using OCR

Creating Structured Query for Large Language Model (LLM)

A custom python function named create_query was created to generate a structured query for the Llama 2 LLM to generate the Python code from the extracted text. The created function takes two parameters as inputs which are as follows : A variable which contains the textual content and the other details extracted from the flowchart image. This input is an iterable containing tuples where each tuple contains three elements which are the bounding box coordinates, the extracted text, and the confidence score for that text. The other argument provided to the function is a python string containing ”#. The function uses a for loop to iterate all elements in img_text and extracts only the text from it, this text is appended to the python string and then the ”def” string is attached to it at the end. The final structured query string is returned from the function. The query string containing the OCR extracted text can be provided to a LLM as a user input and it can generate the python code for the extracted text as shown in Fig 2.