Open Source Large Language Models in Action: A Bioinformatics Chatbot
for PRIDE database
Abstract
We here present a chatbot assistant infrastructure
(https://www.ebi.ac.uk/pride/chatbot/) that simplifies user interactions
with the PRIDE database, the most popular proteomics data repository.
Our system utilizes two advanced Large Language Models (LLM), llama2-13b
and chatglm2-6b, and includes a web service API (Application Programming
Interface), web interface, and sophisticated algorithms. We have
developed a novel approach to construct vector-based representations for
enabling the LLM responses, featuring a curated version and a
comprehensive database of relevant links and paragraphs for each
generated response. An important part of the framework is a benchmark
component based on an Elo-ranking system, providing a scalable method
for evaluating not only the performance of llama2-13b and chatglm2-6b
but also, of any other available and future open-source LLMs. Throughout
the benchmarking process, the PRIDE documentation for external users was
refined to enhance the clarity and efficacy in addressing user queries.
Importantly, while our infrastructure is exemplified through its
application in the PRIDE database context, the modular and adaptable
nature of our approach positions it as a valuable tool for improving
user experiences across a spectrum of bioinformatics and proteomics
tools and resources, among other domains. The integration of advanced
LLMs, innovative vector-based construction, the benchmarking framework,
and optimized documentation collectively form a robust and transferable
chatbot assistant infrastructure.