Essential Site Maintenance: Authorea-powered sites will be updated circa 15:00-17:00 Eastern on Tuesday 5 November.
There should be no interruption to normal services, but please contact us at [email protected] in case you face any issues.

loading page

Open Source Large Language Models in Action: A Bioinformatics Chatbot for PRIDE database
  • +3
  • Jingwen Bai,
  • Selvakumar Kamatchinathan,
  • Deepti J Kundu,
  • Chakradhar Bandla,
  • Juan Antonio Vizcaino,
  • Yasset Perez Riverol
Jingwen Bai
European Bioinformatic Institute
Author Profile
Selvakumar Kamatchinathan
European Bioinformatic Institute
Author Profile
Deepti J Kundu
European Bioinformatic Institute
Author Profile
Chakradhar Bandla
European Bioinformatic Institute
Author Profile
Juan Antonio Vizcaino
EMBL-EBI
Author Profile
Yasset Perez Riverol
European Bioinformatic Institute

Corresponding Author:[email protected]

Author Profile

Abstract

We here present a chatbot assistant infrastructure (https://www.ebi.ac.uk/pride/chatbot/) that simplifies user interactions with the PRIDE database, the most popular proteomics data repository. Our system utilizes two advanced Large Language Models (LLM), llama2-13b and chatglm2-6b, and includes a web service API (Application Programming Interface), web interface, and sophisticated algorithms. We have developed a novel approach to construct vector-based representations for enabling the LLM responses, featuring a curated version and a comprehensive database of relevant links and paragraphs for each generated response. An important part of the framework is a benchmark component based on an Elo-ranking system, providing a scalable method for evaluating not only the performance of llama2-13b and chatglm2-6b but also, of any other available and future open-source LLMs. Throughout the benchmarking process, the PRIDE documentation for external users was refined to enhance the clarity and efficacy in addressing user queries. Importantly, while our infrastructure is exemplified through its application in the PRIDE database context, the modular and adaptable nature of our approach positions it as a valuable tool for improving user experiences across a spectrum of bioinformatics and proteomics tools and resources, among other domains. The integration of advanced LLMs, innovative vector-based construction, the benchmarking framework, and optimized documentation collectively form a robust and transferable chatbot assistant infrastructure.
Submitted to PROTEOMICS
30 Jan 2024Review(s) Completed, Editorial Evaluation Pending
01 Feb 2024Editorial Decision: Revise Minor
11 Mar 2024Reviewer(s) Assigned
20 Mar 2024Review(s) Completed, Editorial Evaluation Pending
20 Mar 2024Editorial Decision: Accept