loading page

Comparative Evaluation of Large Language Models using Key Metrics and Emerging Tools
  • Sarah McAvinue,
  • Kapal Dev
Sarah McAvinue
Health Service Executive
Author Profile
Kapal Dev
Munster Technological University

Corresponding Author:[email protected]

Author Profile

Abstract

This research involved designing and building an interactive generative AI application to conduct a comparative analysis of two advanced Large Language Models (LLMs), GPT-4 and Claude 2, using Langsmith evaluation tools. The project was developed to explore the potential of LLMs in facilitating postgraduate course recommendations within a simulated environment at Munster Technological University (MTU). Designed for comparative analysis, the application enables testing of GPT-4 and Claude 2 and can be hosted flexibly on either AWS (Amazon Web Services) or Azure. It utilizes advanced natural language processing and retrieval-augmented generation (RAG) techniques to process proprietary data tailored to postgraduate needs. A key component of this research was the rigorous assessment of the LLMs using the Langsmith evaluation tool against both customized and standard benchmarks. The evaluation focused on metrics such as bias, safety, accuracy, cost, robustness, and latency. Additionally, adaptability covering critical features like language translation and internet access, was independently researched since the Langsmith tool does not evaluate this metric. This ensures a holistic assessment of the LLM’s capabilities.
13 Jun 2024Submitted to Expert Systems
14 Jun 2024Submission Checks Completed
14 Jun 2024Assigned to Editor
09 Jul 2024Review(s) Completed, Editorial Evaluation Pending
20 Jul 2024Editorial Decision: Revise Major
23 Jul 20241st Revision Received
26 Jul 2024Submission Checks Completed
26 Jul 2024Assigned to Editor
26 Jul 2024Reviewer(s) Assigned
03 Aug 2024Review(s) Completed, Editorial Evaluation Pending
13 Aug 2024Editorial Decision: Accept