Comparative Evaluation of Large Language Models using Key Metrics and Emerging Tools

Sarah McAvinue; Kapal Dev

doi:10.22541/au.172225490.00673881/v1

loading page

Comparative Evaluation of Large Language Models using Key Metrics and Emerging Tools

Sarah McAvinue,
Kapal Dev

Abstract

This research involved designing and building an interactive generative AI application to conduct a comparative analysis of two advanced Large Language Models (LLMs), GPT-4 and Claude 2, using Langsmith evaluation tools. The project was developed to explore the potential of LLMs in facilitating postgraduate course recommendations within a simulated environment at Munster Technological University (MTU). Designed for comparative analysis, the application enables testing of GPT-4 and Claude 2 and can be hosted flexibly on either AWS (Amazon Web Services) or Azure. It utilizes advanced natural language processing and retrieval-augmented generation (RAG) techniques to process proprietary data tailored to postgraduate needs. A key component of this research was the rigorous assessment of the LLMs using the Langsmith evaluation tool against both customized and standard benchmarks. The evaluation focused on metrics such as bias, safety, accuracy, cost, robustness, and latency. Additionally, adaptability covering critical features like language translation and internet access, was independently researched since the Langsmith tool does not evaluate this metric. This ensures a holistic assessment of the LLM’s capabilities.

13 Jun 2024Submitted to Expert Systems

Show details

Hide details

14 Jun 2024Submission Checks Completed

14 Jun 2024Assigned to Editor

09 Jul 2024Review(s) Completed, Editorial Evaluation Pending

20 Jul 2024Editorial Decision: Revise Major

23 Jul 20241st Revision Received

26 Jul 2024Submission Checks Completed

26 Jul 2024Assigned to Editor

26 Jul 2024Reviewer(s) Assigned

03 Aug 2024Review(s) Completed, Editorial Evaluation Pending

13 Aug 2024Editorial Decision: Accept

Abstract

Peer review status:ACCEPTED