Adithya Narasimhan - 21DOCS Test Area

This study addresses the performance gap between academic benchmarks and real-world applications of Large Language Models (LLMs) in Natural Language to SQL (NL-to-SQL) tasks. We evaluated a diverse range of LLMs across various hosting platforms, using a balanced distribution of 360 questions spanning simple, moderate, and challenging difficulty levels. Our analysis incorporated comprehensive metrics that include execution accuracy, operational costs, and throughput. Results revealed significant platform-dependent performance variations, with models like GPT-4o and Claude-3.5 Sonnet demonstrating superior accuracy (41.44% and 40.94%) but at higher operational costs. In contrast, GPT-4o mini emerged as a cost-effective alternative with 38.56% accuracy. The Mistral family exhibited exceptional throughput, particularly on the Bedrock platform. We observed a consistent decline in performance across all models as query complexity increased, highlighting ongoing challenges in handling intricate database queries. Our findings emphasize the complex interplay between accuracy, cost, and throughput in real-world deployments and underscore the importance of platform-specific optimizations. This study provides valuable insights for researchers and practitioners, offering a nuanced understanding of LLM capabilities in NL-to-SQL tasks and guiding future research directions in this rapidly evolving field. The code is available at https://github.com/petavue/NL2SQL-Benchmark.