Soude Ghari

and 3 more

Data science has emerged as an integral part of enterprise business operations, aiming to uncover data-driven insights. However, volatility and variability are two major issues of these systems. The latter indicates the need to consider a multitude of conditions when assessing the accuracy of predictions made by analytics systems. This complexity stems from the understanding that each of these circumstances can alter at any given time. Consequently, the challenge is to estimate the predictive accuracy of data science systems even in the midst of uncertainty and variability. Such an estimate can enable good planning and potentially dynamic resource allocation for such projects. In this work, we present a study of various machine learning and deep learning models to estimate the performance of data science projects deployed in Apache Spark, a popular and flexible distributed analytics platform. Moreover, we create experiments for the purpose of gathering data, as well as assessing feature importance to find which input configuration contributes the most. We demonstrate the process of training such a model, from data collection to training and testing, and we systematically compare the various alternatives to help decision-makers choose the best one. Thus, by providing insights into the performance of data science projects under uncertain and variable conditions, this work offers valuable contributions to both research and practical applications to make informed decisions, plan effectively, and allocate resources dynamically in data science projects. Our results show that LSTM and MLP outperform other models in terms of response time models and throughput models.