Enhancing Outlier Detection in Air Quality Index Data Using a Stacked
Machine Learning Model
Abstract
Air quality is an important part of environmental health, having serious
consequences for human health and well-being. The Air Quality Index
(AQI) is a frequently used metric for assessing air quality in various
areas and at different times. However, AQI data, like many other types
of environmental data, can contain outliers - data points that deviate
significantly from other observations, indicating exceptionally good or
poor air quality, a critical step in identifying and understanding
extreme pollution episodes that can have serious environmental and
public health consequences. These outliers can be caused by a variety of
variables, including measurement mistakes, odd meteorological
circumstances, and pollution occurrences. While outliers can
occasionally give useful information about these unusual conditions,
they can also skew studies and models if they are not adequately
accounted for. This paper describes a hybrid method for detecting
outliers in data, AQI data are used in this study. The model uses a
stacked machine learning model that incorporates K-means clustering,
Random Forest (RF), and Gradient Boosting Classifier (GBC). K-means is
used for initial categorization, followed by RF model training, and
ultimately, the RF output is used as input for the GBC to generate the
final classification. The performance of this stacked machine learning
model is examined and compared to single models using the Accuracy
measure. The findings show that the suggested technique is efficient,
with an accuracy of 0.99, showing its potential for effective outlier
detection in data.