Big Data Analytics to Enable Integrated Research of Biodiversity and
Climate Datasets in the Amazon Basin
Abstract
With the mass adoption of data analysis in several scientific fields
such as climatology, medicine, astronomy and astrophysics, the
availability of an appropriate analytics infrastructure has become a
necessity increasingly recognized by the scientific community. However,
appropriate tools and applications are required to process the large
volume of data collected and generated by researchers. One of the
biggest challenges lies in the fact that these tools need to be gathered
to be applied in specific domains. The area of bioclimatic data is a
scientific field that still has much to improve in this matter. It is a
field of study that lacks great efforts in the direction to provide
methodologies and tools to facilitate the understanding of the complex
phenomena involved in the influence that environmental variables have on
biodiversity on the planet. Thus, the purpose of this work is to propose
a big data analytics architecture that presents an ecosystem that
systematizes and facilitates the task of the scientists to deal with the
complexity in the bioclimatic data analysis, providing tools for
storage, management, analysis using machine learning algorithms and data
mining, and visualization tools. The methodological approach of this
work was to make a thorough bibliographical study to verify the most
used tools and the suitability of each one to the purpose of the work.
In addition, the literature provided indications of software ecosystem
implementations methodologies that served as a guide in the architecture
design. Within the architecture, we attempted to gather a set of
bioclimatic data based on a subset of data obtained from the Atmospheric
Radiation Measurement (ARM) data repository for climatic data, and the
Brazilian Biodiversity Portal for biodiversity data. As a result, we
were able to gather a series of tools to access data such as Cassandra,
distribution of processing such as Spark, programming interface
represented by Jupyter Notebook, system modules for data format
conversion, machine learning algorithms libraries and software for data
visualization. This research discuss the importance of a domain purpose
design of a data analysis architecture for bioclimatic data. We
concluded that this type of ecosystem is imperative to facilitate the
research process and increase the quality of the results.