A Review of Data Structures for Data Science

loading page

Fernando Perez,
Jey Kottalam,
Kyle Barbary,
Awaiting Activation,
Kathryn Huff ,
Daniel Turek,
Nathaniel Smith,
zhangzhao,
Dav Clark,
Stéfan van der Walt

Abstract

Data structures are the foundation upon which computational tools are built. For example, the simple pointer-to-memory approach, established by languages such as Fortran and C, acts as a de facto standard by which different packages and libraries can interoperate with a single shared array of numerical data in memory. While this simple abstraction for n-dimensional arrays has served us well in the past, there is a clear need for data structures that have richer semantics and make it easy to express and manipulate common forms of (semi-)structured data. This need is highlighted by the popularity of R’s data frames and Python libraries, such as bcolz (column storage), pandas (indexed data frames), and X-ray (n-dimensional indexed arrays).

This paper aims to present the state of the art in data structures, across programming languages and implementation details, that are foundational in data science, scientific computing, and statistical applications. It will review current data representation semantics implemented by various libraries, packages, and languages, with an explicit emphasis on interoperability across languages and process boundaries.