Opinions about data sharing among researchers continue to be widely surveyed \citep{wiley2017} \cite{Fecher_2015}. Actual data sharing practices have been investigated by looking at data availability statements published in journal articles \cite{Federer_2018} \cite{Stall_2019} \cite{Hardwicke_2018}. Data availability statements describe whether and how researchers' newly analysed research data have been made available, and the conditions under which they can be accessed. When research authors have shared data in appropriate research data repositories, their data availability statements can include permanent identifiers to link between the journal article and the data. Studies of data availability statements have been used to assess the impact and increasing effectiveness of data sharing policies at journals \cite{Federer_2018} \cite{Stall_2019} \cite{Hardwicke_2018}, and how data sharing practices are changing. Some conclude that practices fall short of the study authors' expectations \cite{Wallach_2018} \cite{Federer_2018} \cite{Rowhani_Farid_2016}. This reminds us how important it is for publishers and journals to set reasonable expectations, and to support those expectations with robust policy and process \cite{open}. It also speaks to the pace of change in different communities as they become familiar with, interested in, able to, and required to share research data. Publishers and journals need to match that pace, and they can also lead change. Measuring, interpreting, and acting on data sharing trends ensures that publishers and journals continue to serve researchers well.
Methods
We used topic modeling, an unsupervised machine learning technique, to identify topics from 124,000 data availability statements submitted by research authors to 176 Wiley journals between 2013 and 2019. The complete workflow is available at GitHub (
https://github.com/DWFlanagan/data-availability-statements). The workflow is managed with Snakemake
\cite{Koster_2012}.
Wiley's electronic editorial office systems allow for the inclusion of custom questions on a journal-by-journal basis. We first extracted all of the records that contained the term "data" in either the question or the answer, but then limited the selection to questions that mentioned "data availability" or "data accessibility".
We then used SpaCy \cite{python} to tokenize the answers, limiting the tokens to nouns, proper nouns, and adjectives. We also added some custom stop words to ignore like "Wiley", "url", "et", and "al".
Then, we used scikit-learn \cite{scikit-learn} to create a term frequency-inverse document frequency (TF-IDF) matrix\cite{SPARCK_JONES_1972} of the tokenized answers, followed by Latent Dirichlet Allocation (LDA) \cite{blei2003latent}. We initially used 20 topics to cluster the documents. We used pyLDAvis \cite{sievert2014ldavis} to visualize the topics estimated by the model.
Finally, we labelled the topics where possible for further analysis and discussion, using Wiley's Data Sharing Policy Author Templates \cite{z3qpxr} as a starting guide.
Results and Analysis
Simply counting the number of answers to the custom questions that contain the terms "data availability" or "data accessibility" shows a dramatic uptick in volume starting in early 2019 (Figure \ref{530186}). This coincides with the rollout of Wiley's Expects Data policy, which added data availability statement requirements to 100+ journals starting in December 2018 \cite{Wu_2019}.