Introduction
Researchers are creating data, code, and software in previously unimaginable quantities. Data sources are everywhere. Researchers use new tools, like digital notebooks, to compile and version data, to code, and to compile and summarize the methods and results of their work for sharing with others. They have ways to control access to research data, and ways to share it when they’re ready to. Some of the researchers that make the most significant impact in both the human sciences and natural sciences are those whose work is data intensive, those that adopt new technology in everything they do, new practices as they go, and those that are active in sharing their data to maximize and accelerate the impact of their work.
Sharing data can, however, come with several challenges and concerns for researchers, and, on the publishing side, for authors and editors of academic research journals. Depending on the discipline, most authors have options for where and how to share their data. They may wonder which repositories are best suited for their data, where others in their community might already be sharing, and also where their data will be the most easily accessible and discoverable, and in turn, able to make the biggest impact. When it comes time to submit their work to a journal for publication, they must also take into consideration which repositories and methods are compliant with data sharing policies at those journals. Journal editors themselves also have concerns. Some wonder whether expectations around data sharing will be difficult for authors to meet and if that will that in turn affect submissions to their journal. Others are eager to promote data sharing policies but require assistance in improving their own knowledge and expertise on the matter to be able to better guide and assist their authors.
Research publishers that don’t recognize the fundamental changes towards data intensive research in the human and natural sciences, that don’t appreciate the challenges researchers and journal editors face while trying to make an impact in this changing environment, and that don’t offer the support researchers look to publishers for, are taking a significant risk. They miss the opportunity to be part of the research data revolution, the opportunity to support quality and best practices in data sharing, and in turn, they risk becoming increasingly less relevant. Most researchers need support to thrive in this new data intensive world. There is an opportunity right now for publishers to provide support around data sharing for researchers and authors and to drive change through their journals.
In recognition of the challenges faced by authors and journal editors alike in navigating the changing landscape around data, Wiley is developing resources that support researchers who want or need to share the new data they create, including those shared in this preprint \cite{Wu_2019}. By doing that we can help lead positive change. To this end, we assessed and analyzed the data sharing behaviors of communities of research authors that publish in Wiley journals. We present the methodology, results, and our analyses here. The results presented here will inform future efforts and plans for developing useful and valuable resources for our author and editor communities. It is our hope that these resources will help facilitate a deeper understanding of best practices around data sharing, reporting standards, and data repositories.
Methods
In our previous report \cite{Graf}, we extracted 124,000 Data Availability Statements (DASs) from the custom questions included for 176 journals in Wiley's electronic editorial office systems between 2013 and 2019. We limited answers to journals that asked custom questions containing the terms "data availability" or "data accessibility".
We used
SpaCy's Matcher to identify and extract URLs from the answers to the custom questions for each submission, and
tldextract to extract the domains and subdomains from each URL.
In total for this study we used the same data source and method to analyze about 145,000 data availability statements, in which we found about 28,000 with URLs, and of those we were able to resolve about 19,500 to repositories.