loading page

Optimising High-throughput sequencing data analysis, from gene database selection to the analysis of compositional data: A case study on tropical soil nematodes
  • +10
  • Simin Wang,
  • Dominik Schneider,
  • Tamara Hartke,
  • Johannes Ballauff,
  • Carina Moura,
  • Garvin Schulz,
  • Zhipeng Li,
  • Andrea Polle,
  • Rolf Daniel,
  • Oliver Gailing,
  • Bambang Irawan,
  • Stefan Scheu,
  • Valentyna Krashevska
Simin Wang
University of Göttingen

Corresponding Author:[email protected]

Author Profile
Dominik Schneider
Georg-August-Universität Göttingen
Author Profile
Tamara Hartke
University of Göttingen
Author Profile
Johannes Ballauff
University of Göttingen
Author Profile
Carina Moura
University of Göttingen
Author Profile
Garvin Schulz
University of Göttingen
Author Profile
Zhipeng Li
Chinese Academy of Sciences
Author Profile
Andrea Polle
University of Göttingen
Author Profile
Rolf Daniel
University of Göttingen
Author Profile
Oliver Gailing
University of Göttingen
Author Profile
Bambang Irawan
Jambi University
Author Profile
Stefan Scheu
University of Göttingen
Author Profile
Valentyna Krashevska
University of Göttingen
Author Profile

Abstract

High-throughput sequencing (HTS) provides an efficient and cost-effective way to generate large amounts of sequence data. However, marker-based methods and the resulting datasets come with a range of challenges and disputes, including incomplete reference databases, controversial sequence similarity thresholds for delineating taxa, and downstream compositional data analysis. Here, we use HTS data from a soil nematode biodiversity experiment to address the following questions: (1) how the choice of reference database affects HTS data analysis, (2) whether the same ecological patterns are detected with ASV (100% similarity) versus classical OTU (97% similarity), and (3) how different data normalization methods affect the recovery of beta diversity patterns and identification of differentially abundant taxa. At this time, the SILVA database performed better than PR2, assigning more reads to family level and providing higher phylogenetic resolution. ASV- and OTU-based alpha and beta diversity of nematodes correlated closely, indicating that OTU-based studies represent useful reference points. For downstream data analyses, our results indicate that rarefaction-based methods are more vulnerable to missed findings, while clr-transformation based methods may overestimate tested effects. ANCOM-BC retains all data and accounts for uneven sampling fractions for each sample, suggesting that this is currently the optimal method to analyze compositional data. Overall, our study highlights the importance of comparing and selecting taxonomic reference databases before data analyses, and provides solid evidence for the similarity and comparability between OTU- and ASV-based nematode studies. Further, the results highlight the potential weakness of rarefaction-based and clr-transformation based methods. We recommend future studies use ASV and that both the taxonomic reference databases and normalization strategies are carefully tested and selected before analyzing the data.