Batch effects in population genomic studies with low-coverage whole genome sequencing data: causes, detection, and mitigation

Runyang Nicolas Lou; Nina Overgaard Therkildsen

doi:10.22541/au.162791857.78788821/v2

loading page

Batch effects in population genomic studies with low-coverage whole genome sequencing data: causes, detection, and mitigation

Runyang Nicolas Lou,
Nina Overgaard Therkildsen

Abstract

Over the past few decades, the rapid democratization of high-throughput sequencing and the growing emphasis on open science practices have resulted in an explosion in the amount of publicly available sequencing data. This opens new opportunities for combining datasets to achieve unprecedented sample sizes, spatial coverage, or temporal replication in population genomic studies. However, a common concern is that non-biological differences between datasets may generate batch effects that can confound real biological patterns. Despite general awareness about the risk of batch effects, few studies have examined empirically how they manifest in real datasets, and it remains unclear what factors cause batch effects and how to best detect and mitigate their impact bioinformatically. In this paper, we compare two batches of low-coverage whole genome sequencing (lcWGS) data generated from the same populations of Atlantic cod (Gadus morhua). First, we show that with a “batch-effect-naive” bioinformatic pipeline, batch effects severely biased our genetic diversity estimates, population structure inference, and selection scan. We then demonstrate that these batch effects resulted from multiple technical differences between our datasets, including the sequencing instrument model/chemistry, read type, read length, DNA degradation level, and sequencing depth, but their impact can be detected and substantially mitigated with simple bioinformatic approaches. We conclude that combining datasets remains a powerful approach as long as batch effects are explicitly accounted for. We focus on lcWGS data in this paper, which may be particularly vulnerable to certain causes of batch effects, but many of our conclusions also apply to other sequencing strategies.

22 Jul 2021Submitted to Molecular Ecology Resources

Show details

Hide details

29 Jul 2021Submission Checks Completed

29 Jul 2021Assigned to Editor

02 Aug 2021Reviewer(s) Assigned

30 Aug 2021Review(s) Completed, Editorial Evaluation Pending

22 Sep 2021Editorial Decision: Revise Minor

05 Nov 2021Review(s) Completed, Editorial Evaluation Pending

05 Nov 20211st Revision Received

11 Nov 2021Editorial Decision: Accept

Jul 2022Published in Molecular Ecology Resources volume 22 issue 5 on pages 1678-1692. 10.1111/1755-0998.13559

Abstract

Peer review status:Published