1 | INTRODUCTION
Advances made in DNA sequencing during the past decade, has led to
genomes of diverse organisms being successfully sequenced and assembled
(de Man et al., 2016; Iorizzo et al., 2016; Jarvis et al., 2017; Lien et
al., 2016). High-quality genome assembly requires high levels of
contiguity, which enable new insights into genome structure evolution
and increase the gene space completeness of the assembly (Berlin et al.,
2015; Gordon et al., 2016; Koren et al., 2013; Loman, Quick, & Simpson,
2015). However, the presence of repetitive regions in a genome poses a
major challenge to the assembling of highly contiguous genomes.
Mate-pair sequencing involves the generation of long-insert paired-end
DNA libraries that span several kilobase pairs of long repeat regions.
This is useful for many sequencing applications, including de novo
sequencing, genome finishing, structural variant detection, and
identification of complex genomic rearrangements (Maretty et al., 2017;
Smadbeck et al., 2018; Tan, Tan, & Cheng, 2020; van Heesch et al.,
2013; Wetzel, Kingsford, & Pop, 2011). During mate-pair library
preparation, DNA is fragmented allowing DNA of a desired length to be
isolated. Afterwards, the ends of the DNA fragments are biotinylated and
circularized. Then, the DNA ring is sheared into smaller fragments
(400-600 bp). Biotinylated fragments are enriched (by biotin tag), and
adapters ligated. These are then ready for cluster generation and
sequencing. Although this technology does not produce long reads, it is
able to span repeat regions if the insert size is sufficiently large.
Combining data generated from mate-pair library sequencing with those
from short-insert paired-end reads provides a powerful combination of
read lengths for maximal sequencing coverage across the genome, leading
to a dramatic improvement in the assembly of large genomes. Mate pairs
with small, medium, and large insert sizes are usually used to scaffold
contigs in order to improve genome assemblies (Pop, Phillippy, Delcher,
& Salzberg, 2004).
Third-generation long-read sequencing technologies, such as PacBio
(Rhoads & Au, 2015) and Nanopore, (Jain, Olsen, Paten, & Akeson,
2016), increase read lengths to overcome the challenge of sequencing
repetitive regions that reads must be long enough to anchor in
nonrepetitive sequences and span across the repeats. Repeats may be
spanned, and subsequent assembling of the region is possible if the read
length is substantially longer than the repeat region (Bongartz, 2019).
Third-generation long reads are also used for scaffolding during genome
assembly (Boetzer & Pirovano, 2014).
High-quality DNA, which is crucial for mate-pair sequencing, can only be
obtained from material that is both fresh and abundant. Similarly,
high-molecular-weight DNA (>50 kb) is needed to realize the
full beneficial effects of potential third-generation sequencing. The
lack of suitable starting material limits the choice of sequencing
technology and affects the quality of the obtained data. For example, in
a comparative genomics study of ruminants, only the genomes of several
species, such as mountain nyala, common eland, bongo, and oribi could be
assembled at the contig level due to degenerate DNA samples, which were
not suitable for constructing mate pair libraries (Chen et al., 2019).
Another example of poor DNA involves studies of ancient DNA (aDNA)
(Stoneking & Krause, 2011) which mostly contains very short fragments
between 44 and 172 bp (Sawyer, Krause, Guschanski, Savolainen, & Paabo,
2012).
Although it is impossible to apply mate-pair or third generation
sequencing to degenerate or ancient samples, (Grau, Hackl, Koepfli, &
Hofreiter, 2018) invented a method that generates in silicomate-pair libraries using a reference genome from a closely related
species, thereby helping to assemble genomes at the scaffold level. In
order to improve genome contiguity, they developed cross-species
scaffolding — a new pipeline that imports long-range distance
information directly into a de novo assembly process by
constructing mate-pair libraries in silico . After processing,
cleaned reads of target species were mapped to the repeat-masked
reference genome, and consensus is computed. Next, read pairs of
mate-pair libraries are generated based on consensus. Finally, the
cleaned reads and in silico mate pairs are used to assemble the
genome using SOAPdenovo2 (Luo et al., 2012). Application of thisin silico mate-pair method resulted in a dramatic improvement in
contiguity and accuracy, as demonstrated by the assembling of two
primate genomes, based on just ∼30x coverage of shotgun sequencing data
(Grau et al., 2018). A drawback of this approach is the introduction of
assembly chimeras (Grau et al., 2018). Furthermore, phylogenetic
distance, quality, and completeness of the reference genome, as well as
its overall synteny and transposable element content, influence the
final number of misassemblies. Methods via which misassemblies can be
reduced and best references can be chosen to generate in silicomate pairs are yet to be tested.
In addition to the in silico mate-pair method, referred to as the
reference-guided approach, similarity between the target and reference
species can also be made use of to gain additional information, which
often leads to more complete and improved genome assemblies (Bao, Jiang,
& Girke, 2014; Pop et al., 2004; Schneeberger et al., 2011). In
contrast to the in silico method that generates mate pairs prior
to genome assembly, reference guide approaches, such as Chromosomer
(Tamazian et al., 2016), Ragout (Kolmogorov, Raney, Paten, & Pham,
2014), and RaGOO (Alonge et al., 2019) , use a single reference to
order, orientate, and join contigs and long reads. Therefore, thein silico mate-pair method is more flexible than the reference
guide approach. For example, high-quality, conserved mate pairs can be
selected by comparing two or more reference genomes to reduce
misassemblies in the target genome assembly.
In this study, we attempted to optimize the use of the in silicomethod. First, we investigated how the phylogenetic distance between a
reference and a target affects the quality of genome assembly. We then
tested whether generating a conserved mate pair by comparing multiple
reference genomes improves the quality of genome assembly. Finally, we
tested the effect of the optimized in silico mate-pair strategy
on degraded samples and a simulated ancient DNA data.