4 | DISCUSSION
High-quality genome sequences are critical for biological research
studies that focus on chromosomal structure and gene rearrangement,
among others. Despite recent advances in sequencing technologies, many
genome assemblies have not yet achieved the desirable level of quality.
Forming the genome assemblies of some species with large or complex
genomes poses challenges. Moreover, current technologies, such as long
read sequencing and mate-pair sequencing, cannot be used to generate
high-quality genome assemblies for some rare or extinct species, due to
available DNA of these species being either degenerate or ancient.
Therefore, in silico mate pair assembly may still be usable,
especially for those species with only some degenerate DNA or ancient
samples.
The phylogenetic distance to target species, quality, and completeness
of the reference genome, as well as its overall synteny and transposable
element content, affects the final quality of target genome assemblies.
Thus, not all references are appropriate for genome assembly of a target
species. Therefore, we tested multiple references with different
phylogenetic distances to the genome assembly of the target species.
This was demonstrated while constructing the genome assemblies ofC. batrachus , T. bimaculatus , and T. buxtoni usingin silico mate pair libraries that were generated using different
references separately. In summary, a reference from the same genus as
that of the target species is the best for making in silico mate
pairs, compared with divergent references. In addition to phylogenetic
distance, the quality of the reference genome also affected the target
genome assembly. For example, the number of in silico mate pairs
generated from the B. grunniens genome (different genera but same
subfamily) to assemble the genome of T. buxtoni , was higher than
those generated from T. scriptus or T. strepsiceros (same
genus). The genome of B. grunniens had an N50 of 114 Mb, which
was much larger than that of T. scriptus (890 Kb) or T.
strepsiceros (511 Kb). Nevertheless, the number of complete BUSCO genes
in the target genome assembled using B. grunniens as the
reference was only slightly higher than that using the congener as the
reference. Thus, the quality and completeness of references influence
the final assemblies, but to a lesser extent than the influence of the
phylogenetic distance of the reference species to the target.
Misassemblies, a common issue encountered in genome assembly, are mainly
caused by sequencing or assembler errors. In de novo assembly
based on long sequence reads, polishing with short reads is often used
to improve the base-pair accuracy of assemblies (Rice & Green, 2019).
Misassemblies in reference-guide genome assemblers or scaffolders are
inevitable due to unknown synteny and transposable element content
discrepancies between the references and target species. This issue is
particularly severe for assemblers that are designed based on one
reference, which limits the wider use of reference-guide assembly
algorithms or tools. Thus, the feasibility of reducing misassemblies in
final genome assemblies is an important issue that needs to be explored
by genomic studies. Therefore, we optimized the in silicomate-pair method by searching for conserved in silico mate pairs
that reduce final misassemblies, under the assumption that conserved
mate pairs would display more consistent synteny in the target species.
We found that using three or more references (family or order conserved)
reduced the number of misassemblies dramatically, but only by scarifying
high contiguity and the number of complete genes. However, using two
references from the same genus of the target species balanced
contiguity, accuracy, and gene completeness of the final assemblies. By
contrast, the original in silico mate-pair method using one
reference resulted in more complete genes as well as in more
misassemblies. Closer examination of these extra genes indicated that
many did not exist in the “true” genome or were erroneous.
An increasing amount of sequence data of aDNA samples has been observed
since the initial application of high-throughput sequencing to ancient
human remains, (Rasmussen et al., 2010) over 2000 ancient samples being
recorded (Brunson & Reich, 2019). In addition to the limitations of
aDNA sequences, such as read length and contamination, data processing
and analysis algorithms lag behind current speeds and costs. This
impedes paleogenomics, with particular reference to the recovery of the
full nuclear genome. The genome assembly of ancient DNA data relies on
the alignment of sequencing reads to a linear reference genome, leading
to the selection of endogenous DNA sequences. Thus, we simulated aDNA
sequences and used these for genome assembly via different methods. The
results suggested that the optimized in silico mate-pair method
performed better than the use of aDNA reads alone or the originalin silico mate-pair method. It also outperformed the assembler,
RaGOO, in the level of accuracy, which may be attributed to the design
of RaGOO, which is based only on one reference.
Use of in silico mate pairs for scaffolding is a simple method
that enables long-range distance information from a reference genome to
be incorporated into a de novo genome assembly, via the
generation of in silico mate-pair libraries. It is essentially a
novel reference-guide approach, since other chromosome scaffolders, such
as Chromosomer (Tamazian et al., 2016), MeDuSa (Bosi et al., 2015),
AlignGraph (Bao et al., 2014), and RaGOO (Alonge et al., 2019) exploit
distance information from a genome of a closely related organism to
order and extend scaffolds or contigs after the de novo assembly
process. By contrast, in silico mate-pair libraries obtain
distance information prior to the assembly process and can be adapted to
any genome assembler that accepts mate-pair sequences as input. The
contiguity of a genome assembly may be improved via the application ofin silico methods or other reference-guided approaches. However,
some reference-guided scaffolders rely heavily on paired-end or
long-length read information, making these unsuitable for single-end
reads. In addition, a large proportion of these reference-guided
scaffolders are designed based only on one reference, resulting in many
misassemblies in the draft genomes. Finally, all reference-guided genome
assemblers or scaffolders have limitations, where only the conserved
regions between target species and references are clear, while the
sequence information between the conserved regions remains unknown.