Genome quality assessment, assortment and annotation
We performed comprehensive assessments for the chromosome-level genome
assembly using datasets from PacBio, Illumina, and RNA-seq derived from
roots, stems and leaves. A number of other indices showed that the
genome was of high quality. By mapping the genomic sequencing data, we
found that the Illumina and PacBio data covered 99.45% and 99.76% of
the whole genomes, repectively. In total, 96.5% of BUSCO (Benchmarking
Universal Single-Copy Orthologs) genes were represented as complete, and
the proportion of transcriptome data that mapped to the genome was
97.8%. The coverage depth distribution for duplicated and single-copy
BUSCO core genes was identical, showing an expected Poisson distribution
(Fig. S1). This indicates that duplicated genes were not derived from
assembling redundancy.
We also used transcripts from several white poplar species, includingP. alba , P. adenopoda , P. davidiana , and P.
grandidentata , together with genomes derived from P. alba var.pyramidalis (J. Ma et al., 2019),P. tremula and P. tremuloides(Lin et al., 2018) and P.
trichocarpa (Tuskan et al., 2006), to
assess the chromosome-scale pseudomolecules (based on co-phylogenetic
analysis). The P. tomentosa genome was successfully separated
into two subgenomes (2×19 choromosomes), with sizes of 336.7 Mb and
344.4 Mb, respectively (Table 1, Table S5a, Table S5b). Mapping of
syntenic regions within the assembly showed clear
chromosome-to-chromosome correspondence and also extensive synteny among
different chromosomes, as expected for the highly duplicatedPopulus genome (Fig. 3). Furthermore, we also evaluated and
compared the read depth between the two subgenomes using both the PacBio
and Illumina reads; the results showed the same depth distribution
between two subgenomes, suggesting an accurate haplotype-resolved
assembly (Fig. S2).
Compared with previous poplar genome assemblies
(J. Ma et al., 2019;
T. Ma et al., 2013;
Tuskan et al., 2006;
Yang et al., 2017), the result of
comprehensive assessments showed that the P. tomentosa assembly
quality in the present study was substantially improved (Table S6).
Using a combination of RepeatModeler and RepeatMasker, 1,001,718 repeats
were identified and masked (de novo identification,
classification, and masking were all run under default parameters).
Collectively, these repeats were 307.6 Mb in size and comprised
~41.6% of the genome (Fig. 3a). Long-terminal repeats
(LTR) were the most abundant, making up 17.5% of the genome. 13.3% of
these were LTR/Gypsy elements, and 4.0% were LTR/Copia repeats. Second
to LTR were unknown elements, making up 9.8% of the genome. This was
followed by 5.6% of Helitron repeats and 5.4% of DNA
elements
(Fig. 3b) (details in Table S7). There were only slight differences in
repeat size between two subgenomes; total repeats sizes were 133.7 Mb
and 137.9 Mb in subgenomes A and D (discussed further below),
respectively, and the sizes of LTRs were 54.9 Mb and 58.3 Mb,
respectively (Table 1).
Transposable element (TE) abundance and distribution varied
significantly between the sub-genomes (Fig. S3). The TEs are composed of
Class I and Class II [Class I consists of LTR (Copia, Gypsy), LINE and
SINE (tRNA); Class II consists of DNA (CMC-EnSpm, hAT-Ac, hAT-Tag1,
PIF-Harbinger) and RC (Helitron)], all of which were widely
distributed on the two subgenomes) (Fig. S3 ①-⑦). For example, a total
109,542 and 114,615 TEs of Class I were found in Subgenome A and
Subgenome D, respectively (4.6% increase), and a total 101,190 and
97,478 TEs of Class II were found in the two subgenomes (3.6%
decrease), respectively (Fig. S4). Further Pearson’s Chi-square test
showed that all TE categories but the LINE’s were significant
differences after Bonferonni correction (alpha = 0.05/7 = 0.007142857)
(Table S8).
To annotate genes, we first annotated the unmasked P. tomentosagenome, incorporating 73,919 protein sequences, and 137,918 transcripts
assembled from P. tomentosa RNA-seq data. A total of 59,124 high
quality gene models were identified, with an average coding-sequence
length of 1.31 kb, 6.04 exons per gene, and 430 amino acids (aa) per
protein. There was 28.5% genome coverage with an total length of 210.8
Mb (Table 1, Table S9, Table S10).
The annotated genes were then associated with the three onotological
classes: biological process, cellular components, and molecular
functions (Fig. 3c). We predicted 662 tRNAs with a total length of
49,659 bp (average length per tRNA: 75 bp), and 436 rRNAs (106 28S
rRNAs, 106 18S rRNAs, and 224 5S rRNAs) with a length of 610,293 bp. We
also annotated 2,072 ncRNAs with a length of 218,117 bp using RfamScan
(Kalvari et al., 2018). Finally, we
performed alignments with protein databases using BLAT
(Kent, 2002), with a maximum annotation
ratio of 98.6% (Table S11).
Comparative genomics and evolution
We compared 19,594 gene families, cointaining 59,124 genes, in theP. tomentosa genome with those of other three sequenced poplar
genomes including P. trichocarpa , P. euphratica , andP. pruinosa using OrthoMCL (L. Li,
Stoeckert, & Roos, 2003). A total of 22,386 gene families (142,738
genes) were identified by homolog clustering. In addition, 14,738 gene
families (119,375 genes) were shared by all four poplar species, and
1,154 gene families consisting of 2,038 genes were found to be unique toP. tomentosa based on OrthoMCL “mutual optimization.”
Similarly, 646/1,349, 179/261, and 399/1,041 gene families/genes were
found to be unique to P. trichocarpa , P. euphratica andP. pruinosa, respectively (Fig. 4a, Table S12).
To phase the chromosome pairs and study the parental origin of P.
tomentosa , we selected 1,052 orthologous genes that appear to be
allelic between each P. tomentosa chromosme pair and are
single-copy genes in poplars of other sections, and then constructed
gene trees to assess phylogenetic distances. The allelic gene-pairs of
each P. tomentosa chromsosme pair were observed to be clearly
closest to either P. alba var. pyramidalis (PA) orP. adenopoda (PD), respectively, on most gene trees. Thus, it
sucessfully divided a total of 38 chromosomes of P. tomentosainto two subgenomes (2 × 19 chromosomes) based on phylogenetic
distances. To confirm the results, we measured Ks distances among
two subgenomes of P. tomentosa , P. alba var.pyramidalis (PA) and P. adenopoda (PD) using 5,345
single-copy orthologous genes. The results were also consistent; we
refer to the genome of P. tomentosa as comprised of subgenome A
(putatively derived from P. alba var. pyramidalis ) and
subgenome D (putatively derived from P. adenopoda ).
To investigate potential recombination events between the two
sub-genomes, the synonymous (Ks ) distance between 5,345 single
copy orthologus genes of P. tomentosa , P. alba var.pyramidalis and P. adenopoda was estimated. We found that
there was limited apparent recombination events within the large
majority of gene loci (4,309: 80.62%), though a low level of
recombination appeared to occur (38 loci, 0.87%); and 998 loci (18.7%)
did not meet either of above two hypotheses and thus were uninformative
(Table S13, Fig. S5). This suggests that the two parental subgenomes may
be largely still intact in P. tomentosa , at least with respect to
genic composition.
We re-constructed phylogenetic
trees of subgenome A, subgenome D and of other poplars (Fig. S6), as
well as for each of the corresponding 19 pairs of chromosomes (Fig. S7).
All of these analyses supported the hypothesis that the P.
tomentosa genome originated from hybridization between P.
adenopoda and P. alba var. pyramidalis . Based on the fact
that the P. alba var. pyramidalis is a male clone, and no
female clone are found, together with our previous phylogenetic analyses
of chloroplast genomes from section Populus(Gao et al., 2019) which indicated thatP. adenopoda is the maternal parent of P. tomentosa , we
deduce that P. alba var. pyramidalis and P.
adenopoda were the male and female parents, respectively, in the hybrid
formation of P. tomentosa .
To
address dates of divergence and duplication events in poplars, we
conducted collinearity analysis of homologous gene pairs derived fromPopulus species vs. Salix suchowensis using MCScanX
(Y. Wang et al., 2012). From theKs (synonymous substitution rate) distribution, we inferred a
whole genome duplication event (WGD) (based on paralogous pairs) and a
species divergence event (based on orthologous pairs). The Ksdistribution among syntenic genes of the four poplar species andS. suchowensis contained two peaks. One peak indicated that
poplar and Salix species both underwent a common WGD event
(Ks ≈ 0.25). Such WGD events are known to have occurred
frequently in the evolution of angiosperms
(Jiao et al., 2011;
Myburg et al., 2014;
Otto, 2007;
Van de Peer, Mizrachi, & Marchal, 2017).
This result is also consistent with a previous study on Salix
suchowensis (Dai et al., 2014). Another
peak that represents divergence between Populus and Salixis also visible, suggesting some further chromosomal duplication
(Ks ≈ 0.12) (Fig. 4b). Further analysis showed that sectionPopulus and P. trichocarpa have a divergence at Ks≈ 0.035, and P. adenopoda and P. alba at Ks ≈
0.025. Subsequently, as a variant, P. alba var.pyramidalis is separated from P. alba at Ks ≈
0.008. The hybridization event between P. adenopoda and P.
alba var. pyramidalis subsequently occurred, followed by the
emergence of P. tomentosa (Ks ≈ 0.005) (Fig. 4c).
To study the parental origin of P. tomentosa , we constructed
phylogenetic trees using Salix suchowensis as an outgroup.
Further, referencing the fossil-based divergence time of Populusand Salix at 48 Mya (Boucher,
Manchester, & Judd, 2003; Manchester,
Dilcher, & Tidwell, 1986), we estimated dates for taxonomic
divergence.
Phylogenetic
analysis indicated that the divergence event between sectionPopulus and section Tacamahaca (P. trichocarpa )
occurred at approximately 13.4 Mya. P. adenopoda , an ancestor ofP. tomentosa , was the first to separate from the Populusfamily as an independent clade approximately 9.3 Mya. Subsequently, the
aspen group and white poplars group underwent a divergence event
(approximately 8.4 Mya). Another ancestor of P. tomentosa ,P. alba var. pyramidalis , gave rise to an independent
variant of P. alba at approximately 4.8 Mya. Approximately 3.9
Mya, P. tomentosa was created by hybridization between P.
adenopoda and P. alba var. pyramidalis (Fig. 4d).
Phylogenetic trees constructed using the chloroplast genomes of 15 white
poplar species and P. trichocapa indicated that the most probable
female parent of P. tomentosa is P. adenopoda (Fig.
S8 ).
Whole-genome synteny analysis revealed pairs of P.
trichocarpa -homologous regions shared between chromosomes corresponding
to the two subgenomes of P. tomentosa . A dot plot (Fig. 4e)
indicated that most of the common linear segments of homologous
chromosomes were shared between P. trichocarpa, subgenome A and
subgenome D. The diagonal distribution (“/”) indicated orthologous
collinear genes in P. tomentosa and P. trichocarpa , and
other dispersed distribution-blocks in the dot plot, suggested the
collinearity of paralogous genes on non-homologous chromosomes between
the two poplars (Fig. 4e). These findings show that both of the P.
tomentosa sub-genomes are highly syntenic with P. trichocarpa .