3.3 Repeat annotation, gene prediction and gene annotation
A total of 384.29 Mb of repeat sequences were detected, accounting for 66.74% of the assembly genome (Table 6). This repeat content was obviously larger than the value (36.60%) obtained from the k-mer analysis. The repetitive sequences mainly consisted of the DNA transposable element (289.32 Mb; 50.24% of the assembly), long terminal repeats (66.95 Mb; 11.63%), and long interspersed elements in 30.96Mb (5.38%) (Table 7).
A total of 21,664 protein-coding genes were predicted by the combination of strategies based on ab initio , homologs, and RNAseq. The average values of the gene length, exon length, and average intron length were 14,606, 292.38, and 1,223 bp, respectively (Table 8). The statistics of the predicted gene models were compared to other ten teleost species,including :Acanthochromis polyacanthus,Oryzias latipes,Amphiprion ocellaris,Anabas testudineus,Astatotilapia calliptera,Astyanax mexicanus,Austrofundulus limnaeus,Gadus morhua,Lepisosteus oculatus,Notothenia coriiceps , showing similar distribution patterns in mRNA length, CDS length, exon length, intron length and exon number (Supplementary Figure S2). The summary of genome characteristics of burbot was shown in Figure 3. A total of 20658 predicted genes (95.36%) were successfully annotated by alignment to the nucleotide, protein, and annotation databases InterPro, NR, Swissprot, TrEMBL, KOG, GO, and KEGG (Table 9). A total of 6390 tRNAs, 300 rRNAs, and 519 microRNAs were identified by noncoding RNA prediction (Supplementary Table S8).
3.4 Comparative genomics and the mechanism of adaption to freshwater
A total of 19,998 gene families and 2,650 single-copy orthologous genes were identified using the genomes and genes of 13 selected teleosts. In addition, 21,664 genes of burbot could be clustered into 14,504 gene families, including 132 unique gene families (Supplementary Table S9). Based on the single-copy orthologous genes, the ML phylogenetic tree was constructed and showed that burbot and Atlantic cod were clustered together, and the divergence time between two cod species was ~44.4 Mya (Figure 4). The divergence time was consistent with the estimated time by Hughes et al. (2018). The burbot genome displayed 639 expanded and 1564 contracted gene families compared with the common ancestor of burbot and Atlantic cod (Figure 4). The expanded gene families of burbot were significantly enriched in 73 GO terms and 34 KEGG pathways, mainly including DNA integration (GO:0015074, corrected P value =0.00E+00), DNA metabolism process (GO:0006259, corrected P value =2.05E−06), apoptosis process (GO:0006915, corrected P value =5.22E−05), zinc ion binding (GO:0008270, corrected P value =2.02E−96), transition metal ion binding (GO:0046914, corrected P value =1.19E−91), natural killer cell-mediated cytotoxicity (ko04650, corrected P value =4.200299E−20), and hematopoietic cell lineage (ko04640, correctedP value=3.04E−18) that were associated with cell damage repair, ion binding, and immune system (Supplementary Tables S10 and S11). Conversely, the burbot clearly showed contracted gene families in homophilic cell adhesion via plasma membrane adhesion molecules (GO:0007156, corrected P value =2.69E−29), cell-cell adhesion via plasma-membrane adhesion molecules (GO:0098742, corrected P value =2.69E−29), membrane (GO:0016020, corrected P value=1.18E−10) GO terms, amino sugar and nucleotide sugar metabolism (ko00520, correctedP value=1.52E−04), and NOD-like receptor signaling (ko04621, corrected P value=2.41E−02) pathways (Supplementary Tables S12 and S13).
Notably, three freshwater species shared no expanded gene families and two contracted gene families associated with cell adhesion (GO:0007155: corrected P value=0.00E+00) and membrane (GO:0016020, correctedP value =0.00E+00) (Supplementary Table S14). These functions are critical for adjusting the ion concentrations inside and outside the cell. However, no enriched KEGG pathway was found for the contracted gene families. Such gene families may reflect the reduced functional requirements of a stable ionic environment in freshwater for cell membrane permeability. These findings are consistent with the different components of omega-3 fatty acids between marine and freshwater fish (Taşbozan & Gökçe, 2017). Marine fish have higher levels of omega-3 fatty acids than freshwater species. Compared with the omega-6 fatty acids, omega-3 fatty acids help improve cell membrane fluidity and provide osmoregulatory capabilities.
To identify the genes evolving under positive selection for freshwater adaptation, two different likelihood ratio tests (branch-site model) were performed. A total of 377 genes were identified as PSGs in the burbot genome (Supplementary Table S15). The burbot PSGs were functionally enriched in the organic cyclic compound metabolic process (GO:1901360, corrected P value =1.83E−02), cellular nitrogen compound metabolic process (GO:0034641, corrected P value =4.11E−03), RNA metabolic process (GO:0016070, corrected P value =4.13E−03), and nucleic acid metabolic process (GO:0090304, correctedP value =6.16E−03 ) (Supplementary Table S16). Additionally, 38 PSGs were detected with three freshwater lineages (burbot, M. albus and G. affinis ) as foreground branch (Supplementary Table S17). Four PSGs (stk33 , ino80e , nabp1a andznf385a ) were related to DNA damage repair. Genes stk33 and nabp1participate in the mitotic DNA damage checkpoint. znf385a is located upstream in the p53 activating pathway. znf385a  interacts with p53/TP53 and promotes DNA damage-induced cell cycle arrest (Das et al., 2007). Protein ino80e is a component of the chromatin remodeling INO80 complex and contributes to the DNA double-strand break repair (Yao et al., 2008).
The exposure of freshwater fish to UV radiation may cause DNA damage. The presence of a group of genes involved in DNA repair under positive selection was consistent with the high levels of exposure to UV radiation in freshwater environment compared with that in the ocean environment. This finding suggests that these genes had functionally convergent in three freshwater lineages.
The PSGs of freshwater lineages were enriched in folic acid transport (GO: 0015884, slc19a1 , corrected P value =8.10E−05) GO terms, amino acid metabolism, replication, and repair pathways (Supplementary Tables S18 and S19). slc19a1 has an important role in folate transmembrane transport. Low osmotic pressure has been previously shown to affect the efficiency of folic acid absorption in the intestine (Zhao et al.,2011). The positive selection onslc19a1 may improve folic acid absorption for freshwater species. These data will serve as valuable resources for future evolution studies of burbot.
4. Conclusion
A chromosomal-scale genome assembly of the burbot was provided by integrating the Hi-C and PacBio long read sequencing data. The burbot is the only freshwater member of the cod family and represents the widest longitudinal range of freshwater fish in the world. The genome assembly and annotation supplied the second high-quality genome of the order Gadiformes and important genomic data for whole genome analysis to further investigate the evolution of burbot with other cod species. A series of candidate genes involved in freshwater adaptation were identified in these comparative genomics analyses. The results were beneficial in elucidating the evolution process in order Gadiformes under environment change. These data are also useful for diverse conservation applications, including identifying conservation units, assessing gene flow, detecting local adaptation of the populations and elucidating the evolutionary history of burbot.