Repeat annotation, gene prediction and functional annotation
Repeat elements were annotated in the burbot genome before the gene prediction. Tandem Repeat Finder (Benson, 1999) and LTR_FINDER (Zhao et al., 2007) were applied for the ab initio  prediction of repeat elements in this genome. RepeatMasker and RepeatProteinMask (http://www.repeatmasker.org) were executed to search the genome sequences for known repeat elements, with the genome sequences used as queries against the Repbase database (Jurka et al., 2005). The ribosomal RNA (rRNA) and microRNA genes were predicted using the Infernal v.1.1 software based on the Rfam and miRBase databases, respectively (Griffiths-Jones et al., 2005). Transfer RNA (tRNA) genes were identified using the tRNAscan-SE v. 1.3.1 software (Lowe & Eddy, 1997).
Based on the repeat-masked genome, three strategies based on abinitio , homologs, and RNA-sequencing were applied to predict the protein-coding genes in the burbot genome assembly. Ab initiogene prediction was performed using Augustus (v2.7) (Stanke et al., 2006) and GenScan (Burge & Karlin, 1997) with default settings. For the homology-based prediction, protein sequences of Gadus morhua ,Acanthochromis polyacanthus , Oryzias latipes ,Amphiprion ocellaris , Anabas testudineus ,Astatotilapia calliptera , Astyanax mexicanus ,Austrofundulus limnaeus , Lepisosteus oculatus , andNotothenia coriiceps were downloaded from the NCBI database and aligned to the burbot genome by using tBLASTn (E-value ≤ 1e−5). The homologous genome sequences were then aligned against the matching proteins by using GeneWise (v2.4.0) (Doerks et al., 2002) for accurately spliced alignments. Transcriptomic data (2×150 bp, NovaSeq-6000 platform) generated from a mixture of six tissues, including the muscle, eye, gonad, gill, liver, and spleen, were aligned to the assembled genome sequences by using HISAT2 (v2.0.10) (Pertea et al., 2016), and the putative transcript structures were detected using gene structure, which was predicted by Cufflinks (Ghosh et al., 2016). All gene models were merged, and redundancy was removed by MAKER (Cantarel et al., 2008) and HiCESAP.
The NCBI nonredundant protein (NR), InterPro, the SwissProt, TrEMBL (Boeckmann et al., 2003), eukaryotic orthologous groups of proteins (KOG) (Tatusov et al., 2003), and Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa & Goto, 2000) databases with an E-value threshold of 1e–5 were used for the functional annotation of the protein-coding genes by using BLASTX and the BLASTN utility (Lobo, 2008). Functional ontology and pathway information from the Gene Ontology (GO) database was assigned to the genes by using Blast2GO (Conesa et al., 2005).
Comparative genomic analyses and testing for genomic selection
The protein sequences of 12 species of teleost fish (Supporting Information Table S1) were downloaded from Ensembl (release version 100). Only the longest transcript was selected for each gene locus with alternative splicing variants. Orthologous groups were constructed by ORTHOMCL (14) v2.0.9 by using the default settings based on the filtered BLASTP results. The single‐copy orthologous genes shared by all 13 species were further aligned using MUSCLE (version 3.8.31) (Edgar, 2004) and concatenated to construct a phylogenetic tree with RaxML (Stamatakis, 2014). The divergence time among species was estimated by the r8s (Sanderson, 2003). Divergence times ofLarimichthys crocea-Notothenia coriiceps (88–114), L. crocea-Gambusia affinis (96.9–150.9), Clupea harengus-G. affinis (149.85–165.2) from the TimeTree database (Kumar et al., 2017) were used as the calibration times. Gene family expansion and contraction analyses were performed using CAFÉ 3.1 (De Bie et al., 2006) with the estimated phylogenetic tree information. P value < 0.05 was used to indicate significantly changed gene families. The expanded and contraction gene families in burbot, the shared contraction gene families of three freshwater species (burbot,Monopterus albus, and G. affinis ) in GO terms, and KEGG pathways were enriched, and the Benjamini and Hochberg FDR correction was applied. Significantly overrepresented GO terms and KEGG pathways were identified with corrected P values ≤0.05.
To construct multiple sequence alignments among the ortholog genes, the CODEML program in PAML 4.5 was used to estimate the dN/dS ratio (ω) (Yang, 2007). Two different branch-site likelihood ratio tests were applied to find genes under positive selection. First, the species-specific positively selected genes (PSGs) in burbot were identified with burbot as foreground species and other species, excluding zebrafish Danio rerio , Clupea harengus andEsox lucius as background species. Second, three freshwater species (i.e., burbot, M. albus, and G. affinis ) were selected as foreground species and other species, excluding zebrafishDanio rerio , Clupea harengus and Esox lucius , as background species. GO and KEGG categories were assigned to orthologous groups according to the zebrafish genome reference for the functional enrichment analyses.