Repeat annotation, gene prediction and functional annotation
Repeat elements were annotated in the burbot genome before the gene
prediction. Tandem Repeat Finder (Benson, 1999) and LTR_FINDER (Zhao et
al., 2007) were applied for the ab initio prediction of repeat
elements in this genome. RepeatMasker and RepeatProteinMask
(http://www.repeatmasker.org)
were executed to search the genome sequences for known repeat elements,
with the genome sequences used as queries against the Repbase database
(Jurka et al., 2005). The ribosomal RNA (rRNA) and microRNA genes were
predicted using the Infernal v.1.1 software based on the Rfam and
miRBase databases, respectively (Griffiths-Jones et al., 2005). Transfer
RNA (tRNA) genes were identified using the tRNAscan-SE v. 1.3.1 software
(Lowe & Eddy, 1997).
Based on the repeat-masked genome, three strategies based on abinitio , homologs, and RNA-sequencing were applied to predict the
protein-coding genes in the burbot genome assembly. Ab initiogene prediction was performed using Augustus (v2.7) (Stanke et al.,
2006) and GenScan (Burge & Karlin, 1997) with default settings. For the
homology-based prediction, protein sequences of Gadus morhua ,Acanthochromis polyacanthus , Oryzias latipes ,Amphiprion ocellaris , Anabas testudineus ,Astatotilapia calliptera , Astyanax mexicanus ,Austrofundulus limnaeus , Lepisosteus oculatus , andNotothenia coriiceps were downloaded from the NCBI database and
aligned to the burbot genome by using tBLASTn (E-value ≤ 1e−5). The
homologous genome sequences were then aligned against the matching
proteins by using GeneWise (v2.4.0) (Doerks et al., 2002) for accurately
spliced alignments. Transcriptomic data (2×150 bp, NovaSeq-6000
platform) generated from a mixture of six tissues, including the muscle,
eye, gonad, gill, liver, and spleen, were aligned to the assembled
genome sequences by using HISAT2 (v2.0.10) (Pertea et al., 2016), and
the putative transcript structures were detected using gene structure,
which was predicted by Cufflinks (Ghosh et al., 2016). All gene models
were merged, and redundancy was removed by MAKER (Cantarel et al., 2008)
and HiCESAP.
The NCBI nonredundant protein (NR), InterPro, the SwissProt, TrEMBL
(Boeckmann et al., 2003), eukaryotic orthologous groups of proteins
(KOG) (Tatusov et al., 2003), and Kyoto Encyclopedia of Genes and
Genomes (KEGG) (Kanehisa & Goto, 2000) databases with an E-value
threshold of 1e–5 were used for the functional annotation of the
protein-coding genes by using BLASTX and the BLASTN utility (Lobo,
2008). Functional ontology and pathway information from the Gene
Ontology (GO) database was assigned to the genes by using Blast2GO
(Conesa et al., 2005).
Comparative genomic analyses and testing for genomic selection
The protein sequences of 12 species of teleost fish (Supporting
Information Table S1) were downloaded from Ensembl (release version
100). Only the longest transcript was selected for each gene locus with
alternative splicing variants. Orthologous groups were constructed by
ORTHOMCL (14) v2.0.9 by using the default settings based on the filtered
BLASTP results. The single‐copy orthologous genes shared by all 13
species were further aligned using MUSCLE (version 3.8.31)
(Edgar, 2004)
and concatenated to construct a phylogenetic tree
with RaxML (Stamatakis, 2014). The divergence time among
species was estimated by the r8s (Sanderson, 2003). Divergence times ofLarimichthys crocea-Notothenia coriiceps (88–114), L.
crocea-Gambusia affinis (96.9–150.9), Clupea harengus-G.
affinis (149.85–165.2) from the TimeTree database (Kumar et al.,
2017)
were used as the calibration times. Gene family expansion and
contraction analyses were performed using CAFÉ 3.1 (De Bie et al.,
2006)
with the estimated phylogenetic tree information. P value
< 0.05 was used to indicate significantly changed gene
families. The expanded and contraction gene families in burbot, the
shared contraction gene families of three freshwater species (burbot,Monopterus albus, and G. affinis ) in GO terms, and KEGG
pathways were enriched, and the Benjamini and Hochberg FDR correction
was applied. Significantly overrepresented GO terms and KEGG pathways
were identified with corrected P values ≤0.05.
To construct multiple sequence alignments among the ortholog genes, the
CODEML program in PAML 4.5 was used to estimate the dN/dS ratio (ω)
(Yang, 2007). Two different branch-site likelihood ratio tests were
applied to find genes under positive selection. First, the
species-specific positively selected genes (PSGs) in burbot were
identified with burbot as foreground species and other species,
excluding zebrafish Danio rerio , Clupea harengus andEsox lucius as background species. Second, three freshwater
species (i.e., burbot, M. albus, and G. affinis ) were
selected as foreground species and other species, excluding zebrafishDanio rerio , Clupea harengus and Esox lucius , as
background species. GO and KEGG categories were assigned to orthologous
groups according to the zebrafish genome reference for the functional
enrichment analyses.