Gene prediction and functional annotation
As for gene prediction, three methods were used in prediction, which are ab initio prediction, homolog protein mapping, transcripts annotation, respectively. About ab initio prediction, AUGUSTUS was selected to predict G. przewalskii’ s gene and the training set was produced by transcriptome (Stanke, et al. 2008; Hoff and Stanke 2019). As for homolog protein mapping, six relative species (Carassius auratus ,Cyprinus carpio , Oryzias latipes , Takifugu rubripes , Gasterosteus aculeatus , Danio rerio ) data were download from National Center for Biotechnology Information (NCBI) and Ensembl to construct homolog-protein database (Table S1), which was utilized by GeMoMa to annotate gene (Keilwagen, et al. 2016). Next, Pacbio sequel transcript data was corrected to produce transcriptome by IsoSeq2 (https://github.com/PacificBiosciences/IsoSeq) and illumina X-ten RNA-seq data were mapped to G. przewalskii’ s genome to assembly transcriptome by hisat2 and stringtie (Pertea, et al. 2016). And then RNA-seq and Iso-seq data was used to predict gene by PASA (–ALIGNERS gmap -f ) (Haas, et al. 2003). Then, EVidenceModeler (Haas, et al. 2008) was employed in integrating above-mentioned prediction gene to obtain a raw gene set. Finally, we used the PSI database to search annotated gene and removed the hit gene from raw gene set to obtain the precise final gene set (Altschul, et al. 1997).
In order to know gene function, all of predicted gene was searched against five databases, which are KEGG, KOG, NR, Swissprot, GO (Ashburner, et al. 2000; Kanehisa and Goto 2000). About the first four databases, gene sequence, which would translate to protein sequence, was mapped to different database by using BASTP (e-value 0.00001) (Altschul 1990). As for GO database, InterProScan was applied for annotation (Hunter, et al. 2009).