Gene prediction and functional annotation
As for gene prediction, three methods were used in prediction, which are
ab initio prediction, homolog protein mapping, transcripts annotation,
respectively. About ab initio prediction, AUGUSTUS was selected to
predict G. przewalskii’ s gene and the training set was produced
by transcriptome (Stanke, et al. 2008;
Hoff and Stanke 2019). As for homolog
protein mapping, six relative species (Carassius auratus ,Cyprinus carpio , Oryzias latipes , Takifugu
rubripes , Gasterosteus aculeatus , Danio rerio ) data were
download from National Center for Biotechnology Information (NCBI) and
Ensembl to construct homolog-protein database (Table S1), which was
utilized by GeMoMa to annotate gene
(Keilwagen, et al. 2016). Next, Pacbio
sequel transcript data was corrected to produce transcriptome by IsoSeq2
(https://github.com/PacificBiosciences/IsoSeq) and illumina X-ten
RNA-seq data were mapped to G. przewalskii’ s genome to assembly
transcriptome by hisat2 and stringtie
(Pertea, et al. 2016). And then RNA-seq
and Iso-seq data was used to predict gene by PASA (–ALIGNERS gmap -f
) (Haas, et al. 2003). Then,
EVidenceModeler (Haas, et al. 2008) was
employed in integrating above-mentioned prediction gene to obtain a raw
gene set. Finally, we used the PSI database to search annotated gene and
removed the hit gene from raw gene set to obtain the precise final gene
set (Altschul, et al. 1997).
In order to know gene function, all of predicted gene was searched
against five databases, which are KEGG, KOG, NR, Swissprot, GO
(Ashburner, et al. 2000;
Kanehisa and Goto 2000). About the first
four databases, gene sequence, which would translate to protein
sequence, was mapped to different database by using BASTP (e-value
0.00001) (Altschul 1990). As for GO
database, InterProScan was applied for annotation
(Hunter, et al. 2009).