Protein-coding gene annotation and filtering
We annotated protein-coding genes using ab initio , RNA-seq-based, and homolog-based methods in the MAKER v2.31.10 genome annotation pipeline (Cantarel et al., 2008). Augustus v3.2.3 (Stanke & Waack, 2003) and SNAP v2013-02-16 (Korf, 2004) were used for the ab initio gene prediction. For Augustus, we used the retrained parameters obtained in the above BUSCO analysis of genome assembly by invoking the Augustus retraining option. In the first round of annotation, we ran MAKER by providing transcriptome assemblies of PFM, protein sequences from eight lepidopteran species (Bombyx mori , Trichoplusia ni , Ostrinia furnacalis , Bombyx mandarina , Galleria mellonella , Spodoptera litura , Helicoverpa armigera ,Plutella xylostella ) and the Augustus model as evidence. The GFF3 file of first round annotation was used to train parameters of SNAP. In the next three rounds of annotation, GFF3 from the last round, Augustus and SNAP models were used as evidence.
The annotation results from the MAKER pipeline were filtered by using gene expression evidence, functional annotation results and Annotation Edit Distance (AED) value. Genes that had a FPKM value great than 0 in any RNA-seq were considered as real genes and retained in further analysis. Functional domains for proteins were identified using InterproScan 5.34-74.0 (Jones et al., 2014) against Pfam database v32.0 (S. El-Gebali et al., 2019). The gene models were filtered based on domain content and evidence support following Campbell, Holt, Moore, and Yandell (2014). Finally, the annotations with AED < 0.75 were removed (Campbell et al., 2014).
Functions of the protein-coding genes were annotated using the software eggNOG-Mapper v1.0.3 (Jaime Huerta-Cepas et al., 2017), a tool for fast functional annotation of novel sequences using precomputed eggNOG-based orthology assignments, against the database EggNOG v5.0 (J. Huerta-Cepas et al., 2019).