Protein-coding gene annotation and filtering
We annotated protein-coding genes using ab initio , RNA-seq-based,
and homolog-based methods in the MAKER v2.31.10 genome annotation
pipeline (Cantarel et al., 2008). Augustus v3.2.3 (Stanke & Waack,
2003) and SNAP v2013-02-16 (Korf, 2004) were used for the ab
initio gene prediction. For Augustus, we used the retrained parameters
obtained in the above BUSCO analysis of genome assembly by invoking the
Augustus retraining option. In the first round of annotation, we ran
MAKER by providing transcriptome assemblies of PFM, protein sequences
from eight lepidopteran species (Bombyx mori , Trichoplusia
ni , Ostrinia furnacalis , Bombyx mandarina , Galleria
mellonella , Spodoptera litura , Helicoverpa armigera ,Plutella xylostella ) and the Augustus model as evidence. The GFF3
file of first round annotation was used to train parameters of SNAP. In
the next three rounds of annotation, GFF3 from the last round, Augustus
and SNAP models were used as evidence.
The annotation results from the MAKER pipeline were filtered by using
gene expression evidence, functional annotation results and Annotation
Edit Distance (AED) value. Genes that had a FPKM value great than 0 in
any RNA-seq were considered as real genes and retained in further
analysis. Functional domains for proteins were identified using
InterproScan 5.34-74.0 (Jones et al., 2014) against Pfam database v32.0
(S. El-Gebali et al., 2019). The gene models were filtered based on
domain content and evidence support following Campbell, Holt, Moore, and
Yandell (2014). Finally, the annotations with AED < 0.75 were
removed (Campbell et al., 2014).
Functions of the protein-coding genes were annotated using the software
eggNOG-Mapper v1.0.3 (Jaime Huerta-Cepas et al., 2017), a tool for fast
functional annotation of novel sequences using precomputed eggNOG-based
orthology assignments, against the database EggNOG v5.0 (J. Huerta-Cepas
et al., 2019).