2.6 Genome annotation
To detect transposable elements (TEs) in the M. japonicus genome, two approaches were used: de novo prediction and homology-based alignment. RepeatMasker and RepeatProteinMask (version 4.0.7) (http://www.repeatmasker.org/) were used in the homology‑based alignment to screen the M. japonicus genome against the Repbase library (Jurka et al., 2005). For de novo prediction, RepeatModeler (v1.0.5, http://repeatmasker.org/RepeatModeler.html), RepeatScout (v1.0.5) (Price et al., 2005), and LTR_FINDER (v1.07) (Xu and Wang, 2007) were used to build a de novo library of non‑redundant repeats using default settings. Based on the constructed de novolibrary, RepeatMasker (v4.0.7) (Chen, 2009) was then run on the M. japonicus genome. In addition, the program Tandem Repeats Finder (TRF, v4.07b) (Benson, 1999) was used to predict tandem repeats, with default settings.
Three approaches were used to predict protein-coding genes in theM. japonicus genome, including homology-based prediction,ab initio prediction, and transcriptome-based prediction. For the homology-based prediction, TBLASTN (version 2.2.26; E-value ≤ 1e-5) (Camacho et al, 2009) was used to align protein sequences fromHomo sapiens (GCF_000001405.38), Tetranychus urticae(GCF_000239435.1), Caenorhabditis elegans (GCF_000002985.6),Crassostrea gigas (GCF_000297895.1), Drosophila melanogaster (GCF_000001215.4), Daphnia pulex(GCA_000187875.1), Ixodes scapularis (GCF_016920785.1),Parasteatoda tepidariorum (GCF_000365465.2), Litopenaeus vannamei (GCA_003789085.1), Tribolium castaneum(GCF_000002335.3), Strongylocentrotus purpuratus(GCF_000002235.5), Cherax quadricarinatus , andFenneropenaeus chinensis onto the M. japonicus genome. Then, the BLAST hits were concatenated using the software Solar (Yu et al., 2006). GeneWise (version 2.4.1) (Stamatakis, 2014) was used to determine the accurate gene structure of the corresponding genomic region on each BLAST hit. Homology predictions were denoted as the “Homology-set”. For ab initio prediction, Augustus (version 3.2.3) (Stamatakis, 2014), GlimmerHMM (version 3.0.4) (Majoros et al., 2004), Genscan (version 1.0) (Burge and Karlin, 1997), Geneid (version 1.4.4) (Burge and Karlin, 1997), and SNAP (version 2013-11-29) (Korf, 2004) were used to predict coding regions in the repeat-masked genome sequences. Trinity (version 2.0) (Grabherr et al, 2011) was used to assemble the RNA-seq data from the seven tissues. Program to Assemble Spliced Alignment (PASA) (Haas et al., 2003) then aligned these assembled transcript sequences, together with the full-length transcript sequences generated from PacBio, against the M. japonicus genome. Valid alignments were clustered according to genome mapping location and assembled into gene structures. Gene models created by PASA were denoted as the PASA-T-set (PASA transcript set). Besides, RNA-seq reads generated from Illumina were directly aligned onto the genome using Tophat (v2.0.13) (Trapnell et al., 2009), and the Cufflinks (v2.1.1) (Trapnell et al., 2012) was used to predict gene models (Cufflinks-set). Gene models obtained from all the methods were integrated into a comprehensive and non‑redundant gene set using the software EvidenceModeler (EVM, v1.1.1) (Hssa et al., 2008). Weights for each type of evidence were set as follows: PASA-T-set > Homology-set > Cufflinks-set > Augustus = GlimmerHMM = Genscan = Geneid = SNAP.
Functional annotations were performed using BLASTP searches against the SwissProt (Boeckmann et al., 2003) and NCBI non‑redundant protein (NR) databases (Pruitt et al., 2007) with e-value less than 1e-5. In addition, InterProScan (version 4.8) (Quevillon et al., 2005) was used to screen proteins against five databases: Pfam, PRINTS, PROSITE, ProDom, and SMART, to determine protein domains and motifs. Gene Ontology (GO) terms were retrieved from the corresponding InterPro entry (Apweiler et al., 2001). The Kyoto Encyclopedia of Genes and Genomes (KEGG) databases was also searched to identify enriched pathways (Kanehisa & Goto, 2000).