2.4 Genome size estimation and assembly
The genome size of M. japonicus was estimated using k -mer analysis. High-quality short‑insert size reads were used to calculate the 17-mer frequency distribution and to estimate the genome size according to the formula: genome size = (total number of 17-mer)/(position of peak depth) (Marçais & Kingsford, 2011).
De novo assembly of the PacBio reads was performed using the wtdbg2 software (version 2.5) with the parameters of “–node-drop 0.20 –node-len 1536 –node-max 600 -s 0.05 -e 3” (Ruan & Li, 2020). Three rounds of consensus correction was performed using racon (version 1.3.1) with default parameters (Vaser et al., 2017), and then pilon (version 1.22) was used to polish the resulting assembly using the Illumina short paired-end read (Walker et al., 2014) Next, the high-quality paired-end reads generated by the Hi-C method were mapped onto the M. japonicus draft genome followed by filtering using HICUP (version 0.7.4) (Wingett et al., 2015) to generate a chromosome-level genome. Briefly, HICUP_TRUNCATER was used to truncate the Hi-C reads at the enzyme digestion ligation site (^GATC) and then the resulting trimmed forward and reverse reads were aligned to the genome by bowtie2 (version 2.2.5) (Langmead & Salzberg, 2012), yielding an alignment BAM file. Only unique high-quality and valid alignment results were used to build the raw intra- or inter-chromosomal interaction maps. Lastly, contigs were clustered and anchored into 42 pseudo-chromosomes using ALLHIC (Zhang et al., 2019). Juicebox (v1.18) (https://github.com/aidenlab/Juicebox) was used for manual fine-tuning in a graphic and inter-active fashion to obtain the final chromosome-level assembly.
To evaluate the accuracy of the genome assembly, Burrows-Wheeler Aligner (BWA) (Li & Durbin, 2009) incorporating parameters comprising ‘-k 32 -w 10 -B 3 -O 11 -E 4’ was utilized to align the short-insert paired-end reads to the M. japonicus genome. To assess the completeness of the genome, BUSCO (Benchmarking Universal Single Copy Orthologs) (version 4.1.2) (Simão et al., 2015) was performed by searching against the arthropoda_odb10 datasets. In addition, genome completeness was assessed using CEGMA (Core Eukaryotic Genes Mapping Approach) (version 2.5) based on 248 conserved core eukaryotic gene sets (Parra et al., 2007).