2.4 Genome size estimation and assembly
The genome size of M. japonicus was estimated using k -mer
analysis. High-quality short‑insert size reads were used to calculate
the 17-mer frequency distribution and to estimate the genome size
according to the formula: genome size = (total number of
17-mer)/(position of peak depth) (Marçais & Kingsford, 2011).
De novo assembly of the PacBio reads was performed using the
wtdbg2 software (version 2.5) with the parameters of “–node-drop
0.20 –node-len 1536 –node-max 600 -s 0.05 -e 3” (Ruan & Li,
2020). Three rounds of consensus correction was performed using racon
(version 1.3.1) with default parameters (Vaser et al., 2017), and then
pilon (version 1.22) was used to polish the resulting assembly using the
Illumina short paired-end read (Walker et al., 2014) Next, the
high-quality paired-end reads generated by the Hi-C method were mapped
onto the M. japonicus draft genome followed by filtering using
HICUP (version 0.7.4) (Wingett et al., 2015) to generate a
chromosome-level genome. Briefly, HICUP_TRUNCATER was used to truncate
the Hi-C reads at the enzyme digestion ligation site (^GATC) and then
the resulting trimmed forward and reverse reads were aligned to the
genome by bowtie2 (version 2.2.5) (Langmead & Salzberg, 2012), yielding
an alignment BAM file. Only unique high-quality and valid alignment
results were used to build the raw intra- or inter-chromosomal
interaction maps. Lastly, contigs were clustered and anchored into 42
pseudo-chromosomes using ALLHIC (Zhang et al., 2019). Juicebox (v1.18)
(https://github.com/aidenlab/Juicebox) was used for manual fine-tuning
in a graphic and inter-active fashion to obtain the final
chromosome-level assembly.
To evaluate the accuracy of the genome assembly, Burrows-Wheeler Aligner
(BWA) (Li & Durbin, 2009) incorporating parameters comprising ‘-k 32 -w
10 -B 3 -O 11 -E 4’ was utilized to align the short-insert paired-end
reads to the M. japonicus genome. To assess the completeness of
the genome, BUSCO (Benchmarking Universal Single Copy Orthologs)
(version 4.1.2) (Simão et al., 2015) was performed by searching against
the arthropoda_odb10 datasets. In addition, genome completeness was
assessed using CEGMA (Core Eukaryotic Genes Mapping Approach) (version
2.5) based on 248 conserved core eukaryotic gene sets (Parra et al.,
2007).