javascript:void(0)Read mapping and variant identification
The quality of the RNA-seq reads was checked with fastqc (Andrews 2010). Reads and bases with a Phred score higher than 20 were retained (Martin, 2011). For transcriptomic data analysis, clean reads were aligned to theP. australis reference genome of a Chinese individual, using the STAR aligner two-pass procedure (Dobin et al., 2013), and the resulting bam files were sorted with Samtools (Li et al., 2009). To ensure data accuracy, mapped reads from the transcriptomic data were screened for duplicates using the ‘MarkDuplicates’ function in the GATK pipeline (v 4.5.0.0), and any artificial duplicates resulting from the sequencing process were removed. Then we used the “SplitNCigarReads” tool on the RNA-seq samples to split reads at intronic regions. Genomic variants were called using DeepVariant (1.1.0), with the argument –model_type WES. Joint variant calling across all samples was performed using GLnexus, including an additional joint call for each population with all invariant sites, configured for DeepVariantWES. Although DeepVariant is optimized for SNP calling in diploids, we used this approach due to its high accuracy. To filter out false variants, only biallelic variants with less than 50% missingness were retained, and variants with a minor allele frequency (MAF) below 0.05 were filtered out using VCFtools. To avoid the effects of partially called SNPs, we filtered SNPs using the –vcf-half-call missing parameter. Finally, lineages were grouped according to their phylogenetic classification.