javascript:void(0)Read mapping and variant identification
The quality of the RNA-seq reads was checked with fastqc (Andrews 2010).
Reads and bases with a Phred score higher than 20 were retained (Martin,
2011). For transcriptomic data analysis, clean reads were aligned to theP. australis reference genome of a Chinese individual, using the
STAR aligner two-pass procedure (Dobin et al., 2013), and the resulting
bam files were sorted with Samtools (Li et al., 2009). To ensure data
accuracy, mapped reads from the transcriptomic data were screened for
duplicates using the ‘MarkDuplicates’ function in the GATK pipeline (v
4.5.0.0), and any artificial duplicates resulting from the sequencing
process were removed. Then we used the “SplitNCigarReads” tool on the
RNA-seq samples to split reads at intronic regions. Genomic variants
were called using DeepVariant (1.1.0), with the argument –model_type
WES. Joint variant calling across all samples was performed using
GLnexus, including an additional joint call for each population with all
invariant sites, configured for DeepVariantWES. Although DeepVariant is
optimized for SNP calling in diploids, we used this approach due to its
high accuracy. To filter out false variants, only biallelic variants
with less than 50% missingness were retained, and variants with a minor
allele frequency (MAF) below 0.05 were filtered out using VCFtools. To
avoid the effects of partially called SNPs, we filtered SNPs using the
–vcf-half-call missing parameter. Finally, lineages were grouped
according to their phylogenetic classification.