2 Materials and Methods
2.1 Sample collection and DNA
sequencing
We analysed 333 individuals across the four Himalopsyche species.
Specimens were collected from 2010 to 2020 in the Himalayas (H.
tibetana : Nind (number of individuals) = 97,
Npop (number of populations) = 28; H. digitata :
Nind = 105; Npop = 23) and the HM
(H. gregoryi : Nind = 71, Npop =
18; H. platon : Nind = 60; Npop =
12, Fig. 1). All specimens were preserved in 95% ethanol and archived
in the collections of the Senckenberg Research Institute and Natural
History Museum (Supplementary Table 1).
We extracted DNA from abdominal
and/or thoracic tissue after removing the intestinal tract following
Miller et al. (1988). We purified low-quality DNA samples with magnetic
beads as described by Deng et al. (2022). After quality check, all DNA
samples were sent to Novogene Co., Ltd. (Hongkong, China) for DNA
library preparation and 2×150 bp whole genome resequencing with a
coverage of 12× on an Illumina HiSeq 2000 platform.
2.2 Data preparation
We assessed the quality of Illumina reads using FastQC v0.11.8 (Babraham
Institute) and MultiQC v1.7 (Ewels et al., 2016). Following Deng et al.
(2022), we trimmed overrepresented k -mers using
autotrim.pl v0.6.1 (Waldvogel et al., 2018) with Trimmomatic v0.38
(Bolger et al., 2014) and removed adapters with Cutadapt v2.23 (Martin,
2011). After cleaning, the average sequence coverage was 14× (see Fig.
S1 for details). We mapped the reads to a published de novogenome of H. tibetana (Deng et al., 2022, NCBI accession number:
JAHFWH000000000) using Bowtie2 (Fig. S2). Duplicate reads were marked
and indexed using Picard v2.20.8
(Picard Tools – By Broad
Institute) and SAMtools (v1.1, Danecek et al., 2021).
We used ANGSD (v0.931,
Korneliussen et al., 2014) to estimate genotype likelihoods and GATK
(v4.1.7.0, Danecek et al., 2011) for genotype calling (see Supplementary
material 1, section 4 for details and parameter settings). Since the
genotype likelihood algorithm in ANGSD is designed for low to
medium-depth data (Korneliussen et al., 2014), we used it for all of the
population genomic inferences except for the TreeMix analysis since the
program cooperated better with GATK for calling the SNPs without missing
data (Pickrell & Pritchard, 2012). With the GATK pipeline, we obtained
approximately 4,643,429; 68,261; 221,016; 245,170 SNPs for H.
tibetana , H. digitata , H. gregoryi, and H. platonrespectively. The genotype likelihood estimation with ANGSD produced
1,668,494; 6,756; 88,947: 12,842 high-quality variants for H.
tibetana , H. digitata , H. gregoryi, and H. platon,respectively. The quality of the genotype was shown in Fig. S3, S4, and
S5.
2.3 Genetic structure
Before performing the principal component analysis (PCA), we pruned
linked sites within our SNP dataset estimated by genotype likelihoods
using ngsLD (Fox et al., 2019). We calculated a covariance matrix for
PCA using PCAngsd (Meisner & Albrechtsen, 2018) and admixture
proportions using NGSadmix (Skotte et al., 2013) based on the linkage
pruned genotype likelihoods. The optimal value of ancestral components
(K) was determined by CLUMPAK (Kopelman et al., 2015).
We measured the genome-wide nuclear heterozygosity (π) of each sample
based on a site frequency spectrum (SFS) following the workflow
suggested by the authors of ANGSD v0.931 (Korneliussen et al., 2014). To
quantify the alleles inherited from common ancestors in an individual’s
lineage, we estimated individual inbreeding coefficients (F )
based on the filtered genotype likelihoods (same filter as used for PCA
analyses) using ngsF v1.2.0 (Vieira et al., 2013) following de Jager et
al. (2021).
To better understand the relationships among populations of each
species, we reconstructed a maximum-likelihood tree with migration
events among populations of each species respectively using TreeMix
version 1.13 (Pickrell & Pritchard, 2012) following Dahms et al. (2022)
and software guidelines. We additionally calculated a median-joining
(MJ) network of each species based on the COI sequences using Network 10
(Fluxus-engineering.com, Bandelt et al., 1999).
To better characterize the genetic structure, we assigned the
populations to subbasins delineated by Lehner and Grill (2013) at level
10 (http://www.hydrosheds.org/page/hydroatlas). The four-digit
code of each subbasin used in this study is replicated from the source
layer; additionally, we use descriptive names for the subbasins (see
Table S9).
2.4 Historic effective population
size
To gain insights into the history of population-size change, especially
during the LGM, we inferred the demographic history of the four target
species by applying the pairwise sequentially Markovian coalescent
(PSMC) model to each genome (Li & Durbin, 2011). We prepared the input
files for PSMC and ran the analyses following the software documentation
(more details see Supplementary material 1, section 8). For the final
visualization, we plotted the inverse instantaneous coalescent rate
(IICR) as estimated by PSMC following Humble et al. (2020) and de Jager
et al. (2021) in R. Notably, for the plot we used a generation time of 1
year (Tsuruishi, 2003, 2006) and a per generation mutation rate for
caddisflies of 2.9e−9 as estimated for a Lepidoptera species (Keightley
et al., 2015), though the mutation rate was proposed to be consistent
with most of the insects (Liu et al., 2017).
2.5 Ecological variables statistics and species distribution
modelling
(SDM)
We collected in-situ environmental variables and GIS resources at
each sampling site (all variables see Table S1, S2). We conducted a PCA
for each species pair from the Himalayas and the HM respectively, using
all environmental variables as mentioned above (Fig. S6, S7).
To better understand the
influence of different climatic conditions on Himalopsyche , we
compared the potential suitable area of each species between the LGM and
the present day by applying SDMs. We used all known locations of each
species from our previous work and the current study as validated
occurrence points in the models (Supplementary Table 1). According to
the current distribution ranges of the species in references and field
observation, we defined the extent of our study area as the main
drainage basins of all four species’ occurrences (Fig. 1, details see
Supplementary material 1, section 10). Considering the high dependence
of caddisflies on water generally and flight ability at the adult stage,
we attempted to include hydrological connectivity in the modelling.
Thus, we applied a sub-sampled layer that was calculated from the DEM
with a stream-initialization threshold of 200 upstream 90 m grid cells,
resulting in 2771220 sub-catchments and approximately 10
km2 for each sub-catchment.
We downloaded topographic variables from the EarthEnv database and the
bioclimatic variables from the CHELSA database (hereafter 1 km, Table
S3). All variables were chosen based on our supposition of their
importance for the distribution of caddisflies (Graf et al., 2008; Morse
et al., 2019). We used seven bioclimatic datasets (CCSM4, MRI, CNRM,
FGOALS, CM5A, MIROC, and MPI download date: 10.12.2019) provided by
CHELSA for the LGM. All variable layers were cropped to our study area
first and buffered approximately 500 km around the study area to
dissolve and avoid errors caused by the coast. Then we calculated the
univariate statistics of each environmental layer that was assigned to
the sub-sampled layer as described above. Before building the SDMs, we
removed highly correlated variables (correlation coefficient ‘r’ cutoff
= 0.75). The same variables were used for all species at different time
scenarios.
We applied the ensemble method implemented in the R package biomod2
(Thuiller et al., 2009) to construct the SDM for each species at
different time scenarios. This method enables combining the predictive
outputs of different algorithms, thus improving the accuracy of the
modelling (Hao et al., 2019; Stone & Wolfe, 2021). The details of the
setting were described in Supplementary material 1, section 10. To
compare the distribution range of each species between the LGM and at
present, we obtained a consensus binary map for the LGM prediction by
using the committee averaging method (Araujo & New, 2007; Gallien et
al., 2012; Gillard et al., 2017).
The final results of the range size change from the LGM to the present
for each species were deposited on the IGB-GeoNode as raster layers
(https://fred.igb-berlin.de/data/package/812). All intermediate results
of the predictions using different models on different time scenarios
were presented in supplementary material 1 (Table S4–S8, Fig. S8–S21).