2 Materials and Methods

2.1 Sample collection and DNA sequencing

We analysed 333 individuals across the four Himalopsyche species. Specimens were collected from 2010 to 2020 in the Himalayas (H. tibetana : Nind (number of individuals) = 97, Npop (number of populations) = 28; H. digitata : Nind = 105; Npop = 23) and the HM (H. gregoryi : Nind = 71, Npop = 18; H. platon : Nind = 60; Npop = 12, Fig. 1). All specimens were preserved in 95% ethanol and archived in the collections of the Senckenberg Research Institute and Natural History Museum (Supplementary Table 1).
We extracted DNA from abdominal and/or thoracic tissue after removing the intestinal tract following Miller et al. (1988). We purified low-quality DNA samples with magnetic beads as described by Deng et al. (2022). After quality check, all DNA samples were sent to Novogene Co., Ltd. (Hongkong, China) for DNA library preparation and 2×150 bp whole genome resequencing with a coverage of 12× on an Illumina HiSeq 2000 platform.

2.2 Data preparation

We assessed the quality of Illumina reads using FastQC v0.11.8 (Babraham Institute) and MultiQC v1.7 (Ewels et al., 2016). Following Deng et al. (2022), we trimmed overrepresented k -mers using autotrim.pl v0.6.1 (Waldvogel et al., 2018) with Trimmomatic v0.38 (Bolger et al., 2014) and removed adapters with Cutadapt v2.23 (Martin, 2011). After cleaning, the average sequence coverage was 14× (see Fig. S1 for details). We mapped the reads to a published de novogenome of H. tibetana (Deng et al., 2022, NCBI accession number: JAHFWH000000000) using Bowtie2 (Fig. S2). Duplicate reads were marked and indexed using Picard v2.20.8 (Picard Tools – By Broad Institute) and SAMtools (v1.1, Danecek et al., 2021).
We used ANGSD (v0.931, Korneliussen et al., 2014) to estimate genotype likelihoods and GATK (v4.1.7.0, Danecek et al., 2011) for genotype calling (see Supplementary material 1, section 4 for details and parameter settings). Since the genotype likelihood algorithm in ANGSD is designed for low to medium-depth data (Korneliussen et al., 2014), we used it for all of the population genomic inferences except for the TreeMix analysis since the program cooperated better with GATK for calling the SNPs without missing data (Pickrell & Pritchard, 2012). With the GATK pipeline, we obtained approximately 4,643,429; 68,261; 221,016; 245,170 SNPs for H. tibetana , H. digitata , H. gregoryi, and H. platonrespectively. The genotype likelihood estimation with ANGSD produced 1,668,494; 6,756; 88,947: 12,842 high-quality variants for H. tibetana , H. digitata , H. gregoryi, and H. platon,respectively. The quality of the genotype was shown in Fig. S3, S4, and S5.

2.3 Genetic structure

Before performing the principal component analysis (PCA), we pruned linked sites within our SNP dataset estimated by genotype likelihoods using ngsLD (Fox et al., 2019). We calculated a covariance matrix for PCA using PCAngsd (Meisner & Albrechtsen, 2018) and admixture proportions using NGSadmix (Skotte et al., 2013) based on the linkage pruned genotype likelihoods. The optimal value of ancestral components (K) was determined by CLUMPAK (Kopelman et al., 2015).
We measured the genome-wide nuclear heterozygosity (π) of each sample based on a site frequency spectrum (SFS) following the workflow suggested by the authors of ANGSD v0.931 (Korneliussen et al., 2014). To quantify the alleles inherited from common ancestors in an individual’s lineage, we estimated individual inbreeding coefficients (F ) based on the filtered genotype likelihoods (same filter as used for PCA analyses) using ngsF v1.2.0 (Vieira et al., 2013) following de Jager et al. (2021).
To better understand the relationships among populations of each species, we reconstructed a maximum-likelihood tree with migration events among populations of each species respectively using TreeMix version 1.13 (Pickrell & Pritchard, 2012) following Dahms et al. (2022) and software guidelines. We additionally calculated a median-joining (MJ) network of each species based on the COI sequences using Network 10 (Fluxus-engineering.com, Bandelt et al., 1999).
To better characterize the genetic structure, we assigned the populations to subbasins delineated by Lehner and Grill (2013) at level 10 (http://www.hydrosheds.org/page/hydroatlas). The four-digit code of each subbasin used in this study is replicated from the source layer; additionally, we use descriptive names for the subbasins (see Table S9).

2.4 Historic effective population size

To gain insights into the history of population-size change, especially during the LGM, we inferred the demographic history of the four target species by applying the pairwise sequentially Markovian coalescent (PSMC) model to each genome (Li & Durbin, 2011). We prepared the input files for PSMC and ran the analyses following the software documentation (more details see Supplementary material 1, section 8). For the final visualization, we plotted the inverse instantaneous coalescent rate (IICR) as estimated by PSMC following Humble et al. (2020) and de Jager et al. (2021) in R. Notably, for the plot we used a generation time of 1 year (Tsuruishi, 2003, 2006) and a per generation mutation rate for caddisflies of 2.9e−9 as estimated for a Lepidoptera species (Keightley et al., 2015), though the mutation rate was proposed to be consistent with most of the insects (Liu et al., 2017).

2.5 Ecological variables statistics and species distribution modelling (SDM)

We collected in-situ environmental variables and GIS resources at each sampling site (all variables see Table S1, S2). We conducted a PCA for each species pair from the Himalayas and the HM respectively, using all environmental variables as mentioned above (Fig. S6, S7).
To better understand the influence of different climatic conditions on Himalopsyche , we compared the potential suitable area of each species between the LGM and the present day by applying SDMs. We used all known locations of each species from our previous work and the current study as validated occurrence points in the models (Supplementary Table 1). According to the current distribution ranges of the species in references and field observation, we defined the extent of our study area as the main drainage basins of all four species’ occurrences (Fig. 1, details see Supplementary material 1, section 10). Considering the high dependence of caddisflies on water generally and flight ability at the adult stage, we attempted to include hydrological connectivity in the modelling. Thus, we applied a sub-sampled layer that was calculated from the DEM with a stream-initialization threshold of 200 upstream 90 m grid cells, resulting in 2771220 sub-catchments and approximately 10 km2 for each sub-catchment.
We downloaded topographic variables from the EarthEnv database and the bioclimatic variables from the CHELSA database (hereafter 1 km, Table S3). All variables were chosen based on our supposition of their importance for the distribution of caddisflies (Graf et al., 2008; Morse et al., 2019). We used seven bioclimatic datasets (CCSM4, MRI, CNRM, FGOALS, CM5A, MIROC, and MPI download date: 10.12.2019) provided by CHELSA for the LGM. All variable layers were cropped to our study area first and buffered approximately 500 km around the study area to dissolve and avoid errors caused by the coast. Then we calculated the univariate statistics of each environmental layer that was assigned to the sub-sampled layer as described above. Before building the SDMs, we removed highly correlated variables (correlation coefficient ‘r’ cutoff = 0.75). The same variables were used for all species at different time scenarios.
We applied the ensemble method implemented in the R package biomod2 (Thuiller et al., 2009) to construct the SDM for each species at different time scenarios. This method enables combining the predictive outputs of different algorithms, thus improving the accuracy of the modelling (Hao et al., 2019; Stone & Wolfe, 2021). The details of the setting were described in Supplementary material 1, section 10. To compare the distribution range of each species between the LGM and at present, we obtained a consensus binary map for the LGM prediction by using the committee averaging method (Araujo & New, 2007; Gallien et al., 2012; Gillard et al., 2017).
The final results of the range size change from the LGM to the present for each species were deposited on the IGB-GeoNode as raster layers (https://fred.igb-berlin.de/data/package/812). All intermediate results of the predictions using different models on different time scenarios were presented in supplementary material 1 (Table S4–S8, Fig. S8–S21).