Results

Overview of the shotgun and hybridization capture data sets

Shotgun sequencing of the entire DNA extracted from the ancient lake sediment samples resulted in about 424 million read pairs for the four samples and 25 million reads for the two blanks (library and extraction blank). Per-sample read count ranged from 78­­–145 million paired reads. After trimming and filtering, 62.6% of the reads remained for the analysis, i.e. 265 million (M) reads in total, and 0.1–0.2% for the blanks, i.e. a total of 39 thousand (k) reads. Eighty-two% of the sample reads overlapped and were merged.
Using kraken2 with the nt database, 0.3% of the quality filtered reads could be classified (Figure 1). Bacteria comprise 62.6% of the classified reads (506 k reads across the four samples) and 23.4% were classified as Eukaryota (189 k reads). Of the Eukaryota, the majority was assigned to Viridiplantae (122 k reads) and 2.8 k reads were assigned to Larix (Supplementary Table S1).
The sequencing of the four hybridization capture samples resulted in approximately 192 M paired-end read counts with a per-sample read count of 42–50 M. Sequencing of the extraction blank and the library blank resulted in 4.3 and 5.7 M reads, respectively. After trimming and quality filtering, 66% of reads were kept from the samples (126 M reads) and 0.1% of reads were kept from the blanks (27 k reads). About 91% of the reads of each sample had overlapping reads and were merged.
Classification with kraken2 using the nt database could classify 28% of the quality filtered reads. Of the classified reads, 31% were classified as Bacteria (11 M reads) and 44% were classified as Eukaryota (15.8 M reads). Of the latter, 15.2 M reads correspond to Viridiplantae and 3 M reads to Larix (Figure 1, Supplementary Table S1). A comparison of the shotgun and capture datasets shows that 46.6 to 155.8-fold more reads were assigned to Eukaryota in the capture dataset. On the taxonomic level of Viridiplantae enrichment ranged from 77.8 to 236.9-fold enrichment of captured data in respect to shogun data. The number of Larix -classified reads per sample corresponds to an increase of around 800 to 1160-fold compared to the shotgun data (Supplementary Table S2).

Ancient DNA authenticity

A mapDamage analysis (Jónsson et al., 2013) was applied to the alignment files ofLarix- classified reads aligned to the L. gmeliniichloroplast genome. The overlapping merged reads for the three ancient samples (from 1900, 5400 and 6700 cal-BP) show a clear increase of C to T substitutions at both ends with a greater pronunciation at the 5’ ends. A clear increase of substitution rate with age is visible (Supplementary Figure S4). The unmerged paired-end reads show comparable C to T substitution rates for the forward reads at the 5’ ends and for the reverse reads at the 3’ ends. The youngest sample (60 cal-BP) does not show any clear pattern of substitution rates for merged or unmerged paired-end reads. The length of sequencing reads was between 50 bp and 340 bp for all samples (Supplementary Figure S4 and S5).

Retrieval of the Larix chloroplast genome

Sequence reads classified as Larix using a custom chloroplast database, were mapped to the L. gmelinii reference genome. In the hybridization capture experiment this resulted in a near-complete retrieval of the Larix chloroplast genomes for most samples. The coverage of the complete chloroplast genome declined from the oldest to the most recent sample. At a minimum of 1-fold coverage 91.4% of sample “6700 cal-BP” is covered, 80.3% of sample “5400 cal-BP”, 85.1% of sample “1900 cal-BP” and 14.3% of sample “60 cal-BP”. The mean coverage of the samples is 21.0x ± 14.2x (6700 cal-BP), 5.0x ± 4.5x (5400 cal-BP), 6.6x ± 5.9x (1900 cal-BP) and 0.3x ± 0.7x (60 cal-BP). Comparing the alignments of shotgun and capture reads, the enrichment ranged from 6.4 to 16.2-fold (Supplementary Table S2).
In the alignment of reads against the L. gmelinii reference chloroplast genome, the coverage is not equal across the different annotated regions. When aligning the Larix- classified reads, the coverage is highest for inverted repeats and lowest for ribosomal RNA (Figure 2, dark shaded colours). In the same dataset, the coverage is, on average, higher for intergenic regions, pseudogenes and conserved open reading frames (ORFs), than for protein coding genes. When aligning the complete (quality controlled) reads against the same reference, the coverages of the different annotated regions show a different pattern: highest coverage is at the ribosomal RNA, followed by the photosystem complex coding region and the inverted repeats.
Considering the 294 sites which differ between the two reference genomes of L. gmelinii and L. sibirica , most of the reads carryL. gmelinii specific variations, but in all samples reads carrying the L. sibirica variants can also be found (Figure 3). Almost no reads carry neither of the two species-specific variations (“other”). Between 53% and 1% of the positions contained reads which were classified as L. sibirica variants, with the highest number detected at 6700 cal-BP and the lowest number detected at 60 cal-BP (Supplementary Table S3). The ratio of L. sibirica variants over all positions and reads varied from 5.17% (6700 cal-BP) to 1.84% (5400 cal-BP). Most of the variation between the two Larix species lie in the intergenic region and in conserved open reading frames (ORFs) of unknown function (Figure 3).