Results
Overview of the shotgun and hybridization capture data
sets
Shotgun sequencing of the entire DNA extracted from the ancient lake
sediment samples resulted in about 424 million read pairs for the four
samples and 25 million reads for the two blanks (library and extraction
blank). Per-sample read count ranged from 78–145 million paired
reads. After trimming and filtering, 62.6% of the reads remained for
the analysis, i.e. 265 million (M) reads in total, and 0.1–0.2% for
the blanks, i.e. a total of 39 thousand (k) reads. Eighty-two% of the
sample reads overlapped and were merged.
Using kraken2 with the nt database, 0.3% of the quality filtered reads
could be classified (Figure 1). Bacteria comprise 62.6% of the
classified reads (506 k reads across the four samples) and 23.4% were
classified as Eukaryota (189 k reads). Of the Eukaryota, the majority
was assigned to Viridiplantae (122 k reads) and 2.8 k reads were
assigned to Larix (Supplementary Table S1).
The sequencing of the four hybridization capture samples resulted in
approximately 192 M paired-end read counts with a per-sample read count
of 42–50 M. Sequencing of the extraction blank and the library blank
resulted in 4.3 and 5.7 M reads, respectively. After trimming and
quality filtering, 66% of reads were kept from the samples (126 M
reads) and 0.1% of reads were kept from the blanks (27 k reads). About
91% of the reads of each sample had overlapping reads and were merged.
Classification with kraken2 using the nt database could classify 28% of
the quality filtered reads. Of the classified reads, 31% were
classified as Bacteria (11 M reads) and 44% were classified as
Eukaryota (15.8 M reads). Of the latter, 15.2 M reads correspond to
Viridiplantae and 3 M reads to Larix (Figure 1, Supplementary
Table S1). A comparison of the shotgun and capture datasets shows
that 46.6 to 155.8-fold more reads were assigned to Eukaryota in the
capture dataset. On the taxonomic level of Viridiplantae enrichment
ranged from 77.8 to 236.9-fold enrichment of captured data in respect to
shogun data. The number of Larix -classified reads per sample
corresponds to an increase of around 800 to 1160-fold compared to the
shotgun data (Supplementary Table S2).
Ancient DNA authenticity
A mapDamage analysis
(Jónsson
et al., 2013) was applied to the alignment files ofLarix- classified reads aligned to the L. gmeliniichloroplast genome. The overlapping merged reads for the three ancient
samples (from 1900, 5400 and 6700 cal-BP) show a clear increase of C to
T substitutions at both ends with a greater pronunciation at the 5’
ends. A clear increase of substitution rate with age is visible
(Supplementary Figure S4). The unmerged paired-end reads show comparable
C to T substitution rates for the forward reads at the 5’ ends and for
the reverse reads at the 3’ ends. The youngest sample (60 cal-BP) does
not show any clear pattern of substitution rates for merged or unmerged
paired-end reads. The length of sequencing reads was between 50 bp and
340 bp for all samples (Supplementary Figure S4 and S5).
Retrieval of the Larix chloroplast
genome
Sequence reads classified as Larix using a custom chloroplast
database, were mapped to the L. gmelinii reference genome. In the
hybridization capture experiment this resulted in a near-complete
retrieval of the Larix chloroplast genomes for most samples. The
coverage of the complete chloroplast genome declined from the oldest to
the most recent sample. At a minimum of 1-fold coverage 91.4% of sample
“6700 cal-BP” is covered, 80.3% of sample “5400 cal-BP”, 85.1% of
sample “1900 cal-BP” and 14.3% of sample “60 cal-BP”. The mean
coverage of the samples is 21.0x ± 14.2x (6700 cal-BP), 5.0x ± 4.5x
(5400 cal-BP), 6.6x ± 5.9x (1900 cal-BP) and 0.3x ± 0.7x (60 cal-BP).
Comparing the alignments of shotgun and capture reads, the enrichment
ranged from 6.4 to 16.2-fold (Supplementary Table S2).
In the alignment of reads against the L. gmelinii reference
chloroplast genome, the coverage is not equal across the different
annotated regions. When aligning the Larix- classified reads, the
coverage is highest for inverted repeats and lowest for ribosomal RNA
(Figure 2, dark shaded colours). In the same dataset, the coverage is,
on average, higher for intergenic regions, pseudogenes and conserved
open reading frames (ORFs), than for protein coding genes. When aligning
the complete (quality controlled) reads against the same reference, the
coverages of the different annotated regions show a different pattern:
highest coverage is at the ribosomal RNA, followed by the photosystem
complex coding region and the inverted repeats.
Considering the 294 sites which differ between the two reference genomes
of L. gmelinii and L. sibirica , most of the reads carryL. gmelinii specific variations, but in all samples reads
carrying the L. sibirica variants can also be found (Figure 3).
Almost no reads carry neither of the two species-specific variations
(“other”). Between 53% and 1% of the positions contained reads which
were classified as L. sibirica variants, with the highest number
detected at 6700 cal-BP and the lowest number detected at 60 cal-BP
(Supplementary Table S3). The ratio of L. sibirica variants over
all positions and reads varied from 5.17% (6700 cal-BP) to 1.84% (5400
cal-BP). Most of the variation between the two Larix species lie
in the intergenic region and in conserved open reading frames (ORFs) of
unknown function (Figure 3).