MATERIAL AND
METHODS
Sampling and sequence data
generation
We collected 68 blood or tissue samples from 25 of the 32 recognised
species of shearwaters (Gill, Donsker, & Rasmussen, 2020) (Table S1)
representing all the major lineages in the group
(Austin
et al., 2004). Species that
could not be included (Puffinus heinrothi , P. bannermani ,P. bryani , P. myrtae , P. auricularis , P.
persicus and P. subalaris ) breed in remote islands, have very
limited distributions and/or are categorised as critically endangered by
the IUCN Red List of Threatened Species (http://www.iucnredlist.org/).
Sampling was conducted under permits issued by the relevant authorities.
Sequence data for 51 of these samples were generated previously in a
recent phylogenomic study (Ferrer-Obiol et al. under review).
For the new samples generated here, we extracted genomic DNA using the
Qiagen DNeasy Blood and Tissue Kit according to the manufacturer’s
instructions (Qiagen GmbH, Hilden, Germany). We used a Qubit Fluorometer
(Life Technologies) to quantify and standardise DNA concentrations of
all samples at 10 ng/ul. Approximately 250 ng of genomic DNA of each
sample was sent to the Genomic Sequencing and Analysis Facility,
University of Texas at Austin, to perform ddRAD library preparation
following the Peterson et al.
(2012)
protocol. DNA was fragmented using an uncommon cutter EcoRI and
a common cutter MspI in a single reaction. Illumina adaptors
containing sample-specific barcodes and Illumina indexes were ligated
onto the fragments and four pools were produced differing by their
Illumina index. Barcodes differed by at least two base pairs to reduce
the chance of inaccurate barcode assignment. Pooled libraries were size
selected (between 150 and 300 bp after accounting for adapter length)
using a Pippin Prep size fractionator (Sage Science, Beverly, Ma).
Libraries were amplified in a final PCR step prior to sequencing on a
single lane of an Illumina HiSeq4000 platform using a 150-bp paired-end
(PE) metric.
PE-ddRAD-Seq data filtering and
assembly
Raw reads were demultiplexed and cleaned using process_radtags in
Stacks v2.41 (Rochette,
Rivera-Colón, & Catchen, 2019). To maximise the amount of biological
information, we built loci using the forward reads with parameters
optimised for shearwater data (see Ferrer-Obiol et al. under
review) using the USTACKS-CSTACKS-SSTACKS core clustering
algorithm. We used the TSV2BAM program to incorporate reverse reads by
matching the set of forward read IDs in each locus. We then assembled a
contig for each locus, called SNPs using the Bayesian genotype caller
(BGC;
Maruki
& Lynch, 2015, 2017) and phased haplotypes using GSTACKS.
Subsequently, we mapped the GSTACKS catalog to the Balearic shearwater
(Puffinus mauretanicus ) genome assembly (Cuevas et al., 2019)
using BWA-MEM v. 0.7.17 (Li,
2013). We sorted and indexed the mapped reads using SAMtools v.1.4
(Li, 2011; Li et al.,
2009) and integrated alignment positions to the catalog using
STACKS-INTEGRATE-ALIGNMENTS
(Paris, Stevens, & Catchen,
2017). Finally, we used the POPULATIONS program to filter SNP data
requiring a minor allele frequency (MAF) above 5% and an observed
heterozygosity below 50% to generate datasets for downstream analysis.
Species tree
inference
To estimate a time-calibrated species tree for shearwaters, we applied
the SNP-based MSC approach of
(Stange, Sánchez-Villagra,
Salzburger, & Matschiner, 2018) implemented in the SNAPP v.1.3.
(Bryant, Bouckaert,
Felsenstein, Rosenberg, & RoyChoudhury, 2012) package of the program
BEAST 2 v.2.5.0 (R. Bouckaert
et al., 2019). To prepare a suitable dataset for this method, we
selected a maximum of two individuals per subspecies (51 individuals in
total) and we exported called variants to variant call format (VCF).
Because SNAPP assumes a single nucleotide substitution rate, we
performed the analyses including only transitions to reduce
heterogeneity in the evolutionary rate. We further processed the VCF
file with VCFtools v.0.1.15
(Danecek et al., 2011) to
include only biallelic SNPs without missing data, to mask genotypes if
the per-sample read depth was below 5 or above 150, or if the genotype
quality was below 30. Finally, we selected a single SNP per ddRAD locus
to remove potentially linked SNPs (minimum distance between SNPs
> 500 bp). After filtering, we retained a dataset of 1397
transitions.
We followed recommendations of
Stange et al. (2018) by
constraining the root of the species tree to follow a normal
distribution with a mean of 20.23 Mya and a standard deviation (SD) of 2
as reported by Ferrer-Obiol et al. (under review ) based on three
fossil calibrations (see calibration strategy B there-in) and a relaxed
clock. SD was calculated to fit the posterior distribution for the root
in Ferrer-Obiol et al. (under review ). This divergence time
estimate for the root was further supported by a global study on birds
using relaxed clocks (Jetz,
Thomas, Joy, Hartmann, & Mooers, 2012). As we were mainly interested
in SNAPP’s ability to estimate divergence times rather than the tree
topology, we fixed the species tree topology to that inferred by
Ferrer-Obiol et al. (under review ) using UCE and ddRAD data. We
also tested the robustness of divergence-time estimates by performing
two additional analyses. Firstly, we explored the effects of fixing the
topology by also performing the analysis without the topology being
fixed. We also evaluated the use of fossil calibrations using three
different calibration points based on those described in strategy B of
Ferrer-Obiol et al. (under review ). We used the ruby script
snapp_prep.rb (https://github.com/mmatschiner/snapp_prep) to
prepare the XML file for SNAPP analyses. For each analysis, we conducted
three replicate runs, each with a run length of 500,000 Markov-chain
Monte Carlo (MCMC) iterations. Convergence and stationarity were
confirmed (effective sample sizes > 300) using Tracer
v.1.7.1 (Rambaut, Drummond,
Xie, Baele, & Suchard, 2018). The first 10% of each MCMC was
discarded as burn-in, and posterior distributions of run replicates were
merged to generate maximum-clade-credibility (MCC) trees with node
heights set to mean age estimates with TreeAnnotator
(Heled & Bouckaert, 2013).
SNAPP trees were visualised in Densitree v.2.2.7
(Bouckaert, 2010).
The finite-sites model implemented in SNAPP allows the estimation of
both branch lengths (times) and population sizes (θ)
(Bryant et al., 2012).
Because the Stange et al.
(2018) approach only estimates a single value of θ for all branches, we
also constructed a SNAPP phylogeny without any age constraint, in order
to estimate θ values for each branch.
Ancestral range
estimation
Biogeographic analyses were performed to estimate ancestral ranges and
to examine patterns of shearwater dispersal across five broad areas. The
five areas were chosen based on contemporary shearwater breeding ranges:
Southern Australia and New Zealand (A), Southern Ocean (B), North and
Tropical Pacific Ocean (C), Tropical Indian Ocean (D), and North
Atlantic Ocean and Mediterranean Sea (E). We set the limit between areas
A and B at the Subtropical Front
(Sutton, 2001). The R
package BioGeoBEARS v. 1.1.2
(Matzke, 2013) was used to
estimate ancestral ranges using likelihood versions of three models:
dispersal-extinction-cladogenesis (DEC;
Ree & Smith, 2008),
dispersal-vicariance (DIVA;
Ronquist, 1997), and BayArea
(Landis, Matzke, Moore, &
Huelsenbeck, 2013), and the time-calibrated shearwater tree (Figure 1).
We compared ancestral range estimates of these models with and without
the founder-event speciation parameter (j) under two scenarios: one that
allowed unrestricted dispersal between all areas and another that
limited dispersal between areas connected by major surface ocean
currents from the Pliocene to the present, when most of the shearwater
diversification occurred (Figure 1). Corrected Akaike Information
Criterion (AICc) and AICc weights were used to select the best-fit
scenario for the models with and without the j parameter separately,
because the DEC + j model has been criticised for not being
statistically comparable to the DEC model
(Ree & Sanmartín, 2018). To
infer the ancestral range of the shearwaters’ most recent common
ancestor (MRCA), we used the ranges of the two most closely related
outgroup lineages (for which no genetic data are available): genusProcellaria , and genera Pseudobulweria and Bulweria(Estandía, 2019). Pseudobulweria rostrata , Bulweria
bulwerii , Procellaria westlandica and Procellaria cinereawere chosen because they represent the totality of ranges within their
clades. Divergence times between the outgroups and shearwaters and among
the outgroups were retrieved from the TimeTree database
(Kumar, Stecher, Suleski, &
Hedges, 2017). Outgroups were incorporated into the time-calibrated
shearwater tree using the bind.tree function from the ape package
(Paradis & Schliep, 2019)
in R.
Phylogenetic comparative
analyses
To evaluate potential predictors of body size variation in shearwaters,
we retrieved data for four body size measures: 1) mean body mass; 2)
range of body mass (maximum body mass - minimum body mass); 3) wing
length; and 4) total body length, and five predictors: 1) minimum, 2)
mean and 3) maximum breeding latitudes (in absolute values), 4)
latitudinal range occupied by a species year-round (maximum latitude -
minimum latitude where the species is present either during the breeding
or the non-breeding period) and 5) migratory strategy (long-distance
migrant, short distance migrant or dispersive/sedentary). Additionally,
we retrieved wingspan measurements to obtain a mean body mass measure
corrected by body surface (mean body mass / (body length x wingspan).
Data were retrieved for all recognised species of shearwaters (Gill et
al. 2020) with the exception of Heinroth’s shearwater (Puffinus
heinrothi ), because no information about its phylogenetic placement is
available. The majority of morphometric, distributional and behavioural
data were retrieved from Birds of the World (Billerman, Keeney,
Rodewald, & Schulenberg, 2020) and additional morphometric data were
extracted from (Onley &
Scofield, 2013). We performed phylogenetic generalised least squares
regressions (PGLS) for all body size measures against each of the
potential predictors using the R package caper
(Orme et al., 2013) and we
adjusted p-values by FDR correction for multiple testing. Due to the
reduced number of species, we only performed univariate regressions to
avoid overfitting (Mundry,
2014). Following recommendations of
(Revell, 2010), we
simultaneously estimated the λ parameter
(Pagel, 1999) to account for
deviations from a pure Brownian motion (BM). PGLS analyses were run
using the time-calibrated tree with the species for which no sequencing
data was available incorporated into the phylogeny using the bind.tip
function from the R package phytools
(Revell, 2012) according to
the phylogenetic position and branch lengths from previous phylogenetic
studies (Austin et al.,
2004; Pyle, Welch, &
Fleischer, 2011);
(Martínez-Gómez,
Matías-Ferrer, Sehgal, & Escalante, 2015);
(Kawakami, Eda, Izumi,
Horikoshi, & Suzuki, 2018). We estimated ancestral states for body
size measures using the function fastAnc in the R package phytools and
visualised the reconstructions with phenograms using the R package
ggtree (Yu, Smith, Zhu, Guan,
& Lam, 2017). We also reconstructed ancestral states for migratory
behaviour using maximum likelihood (ML) with the function
rerootingMethod in the R package phytools.
To evaluate the effect of life-history traits (LHT) on the nucleotide
substitution rate and the equilibrium of GC content (GC*), we modelled
the correlation between these two parameters, and their correlations
with mean body mass and the number of breeding pairs as a multivariate
Brownian motion in CoEvol
(Lartillot & Poujol, 2011).
We ran two independent Markov Chain Monte Carlo (MCMC) chains, and
stopped the process after reaching convergence (effective sample size
> 1,000 and discrepancy between chains < 0.05 for
all statistics; 5,000 generations) using tracecomp from the Coevol
package. The number of breeding pairs for each species were retrieved
from Birds of the World (Billerman et al. 2020) and BirdLife
International (2020).
The time-calibrated tree was also used to calculate evolutionary
distinctness (ED) scores and EDGE scores
(i.e., evolutionary
distinctness and globally endangered status; Isaac, Turvey, Collen,
Waterman, & Baillie, 2007), based on IUCN Red List of Threatened
Species threat-status (GE, as of June 2020;
http://www.iucnredlist.org/), calculated in the R package caper. EDGE
scores for each species were calculated as follows: EDGE = ln(1 + ED) +
GE x ln(2).
Patterns of recent coancestry and
sequence
divergence
To explore congruence between the current shearwater taxonomy and the
genetic structure among species, we used fineRADstructure v0.3.2
(Malinsky, Trucchi, Lawson,
& Falush, 2018) to infer the shared ancestry among all individuals.
fineRADstructure uses haplotype linkage information to derive a
co-ancestry matrix based on the most recent coalescent events. We
exported haplotypes for loci present in at least 75% of the individuals
to RADpainter format using POPULATIONS, resulting in a set of haplotypes
for 8,049 PE-ddRAD loci containing a total of 63,492 SNPs. RADpainter
was used to infer a coancestry matrix and the fineSTRUCTURE MCMC
clustering algorithm was used to assign individuals into clusters, with
a burn‐in period of 100,000 generations and an extra 100,000 MCMC
iterations sampled every 1,000 generations. To arrange the clusters
based on their relationships within the coancestry matrix, we built a
tree within fineSTRUCTURE using default parameters. To visualise the
results, we used the R scripts fineradstructureplot.r and
finestructurelibrary.r (available athttp://cichlid.gurdon.cam.ac.uk/fineRADstructure.html).
As an additional approach to examining congruence between the current
shearwater taxonomy and genomic divergence, we examined the distribution
of pairwise genetic distances using loci present in at least 95% of the
individuals (1,525 loci; 11,055 SNPs). Briefly, we exported variants
into a VCF file using POPULATIONS in Stacks, we converted the VCF file
into a DNAbin object using the R package vcfR
(Knaus & Grünwald, 2017),
and we calculated pairwise distances using the dist.dna function from
the ape package in R.