Sequencing data and Genome assembly and annotation
Illumina paired-end (2x150 bp) sequencing of the male-Mll yielded a
throughput of 147.7 Gbp (Table 1), representing a mean coverage of 118x.
The five runs of ONT sequencing of the unsexed-Ei resulted in a 10x
coverage with a read N50 of 9,431 bp. RNA sequencing of the chick-Mll
(2x100 bp) yielded 15 Gbp of data.
We obtained a hybrid assembly with MaSuRCA formed by 4,169 scaffolds,
with an N50 of 2.1 Mbp, and an assembly length of 1.21 Gbp (Table 2,
Figure 1b). The completeness analysis using BUSCO yields a value of
95.9%, and only 0.3% of the complete genes were duplicated and 1.1%
were fragmented (Table 2). Our de novo repeat annotation analysis
shows that 9.95% of the genome consists of repetitive regions (Table
S1), which is within the range of previously sequenced avian genomes (G.
Zhang et al., 2014). Among repeat elements, long interspersed nuclear
elements (LINEs) were the most abundant (4.45% of the genome). The
genome annotation process resulted in a total of 21,959 protein-coding
genes, of which 18,769 (85.5%) have at least one GO associated term,
and 19,218 (87.5%) have hits across the surveyed curated databases
(Table S2).
Blood transcriptome assembly from the chick-Mll resulted in 224,904
transcripts (Table S3). However, BUSCO completeness was only 62.4%,
which was far below genome completeness, probably due to the RNA coming
from a single not very transcriptionally active tissue.
The assembly of the mitogenome of P. mauretanicus resulted in a
single contig of 19,885 bp long, with a coverage (Illumina reads) of
371x, which is around three times higher than the coverage of the
nuclear genome. This mitogenome has the same gene order as other
published Procellariiformes’ mitogenomes (Figure S1). The mitogenome has
two copies of the nad6 gene, as predicted in P.
lherminieri (Torres et al., 2018); the later feature was also
confirmed analysing the mean coverage (illumina reads) across genes
(Table S4).