Sequencing data and Genome assembly and annotation
Illumina paired-end (2x150 bp) sequencing of the male-Mll yielded a throughput of 147.7 Gbp (Table 1), representing a mean coverage of 118x. The five runs of ONT sequencing of the unsexed-Ei resulted in a 10x coverage with a read N50 of 9,431 bp. RNA sequencing of the chick-Mll (2x100 bp) yielded 15 Gbp of data.
We obtained a hybrid assembly with MaSuRCA formed by 4,169 scaffolds, with an N50 of 2.1 Mbp, and an assembly length of 1.21 Gbp (Table 2, Figure 1b). The completeness analysis using BUSCO yields a value of 95.9%, and only 0.3% of the complete genes were duplicated and 1.1% were fragmented (Table 2). Our de novo repeat annotation analysis shows that 9.95% of the genome consists of repetitive regions (Table S1), which is within the range of previously sequenced avian genomes (G. Zhang et al., 2014). Among repeat elements, long interspersed nuclear elements (LINEs) were the most abundant (4.45% of the genome). The genome annotation process resulted in a total of 21,959 protein-coding genes, of which 18,769 (85.5%) have at least one GO associated term, and 19,218 (87.5%) have hits across the surveyed curated databases (Table S2).
Blood transcriptome assembly from the chick-Mll resulted in 224,904 transcripts (Table S3). However, BUSCO completeness was only 62.4%, which was far below genome completeness, probably due to the RNA coming from a single not very transcriptionally active tissue.
The assembly of the mitogenome of P. mauretanicus resulted in a single contig of 19,885 bp long, with a coverage (Illumina reads) of 371x, which is around three times higher than the coverage of the nuclear genome. This mitogenome has the same gene order as other published Procellariiformes’ mitogenomes (Figure S1). The mitogenome has two copies of the nad6 gene, as predicted in P. lherminieri (Torres et al., 2018); the later feature was also confirmed analysing the mean coverage (illumina reads) across genes (Table S4).