Differential gene expression analysis
NGS reads were assessed for quality control metrics after adapter
trimming with Trim Galore (Krueger, F., 2019). We then used Kallisto
(Bray, N. L. et al., 2016) to quantify transcript abundance by
pseudo-alignment on the latest release of the Apis melliferagenome assembly Amel_HAv3.1 (Wallberg, A. et al., 2019), compiled the
transcript abundance estimates for each sample into an expression matrix
in R (R Core Team, 2019), and summed transcripts by gene using the
tximport method, see Table S2 (Soneson, C. et al., 2015). Low count
genes (< 1 count across > 25% of samples) were
removed from the analysis, resulting in 92.38% (9178/9935) of theA. mellifera protein-coding genes (Wallberg, A. et al., 2019)
represented in this study.
We employed a two-factor design using DESeq2 (Love, M. I. et al., 2014)
to assess differential expression of protein-coding genes (DEGs) within
tissue types between nurse, forager, and winter bees, using colony as a
cofactor. In each analysis, we applied the default Benjamini-Hochberg
(Benjamini, Y. & Hochberg, Y., 1995) correction and set the threshold
for differential expression at padj < 0.01.
To assess sample relationships and variance, hierarchical clustering
analysis was conducted using the variance stabilized transformed (VST)
counts generated by DESeq2. We determined that two of the nurse fat body
samples (N1 and N3) were outliers from this analysis, as each created
separate branches independent of the other samples that clustered as
expected by phenotype (Figure S1). We identified 411 genes that were
differentially expressed between the clustered and outlier nurse samples
(padj < 0.05) but did not observe functional enrichment
for any biological processes among these genes. FastQC did not report
major differences in any of the default sequence quality metrics apart
from reads corresponding to highly overrepresented sequences. A local
BLASTN (McGinnis, S. & Madden, T. L., 2004) search of the
overrepresented sequences revealed many hits with 100% identity and
100% query cover for A. mellifera 16S ribosomal RNAs, a common
source of contamination during library preparation (Zhao, W. et al.,
2014). However, upon filtering our sequencing reads for honey bee rRNAs
(The RNAcentral Consortium, 2018), sample clustering did not improve
(Figure S2). Thus, we could not determine the true source of this
discrepancy, and so removed these samples from all downstream analyses.
\(\chi^{2}\) tests and k-means clustering
We performed chi-squared tests (with Yate’s continuity correction) to
determine, for each tissue, whether the number of DEGs between winter
bees and nurses is significantly different than the number of DEGs
between winter bees and foragers. Additionally, we performedk -means clustering (Oyelade, J. et al., 2016) to assess whether
the tissue-specific transcriptomes of winter bees had more in common
with foragers or nurses using the fviz_cluster function from the
factoextra (Kassambara, A., 2019) package in R, testing for \(k=2\)clusters. Separate tests were performed using: 1) all genes passing our
initial low-count filter (9178), and 2) only forager vs nurse DEGs.