Abstract
A decade ago, de novo transcriptome assembly evolved as a versatile and
powerful approach to make evolutionary assumptions, analyze gene
expression, and annotate novel transcripts, in particular, for non-model
organisms lacking an appropriate reference genome. Various tools have
been developed to generate a transcriptome assembly, and even more
computational methods depend on the results of these tools for further
downstream analyses. In this issue of Molecular Ecology Resources,
Freedman et al. (2020) present a comprehensive analysis of errors in de
novo transcriptome assemblies across public data sets and different
assembly methods. They focus on two implicit assumptions that are often
violated: First, the assembly presents an unbiased view of the
transcriptome. Second, the expression estimates derived from the
assembly are reasonable, albeit noisy, approximations of the relative
frequency of expressed transcripts. They show that appropriate filtering
can reduce this bias but can also lead to the loss of a reasonable
number of highly expressed transcripts. Thus, to partly alleviate the
noise in expression estimates, they propose a new normalization method
called length-rescaled CPM. Remarkably, the authors found considerable
distortions at the nucleotide level, which leads to an underestimation
of diversity in transcriptome assemblies. The study by Freedman et al.
clearly shows that we have not yet reached “high-quality” in the field
of transcriptome assembly. Above all, it helps researchers be aware of
these problems and filter and interpret their transcriptome assembly
data appropriately and with caution.