BITACORA: A comprehensive tool for the identification and annotation of
gene families in genome assemblies
Abstract
Gene annotation is a critical bottleneck in genomic research, especially
for the comprehensive study of very large gene families in the genomes
of non-model organisms. Despite the recent progress in automatic
methods, state-of-the-art tools used for this task often produce
inaccurate annotations, such as fused, chimeric, partial or even
completely absent gene models for many family copies, errors that
require considerable extra efforts to be corrected. Here we present
BITACORA, a bioinformatics solution that integrates popular sequence
similarity-based search tools and Perl scripts to facilitate both the
curation of these inaccurate annotations and the identification of
previously undetected gene family copies directly in genomic DNA
sequences. We tested the performance of BITACORA in annotating the
members of two chemosensory gene families with different repertoire size
in seven available genome sequences, and compared its performance with
that of Augustus-PPX, a tool also designed to improve automatic
annotations using a sequence similarity-based approach. Despite the
relatively high fragmentation of some of these drafts, BITACORA was able
to improve the annotation of many members of these families and detected
thousands of new chemoreceptors encoded in genome sequences. The program
creates general feature format (GFF) files, with both curated and newly
identified gene models, and FASTA files with the predicted proteins.
These outputs can be easily integrated in genomic annotation editors,
greatly facilitating subsequent manual annotation and downstream
evolutionary analyses.