Chérubin Mugisha - 21DOCS Test Area

Specific-domain language models (LMs) are ubiquitous in natural language processing. Effective biomedical text mining requires in-domain models of biomedical and clinical documents, which involve particular challenges. Most existing biomedical LMs rely on self-attention mechanisms, which do not scale to clinical-document length because of space complexity and quadratic execution times. In addition, in-domain LMs ignore the importance of a dedicated tokenizer and of learning optimal hyperparameter values for downstream tasks. In this paper, we introduce a model that addresses these three main issues. Inspired by the BigBird architecture model, we extend the embedding size through an attention mechanism that scales linearly with the sequence length to enable the processing of longer sequences, by a factor of up to eight. Our model was trained on biomedical and raw clinical corpora, and the same data were used to build a domain-specific sentence-piece tokenizer. In our experiments comparing our tokenizer with those in existing models, it was found to have the lowest fertility rate. To optimize the model performance, additional data-oriented hyperparameter fine-tuning was performed to enhance the accuracy and reproducibility of the model. Using the BLURB benchmarks, we evaluated our model with respect to named-entity recognition, relation extraction, sentence similarity, and question-answering tasks. Our model demonstrated competitive results, outperforming other strong self-attention baseline models for several datasets.