Efficient Biomedical Text Summarization with Quantized LLAMA 2:
Enhancing Memory Usage and Inference on Low Powered Devices
Abstract
The deployment of large language models (LLMs) on edge devices
and non-server environments presents significant challenges, primarily
due to constraints in memory usage, computational power, and inference
time. This paper investigates the feasibility of running LLMs across
such devices by focusing on optimizing memory usage, employing
quantization techniques, and reducing inference time. Specifically, we
utilize LLaMA 2 for biomedical text summarization and implement Low-Rank
Adaptation (LoRA) quantization to compress the model size to compress
the model size and fine-tune it using limited resources. Our study
systematically evaluates memory consumption during both training and
inference phases, demonstrating substantial reductions through efficient
LoRA quantization. Our results indicate that with careful optimization,
it is feasible to deploy sophisticated LLMs like LLaMA 2 on low powered
devices, thereby broadening the scope of their application in
resource-constrained environments.