Vulnerability Aspects Extraction and Discrepancies Detection across
Heterogeneous Threat Intelligence
Abstract
Security vulnerabilities are constantly reported and must be accurately
documented for vulnerability repositories. Each vulnerability
description usually includes key aspects, such as the vulnerable
product, version, component, vulnerability type, root cause, impact, and
attack vector. Understanding and managing these key aspects is crucial,
but manually analyzing and integrating the growing number of
vulnerabilities from heterogeneous databases is impractical, leading to
the need for automated solutions. This study investigates the serious
differences in aspect-level vulnerability information between major
vulnerability databases such as NVD, IBM X-Force, ExploitDB, and
Openwall. The study addresses two major challenges: improving the
accuracy of extracting critical vulnerability aspects and distinguishing
differences in these aspects across databases. The complexity of this
task stems from the heterogeneous and often conflicting nature of data
sources, coupled with the lack of effective techniques for accurate
aspect extraction and discrepancy resolution. Recent research has shown
that advanced natural language processing techniques, particularly
large-scale language models (LLMs) such as GPT-3.5 and GPT-4, excel in
handling detailed and context-rich textual data. Our approach leverages
these LLMs to address the challenge of aspect-level differences in
vulnerability information present in different databases. Through
rigorous testing on a variety of datasets, our approach not only
provides significant improvements over traditional models in extracting
and distinguishing vulnerabilities more accurately but also enhances our
ability to manage and integrate threat intelligence data effectively.