Understanding and tracing semantics of concepts to application domains
emerging from source code, documentation, and tests
Abstract
As software artifacts continuously evolve and increase in number, the
need for automated traceability increases due to the complexity of trace
links. Besides tracing components across different artifacts, the need
for tracing to application domains is critical to understand the
classification of semantics and the coverage (i.e., which application
domain is present in each artifact?). In this paper, we propose the
notion of using NLP to map concepts emerging from software artifacts to
application domains, and tracing these between artifacts. We extracted
the corpus keywords from source code, documentation, and tests. We ran
an optimised Latent Dirichlet Allocation (LDA) to generate the concepts
emerging from each artifact. We then calculated the similarity scores of
each concept against each application domain, and ranked the difference
of these scores between pairwise artifacts. Results show that the
ranking of the inverse of the difference represents the strength of
tracing in semantics, and different embeddings show varying results. We
observed the strong applicability of our method and its replicability by
other researchers and practitioners, particularly in detecting
synchronised application domains that are traced between artifacts.