Prevalence of ‘Atoms of Confusion’ in Open Source Java Systems: An
Empirical Study
Abstract
Atoms of confusion, or simply “atoms,” are pieces of code that
lead to misunderstanding while being interpreted. Previous research has
shown that the presence of atoms has an effect on code readability.
Aside from simple misunderstanding in lab setting, atoms of confusion
are common and meaningful in open source C and C++ projects, and are
thus removed by bug-fix commits. However, due to syntactical differences
between language paradigms, the prevalence of atoms may vary in projects
written in other languages (e.g. Java), which is yet to be explored. In
this study, the first step is taken towards investigating the prevalence
of 12 different atoms in the 13 most popular open-source Java projects.
The relationship between the presence of atoms and aspects of code
maintainability is also studied. Results show that, atoms are 4.7 time
more prevalent in Java projects compared to open source C/C++ projects
based on occurrence per line. For a total of 1085223 atoms in our
corpus, they occur once every 4.8 lines. Some atoms are very obscure
(e.g. the Logic As Control Flow atom which occurs once in 440060 lines).
Some atoms are frequently occurring (e.g. the Infix Operator Precedence
atom which occurs once in 6.4 lines). Impact of the presence of atoms on
code maintainability is also explored. Besides, correlation between
atoms are investigated. Results indicate that object oriented metrics
contribute less in atom prevalence, whereas fine grained code-metrics
have relatively better association.