Abstract
Binary similarity detection determines whether two given binary code
snippets are similar or not, usually on function granularity. This task
is challenging due to different compilation optimizations and CPU
architectures. Recently, deep-learning methods have made great
achievements in this field, although most of them use artificially
selected features or ignore some important semantic information like
code literals or function signatures during feature processing. In
addition, random samples and pair loss function are used in similarity
training, which only covers limited similarity relations between
functions. In this paper, a new framework MFEN-Sim is proposed to detect
similar binary functions. The framework contains three stages: feature
extraction and normalization, mutli-feature based function feature
embedding network (MFEN) and similarity learning network. Multiple
features including assembly instructions, CFG structures and function
code literals are extracted from binary functions. Then these features
are fed into MFEN composed of three modules: function semantic and
structure embedding module, function signature prediction module, and
function code literal embedding module. The three modules generate
embeddings representing the function semantic and structural features,
the function signature prediction features and the function code literal
features. Finally, MFEN-Sim utilizes a similarity training network based
on contrastive learning to make MFEN recognize more similarity relations
between functions. MFEN-Sim is evaluated on 281,601 functions in 144
binaries and 17 CVEs in real-world software. Experimental results show
that our work outperforms state-of-the-art systems ( i.e.,
Gemini, FIT and SAFE) by 7.1%, 9.9% and 8.2% on AUC metric in
cross-architecture, optimization-level similarity detection, and
achieves higher recall than baselines in searching vulnerabilities in
real-world applications.