XIANGYU LI

and 3 more

Binary similarity detection determines whether two given binary code snippets are similar or not, usually on function granularity. This task is challenging due to different compilation optimizations and CPU architectures. Recently, deep-learning methods have made great achievements in this field, although most of them use artificially selected features or ignore some important semantic information like code literals or function signatures during feature processing. In addition, random samples and pair loss function are used in similarity training, which only covers limited similarity relations between functions. In this paper, a new framework MFEN-Sim is proposed to detect similar binary functions. The framework contains three stages: feature extraction and normalization, mutli-feature based function feature embedding network (MFEN) and similarity learning network. Multiple features including assembly instructions, CFG structures and function code literals are extracted from binary functions. Then these features are fed into MFEN composed of three modules: function semantic and structure embedding module, function signature prediction module, and function code literal embedding module. The three modules generate embeddings representing the function semantic and structural features, the function signature prediction features and the function code literal features. Finally, MFEN-Sim utilizes a similarity training network based on contrastive learning to make MFEN recognize more similarity relations between functions. MFEN-Sim is evaluated on 281,601 functions in 144 binaries and 17 CVEs in real-world software. Experimental results show that our work outperforms state-of-the-art systems ( i.e., Gemini, FIT and SAFE) by 7.1%, 9.9% and 8.2% on AUC metric in cross-architecture, optimization-level similarity detection, and achieves higher recall than baselines in searching vulnerabilities in real-world applications.