XiangZheng Liu - 21DOCS Test Area

The self-supervised training method can fully leverage the structural characteristics of unlabeled datasets, enhancing the technical prowess of natural language processing. We explore the application of self-supervision’s advantages to query expansion. To achieve this, we amassed a substantial query corpus and trained it using a generative model. Simultaneously, we introduced the SG-BERT/GPT2 extension method, leveraging two generators, BERT and GPT2. We also conducted an analysis of the impact on code search before and after generative model training, as well as the influence of the original query length on the expanded model. Experimental results consistently demonstrate the superiority of the trained model over the untrained one. Additionally, in deep learning-based code search methods, query expansion proves most effective for datasets with relatively short original query lengths, while longer expansions yield the opposite effect. Finally, when compared to various established query expansion methods, the SG-BERT/GPT2 expansion method, rooted in the self-supervised generative model, consistently outperforms them across multiple language test datasets. With ongoing self-supervised generation and continuous optimization of the query corpus, further enhancements in effectiveness can be achieved.