In traditional machine learning environments, the use of non-parametric error estimation to set the discriminative threshold of a classifier to achieve the best accuracy is very effective. This method is not effective in a transfer learning environment because it is only reliable when both the training and testing data have similar distributions which is not the case in a transfer learning setting. Although the use of control variate techniques has been proposed to exploit the information about the error in the training sample to reduce the error in the test sample, this method yields a finite variance and the model uncertainty is not distributed among the variance. In this paper, we proposed and test a new transfer learning validation method called control linear minimum mean-squared error (CLMMSE) for source model selection under homogeneous transfer learning settings with the absence of an adequate pre-trained source model. Our approach adopts the Bayesian linear minimum mean-squared error (LMMSE) and integrates the idea of importance sampling into a control variate approach to provide an accurate estimate for the LMMSE that is then used to select the optimal source model. By combining importance sampling with the control variate technique to reduce further the variance, we can achieve a much tighter bound with the LMMSE. This approach reduces the risk in the target domain under data shift. Experimental results on synthetic data under two data shift settings demonstrate the efficacy of our approach. A further experiment on two real-world datasets shows that we were able to improve the accuracy of two state-of-the-art models tested; Bert (0.94\% to 65\%) and CodeBERT (1.82\% to 18.2\%) when compared to using previous selection methods.
—Cross-project defect prediction (CPDP) makes use of cross-project (CP) data to overcome the lack of data necessary to train well-performing software defect prediction (SDP) classifiers in the early stage of new software projects. Since the CP data (known as the source) may be different from the new project’s data (known as the target), this makes it difficult for CPDP classifiers to perform well. In particular, it is a mismatch of data distributions between source and target that creates this difficulty. Transfer learning-based CPDP classifiers are designed to minimize these distribution differences. The first Transfer learning-based CPDP classifiers treated these differences equally, thereby degrading prediction performance. To this end, recent research has proposed the Weighted Balanced Distribution Adaptation (W-BDA) method to leverage the importance of both distribution differences to improve classification performance. Although W-BDA has been shown to improve model performance in CPDP, research to date has failed to consider model performance in light of increasing target data or variances in data sampling. We provide the first investigation of when and to what extent the effect of increasing the target data and using various sampling techniques have when leveraging the importance of both distribution differences. We extend the initial W-BDA method and call this extension the W-BDA+‘ method. To evaluate the effectiveness of W-BDA+‘ for improving CPDP performance, we conduct eight experiments on 18 projects from four datasets where data sampling was performed with different sampling methods. We evaluate our method using four complementary indicators (i.e., Balanced Accuracy, AUC, F-measure and G-Measure). Our findings reveal an average improvement of 6%, 7.5%, 10% and 12% for these four indicators when W-BDA+‘ is compared to five other baseline methods (including W-BDA), for all four of the sampling methods used. Also, as the target to source ratio is increased with different sampling methods, we observe a decrease in performance for the original W-BDA, with our W-BDA+ approach outperforming the original W-BDA in most cases. Our results highlight the importance of adjusting for data imbalance and having an awareness of the effect of the increasing availability of target data in CPDP scenarios.