Software defect prediction using K-PCA and various kernel-based extreme learning machine: an empirical study

Sushant Kumar Pandey; Deevashwer Rathee; Anil Kumar Tripathi

Software defect prediction using K-PCA and various kernel-based extreme learning machine: an empirical study

View Fulltext

Author(s): Sushant Kumar Pandey¹ ; Deevashwer Rathee¹ ; Anil Kumar Tripathi¹
- Affiliations: 1: Department of Computer Science and Engineering , Indian Institute of Technology-BHU , Varanasi , India
Source: Volume 14, Issue 7, 30 December 2020, p. 768 – 782
DOI: 10.1049/iet-sen.2020.0119 , Print ISSN 1751-8806, Online ISSN 1751-8814

Received 30/03/2020, Accepted 10/11/2020, Revised 22/08/2020, Published 26/11/2020

Predicting defects during software testing reduces an enormous amount of testing effort and help to deliver a high-quality software system. Owing to the skewed distribution of public datasets, software defect prediction (SDP) suffers from the class imbalance problem, which leads to unsatisfactory results. Overfitting is also one of the biggest challenges for SDP. In this study, the authors performed an empirical study of these two problems and investigated their probable solution. They have conducted 4840 experiments over five different classifiers using eight NASA projects and 14 PROMISE repository datasets. They suggested and investigated the varying kernel function of an extreme learning machine (ELM) along with kernel principal component analysis (K-PCA) and found better results compared with other classical SDP models. They used the synthetic minority oversampling technique as a sampling method to address class imbalance problems and k-fold cross-validation to avoid the overfitting problem. They found ELM-based SDP has a high receiver operating characteristic curve over 11 out of 22 datasets. The proposed model has higher precision and F-score values over ten and nine, respectively, compared with other state-of-the-art models. The Mathews correlation coefficient (MCC) of 17 datasets of the proposed model surpasses other classical models' MCC.

References

1. 1)
  - 46. Mahaweerawat, A., Sophatsathit, P., Lursinsap, C., et al: ‘Fault prediction in object-oriented software using neural network techniques’. Advanced Virtual and Intelligent Computing Center (AVIC), Bangkok, Thailand, 2004, pp. 1–8.
2. 2)
  - 84. Boser, B.E., Guyon, I.M., Vapnik, V.N.: ‘A training algorithm for optimal margin classifiers’. Proc. 5th Annual ACM Workshop on Computational Learning Theory, Pittsburgh, PA, USA, 2003, pp. 144–152.
3. 3)
  - 92. Singh, Y., Kaur, A., Malhotra, R.: ‘Software fault proneness prediction using support vector machines’. Proc. World Congress on Engineering, vol. 1, London, UK, 2009, pp. 1–3.
4. 4)
  - 89. Fletcher, R.: ‘Practical methods of optimization’ (John Wiley & Sons, USA, 2013).
5. 5)
  - 83. Cao, L.J., Chua, K.S., Chong, W.K., et al: ‘A comparison of PCA, KPCA and ICA for dimensionality reduction in support vector machine’, Neurocomputing, 2003, 55, (1–2), pp. 321–336.
6. 6)
  - 97. Hu, B.-G., Dong, W.-M.: ‘A study on cost behaviors of binary classification measures in class-imbalanced problems’, arXiv preprint arXiv:1403.7100, 2014.
7. 7)
  - 56. Shivaji, S., James Whitehead, E., Akella, R., et al: ‘Reducing features to improve code change-based bug prediction’, IEEE Trans. Softw. Eng., 2012, 39, (4), pp. 552–569.
8. 8)
  - 87. Banerjee, K.S.: ‘Generalized inverse of matrices and its applications’ (Taylor and Francis Group, USA, 1973).
9. 9)
  - 65. Pandey, S.K., Tripathi, A.K.: ‘BCV-predictor: a bug count vector predictor of a successive version of the software system’, Knowl.-Based Syst., 2020, 197, p. 105924.
10. 10)
  - 88. Radhakrishna Rao, C., Mitra, S.K.: ‘Generalized inverse of a matrix and its applications’. Proc. Sixth Berkeley Symp. on Mathematical Statistics and Probability, vol. 1, 1972.
11. 11)
  - 77. MacKinnon, D.P., Lockwood, C.M., Williams, J.: ‘Confidence limits for the indirect effect: distribution of the product and resampling methods’, Multivariate. Behav. Res., 2004, 39, (1), pp. 99–128.
12. 12)
  - 64. Pandey, S.K., Mishra, R.B., Tripathi, A.K.: ‘BPDET: an effective software bug prediction model using deep representation and ensemble learning techniques’, Expert Syst. Appl., 2019, 144, p. 113085.
13. 13)
  - 69. Sokolova, M., Japkowicz, N., Szpakowicz, S.: ‘Beyond accuracy, f-score and roc: a family of discriminant measures for performance evaluation’. Australasian Joint Conf. on Artificial Intelligence, 2006, pp. 1015–1021.
14. 14)
  - 74. Bengio, Y., Grandvalet, Y.: ‘No unbiased estimator of the variance of k-fold cross-validation’, J. Mach. Learn. Res., 2004, 5, (Sep), pp. 1089–1105.
15. 15)
  - 18. Sengur, A.: ‘An expert system based on principal component analysis, artificial immune system and fuzzy k-NN for diagnosis of valvular heart diseases’, Comput. Biol. Med., 2008, 38, (3), pp. 329–338.
16. 16)
  - 9. Wang, M., Chen, H., Yang, B., et al: ‘Toward an optimal kernel extreme learning machine using a chaotic moth-flame optimization strategy with applications in medical diagnoses’, Neurocomputing, 2017, 267, pp. 69–84.
17. 17)
  - 52. Wei, H., Hu, C., Chen, S., et al: ‘Establishing a software defect prediction model via effective dimension reduction’, Inf. Sci., 2019, 477, pp. 399–409.
18. 18)
  - 21. Sayyad Shirabad, J., Menzies, T.J.: ‘The PROMISE repository of software engineering databases’ (School of Information Technology and Engineering, University of Ottawa, Canada, 2005).
19. 19)
  - 33. Elish, K.O., Elish, M.O.: ‘Predicting defect-prone software modules using support vector machines’, J. Syst. Softw., 2008, 81, (5), pp. 649–660.
20. 20)
  - 75. Luengo, J., Fernández, A., García, S., et al: ‘Addressing data complexity for imbalanced data sets: analysis of smote-based oversampling and evolutionary undersampling’, Soft Comput., 2011, 15, (10), pp. 1909–1936.
21. 21)
  - 15. Schölkopf, B., Smola, A., Müller, K.-R.: ‘Kernel principal component analysis’. Int. Conf. on Artificial Neural Networks, Lausanne, Switzerland, 1997, pp. 583–588.
22. 22)
  - 8. Mohapatra, P., Chakravarty, S., Dash, P.K.: ‘An improved cuckoo search based extreme learning machine for medical data classification’, Swarm. Evol. Comput., 2015, 24, pp. 25–49.
23. 23)
  - 63. Bennin, K.E., Keung, J., Monden, A., et al: ‘The significant effects of data sampling approaches on software defect prioritization and classification’. Proc. 11th ACM/IEEE Int. Symp. on Empirical Software Engineering and Measurement, Toronto, ON, Canada, 2017, pp. 364–373.
24. 24)
  - 4. Huang, G.-B., Zhu, Q.-Y., Siew, C.-K.: ‘Extreme learning machine: theory and applications’, Neurocomputing, 2006, 70, (1–3), pp. 489–501.
25. 25)
  - 96. Boughorbel, S., Jarray, F., El-Anbari, M.: ‘Optimal classifier for imbalanced data using Matthews correlation coefficient metric’, PLOS ONE, 2017, 12, (6), p. e0177678.
26. 26)
  - 67. Rätsch, G., Onoda, T., Müller, K.R.: ‘An improvement of AdaBoost to avoid overfitting’. Proc. Int. Conf. on Neural Information Processing, Kitakyushu, Japan, 1998.
27. 27)
  - 85. Ding, S., Zhao, H., Zhang, Y., et al: ‘Extreme learning machine: algorithm, theory and applications’, Artif. Intell. Rev., 2015, 44, (1), pp. 103–115.
28. 28)
  - 31. Dejaeger, K., Verbraken, T., Baesens, B.: ‘Toward comprehensible software fault prediction models using Bayesian network classifiers’, IEEE Trans. Softw. Eng., 2012, 39, (2), pp. 237–257.
29. 29)
  - 35. Khoshgoftaar, T.M., Seliya, N.: ‘Tree-based software quality estimation models for fault prediction’. Proc. Eighth IEEE Symp. on Software Metrics, 2002, pp. 203–214.
30. 30)
  - 50. Turhan, B., Bener, A.: ‘A multivariate analysis of static code attributes for defect prediction’. Seventh Int. Conf. on Quality Software (QSIC 2007), 2007, pp. 231–237.
31. 31)
  - 11. Pal, M., Maxwell, A.E., Warner, T.A.: ‘Kernel-based extreme learning machine for remote-sensing image classification’, Remote Sens. Lett., 2013, 4, (9), pp. 853–862.
32. 32)
  - 42. Wang, T., Li, W., Shi, H., et al: ‘Software defect prediction based on classifiers ensemble’, J. Inf. Comput. Sci., 2011, 8, (16), pp. 4241–4254.
33. 33)
  - 82. Wold, S., Esbensen, K., Geladi, P.: ‘Principal component analysis’, Chemometr. Intell. Lab. Syst., 1987, 2, (1–3), pp. 37–52.
34. 34)
  - 47. Khoshgoftaar, T.M., Pandya, A.S., More, H.B.: ‘A neural network approach for predicting software development faults’. 1992 Proc. Third Int. Symp. on Software Reliability Engineering, 1992, pp. 83–89.
35. 35)
  - 27. Catal, C., Diri, B.: ‘A systematic review of software fault prediction studies’, Expert Syst. Appl., 2009, 36, (4), pp. 7346–7354.
36. 36)
  - 24. Chawla, N.V., Bowyer, K.W., Hall, L.O., et al: Smote: synthetic minority over-sampling technique. J. Artif. Intell. Res., 2002, 16, pp. 321–357.
37. 37)
  - 93. Kanmani, S., Rhymend Uthariaraj, V., Sankaranarayanan, V., et al: ‘Object-oriented software fault prediction using neural networks’, Inf. Softw. Technol., 2007, 49, (5), pp. 483–492.
38. 38)
  - 73. Al Shalabi, L., Shaaban, Z., Kasasbeh, B.: ‘Data mining: a preprocessing engine’, J. Comput. Sci., 2006, 2, (9), pp. 735–739.
39. 39)
  - 99. Woolson, R.F.: ‘Wilcoxon signed-rank test’ in ‘Wiley encyclopedia of clinical trials’ (Wiley, USA, 2007), pp. 1–3.
40. 40)
  - 58. Xu, Z., Liu, J., Yang, Z., et al: ‘The impact of feature selection on defect prediction performance: an empirical comparison’. 2016 IEEE 27th Int. Symp. on Software Reliability Engineering (ISSRE), Ottawa, ON, Canada, 2016, pp. 309–320.
41. 41)
  - 3. Song, Q., Jia, Z., Shepperd, M., et al: ‘A general software defect-proneness prediction framework’, IEEE Trans. Softw. Eng., 2010, 37, (3), pp. 356–370.
42. 42)
  - 23. Rodriguez, D., Herraiz, I., Harrison, R., et al: ‘Preliminary comparison of techniques for dealing with imbalance in software defect prediction’. Proc. 18th Int. Conf. on Evaluation and Assessment in Software Engineering, 2014, p. 43.
43. 43)
  - 36. Seliya, N.A.: ‘Software fault prediction using tree-based models’, 2001.
44. 44)
  - 45. Bisi, M., Goyal, N.K.: ‘Software development efforts prediction using artificial neural network’, IET Softw., 2016, 10, (3), pp. 63–71.
45. 45)
  - 40. Guo, L., Ma, Y., Cukic, B., et al: ‘Robust prediction of fault-proneness by random forests’. 15th Int. Symp. on Software Reliability Engineering, 2004, pp. 417–428.
46. 46)
  - 57. Shivaji, S., Whitehead, E.J.Jr, Akella, R., et al: ‘Reducing features to improve bug prediction’. 2009 IEEE/ACM Int. Conf. on Automated Software Engineering, Auckland, New Zealand, 2009, pp. 600–604.
47. 47)
  - 53. Pandey, S.K., Mishra, R.B., Triphathi, A.K.: ‘Software bug prediction prototype using Bayesian network classifier: A comprehensive model’, Procedia Comput. Sci., 2018, 132, pp. 1412–1421.
48. 48)
  - 59. Ghotra, B., McIntosh, S., Hassan, A.E.: ‘A large-scale study of the impact of feature selection techniques on defect classification models’. 2017 IEEE/ACM 14th Int. Conf. on Mining Software Repositories (MSR), 2017, pp. 146–157.
49. 49)
  - 34. Kumar, L., Sripada, S.K., Sureka, A., et al: ‘Effective fault prediction model developed using least square support vector machine (LSSVM)’, J. Syst. Softw., 2018, 137, pp. 686–712.
50. 50)
  - 26. Malhotra, R.: ‘A systematic review of machine learning techniques for software fault prediction’, Appl. Soft Comput., 2015, 27, pp. 504–518.
51. 51)
  - 6. Huang, G.-B., Zhu, Q.-Y., Siew, C.-K.: ‘Extreme learning machine: a new learning scheme of feedforward neural networks’. 2004 IEEE Int. Joint Conf. on Neural Networks (IEEE Cat. No. 04CH37541), Budapest, Hungary, vol. 2, 2004, pp. 985–990.
52. 52)
  - 61. Pelayo, L., Dick, S.: ‘Applying novel resampling strategies to software defect prediction’. NAFIPS 2007-2007 Annual Meeting of the North American Fuzzy Information Processing Society, 2007, pp. 69–72.
53. 53)
  - 28. Catal, C., Sevim, U., Diri, B.: ‘Practical development of an eclipse-based software fault prediction tool using Naive Bayes algorithm’, Expert Syst. Appl., 2011, 38, (3), pp. 2347–2353.
54. 54)
  - 30. Wang, T., Li, W.-h.: ‘Naive Bayes software defect prediction model’. 2010 Int. Conf. on Computational Intelligence and Software Engineering, Wuhan, People's Republic of China, 2010, pp. 1–4.
55. 55)
  - 12. Xu, Z., Liu, J., Luo, X., et al: ‘Software defect prediction based on kernel PCA and weighted extreme learning machine’, Inf. Softw. Technol., 2019, 106, pp. 182–200.
56. 56)
  - 5. Wang, G., Wei, Y., Qiao, S., et al ‘Generalized inverses: theory and computations’ vol. 53 (Springer, Germany, 2018).
57. 57)
  - 48. Tong, H., Liu, B., Wang, S.: ‘Software defect prediction using stacked denoising autoencoders and two-stage ensemble learning’, Inf. Softw. Technol., 2018, 96, pp. 94–111.
58. 58)
  - 70. Linnig, F., Mandel, J., Peterson, J.: ‘Plan for studying accuracy and precision of analytical procedure’, Anal. Chem., 1954, 26, (7), pp. 1102–1110.
59. 59)
  - 98. Bekkar, M., Djemaa, H.K., Alitouche, T.A.: ‘Evaluation measures for models assessment over imbalanced data sets’, J Inf. Eng. Appl., 2013, 3, (10), pp. 27–38.
60. 60)
  - 38. Li, R., Wang, S.: ‘An empirical study for software fault-proneness prediction with ensemble learning models on imbalanced data sets’, JSW, 2014, 9, (3), pp. 697–704.
61. 61)
  - 49. Li, J., He, P., Zhu, J., et al: ‘Software defect prediction via convolutional neural network’. 2017 IEEE Int. Conf. on Software Quality, Reliability and Security (QRS), Prague, Czech Republic, 2017, pp. 318–328.
62. 62)
  - 14. Bal, P.R., Kumar, S.: ‘Cross project software defect prediction using extreme learning machine: an ensemble based study’. ICSOFT, Porto Portugal, 2018, pp. 354–361.
63. 63)
  - 78. Chawla, N.V.: ‘Data mining for imbalanced datasets: an overview’ In: ‘Data mining and knowledge discovery handbook’ (Springer, USA, 2009), pp. 875–886.
64. 64)
  - 81. Estévez, P.A., Tesmer, M., Perez, C.A., et al: ‘Normalized mutual information feature selection’, IEEE Trans. Neural Netw., 2009, 20, (2), pp. 189–201.
65. 65)
  - 43. Bisi, M., Goyal, N.K.: ‘Artificial neural network applications for software reliability prediction’ (John Wiley & Sons, USA, 2017).
66. 66)
  - 90. Rodriguez, J.D., Perez, A., Lozano, J.A.: ‘Sensitivity analysis of k-fold cross validation in prediction error estimation’, IEEE Trans. Pattern Anal. Mach. Intell., 2009, 32, (3), pp. 569–575.
67. 67)
  - 1. Fenton, N.E., Neil, M.: ‘A critique of software defect prediction models’, IEEE Trans. Softw. Eng., 1999, 25, (5), pp. 675–689.
68. 68)
  - 51. Pai, G.J., Dugan, J.B.: ‘Empirical analysis of software fault content and fault proneness using Bayesian methods’, IEEE Trans. Softw. Eng., 2007, 33, (10), pp. 675–686.
69. 69)
  - 20. Ceylan, E., Onur Kutlubay, F., Bener, A.B.: ‘Software defect identification using machine learning techniques’. 32nd EUROMICRO Conf. on Software Engineering and Advanced Applications (EUROMICRO'06), 2006, pp. 240–247.
70. 70)
  - 95. Akosa, J.: ‘Predictive accuracy: a misleading performance measure for highly imbalanced data’. Proc. SAS Global Forum, Orlando, FL, USA, 2017, pp. 2–5.
71. 71)
  - 71. Matthews, B.W.: ‘Comparison of the predicted and observed secondary structure of t4 phage lysozyme’, Biochim. Biophys. Acta-Protein Struct., 1975, 405, (2), pp. 442–451.
72. 72)
  - 22. Gray, D., Bowes, D., Davey, N., et al: ‘The misuse of the NASA metrics data program data sets for automated software defect prediction’. 15th Annual Conf. on Evaluation & Assessment in Software Engineering (EASE 2011), 2011, pp. 96–103.
73. 73)
  - 32. Gray, D., Bowes, D., Davey, N., et al: ‘Using the support vector machine as a classification method for software defect prediction with static code metrics’. Int. Conf. on Engineering Applications of Neural Networks, 2009, pp. 223–234.
74. 74)
  - 86. Huang, G.-B., Chen, L., Siew, C.K., et al: ‘Universal approximation using incremental constructive feedforward networks with random hidden nodes’, IEEE Trans. Neural Netw., 2006, 17, (4), pp. 879–892.
75. 75)
  - 94. Gayathri, M., Sudha, A.: ‘Software defect prediction system using multilayer perceptron neural network with data mining’, Int. J. Recent Technol. Eng., 2014, 3, (2), pp. 54–59.
76. 76)
  - 55. Arar, Ö.F., Ayan, K.: ‘A feature dependent naive Bayes approach and its application to the software defect prediction problem’, Appl. Soft Comput., 2017, 59, pp. 197–209.
77. 77)
  - 25. Dietterich, T.: ‘Overfitting and undercomputing in machine learning’, ACM Comput. Surv., 1995, 27, (3), pp. 326–327.
78. 78)
  - 29. Ji, H., Huang, S., Wu, Y., et al: ‘A new weighted Naive Bayes method based on information diffusion for software defect prediction’, Softw. Qual. J., 2019, 27, (3), pp. 1–46.
79. 79)
  - 13. Mesquita, D.P., Rocha, L.S., Gomes, J.P.P., et al: ‘Classification with reject option for software defect prediction’, Appl. Soft Comput., 2016, 49, pp. 1085–1093.
80. 80)
  - 79. Etikan, I., Musa, S.A., Alkassim, R.S.: ‘Comparison of convenience sampling and purposive sampling’, Am. J. Theor. Appl. Stat., 2016, 5, (1), pp. 1–4.
81. 81)
  - 60. Khoshgoftaar, T.M., Gao, K., Seliya, N.: ‘Attribute selection and imbalanced data: problems in software defect prediction’. 2010 22nd IEEE Int. Conf. on Tools with Artificial Intelligence, vol. 1, Arras, France, 2010, pp. 137–144.
82. 82)
  - 62. Kamei, Y., Monden, A., Matsumoto, S., et al: ‘The effects of over and under sampling on fault-prone module detection’. First Int. Symp. on Empirical Software Engineering and Measurement (ESEM 2007), 2007, pp. 196–204.
83. 83)
  - 76. Davies, D.: ‘Parallel processing with subsampling/spreading circuitry and data transfer circuitry to and from any processing unit’, 1995.
84. 84)
  - 66. Khoshgoftaar, T.M., Allen, E.B.: ‘Controlling overfitting in classification-tree models of software quality’, Empir. Softw. Eng., 2001, 6, (1), pp. 59–79.
85. 85)
  - 7. Huang, G., Huang, G.-B., Song, S., et al: ‘Trends in extreme learning machines: a review’, Neural Netw., 2015, 61, pp. 32–48.
86. 86)
  - 91. Lever, J., Krzywinski, M., Altman, N.: ‘Points of significance: model selection and overfitting’ (Nature Publishing Group, USA, 2016).
87. 87)
  - 2. Lessmann, S., Baesens, B., Mues, C., et al: ‘Benchmarking classification models for software defect prediction: a proposed framework and novel findings’, IEEE Trans. Softw. Eng., 2008, 34, (4), pp. 485–496.
88. 88)
  - 37. Zheng, J.: ‘Cost-sensitive boosting neural networks for software defect prediction’, Expert Syst. Appl., 2010, 37, (6), pp. 4537–4543.
89. 89)
  - 72. King, D.J.: ‘On the accuracy of written recall; a scaling and factor analytic study’, Psychol. Rec., 1960, 10, (2), pp. 113–122.
90. 90)
  - 17. Nandi, D., Ashour, A.S., Samanta, S., et al: ‘Principal component analysis in medical image processing: a study’, Int. J. Image Mining, 2015, 1, (1), pp. 65–86.
91. 91)
  - 19. Challagulla, V.U.B, Bastani, F.B., Yen, I.-L., et al ‘Empirical assessment of machine learning based software defect prediction techniques’, Int. J. Artif. Intell. Tools, 2008, 17, (2), pp. 389–400.
92. 92)
  - 54. Mori, T., Uchihira, N.: ‘Balancing the trade-off between accuracy and interpretability in software defect prediction’, Empir. Softw. Eng., 2019, 24, (2), pp. 779–825.
93. 93)
  - 41. Kaur, A., Malhotra, R.: ‘Application of random forest in predicting fault-prone classes’. 2008 Int. Conf. on Advanced Computer Theory and Engineering, 2008, pp. 37–43.
94. 94)
  - 16. Latifoğlu, F., Polat, K., Kara, S., et al: ‘Medical diagnosis of atherosclerosis from carotid artery Doppler signals using principal component analysis (PCA), k-NN based weighting pre-processing and artificial immune recognition system (AIRS)’, J. Biomed. Inf., 2008, 41, (1), pp. 15–23.
95. 95)
  - 68. Hanley, J.A., McNeil, B.J.: ‘The meaning and use of the area under a receiver operating characteristic (ROC) curve’, Radiology, 1982, 143, (1), pp. 29–36.
96. 96)
  - 10. Suresh, S., Venkatesh Babu, R., Kim, H.J.: ‘No-reference image quality assessment using modified extreme learning machine classifier’, Appl. Soft Comput., 2009, 9, (2), pp. 541–552.
97. 97)
  - 39. Dhamayanthi, N., Lavanya, B.: ‘Improvement in software defect prediction outcome using principal component analysis and ensemble machine learning algorithms’. Int. Conf. on Intelligent Data Communication Technologies and Internet of Things, 2018, pp. 397–406.
98. 98)
  - 80. Gong, L., Jiang, S., Bo, L., et al: ‘A novel class imbalance learning approach for both within-project and cross-project defect prediction’, IEEE Trans. Reliab., 2019, 69, (1), pp. 40–54.
99. 99)
  - 44. Owhadi-Kareshk, M., Sedaghat, Y., Akbarzadeh-T, M.-R.: ‘Pre-training of an artificial neural network for software fault prediction’. 2017 7th Int. Conf. on Computer and Knowledge Engineering (ICCKE), 2017, pp. 223–228.

Login

Not registered yet?

Share

Tools

Login to add to favourites

Key

Software defect prediction using K-PCA and various kernel-based extreme learning machine: an empirical study

References

Related content