Hybrid Approach to Improving the Results of the SVM Classification  Using Posterior Probability And Correlation

Canggih Ajika Pamungkas; Megat F. Zuhairi

doi:10.29040/ijcis.v6i1.217

Hybrid Approach to Improving the Results of the SVM Classification Using Posterior Probability And Correlation

Canggih Ajika Pamungkas, Megat F. Zuhairi

Abstract

Class imbalance in datasets poses significant challenges to traditional machine learning models, such as Support Vector Machines (SVM), leading to poor performance in minority class classification. To address this issue, this study introduces a hybrid approach, Posterior Probability and Correlation-SVM (PC-SVM), which combines posterior probability estimation and correlation analysis. The purpose of this research is to enhance SVM's ability to classify imbalanced datasets by weighting attributes based on their correlation with the target class and leveraging posterior probabilities to refine decision boundaries. The methodology includes preprocessing datasets to ensure data quality, applying correlation analysis to calculate attribute weights, and using these weights to transform input features into posterior probability estimates. The transformed features serve as inputs to the SVM for classification. Experiments were conducted on two datasets: Yeast and Churn, which exhibit varying degrees of class imbalance. The results demonstrate that the PC-SVM model achieves 100% accuracy, precision, recall, and F1-scores across all classes, significantly outperforming the standard SVM. The approach effectively mitigates the bias toward majority classes by improving sensitivity to minority instances. This study highlights the robustness and reliability of the PC-SVM model in handling imbalanced data classification. In conclusion, integrating posterior probabilities with correlation-based attribute weighting significantly enhances the performance of SVMs on imbalanced datasets. Future research should focus on extending this approach to multiclass problems and optimizing its computational efficiency.

Full Text:

PDF

References

J. Alcaraz, M. Labbé, and M. Landete, “Support Vector Machine with feature selection: A multiobjective approach,” Expert Syst. Appl., vol. 204, no. May, p. 117485, 2022, doi: 10.1016/j.eswa.2022.117485.

J. Liu, “Fuzzy support vector machine for imbalanced data with borderline noise,” Fuzzy Sets Syst., vol. 1, pp. 1–10, 2020, doi: 10.1016/j.fss.2020.07.018.

C. Wu, N. Wang, and Y. Wang, “Increasing Minority Recall Support Vector Machine Model for Imbalanced Data Classification,” Discret. Dyn. Nat. Soc., vol. 2021, 2021, doi: 10.1155/2021/6647557.

H. Liu, Z. Liu, W. Jia, D. Zhang, and J. Tan, “A Novel Imbalanced Data Classification Method Based on Weakly Supervised Learning for Fault Diagnosis,” IEEE Trans. Ind. Informatics, vol. 18, no. 3, pp. 1583–1593, 2022, doi: 10.1109/TII.2021.3084132.

S. Shaikh, S. M. Daudpota, A. S. Imran, and Z. Kastrati, “Towards improved classification accuracy on highly imbalanced text dataset using deep neural language models,” Appl. Sci., vol. 11, no. 2, pp. 1–20, 2021, doi: 10.3390/app11020869.

R. A. Hamad, M. Kimura, and J. Lundström, “Efficacy of Imbalanced Data Handling Methods on Deep Learning for Smart Homes Environments,” SN Comput. Sci., vol. 1, no. 4, pp. 1–10, 2020, doi: 10.1007/s42979-020-00211-1.

V. Rupapara, F. Rustam, H. F. Shahzad, A. Mehmood, I. Ashraf, and G. S. Choi, “Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification Using RVVC Model,” IEEE Access, vol. 9, pp. 78621–78634, 2021, doi: 10.1109/ACCESS.2021.3083638.

H. Qin, H. Zhou, and J. Cao, “Imbalanced learning algorithm based intelligent abnormal electricity consumption detection,” Neurocomputing, vol. 402, no. xxxx, pp. 112–123, 2020, doi: 10.1016/j.neucom.2020.03.085.

S. S. Mullick, S. Datta, S. G. Dhekane, and S. Das, “Appropriateness of performance indices for imbalanced data classification: An analysis,” Pattern Recognit., vol. 102, p. 107197, 2020, doi: 10.1016/j.patcog.2020.107197.

H. Shamsudin, U. K. Yusof, A. Jayalakshmi, and M. N. Akmal Khalid, “Combining oversampling and undersampling techniques for imbalanced classification: A comparative study using credit card fraudulent transaction dataset,” IEEE Int. Conf. Control Autom. ICCA, vol. 2020-Octob, pp. 803–808, 2020, doi: 10.1109/ICCA51439.2020.9264517.

K. H. Kim and S. Y. Sohn, “Hybrid neural network with cost-sensitive support vector machine for class-imbalanced multimodal data,” Neural Networks, vol. 130, pp. 176–184, 2020, doi: 10.1016/j.neunet.2020.06.026.

X. Tao et al., “Affinity and class probability-based fuzzy support vector machine for imbalanced data sets,” Neural Networks, vol. 122, pp. 289–307, 2020, doi: 10.1016/j.neunet.2019.10.016.

C. Jimenez-Castaño, A. Alvarez-Meza, and A. Orozco-Gutierrez, “Enhanced automatic twin support vector machine for imbalanced data classification,” Pattern Recognit., vol. 107, 2020, doi: 10.1016/j.patcog.2020.107442.

R. Abo Zidan and G. Karraz, “Gaussian Pyramid for Nonlinear Support Vector Machine,” Appl. Comput. Intell. Soft Comput., vol. 2022, 2022, doi: 10.1155/2022/5255346.

Y. S. Solanki et al., “A hybrid supervised machine learning classifier system for breast cancer prognosis using feature selection and data imbalance handling approaches,” Electron., vol. 10, no. 6, pp. 1–16, 2021, doi: 10.3390/electronics10060699.

R. Kumar R et al., “Investigation of nano composite heat exchanger annular pipeline flow using CFD analysis for crude oil and water characteristics,” Case Stud. Therm. Eng., vol. 49, p. 104908, 2023, doi: 10.1016/j.csite.2023.103297.

X. W. Liang, A. P. Jiang, T. Li, Y. Y. Xue, and G. T. Wang, “LR-SMOTE — An improved unbalanced data set oversampling based on K-means and SVM,” Knowledge-Based Syst., vol. 196, 2020, doi: 10.1016/j.knosys.2020.105845.

C. Wang, C. Deng, Z. Yu, D. Hui, X. Gong, and R. Luo, “Adaptive ensemble of classifiers with regularization for imbalanced data classification,” Inf. Fusion, vol. 69, no. September 2020, pp. 81–102, 2021, doi: 10.1016/j.inffus.2020.10.017.

G. A. Pradipta, R. Wardoyo, A. Musdholifah, and I. N. H. Sanjaya, “Improving classifiaction performance of fetal umbilical cord using combination of SMOTE method and multiclassifier voting in imbalanced data and small dataset,” Int. J. Intell. Eng. Syst., vol. 13, no. 5, pp. 441–454, 2020, doi: 10.22266/ijies2020.1031.39.

F. Feng, K. C. Li, J. Shen, Q. Zhou, and X. Yang, “Using Cost-Sensitive Learning and Feature Selection Algorithms to Improve the Performance of Imbalanced Classification,” IEEE Access, vol. 8, pp. 69979–69996, 2020, doi: 10.1109/ACCESS.2020.2987364.

J. B. Wang, C. A. Zou, and G. H. Fu, “AWSMOTE: An SVM-Based Adaptive Weighted SMOTE for Class-Imbalance Learning,” Sci. Program., vol. 2021, 2021, doi: 10.1155/2021/9947621.

S. Sreejith, H. Khanna Nehemiah, and A. Kannan, “Clinical data classification using an enhanced SMOTE and chaotic evolutionary feature selection,” Comput. Biol. Med., vol. 126, no. September, p. 103991, 2020, doi: 10.1016/j.compbiomed.2020.103991.

S. Ketu and P. K. Mishra, “Empirical Analysis of Machine Learning Algorithms on Imbalance Electrocardiogram Based Arrhythmia Dataset for Heart Disease Detection,” Arab. J. Sci. Eng., vol. 47, no. 2, pp. 1447–1469, 2022, doi: 10.1007/s13369-021-05972-2.

H. I. Hussein and S. A. Anwar, “Imbalanced Data Classification Using Support Vector Machine Based on Simulated Annealing for Enhancing Penalty Parameter,” Period. Eng. Nat. Sci., vol. 9, no. 2, pp. 1030–1037, 2021, doi: 10.21533/pen.v9i2.2031.

S. Rezvani and X. Wang, “Class imbalance learning using fuzzy ART and intuitionistic fuzzy twin support vector machines,” Inf. Sci. (Ny)., vol. 578, pp. 659–682, 2021, doi: 10.1016/j.ins.2021.07.010.

B. B. Hazarika and D. Gupta, “Density-weighted support vector machines for binary class imbalance learning,” Neural Comput. Appl., vol. 33, no. 9, pp. 4243–4261, 2021, doi: 10.1007/s00521-020-05240-8.

Y. Ünal and M. N. Dudak, “Classification of Covid-19 Dataset with Some Machine Learning Methods,” J. Amasya Univ. Inst. Sci. Technol., vol. 1, no. 1, pp. 36–44, 2020.

M. C. Mihaescu and P. S. Popescu, “Review on publicly available datasets for educational data mining,” Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 11, no. 3, pp. 1–16, 2021, doi: 10.1002/widm.1403.

S. Yadav and G. P. Bhole, “Handling Imbalanced Dataset Classification in Machine Learning,” 2020 IEEE Pune Sect. Int. Conf. PuneCon 2020, pp. 38–43, 2020, doi: 10.1109/PuneCon50868.2020.9362471.

S. Rahman, M. Hasan, and A. K. Sarkar, “Prediction of Brain Stroke using Machine Learning Algorithms and Deep Neural Network Techniques,” Eur. J. Electr. Eng. Comput. Sci., vol. 7, no. 1, pp. 23–30, 2023, doi: 10.24018/ejece.2023.7.1.483.

N. Matondang and N. Surantha, “Effects of oversampling SMOTE in the classification of hypertensive dataset,” Adv. Sci. Technol. Eng. Syst., vol. 5, no. 4, pp. 432–437, 2020, doi: 10.25046/AJ050451.

N. Anđelić, S. Baressi Šegota, and Z. Car, “Improvement of Malicious Software Detection Accuracy through Genetic Programming Symbolic Classifier with Application of Dataset Oversampling Techniques,” Computers, vol. 12, no. 12, 2023, doi: 10.3390/computers12120242.

D. Liu, R. Sun, and H. Ren, “Efficient Fraud Detection Classification: Class Imbalanceand Attribute Correlations,” Thr Front. Soc. Sci. Technol., vol. 2, no. 11, pp. 96–103, 2020, doi: 10.25236/FSST.2020.021115.

R. Mohammed, J. Rawashdeh, and M. Abdullah, “Machine Learning with Oversampling and Undersampling Techniques: Overview Study and Experimental Results,” 2020 11th Int. Conf. Inf. Commun. Syst. ICICS 2020, pp. 243–248, 2020, doi: 10.1109/ICICS49469.2020.239556.

S. Ilyas, S. Zia, U. M. Butt, S. Letchmunan, and Z. un Nisa, “Predicting the future transaction from large and imbalanced banking dataset,” Int. J. Adv. Comput. Sci. Appl., vol. 11, no. 1, pp. 273–286, 2020, doi: 10.14569/ijacsa.2020.0110134.

V. Khattri and S. K. Nayak, “Performance Improvement of Classification Model with Imbalanced Dataset,” Turkish J. Comput. Math. Educ., vol. 12, no. 13, pp. 402–408, 2021, [Online]. Available: https://coloradotech.idm.oclc.org/login?url=https://www.proquest.com/scholarly-journals/performance-improvement-classification-model-with/docview/2623929968/se-2%0Ahttps://media.proquest.com/media/hms/PFT/1/N218M?_a=ChgyMDIzMDMwNTAwMTI1MjExMjo0NjIwNjMSBzE

Asniar, N. U. Maulidevi, and K. Surendro, “SMOTE-LOF for noise identification in imbalanced data classification,” J. King Saud Univ. - Comput. Inf. Sci., vol. 34, no. 6, pp. 3413–3423, 2022, doi: 10.1016/j.jksuci.2021.01.014.

S. Strasser and M. Klettke, “Transparent Data Preprocessing for Machine Learning,” HILDA 2024 - Work. Human-In-the-Loop Data Anal. Co-located with SIGMOD 2024, 2024, doi: 10.1145/3665939.3665960.

J. Nalic and A. Svraka, “Importance of data pre-processing in credit scoring models based on data mining approaches,” 2018 41st Int. Conv. Inf. Commun. Technol. Electron. Microelectron. MIPRO 2018 - Proc., pp. 1046–1051, 2022, doi: 10.23919/MIPRO.2018.8400191.

H. F. Tayeb, M. Karabatak, and C. Varol, “Time Series Database Preprocessing for Data Mining Using Python,” 8th Int. Symp. Digit. Forensics Secur. ISDFS 2020, pp. 20–23, 2020, doi: 10.1109/ISDFS49300.2020.9116260.

S. Albahra et al., “Artificial intelligence and machine learning overview in pathology & laboratory medicine: A general review of data preprocessing and basic supervised concepts,” Semin. Diagn. Pathol., vol. 40, no. 2, pp. 71–87, 2023, doi: 10.1053/j.semdp.2023.02.002.

Z. Liu, “Research on data preprocessing method for artificial intelligence algorithm based on user online behavior,” J. Comput. Electron. Inf. Manag., vol. 12, no. 3, pp. 74–78, 2024, doi: 10.54097/qf6fv8j1.

A. Q. Md, S. Kulkarni, C. J. Joshua, T. Vaichole, S. Mohan, and C. Iwendi, “Enhanced Preprocessing Approach Using Ensemble Machine Learning Algorithms for Detecting Liver Disease,” Biomedicines, vol. 11, no. 2, 2023, doi: 10.3390/biomedicines11020581.

K. Maharana, S. Mondal, and B. Nemade, “A review: Data pre-processing and data augmentation techniques,” Glob. Transitions Proc., vol. 3, no. 1, pp. 91–99, 2022, doi: 10.1016/j.gltp.2022.04.020.

H. T. Duong and T. A. Nguyen-Thi, “A review: preprocessing techniques and data augmentation for sentiment analysis,” Comput. Soc. Networks, vol. 8, no. 1, pp. 1–16, 2021, doi: 10.1186/s40649-020-00080-x.

C. Fan, M. Chen, X. Wang, J. Wang, and B. Huang, “A Review on Data Preprocessing Techniques Toward Efficient and Reliable Knowledge Discovery From Building Operational Data,” Front. Energy Res., vol. 9, no. March, pp. 1–17, 2021, doi: 10.3389/fenrg.2021.652801.

V. Chernykh, A. Stepnov, and B. O. Lukyanova, “Data preprocessing for machine learning in seismology,” CEUR Workshop Proc., vol. 2930, pp. 119–123, 2021.

A. J. Mohammed, “Improving Classification Performance for a Novel Imbalanced Medical Dataset using SMOTE Method,” Int. J. Adv. Trends Comput. Sci. Eng., vol. 9, no. 3, pp. 3161–3172, 2020, doi: 10.30534/ijatcse/2020/104932020.

A. Kulkarni, D. Chong, and F. A. Batarseh, Foundations of data imbalance and solutions for a data democracy. Elsevier Inc., 2020. doi: 10.1016/B978-0-12-818366-3.00005-8.

A. Arafa, N. El-Fishawy, M. Badawy, and M. Radad, “RN-SMOTE: Reduced Noise SMOTE based on DBSCAN for enhancing imbalanced data classification,” J. King Saud Univ. - Comput. Inf. Sci., vol. 34, no. 8, pp. 5059–5074, 2022, doi: 10.1016/j.jksuci.2022.06.005.

D. Makowski, M. Ben-Shachar, I. Patil, and D. Lüdecke, “Methods and Algorithms for Correlation Analysis in R,” J. Open Source Softw., vol. 5, no. 51, p. 2306, 2020, doi: 10.21105/joss.02306.

M. S. Vural and M. Telceken, “Modification of posterior probability variable with frequency factor according to Bayes Theorem,” J. Intell. Syst. with Appl., vol. 5, no. 1, pp. 19–26, 2022, doi: 10.54856/jiswa.202205195.

DOI: https://doi.org/10.29040/ijcis.v6i1.217