Finance and Credit

Abstracting and Indexing

Referativny Zhurnal VINITI RAS
LCCN Permalink
Google Scholar

Online available



Cyberleninka (24 month OA embargo)

Overcoming the class imbalance in modeling the credit default

Vol. 18, Iss. 11, NOVEMBER 2019

PDF  Article PDF Version

Received: 17 October 2019

Received in revised form: 31 October 2019

Accepted: 14 November 2019

Available online: 29 November 2019

Subject Heading: Banking

JEL Classification: G21, G28

Pages: 2534Ц2561

Roskoshenko V.Vl. Lomonosov Moscow State University (MSU), Moscow, Russian Federation

ORCID id: not available

Subject The banking sector faces the class imbalance of samples in modeling the credit default. Data pre-processing is traditionally the first option to choose in bank modeling, since it helps overcome the class imbalance. Available studies into such approaches and their comparison discuss a few methods or focus on very specific data. Moreover, previous researchers overlook approaches combining data pre-processing and ensemble-based solutions (stacking).
Objectives The study aims to find the best-fit option to overcome the class imbalance of each group of approaches applied to bank data on retail lending.
Methods The study employs mathematical modeling, statistical analysis and content analysis of sources.
Results Although being rather mathematically difficult, EditedNearestNeighbours approach proved to be most convenient for pre-processing of data. It excludes representatives of the dominant class, which are inadequate to the surrounding environment which is determined through clustering. RandomOverSampler also turned to meet expectations among combinations of data pre-processing and stacking approaches. It increases a percentage of the minority class randomly and appears to be most simple.
Conclusions and Relevance The article presents an exhaustive comparison of approaches to the class imbalance in samples. I selected the most appropriate approach from data pre-processing approaches and the best combination of data pre-processing and ensemble-based solution. The findings can be used for purposes of credit scoring and statistical modeling, when binary classification is required.

Keywords: credit scoring, logistic regression, ensemble, class imbalance, binary classification


  1. Sun Y., Wong A.C., Kamel M.S. Classification of Imbalanced Data: A Review. International Journal of Pattern Recognition and Artificial Intelligence, 2009, vol. 23, no. 4, pp. 687Ц719. URL: Link
  2. García V., Mollineda R., Sánchez J. On the k-NN Performance in a Challenging Scenario of Imbalance and Overlapping. Pattern Analysis and Applications, 2008, vol. 11, iss. 3-4, pp. 269Ц280. URL: Link
  3. Japkowicz N., Stephen S. The Class Imbalance Problem: A Systematic Study. Intelligent Data Analysis, 2002, vol. 6, no. 5, pp. 429Ц449. URL: Link
  4. Weiss G.M., Provost F. Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research, 2003, vol. 19, pp. 315Ц354. URL: Link
  5. Lin Y., Lee Y., Wahba G. Support Vector Machines for Classification in Nonstandard Situations. Machine Learning, 2002, vol. 46, iss. 1-3, pp. 191Ц202. URL: Link
  6. Wu G., Chang E. KBA: Kernel Boundary Alignment Considering Imbalanced Data Distribution. IEEE Transactions on Knowledge and Data Engineering, 2005, vol. 17, iss. 6, pp. 786Ц795. URL: Link
  7. Chawla N.V., Bowyer K.W., Hall L.O., Kegelmeyer W.P. SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research, 2002, vol. 16, pp. 321Ц357. URL: Link
  8. He H., Bai Y., Garcia E.A., Li S. ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning. IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 2008, pp. 1322Ц1328. URL: Link
  9. Han H., Wang W.-Y., Mao B.-H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In: Huang DS., Zhang XP., Huang GB. (eds) Advances in Intelligent Computing. ICIC 2005. Lecture Notes in Computer Science, 2005, vol. 3644, pp. 878Ц887. URL: Link
  10. Nguyen H.M., Cooper E.W., Kamei K. Borderline Over-Sampling for Imbalanced Data Classification. International Journal of Knowledge Engineering and Soft Data Paradigms, 2011, vol. 3, iss. 1, pp. 4Ц21. URL: Link
  11. Last F., Douzas G., Bacao F. Oversampling for Imbalanced Learning Based on k-Means and SMOTE. URL: Link
  12. Mani I., Zhang I. kNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. Proceedings of Workshop on Learning from Imbalanced Datasets, 2003. URL: Link
  13. Tomek I. Two Modifications of CNN. IEEE Transactions on Systems, Man, and Cybernetics, 1976, vol. SMC-6, iss. 11, pp. 769Ц772. URL: Link
  14. Wilson D. Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Cybernetics, 1972, vol. SMC-2, iss. 3, pp. 408Ц421. URL: Link
  15. Hart P. The Condensed Nearest Neighbor Rule. IEEE Transactions on Information Theory, 1968, vol. 14, iss. 3, pp. 515Ц516. URL: Link
  16. Kubat M., Matwin S. Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. Proceedings of the Fourteenth International Conference on Machine Learning, 1997, vol. 97, pp. 179Ц186.
  17. Smith M.R., Martinez T., Giraud-Carrier C. An Instance Level Analysis of Data Complexity. Machine Learning, 2014, vol. 95, iss. 2, pp. 225Ц256. URL: Link
  18. Domingos P. MetaCost: A General Method for Making Classifiers Cost-Sensitive. Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999, pp. 155Ц164. URL: Link
  19. Ling C.X., Sheng V.S., Yang Q. Test Strategies for Cost-Sensitive Decision Trees. IEEE Transactions on Knowledge and Data Engineering, 2006, vol. 18, iss. 8, pp. 1055Ц1067. URL: Link
  20. Schapire R.E. The Strength of Weak Learnability. Machine Learning, 1990, vol. 5, iss. 2, pp. 197Ц227. URL: Link
  21. Freund Y., Schapire R.E. A Decision-Theoretic Generalization of On-line Learning and an Application to Boosting. Journal of Computer and System Sciences, 1997, vol. 55, iss. 1, pp. 119Ц139. URL: Link
  22. Schapire R.E., Singer Y. Improved Boosting Algorithms Using Confidence-Rated Predictions. Machine Learning, 1999, vol. 37, iss. 3, pp. 297Ц336. URL: Link
  23. Breiman L. Bagging Predictors. Machine Learning, 1996, vol. 24, iss. 2, pp. 123Ц140. URL: Link
  24. Aslam J.A., Popa R.A., Rivest R.L. On Estimating the Size and Confidence of a Statistical Audit. Proceedings of the USENIX Workshop on Accurate Electronic Voting Technology, 2007.
  25. Wolpert D.H. Stacked Generalization. Neural Networks, 1992, vol. 5, iss. 2, pp. 241Ц259. URL: Link80023-1
  26. Oza N.C., Tumer K. Classifier Ensembles: Select Real-World Applications. Information Fusion, 2008, vol. 9, iss. 1, pp. 4Ц20. URL: Link
  27. Chawla N.V., Lazarevic A., Hall L.O., Bowyer K.W. SMOTEBoost: Improving Prediction of the Minority Class in Boosting. In: Lavrač N., Gamberger D., Todorovski L., Blockeel H. (eds) Knowledge Discovery in Databases: PKDD 2003. PKDD 2003. Lecture Notes in Computer Science, 2003, vol. 2838. Berlin, Springer, pp. 107Ц119. URL: Link
  28. Seiffert C., Khoshgoftaar T.M., Van Hulse J., Napolitano A. Rusboost: A Hybrid Approach to Alleviating Class Imbalance. IEEE Transactions on Systems, Man, and Cybernetics Ц Part A: Systems and Humans, 2010, vol. 40, iss. 1, pp. 185Ц197. URL: Link
  29. Hu S., Liang Y., Ma L., He Y. MSMOTE: Improving Classification Performance When Training Data is Imbalanced. Second International Workshop on Computer Science and Engineering, 2009, vol. 2, pp. 13Ц17. URL: Link
  30. Wang S., Yao X. Diversity Analysis on Imbalanced Data Sets by Using Ensemble Models. IEEE Symposium on Computational Intelligence and Data Mining, 2009, pp. 324Ц331. URL: Link
  31. Tao D., Tang X., Li X., Wu X. Asymmetric Bagging and Random Subspace for Support Vector Machines-Based Relevance Feedback in Image Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2006, vol. 28, iss. 7, pp. 1088Ц1099. URL: Link
  32. Chang E., Li B., Wu G., Goh K. Statistical Learning for Effective Visual Information Retrieval. Proceedings 2003 International Conference on Image Processing, 2003, pp. 609Ц612. URL: Link
  33. Hido S., Kashima H., Takahashi Y. Roughly Balanced Bagging for Imbalanced Data. Statistical Analysis and Data Mining, 2009, vol. 2, iss. 5-6, pp. 412Ц426. URL: Link
  34. Chan P.K., Stolfo S.J. Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 1998, pp. 164Ц168. URL: Link
  35. Yan R., Liu Y., Jin R., Hauptmann A. On Predicting Rare Classes with SVM Ensembles in Scene Classification. 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings (ICASSP '03), 2003, vol. 3, pp. 21Ц24. URL: Link
  36. Liu X.-Y., Wu J., Zhou Z.-H. Exploratory Undersampling for Class Imbalance Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 2009, vol. 39, iss. 2, pp. 539Ц550. URL: Link
  37. Fan W., S. Stolfo J., Zhang J., Chan P.K. Adacost: Misclassification Cost-Sensitive Boosting. Proceedings of the Sixteenth International Conference on Machine Learning, 1999, pp. 97Ц105.
  38. Ting K.M. A Comparative Study of Cost-Sensitive Boosting Algorithms. Proceedings of the Seventeenth International Conference on Machine Learning, 2000, pp. 983Ц990.
  39. Sun Y., Kamel M.S., Wong A.K., Wang Y. Cost-Sensitive Boosting for Classification of Imbalanced Data. Pattern Recognition, 2007, vol. 40, iss. 12, pp. 3358Ц3378. URL: Link
  40. Joshi M.V., Kumar V., Agarwal R.C. Evaluating Boosting Algorithms to Classify Rare Classes: Comparison and Improvements. Proceedings 2001 IEEE International Conference on Data Mining, 2001, pp. 257Ц264. URL: Link
  41. Estabrooks A., Jo T., Japkowicz N. A Multiple Resampling Method for Learning from Imbalanced Data Sets. Computational Intelligence, 2004, vol. 20, iss. 1, pp. 18Ц36. URL: Link
  42. Stefanowski J., Wilk S. Selective Pre-Processing of Imbalanced Data for Improving Classification Performance. In: Song IY., Eder J., Nguyen T.M. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2008. Lecture Notes in Computer Science, 2008, vol. 5182. Berlin, Springer, pp. 283Ц292. URL: Link
  43. Batista G.E.A.P.A., Prati R.C., Monard M.C. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM SIGKDD Explorations Newsletter, 2004, vol. 6, iss. 1, pp. 20Ц29.
  44. Fernandez A., Garcıa S., del Jesus M.J., Herrera F. A Study of the Behaviour of Linguistic Fuzzy Rule Based Classification Systems in the Framework of Imbalanced Data-sets. Fuzzy Sets and Systems, 2008, vol. 159, iss. 18, pp. 2378Ц2398. URL: Link
  45. Galar M., Fernandez A., Barrenechea E. et al. A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 2012, vol. 42, iss. 4, pp. 463Ц484. URL: Link

View all articles of issue


ISSN 2311-8709 (Online)
ISSN 2071-4688 (Print)

Journal current issue

Vol. 26, Iss. 6
June 2020