International Journal of applied mathematics and computer science

online read us now

Paper details

Number 2 - June 2024
Volume 34 - 2024

A Gaussian-based WGAN-GP oversampling approach for solving the class imbalance problem

Qian Zhou, Bo Sun

In practical applications of machine learning, the class distribution of the collected training set is usually imbalanced, i.e., there is a large difference among the sizes of different classes. The class imbalance problem often hinders the achievable generalization performance of most classifier learning algorithms to a large extent. To ameliorate the learning performance, some effective approaches have been proposed in the literature, where the recently presented GAN-based oversampling methods are very representative. However, their generated minority class examples have the risk of high similarity and duplication degree. To further ameliorate the quality of the generated minority class examples, i.e., to make the generated examples effectively expand the minority class region, a novel oversampling approach named the GWGAN-GP is proposed, which is based on the Gaussian distribution label within the framework of a Wasserstein generative adversarial network with gradient penalty (WGAN-GP). Our GWGAN-GP approach incorporates the Gaussian distribution as an input label, thereby making the generated examples more diverse and dispersive. The examples are then combined with the original dataset to form a balanced dataset, which is subsequently utilized to evaluate the classification performance of three selected classification algorithms. Experimental results on 16 imbalanced datasets demonstrate that the GWGAN-GP not only generates examples that better conform to the distribution of the original dataset, but also achieves superior classification performance. Specifically, when combined with the KNN classifier, the GWGAN-GP significantly outperforms other oversampling approaches considered in the study.

machine learning, class imbalance, generative adversarial networks, oversampling, data duplication