Rebalancing Clinical Data with Probabilistic Random Oversampling
Keywords:
Data imbalance, Oversampling, PredictionAbstract
Data analysis has become a popular tool to obtain knowledge and create useful application for various business areas. However, often, this is not the case for healthcare industry. This is because the data collected by hospitals and health centers are often bias. Either most data consist of similar patients or most data sets do not contain sufficiently interesting disease data. As a result, prediction and classification on healthcare data usually suffers from the problem of data bias. Since most machine learning algorithms assume sufficiently balanced data which essentially based on general statistics, the prediction performance of the algorithms is largely affected by this bias. There are several well-known methods which offer solution to this data imbalanced problem such as SMOTE and random oversampling. However, most of them do not make use of knowledge hidden in the data. We propose probability model-based random oversampling technique which makes use of the knowledge (probability distribution) of the data so that we can perform random oversampling better than existing methods. The data generated will be based on the probability models of the data portion we are interested in. This will reduce the chance of generating a rigid or strict sample of data which can strongly alter the statistical information of the data in the experiment. We tested our method using widely known data set such as the UCI diabetes and beast cancer data set. We found that our technique outperforms SMOTE and random oversampling technique in terms of sensitivity and specificity performance.
References
Ma, Y., and He, H., “Imbalanced Learning: Foundations, Algorithms, and Applications”, Wiley-IEEE Press; 1st edition, 2013
Fernández, “Learning from Imbalanced Data Sets”, Springer, 1 st ed. 2018.
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P., “SMOTE: Synthetic Minority Over-sampling Technique”, Journal of Artificial Intelligence Research, Vol. 16, 2002.
Madan, D., “Estimating Parametric Models of Probability Distributions”, SSRN, 2015.
Law, A., Kelton, D., “Simulation Modeling and Analysis”, 3 rd edition, McGraw-Hill Higher Education, 2000.
Law, A., “Simulation Modeling and Analysis”, 5th edition, McGraw-Hill Series in Industrial Engineering and Management, 2014.
Breast Cancer Wisconsin (Diagnostic) Data Set, UCI Machine Learning Database.
Pima Indians Diabetes Database, UCI Machine Learning Database