Hello everybody, I hope everything is going well. We will talk about 8 ways to handle Imbalance data in Python in this blog post and how it impacts the outcome in the prediction step.
When the distribution of the data for specific labels or classes is biased or skewed across the dataset, we have an imbalanced classification problem. As an illustration, take a CSV file with two classes of datasets, where one class has 3252 data points and the other has 392 data points per column. This is an extreme case of an imbalance.
Predictive modeling is challenged by imbalanced classifications because the majority of machine learning methods for classification were built on the idea that there should be an equal number of samples in each class.
Imbalance can be more concerning since it increases the risk that a minority class will be incorrectly forecast as the majority class, especially when models are deployed to solve real-world problems or when automation is used. For instance, if a self-driving car misinterprets a sign on the road instructing it to stop, it may speed instead, endangering the lives of anyone in the car or vicinity. This can be the situation if the dataset the car was trained in had excessively imbalanced classes and no corrections were performed to address this imbalance before model training.
Classification Predictive Modelling
Predictive modeling’s classification challenge essentially gives each observation a class label.
For example, a dataset consists of several plant species which contain various types of vascular plants, flowering plants, and other smaller plant groups. The dataset seems to be imbalanced for flowering plants and smaller plant groups. This may create a problem during the prediction alternately; we could decide to forecast class likelihood.
Two class labels are possible in a classification predictive modeling task. Two-class classification, often known as binary classification, is the simplest kind of classification problem. Another possibility is that the issue has more than two classes—perhaps, three, ten, hundreds or even thousands. Multi-class classification difficulties are the name given to these kinds of issues.
- A binary classification problem is one in which there are only two possible classifications for each example.
- A multiclass classification problem is one in which there are three classes and each example belongs to one of them.
The above example we saw can be considered a multiclass classification problem.
We must gather a training dataset when working on categorization predictive modeling issues.
A collection of samples from the domain that comprise both the input data (such as plant species) and the output data is referred to as a training dataset (e.g. class label).
Training Dataset: A collection of instances taken from the given problem that include the inputs observations and the class labels for the outcomes. We may require dozens, thousands, or even millions of samples from the domain to make up a training dataset, depending on the nature of the problem and the models we choose to apply.
In order to appropriately prepare the input data for modeling, the training data is used to fully understand it. Additionally, it is used to assess a variety of modeling approaches. It is used to fine-tune a model’s hyperparameters. The last step involves using the training dataset to create a model that can be applied to all data and used to predict future examples from the area of study.
Reasons Behind Class Imbalance
Biases and measurement errors made during data collection could be the main factors contributing to the class imbalance.
It’s possible that mistakes were made during gathering the observations. The incorrect class labels being applied to numerous samples could have been one form of problem. Alternately, the imbalance may have been brought about by the systems or processes from which the instances were drawn being broken or impaired.
When a measurement error or sampling bias is to blame for an imbalance, the imbalance can frequently be fixed by using better sampling techniques or fixing the measurement error. This is because the problem domain being addressed is not fairly represented in the training dataset.
Possible aspects of the specific problem include the imbalance.
In simple terms, an imbalance for one type of dataset or problem may turn out to be a wise choice for another type of dataset.
The two explanations listed above are general ones; however, there may be additional, more issue-specific factors for the class disparity.
Challenges Of Class Imbalance
The class imbalance will vary according to the problem or the dataset. For example, suppose the dataset has a small skew, with the majority class having 4000 data points and the minority class having 3980. You may not need to be concerned about this kind of data and might only be feeding the classification model.
However, a severe case of class imbalance occurs when the minority class has tens of data points while the majority class has hundreds, thousands, or even more data points. In this case, you should be concerned.
The minority class is often of the greatest interest when dealing with an extremely unbalanced classification situation. This implies that the accuracy of a model’s predictions of the class label or likelihood for the minority class is much more significant than those for the class or classes that make up the majority. This even makes sense since the minority class, by definition, has few examples, making prediction more difficult. This indicates that learning the traits of samples from this class and differentiating them from instances from the majority class will be more difficult for a model (or classes).
Dealing with this becomes a major issue since due to abundance of data present in majority classes raises the likelihood of inaccurate forecasts. There will always be some degree of uncertainty in a prediction made using a machine learning model or algorithm, but in the case of a severe class imbalance situation, the likelihood of incorrect or misclassifications is rather significant and should be avoided at all costs.
Techniques to handle Imbalance data in Python
1- Choosing the right evaluation metrics
The selection of evaluation metrics is quite a crucial step since accuracy isn’t an ideal evaluation metric when it comes to the case of severe imbalance. For example, a model that labels all testing samples as “0” will have a high accuracy (99.6%) if accuracy is used to assess a model’s usefulness, but obviously, this model won’t offer us any useful information.
The best evaluation metric for the data imbalance problem is the F1 score.
The F1 score is computed using the harmonic mean of recall and precision. Additionally, this metric has been created to function on unbalanced datasets. The better the F1 score, with 0 being the worst, the higher it is.
2- Resampling the training dataset
2.1 Under Sampling
Under Sampling is the removal of some of the majority class’s observations in order to nearly balance the majority and minority classes or to equalize them.
A balanced new dataset can be produced for further modeling by keeping all samples in the minority or rare class and randomly choosing an equal number of observations from the majority or plentiful class.
But undersampling has the disadvantage that we are discarding potentially useful data.
Implementation of the undersampling algorithm
The first step is to download the dataset from this link. Next import all the required libraries which is needed for this task.
#used to load the CSV files import pandas as pd #used to plot visualization like corr plot, bar chart, etc. import seaborn as sns import matplotlib.pyplot as plt |
The next step is to load the dataset. You can do this with the help of the pandas function. In this CSV dataset the delimiter is denoted by “;”.
# loading the dyslexia dataset data= pd.read_csv(‘Dyt-desktop.csv’,sep=”;”) data.head() |
The third step is to plot the target column with the help of the seaborn library. The count plot helps us to know whether the dataset is imbalanced or not.
From the above plot, you can observe that the dyslexia dataset is imbalanced in nature. After that, the categorical columns are encoded with the help of the cat.codes function. Then the next step is to specify the inputs and the target variable.
# encoding the categorical column for col in data.columns: if(data[col].dtype == ‘object’): data[col]= data[col].astype(‘category’) data[col] = data[col].cat.codes # specifying the inputs and the target variable y=data[‘Dyslexia’] X=data.drop(labels=’Dyslexia’, axis=1) |
Next, I will implement the undersampling techniques with the help RandomUnderSampler
class. It randomly selects a subset of data from the majority class data to balance the class distribution.
from imblearn.under_sampling import RandomUnderSampler # Create the undersampler object undersampler = RandomUnderSampler() # Fit and transform the data X_undersampled, y_undersampled = undersampler.fit_resample(X, y) |
The fit_resample
method is used to fit the undersampler to the data and return the undersampled input and target variable.
2.2 Over Sampling
It is just the reverse scenario we observed in the case of under-sampling here the data is quite insufficient, so it tries to balance the dataset by increasing the samples in a rare class.
New samples are generated using bootstrap, data augmentation for minority class referred to as smote (Synthetic Minority Oversampling Technique).
Undersampling has risks, including the potential for overfitting and inadequate generalization to your testing data.
Implementation of the Oversampling algorithm
The oversampling algorithm is performed with the help of RandomOverSampler
class, which randomly selects samples from the minority class to balance the class distribution.
from imblearn.over_sampling import RandomOverSampler # Create the oversampler object oversampler = RandomOverSampler() # Fit and transform the data X_oversampled, y_oversampled = oversampler.fit_resample(X, y) |
Similarly, there are various algorithms like SMOTE
,
, ADASYN
BorderlineSMOTE
, and
to implement the oversampling technique.SVMSMOTE
3- Using K-Fold cross-validation
K-fold cross-validation makes sure that every observation from the original dataset has a chance of appearing in the training and test datasets. This leads to better predictions. It is crucial to emphasize that cross-validation must be properly applied when using the oversampling method to address imbalance issues.
Keep in mind that oversampling uses bootstrapping to create fresh random data from observed atypical samples based on a distribution function. When oversampling is followed by cross-validation, we are essentially overfitting our model to a specific bootstrapping result. Therefore, cross-validation must always be done before oversampling the data, just as feature selection must be used. Only by repeatedly resampling the data can randomness be provided to the dataset, preventing overfitting problems.
# importing required libraries from sklearn.model_selection import KFold from imblearn.over_sampling import RandomOverSampler # Define the number of folds k = 5 # Create the KFold object kf = KFold(n_splits=k, shuffle=True, random_state=0) # Loop through the folds for train_index, test_index in kf.split(X): # Split the data into training and test sets X_train, X_test = X.iloc[train_index], X.iloc[test_index] y_train, y_test = y.iloc[train_index], y.iloc[test_index] # Handle data imbalance by oversampling the minority class or downsampling the majority class ros = RandomOverSampler(sampling_strategy=’minority’) X_train_resampled, y_train_resampled = ros.fit_resample(X_train, y_train) |
4- Near Miss Algorithm
An approach called “near-miss” can assist in balancing an unbalanced dataset. This technique falls under the category of undersampling algorithms. The technique accomplishes this by evaluating the distribution of the class and randomly removing samples from the bigger class. This algorithm removes the data point from the larger class when two distribution points that belong to separate classes are relatively near to one another, in order to equalize the dispersion.
THE STEPS of NEAR MISS ALGORITHM FOLLOWS:
- For the process of undersampling, the algorithm calculates the distance between all points in a larger class with points in a smaller class.
- Those data points of larger classes are selected that have the minimum distance with respect to a smaller class, so they could be deleted.
- The procedure will yield p*q elements of the larger class if there are p occurrences of the smaller class.
from imblearn.under_sampling import NearMiss # fit and transform the data X_undersampled, y_undersampled = NearMiss().fit_resample(X, y) |
Similarly, there are various algorithms like TomekLinks
, CondensedNearestNeighbour
and OneSidedSelection
to implement the undersampling technique.
5- Penalizing algorithms
If the model makes classification errors while being trained on the minority class, penalized categorization imposes an additional cost. The model might favor the minority class more as a consequence of these penalties.
Frequently, the learning algorithm is specific in how to handle class penalties or weights. Algorithms have penalized variations, including penalized-SVM and penalized-LDA.
Penalization is preferred if you are required to employ a specific algorithm and are unable to resample. This provides yet another way of “balancing” the classes. Setting up the punishment matrix might be challenging.
The hyperplane decision boundary that divides the instances into two groups most effectively is found by the SVM method. The adoption of a margin that permits some points to be incorrectly categorized softens the split. On unbalanced datasets, this margin favors the majority class by default; however, it can be modified to take into consideration the significance of each class and significantly enhance the algorithm’s performance on certain datasets by hyperparameter tuning.
- Weighted SVM, also known as cost-sensitive SVM, penalized SVM is a variant of SVM that counts the margin proportionally to the relevance of the class.
- Penalized LDA stands for Penalized – Linear Discriminant Analysis, in an attempt to avoid the curse of dimensionality, as well as to save resources and minimize dimensional expenses, LDA projects features from a higher-dimensional space onto a lower-dimensional space.
6- Ensemble different models together
For example, you might create n different models using the majority and minority classes separately, and then for the majority class, you might randomly subset the dataset or split it to ensure that there is no bias in the data, then train the n models on it to find a total of a new set of predictions that could be used as new data for common classes while you have already trained for the rarer classes, and you might combine both the predictions to get a balanced dataset.
There are many ensemble algorithms such as random forest, XgBoost, and LightGBM have specific parameters for balancing the dataset.
For example- the class_weight parameter in a random forest, scale_pos_weight
in Xgboost and focal loss for the LightGBM algorithm.
7- Clustering Algorithm
Rather than randomly selecting datasets from the common class we could follow the clustering technique, which seems to be more reliable. Clustering the abundant class in n groups, where n is the number of cases in n, rather than using random samples to cover the diversity of the training samples. Only the centroid for each group is preserved and this centroid is obtained by getting the distance between similar data points. The model is then trained using only the medoids and members of the uncommon class.
Common clustering algorithms are K-means clustering and DBSCAN (Density-Based Clustering Algorithms).
You could follow this post by Milecia McGregor to get a more detailed understanding of clustering, its types, and its implementations.
8- Experiment with different models that best suit your dataset
While trying a range of techniques is a good idea for any machine learning task, it can be extremely helpful for datasets that are unbalanced. For instance, tree ensembles (Random Forests, Gradient Boosted Trees (XGBoost), etc.) in contemporary machine learning virtually always outperform single decision trees; this might be employed in the case of data imbalance.
We might utilize data augmentation techniques like geometric modifications (rotations, flips, cropping, translations, image scaling), color transformations, combining images, kernel filters like blurring and sharpening, etc., to address the class imbalance in image data.
To understand the implementations of the above techniques you could follow the kaggle notebook by JANIO MARTINEZ BACHMANN.
Conclusion
The approaches we covered above each have their own advantages and disadvantages, so don’t limit yourself to using these methods alone to handle datasets with imbalances. The right technique solely depends on the user and what that person expects from the data; don’t be afraid to experiment with and explore various techniques. Data scientists will encounter the data imbalance more frequently, and it is more frequently seen in cases of credit fraud detection, spam detection, outlier recognition, anomaly detection, and many other applications. Hope you understood a few of the techniques I discussed above. Thank you for reading this long. I sincerely value your constructive feedback and suggestions as I strive to get better.
If you like the article and would like to support me, make sure to:
- 👏 Like for this article and subscribe to our newsletter
- 📰 View more content on my DataSpoof website
- 🔔 Follow Me: LinkedIn| Youtube | Instagram | Twitter