I have gathered some train dataset to train the network model, but unfortunately the dataset is critically unbalanced is there a way to balancing the data using Keras library without the need to balance it manually (dataset of two objects: object 1 2000 data while the other is 15000 ) , I don't want to use upsampling or downsampling cause I don't want to get problems in overfitting or underfitting
There are a number of ways and best-practices to deal with so called imbalanced data sets.
Upsample the minority class (Drawback: possibly overfitting of minority class)
Downsample the majority class (Drawback: loss of training data, information loss)
There are a number of techniques you can use for this, some even offer methods to overcome drawbacks (e.g. synthetic sampling). Have a look at the imbalanced-learn package for a easy-to-use implementation.
Another thing you could use is to weight the loss of your model in order to tell the model that it should "pay more attention" to specific classes. This can be easily done by defining the optional argument class_weight in keras fit function. The class weights can be easily computed by sklearns compute_class_weight function.
Related
I am performing multi-class text classification using BERT in python. The dataset that I am using for retraining my model is highly imbalanced. Now, I am very clear that the class imbalance leads to a poor model and one should balance the training set by undersampling, oversampling, etc. before model training.
However, it is also a fact that the distribution of the training set should be similar to the distribution of the production data.
Now, if I am sure that the data thrown at me in the production environment will also be imbalanced, i.e., the samples to be classified will likely belong to one or more classes as compared to some other classes, should I balance my training set?
OR
Should I keep the training set as it is as I know that the distribution of the training set is similar to the distribution of data that I will encounter in the production?
Please give me some ideas, or provide some blogs or papers for understanding this problem.
Class imbalance is not a problem by itself, the problem is too few minority class' samples make it harder to describe its statistical distribution, which is especially true for high-dimensional data (and BERT embeddings have 768 dimensions IIRC).
Additionally, logistic function tends to underestimate the probability of rare events (see e.g. https://gking.harvard.edu/files/gking/files/0s.pdf for the mechanics), which can be offset by selecting a classification threshold as well as resampling.
There's quite a few discussions on CrossValidated regarding this (like https://stats.stackexchange.com/questions/357466). TL;DR:
while too few class' samples may degrade the prediction quality, resampling is not guaranteed to give an overall improvement; at least, there's no universal recipe to a perfect resampling proportion, you'll have to test it out for yourself;
however, real life tasks often weigh classification errors unequally: resampling may help improving certain class' metrics at the cost of overall accuracy. Same applies to classification threshold selection however.
This depends on the goal of your classification:
Do you want a high probability that a random sample is classified correctly? -> Do not balance your training set.
Do you want a high probability that a random sample from a rare class is classified correctly? -> balance your training set or apply weighting during training increasing the weights for rare classes.
For example in web applications seen by clients, it is important that most samples are classified correctly, disregarding rare classes, whereas in the case of anomaly detection/classification, it is very important that rare classes are classified correctly.
Keep in mind that a highly imbalanced dataset tends to always predicting the majority class, therefore increasing the number or weights of rare classes can be a good idea, even without perfectly balancing the training set..
P(label | sample) is not the same as P(label).
P(label | sample) is your training goal.
In the case of gradient-based learning with mini-batches on models with large parameter space, rare labels have a small footprint on the model training. So, your model fits in P(label).
To avoid fitting to P(label), you can balance batches.
Overall batches of an epoch, data looks like an up-sampled minority class. The goal is to get a better loss function that its gradients move parameters toward a better classification goal.
UPDATE
I don't have any proof to show this here. It is perhaps not an accurate statement. With enough training data (with respect to the complexity of features) and enough training steps you may not need balancing. But most language tasks are quite complex and there is not enough data for training. That was the situation I imagined in the statements above.
I am dealing with a classification problem with 3 classes [0,1,2], and imbalanced class distribution as shown below.
I want to apply XGBClassifier (in Python) to this classification problem, but the model does not respond to class_weight adjustments and skews towards the majority class 0, and ignores the minority classes 1,2. Which hyperparameters other than class_weight can help me?
I tried 1) computing class weights using sklearn compute_class_weight; 2) setting weights according to the relative frequency of the classes; 3) and also manually adjusting classes with extreme values to see if any change happens at all, such as {0:0.5,1:100,2:200}. But in any case, it does not help the classifier to take the minority classes into account.
Observations:
I can handle the problem in the binary case: If I make the problem a binary classification by identifying classes [1,2], then I can get the classifier work properly by adjusting scale_pos_weight (even in this case class_weight alone does not help).
But scale_pos_weight, as far as I know, works for binary classification. Is there an analogue of this parameter for the multi-classification problems?
Using RandomForestClassifier instead of XGBClassifier, I can handle the problem by setting class_weight='balanced_subsample' and tunning max_leaf_nodes. But, for some reason, this approach does not work for XGBClassifier.
Remark: I know about balancing techniques, such as over/undersampling, or SMOTE. But I want to avoid them as much as possible, and prefer a solutions using hyperparameter tunning of the model if possible.
My observation above shows that this can work for the binary case.
sample_weight parameter is useful for handling imbalanced data while using XGBoost for training the data. You can compute sample weights by using compute_sample_weight() of sklearn library.
This code should work for multiclass data:
from sklearn.utils.class_weight import compute_sample_weight
sample_weights = compute_sample_weight(
class_weight='balanced',
y=train_df['class'] #provide your own target name
)
xgb_classifier.fit(X, y, sample_weight=sample_weights)
You can use sample_weight as #Prakash Dahal suggested, but compute your own weights. I found that different weights made a dramatic difference (I have 12 classes and very imbalanced data).
If you compute your own weights, you need to assign the relevant weight to each entry and pass the param to the classifier in the same way:
xgb_class.fit(X_train, y_train, sample_weight=weights)
I have a really large dataset with 60 million rows and 11 features.
It is highly imbalanced dataset, 20:1 (signal:background).
As I saw, there are two ways to tackle this problem:
First: Under-sampling/Oversampling.
I have two problems/questions in this way.
If I make under-sampling before train test split, I am losing a lot of data.
But more important, If I train a model on a balanced dataset, I am losing information about the frequency of my signal data(let's say the frequency of benign tumor over malignant), and because model is trained on and evaluated, model will perform well. But if sometime in the future I am going to try my model on new data, it will bad perform because real data is imbalanced.
If I made undersampling after train test split, my model will underfit because it will be trained on balanced data but validated/tested on imbalanced.
Second - class weight penalty
Can I use class weight penalty for XBG, Random Forest, Logistic Regression?
So, everybody, I am looking for an explanation and idea for a way of work on this kind of problem.
Thank you in advance, I will appreciate any of your help.
I suggest this quick paper by Breiman (author of Random Forest):
Using Random Forest to Learn Imbalanced Data
The suggested methods are weighted RF, where you compute the splits using weighted Gini (or Entropy, which in my opinion is better when weighted), and Balanced Random Forest, where you try to balance the classes during the bootstrap.
Both methods can be implemented also for boosted trees!
One of the suggested methodologies could be using Synthetic Minority oversampling technique (SMOTE) which attempts to balance the data set by creating synthetic instances. And train the balanced data set using any of the classification algorithm.
For comparing multiple models, Area Under the ROC Curve (AUC score) can be used to determine which model is superior.
This guide will be able to give you some ideas on different methodologies you can use and compare to resolve imbalance problem.
The above issue is pretty common when dealing with medical datasets and other types of fault detection where one of the classes (ill-effect) is always under-represented.
The best way to tackle this is to generate folds and apply cross validation. The folds should be generated in a way to balance the classes in each fold. In your case this creates 20 folds, each has the same under-represented class and a different fraction of the over-represented class.
Generating balanced folds and using cross validation also results in a better generalised and robust model. In your case, 20 folds might seem to harsh, so you can possibly create 10 folds each with a 2:1 class ratio.
I am training on three classes with one dominant majority class of about 80% and the other two even. I am able to train a model using undersampling / oversampling techniques to get validation accuracy of 67% which would already be quite good for my purposes. The issue is that this performance is only present on the balanced validation data, once I test on out of sample with imbalanced data it seems to have picked up a bias towards even class predictions. I have also tried using weighted loss functions but also no joy on out of sample. Is there a good way to ensure the validation performance translates over? I have tried using auroc to validate the model successfully but again the strong performance is only present in the balanced validation data.
Methods of resampling I have tried: SMOTE oversampling and random undersampling.
If I understood correctly, may be you are looking for performance measurement and better classification results on imbalance datasets.
Alone measuring the performance using accuracy in case of imbalanced datasets usually high and misleading and minority class could be totally ignored Instead use f1-score, precision/recall score.
For my project work on imbalanced datasets, I have used SMOTE sampling methods along with the K-Fold cross validation.
Cross validation technique assures that model gets the correct patterns from the data, and it is not getting up too much noise.
References :
What is the correct procedure to split the Data sets for classification problem?
Keras uses a class_weight parameter to deal with imbalanced datasets.
Here is what we can find in the doc:
Optional dictionary mapping class indices (integers) to a weight (float) to apply to the model's loss for the samples from this class during training. This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
Does that mean that the class_weight gives a different weights in the training error function to each class? Does it have an influence elsewhere? Is it really efficient against generalization errors, in comparison of "physically" drop instances from the most represented class?
The class_weight parameter weights the loss associated with each training example proportionate to that class's underrepresentation in the training set. This prevents class imbalance during training and should render your network robust to generalization error.
I'd exercise caution when physically dropping data instances corresponding to the most represented class, however - if your network is deep and therefore has significant representational capacity, culling your dataset can lead to overfitting, and consequently poor generalization to the validation/test sets.
I would recommend using the class_weights parameter as specified in the Keras documentation. If you really are intent on dropping data instances from the most represented class, ensure that you tune your network topology to decrease the model's representational capacity(i.e. add Dropout and/or L2 regularization layers).