I currently have an imbalanced dataset of over 800,000 datapoints. The imbalance is severe as there is only 3719 datapoints for one of the two classes. Upon undersampling the data using NearMiss algorithm in Python and applying a Random Forest classifier, I am able to achieve the following results:
Accuracy: 81.4%
Precision: 82.6%
Recall: 79.4%
Specificity: 83.4%
However, when re-testing this same model on the full dataset again, the confusion matrix results show a large bias towards the minority class for some reason, showing a large number of false positives. Is this the correct way of testing the model after undersampling?
Undersampling first from 800k records to 4k might be quite a loss in your domain knowledge. Most of the time you do over-sampling first and under-sampling second.
There's dedicated package for that: imblearn. As for validation: you don't want to score resampled records, as it'll mess things up. Look closer into scoring params in sklearn, namely: micro, macro, weighted. Docs are here. There're also some specific metrics for this. Check it here:
https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/evaluation/plot_classification_report.html#sphx-glr-auto-examples-evaluation-plot-classification-report-py
https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/evaluation/plot_metrics.html#sphx-glr-auto-examples-evaluation-plot-metrics-py
Related
I am performing multi-class text classification using BERT in python. The dataset that I am using for retraining my model is highly imbalanced. Now, I am very clear that the class imbalance leads to a poor model and one should balance the training set by undersampling, oversampling, etc. before model training.
However, it is also a fact that the distribution of the training set should be similar to the distribution of the production data.
Now, if I am sure that the data thrown at me in the production environment will also be imbalanced, i.e., the samples to be classified will likely belong to one or more classes as compared to some other classes, should I balance my training set?
OR
Should I keep the training set as it is as I know that the distribution of the training set is similar to the distribution of data that I will encounter in the production?
Please give me some ideas, or provide some blogs or papers for understanding this problem.
Class imbalance is not a problem by itself, the problem is too few minority class' samples make it harder to describe its statistical distribution, which is especially true for high-dimensional data (and BERT embeddings have 768 dimensions IIRC).
Additionally, logistic function tends to underestimate the probability of rare events (see e.g. https://gking.harvard.edu/files/gking/files/0s.pdf for the mechanics), which can be offset by selecting a classification threshold as well as resampling.
There's quite a few discussions on CrossValidated regarding this (like https://stats.stackexchange.com/questions/357466). TL;DR:
while too few class' samples may degrade the prediction quality, resampling is not guaranteed to give an overall improvement; at least, there's no universal recipe to a perfect resampling proportion, you'll have to test it out for yourself;
however, real life tasks often weigh classification errors unequally: resampling may help improving certain class' metrics at the cost of overall accuracy. Same applies to classification threshold selection however.
This depends on the goal of your classification:
Do you want a high probability that a random sample is classified correctly? -> Do not balance your training set.
Do you want a high probability that a random sample from a rare class is classified correctly? -> balance your training set or apply weighting during training increasing the weights for rare classes.
For example in web applications seen by clients, it is important that most samples are classified correctly, disregarding rare classes, whereas in the case of anomaly detection/classification, it is very important that rare classes are classified correctly.
Keep in mind that a highly imbalanced dataset tends to always predicting the majority class, therefore increasing the number or weights of rare classes can be a good idea, even without perfectly balancing the training set..
P(label | sample) is not the same as P(label).
P(label | sample) is your training goal.
In the case of gradient-based learning with mini-batches on models with large parameter space, rare labels have a small footprint on the model training. So, your model fits in P(label).
To avoid fitting to P(label), you can balance batches.
Overall batches of an epoch, data looks like an up-sampled minority class. The goal is to get a better loss function that its gradients move parameters toward a better classification goal.
UPDATE
I don't have any proof to show this here. It is perhaps not an accurate statement. With enough training data (with respect to the complexity of features) and enough training steps you may not need balancing. But most language tasks are quite complex and there is not enough data for training. That was the situation I imagined in the statements above.
I have a really large dataset with 60 million rows and 11 features.
It is highly imbalanced dataset, 20:1 (signal:background).
As I saw, there are two ways to tackle this problem:
First: Under-sampling/Oversampling.
I have two problems/questions in this way.
If I make under-sampling before train test split, I am losing a lot of data.
But more important, If I train a model on a balanced dataset, I am losing information about the frequency of my signal data(let's say the frequency of benign tumor over malignant), and because model is trained on and evaluated, model will perform well. But if sometime in the future I am going to try my model on new data, it will bad perform because real data is imbalanced.
If I made undersampling after train test split, my model will underfit because it will be trained on balanced data but validated/tested on imbalanced.
Second - class weight penalty
Can I use class weight penalty for XBG, Random Forest, Logistic Regression?
So, everybody, I am looking for an explanation and idea for a way of work on this kind of problem.
Thank you in advance, I will appreciate any of your help.
I suggest this quick paper by Breiman (author of Random Forest):
Using Random Forest to Learn Imbalanced Data
The suggested methods are weighted RF, where you compute the splits using weighted Gini (or Entropy, which in my opinion is better when weighted), and Balanced Random Forest, where you try to balance the classes during the bootstrap.
Both methods can be implemented also for boosted trees!
One of the suggested methodologies could be using Synthetic Minority oversampling technique (SMOTE) which attempts to balance the data set by creating synthetic instances. And train the balanced data set using any of the classification algorithm.
For comparing multiple models, Area Under the ROC Curve (AUC score) can be used to determine which model is superior.
This guide will be able to give you some ideas on different methodologies you can use and compare to resolve imbalance problem.
The above issue is pretty common when dealing with medical datasets and other types of fault detection where one of the classes (ill-effect) is always under-represented.
The best way to tackle this is to generate folds and apply cross validation. The folds should be generated in a way to balance the classes in each fold. In your case this creates 20 folds, each has the same under-represented class and a different fraction of the over-represented class.
Generating balanced folds and using cross validation also results in a better generalised and robust model. In your case, 20 folds might seem to harsh, so you can possibly create 10 folds each with a 2:1 class ratio.
I am new to Machine Learning
I have a dataset which has highly unbalanced classes(dominated by negative class) and contains more than 2K numeric features and the target is [0,1]. I have trained a logistics regression though I am getting an accuracy of 89% but from confusion matrix, it was found the model True positive is very low. Below are the scores of my model
Accuracy Score : 0.8965989500114129
Precision Score : 0.3333333333333333
Recall Score : 0.029545454545454545
F1 Score : 0.05427974947807933
How I can increase my True Positives? Should I be using a different classification model?
I have tried the PCA and represented my data in 2 components, it increased the model accuracy up to 90%(approx) however True Positives was decreased again
There are several ways to do this :
You can change your model and test whether it performs better or not
You can Fix a different prediction threshold : here I guess you predict 0 if the output of your regression is <0.5, you could change the 0.5 into 0.25 for example. It would increase your True Positive rate, but of course, at the price of some more False Positives.
You can duplicate every positive example in your training set so that your classifier has the feeling that classes are actually balanced.
You could change the loss of the classifier in order to penalize more False Negatives (this is actually pretty close to duplicating your positive examples in the dataset)
I'm sure many other tricks could apply, here is just my favorite short-list.
I'm assuming that your purpose is to obtain a model with good classification accuracy on some test set, regardless of the form of that model.
In that case, if you have access to the computational resources, try Gradient-Boosted Trees. That's a ensemble classifier using multiple decision trees on subsets of your data, then a voting ensemble to make predictions. As far as I know, it can give good results with unbalanced class counts.
SciKitLearn has the function sklearn.ensemble.GradientBoostingClassifier for this. I have not used that particular one, but I use the regression version often and it seems good. I'm pretty sure MATLAB has this as a package too, if you have access.
2k features might be difficult for the SKL algorithm - I don't know I've never tried.
What is the size of your dataset?How many rows are we talking here?
Your dataset is not balanced and so its kind of normal for a simple classification algorithm to predict the 'majority-class' most of the times and give you an accuracy of 90%. Can you collect more data that will have more positive examples in it.
Or, just try oversampling/ under-sampling. see if that helps.
You can also use penalized version of the algorithm to impose penalty, whenever a wrong class is predicted. That may help.
You can try many different solutions.
If you have quite a lot data points. For instance you have 2k 1s and 20k 0s. You can try just dump those extra 0s only keep 2k 0s. Then train it. And also you can try to use different set of 2k 0s and same set of 2k 1s. To train multiple models. And make decision based on multiple models.
You also can try adding weights at the output layer. For instance, you have 10 times 0s than 1s. Try to multiply 10 at the 1s prediction value.
Probably you also can try to increase dropout?
And so on.
I am training on three classes with one dominant majority class of about 80% and the other two even. I am able to train a model using undersampling / oversampling techniques to get validation accuracy of 67% which would already be quite good for my purposes. The issue is that this performance is only present on the balanced validation data, once I test on out of sample with imbalanced data it seems to have picked up a bias towards even class predictions. I have also tried using weighted loss functions but also no joy on out of sample. Is there a good way to ensure the validation performance translates over? I have tried using auroc to validate the model successfully but again the strong performance is only present in the balanced validation data.
Methods of resampling I have tried: SMOTE oversampling and random undersampling.
If I understood correctly, may be you are looking for performance measurement and better classification results on imbalance datasets.
Alone measuring the performance using accuracy in case of imbalanced datasets usually high and misleading and minority class could be totally ignored Instead use f1-score, precision/recall score.
For my project work on imbalanced datasets, I have used SMOTE sampling methods along with the K-Fold cross validation.
Cross validation technique assures that model gets the correct patterns from the data, and it is not getting up too much noise.
References :
What is the correct procedure to split the Data sets for classification problem?
I am trying to create a binary classification model for imbalance dataset using Random Forest - 0- 84K, 1- 16K. I have tried using class_weights = 'balanced', class_weights = {0:1, 1:5}, downsampling and oversampling but none of these seem to work. My metrics are usually in the below range:
Accuracy = 66%
Precision = 23%
Recall = 44%
I would really appreciate any help on this! Thanks
there are lots of ways to improve classifier behavior. If you think your data are balanced (or rather, your weight method balances them enough), then consider expanding your forest, either with deeper trees or more numerous trees.
Try other methods like SVM, or ANN, and see how they compare.
Try Stratified sampling for the dataset so that you can get the constant ration being taken in account for both the test and the training dataset. And then use the class weight balanced which you have already used. If you want the accuraccy improved there are tons other ways.
1) First be sure that the dataset being provided is accurate or verified.
2) You can increase the accuracy by playing with threshold of the probability (if in binary classification if its >0.7 confident then do a prediction else wise don't , the draw back in this approach would be NULL values or mostly being not predicting as algorithm is not confident enough, but for a business model its a good approach because people prefer less False Negatives in their model.
3) Use Stratified Sampling to equally divide the training and the testing dataset, so that constant ration is being divided. rather than train_test_splitting : stratified sampling will return you the indexes for training and testing . You can play with the (cross_validation : different iteration)
4) For the confusion matrix, have a look at the precision score per class and see which class is showing more( I believe if you apply threshold limitation it would solve the problem for this.
5) Try other classifiers , Logistic, SVM(linear or with other kernel) : LinearSVC or SVC , NaiveBayes. As per seen in most cases of Binary classification Logistc and SVC seems to be performing ahead of other algorithms. Although try these approach first.
6) Make sure to check the best parameters for the fitting such as choice of Hyper Parameters (using Gridsearch with couple of learning rates or different kernels or class weights or other parameters). If its textual classification are you applying CountVectorizer with TFIDF (and have you played with max_df and stop_words removal) ?
If you have tried these, then possibly be sure about the algorithm first.