Metrics to consider for heavily imbalanced dataset - python

I am trying to train a GradientBoosting model on a heavily imbalanced data in Python. Class distribution is like 0.96 : 0.04 for class 0 and class 1 respectively.
After some parameter tuning considering the recall and precision scores I came up with a good model. Different metrics scores are like given below for validation set. Also, it is close to the Cross Validation Scores.
recall : 0.928777
precision : 0.974747
auc : 0.9636
kappa : 0.948455
f1 weighted : 0.994728
If I want to tune the model further, which metrics should I consider to increase.? In my problem miss-classifying 1 as 0 is more problematic than miss-predicting 0 as 1.

There are various techniques to work with when dealing with Class imbalance issue. Few as stated below:
(Links include pythons imblearn package and costcla package)
Resample:
Undersample majority class (class 0 in your case) You can try random undersampling for starters.
Oversample the minority class (Class 1). Explore SMOTE/ADASYN techniques.
Ensemble Techniques:
Bagging/Boosting techniques.
Cost-sensitive Learning: You should definitely explore this since you have mentioned:
In my problem miss-classifying 1 as 0 is more problematic than miss-predicting 0 as 1.
In cost sensitive learning using costcla package, you should try the following approach, keeping your base classifier as GradientBoostingRegressor:
costcla.sampling.cost_sampling(X, y, cost_mat, method='RejectionSampling', oversampling_norm=0.1, max_wc=97.5)
Here you can load a cost_mat[C_FP,C_FN,C_TP,C_TN] for each data point in train and test. C_FP and C_FN are based on the misclassification cost that you want to set for positives and negatives classes. Refer to the full tutorial on credit score data here.

Related

Reducing False positives ML models

Is there a nice way to enforce a limit on the false positives while training a ML model?
Let's suppose you start with a balanced dataset with two class. You develop a ML model for binary classification. As the task is easy the output distributions will be peaked respectively at 0 and 1 and overlapping around 0.5 . However what you really care about is that your false positive rate is sustainable and cannot exceed a certain amount.
So at best you would like to have that for pred > 0.8 you only have one class.
At the moment i'm weighting the two class to penalise an error on the class "0".
history = model.fit(..., class_weight={0:5, 1:1}, ...)
As expected it does decrease the fpr in the region pred > 0.8 and of course it will worsen the recall of class 1.
I'm wondering if there are other ways to enforce this.
Thank you
Depending on your problem , you can consider one-class classification svm. This article can be useful : https://towardsdatascience.com/outlier-detection-with-one-class-svms-5403a1a1878c . The article shows also why one-class classification is better to consider instead of some other classical techniques , such as oversampling/undersampling or class-weighting. But of course it depends on the problem you want to solve.

How to perform supervised training of a deep neural network when the target label has only 0 and 1?

I am trying to train a Deep Neural Network (DNN) with labeled data. The labels are encoded in such a way that it only contains values 0 and 1. The shape of the encoded label is 5 x 5 x 232. About 95% of values in the label is 0and rests are 1. Currently, I am using binary_crossentroy loss function to train the network.
What is the best technique to train the DNN in such a scenario? Is the choice of binary_crossentroy
as the loss function is appropriate in this case? Any suggestion to improve the performance of the model.
You can try MSE loss. If you want to stick to binary cross-entropy (used in binary classification), consider using label smoothing.
You may use 2 other alternative loss functions instead of Binary cross-entropy.They are
Hinge Loss
An alternative to cross-entropy for binary classification problems is the hinge loss function, primarily developed for use with Support Vector Machine (SVM) models.
It is intended for use with binary classification where the target values are in the set {-1, 1}.
Squared Hinge Loss
For more Detail on loss function with examples.click here
Hope helpful, happy learning.
binary_crossentroy as loss is fine
Don't use accuracy as your metrics, because model will just predict every thing as label 0 and will still get 95% accuracy. Instead use F1 score (or precision or recall)
Use Weighted loss: I.e penalize class 1 heavily if they are wrong as compared to class 0.
Instead of class weights you can also use methods like oversampling form the minority class. (Techniques like SMOTE)
How to calculate class weight
You can use sklearn.utils.class_weight to calculate weight from your labels. Check this answer
In such scenarios where you have highly imbalanced data, I would suggest going with Random Forest with up-Sampling. This approach will up-sample the minority class and hence improve the model accuracy.

Is it feasible to have the training set < the test set after undersampling the majority class?

I have a data set of 1500 records with two classes which are imbalanced. Class 0 is 1300 records while Class 1 is 200 records, hence a ratio of ard 6.5:1.
I built a random forest with this data set for classification. I know from past experience, if I use the whole data set, the recall is pretty low, which is probably due to the imbalanced class.
So I decided to undersample Class 0. My steps are as follows:
Randomly split the data set into train & test set of ratio 7:3 (hence 1050 for training and 450 for test.)
Now the train set has ~900 data of Class 0 ~100 for Class 1. I clustered ~900 data of Class 0, and undersample it (proportionally) to ~100 records.
So now train set ~100 Class 0 + ~100 Class 1 = ~200 records in total while the test set is 70 Class 0 + 380 Class 1 = 450 records in total.
Here comes my questions:
1) Are my steps valid? I split the train/test first and then undersample the majority class of the train set.
2) Now my train set (~200) < test set (450). Does it make sense?
3) The performance is still not very good. Precision is 0.34, recall is 0.72 and the f1 score is 0.46. Is there any way to improve? Should I use CV?
Many thanks for helping!
1) Are my steps valid? I split the train/test first and then
undersample the majority class of the train set.
You should split train and test so the class balance is preserved in both. If in your whole dataset ratio is 6.5:1 it should be the same both in train and test.
Yes, you should split it before undersampling (no need to undersample test cases), just remember to monitor multiple metrics (e.g. f1 score, recall, precision were already mentioned and you should be fine with those) as you are training on different distribution than test.
2) Now my train set (~200) < test set (450). Does it make sense?
Yes it does. You may also go for oversampling on training dataset (e.g. minority class is repeated at random to match the number of examples from majority). In this case you have to split before as well otherwise you may spoil your test set with training samples which is even more disastrous.
3) The performance is still not very good. Precision is 0.34, recall is 0.72 and the f1 score is 0.46. Is there any way to improve? Should I use CV?
It depends on specific problem, what I would do:
oversampling instead of undersampling - neural networks need a lot of data, you don't have many samples right now
try other non-DL algorithms (maybe SVM if you have a lot of features? RandomForest otherwise might be a good bet as well)
otherwise fine tune your neural network (focus especially on learning rate, use CV or related methods if you got the time)
try to use some pretrained neural networks if available for the task at hand

Imbalanced Dataset - Binary Classification Python

I am trying to create a binary classification model for imbalance dataset using Random Forest - 0- 84K, 1- 16K. I have tried using class_weights = 'balanced', class_weights = {0:1, 1:5}, downsampling and oversampling but none of these seem to work. My metrics are usually in the below range:
Accuracy = 66%
Precision = 23%
Recall = 44%
I would really appreciate any help on this! Thanks
there are lots of ways to improve classifier behavior. If you think your data are balanced (or rather, your weight method balances them enough), then consider expanding your forest, either with deeper trees or more numerous trees.
Try other methods like SVM, or ANN, and see how they compare.
Try Stratified sampling for the dataset so that you can get the constant ration being taken in account for both the test and the training dataset. And then use the class weight balanced which you have already used. If you want the accuraccy improved there are tons other ways.
1) First be sure that the dataset being provided is accurate or verified.
2) You can increase the accuracy by playing with threshold of the probability (if in binary classification if its >0.7 confident then do a prediction else wise don't , the draw back in this approach would be NULL values or mostly being not predicting as algorithm is not confident enough, but for a business model its a good approach because people prefer less False Negatives in their model.
3) Use Stratified Sampling to equally divide the training and the testing dataset, so that constant ration is being divided. rather than train_test_splitting : stratified sampling will return you the indexes for training and testing . You can play with the (cross_validation : different iteration)
4) For the confusion matrix, have a look at the precision score per class and see which class is showing more( I believe if you apply threshold limitation it would solve the problem for this.
5) Try other classifiers , Logistic, SVM(linear or with other kernel) : LinearSVC or SVC , NaiveBayes. As per seen in most cases of Binary classification Logistc and SVC seems to be performing ahead of other algorithms. Although try these approach first.
6) Make sure to check the best parameters for the fitting such as choice of Hyper Parameters (using Gridsearch with couple of learning rates or different kernels or class weights or other parameters). If its textual classification are you applying CountVectorizer with TFIDF (and have you played with max_df and stop_words removal) ?
If you have tried these, then possibly be sure about the algorithm first.

How to interpret feature importance for ensemble methods?

I'm using ensemble methods (random forest, xgbclassifier, etc) for classification.
One important aspect is feature importance prediction, which is like below:
Importance
Feature-A 0.25
Feature-B 0.09
Feature-C 0.08
.......
This model achieves accuracy score around 0.85; obviously Feature-A is dominantly important, so I decided to remove Feature-A and calculated again.
However, after removing Feature-A, I still found a good performance with accuracy around 0.79.
This doesn't make sense to me, because Feature-A contributes 25% for the model, if removed, why accuracy score is barely affected?
I know ensemble methods hold an advantage to combine 'weak' features into 'strong' ones, so accuracy score mostly relies on aggregation and less sensitive to important feature removal?
Thanks
It's possible there are other features that are redundant with Feature A. For instance, suppose that features G,H,I are redundant with feature A: if you know the value of features G,H,I, then the value of feature A is pretty much determined.
That would be consistent with your results. If we include feature A, the model will learn to us it, as it's very simple to get excellent accuracy using just feature A and ignoring features G,H,I, so it'll have excellent accuracy, high importance for feature A, and low importance for features G,H,I. If we exclude feature A, the model can still get almost-as-good accuracy by using features G,H,I, so it'll still have very good accuracy (though the model might become more complicated because the relationship between G,H,I and class is more complicated than the relationship between A and class).

Categories

Resources