Is there a nice way to enforce a limit on the false positives while training a ML model?
Let's suppose you start with a balanced dataset with two class. You develop a ML model for binary classification. As the task is easy the output distributions will be peaked respectively at 0 and 1 and overlapping around 0.5 . However what you really care about is that your false positive rate is sustainable and cannot exceed a certain amount.
So at best you would like to have that for pred > 0.8 you only have one class.
At the moment i'm weighting the two class to penalise an error on the class "0".
history = model.fit(..., class_weight={0:5, 1:1}, ...)
As expected it does decrease the fpr in the region pred > 0.8 and of course it will worsen the recall of class 1.
I'm wondering if there are other ways to enforce this.
Thank you
Depending on your problem , you can consider one-class classification svm. This article can be useful : https://towardsdatascience.com/outlier-detection-with-one-class-svms-5403a1a1878c . The article shows also why one-class classification is better to consider instead of some other classical techniques , such as oversampling/undersampling or class-weighting. But of course it depends on the problem you want to solve.
Related
How to Determine the best threshold value for deep learning model. I am working on predicting seizure epilepsy using CNN. I want to determine the best threshold for my deep learning model in order to get best results.
I am trying for more than 2 weeks to find how I can do it.
Any help would be appreciated.
code
history=model.fit_generator(generate_arrays_for_training(indexPat, filesPath, end=75), #end=75),
validation_data=generate_arrays_for_training(indexPat, filesPath, start=75),#start=75),
steps_per_epoch=int((len(filesPath)-int(len(filesPath)/100*25))),#*25),
validation_steps=int((len(filesPath)-int(len(filesPath)/100*75))),#*75),
verbose=2,
epochs=50, max_queue_size=2, shuffle=True, callbacks=[callback,call])
In general, choosing right classification threshold depends on the use case. You should remember that choosing threshold is not a part of hyperparameters tuning. The value of classification threshold greatly impacts the behaviour of model after you train it.
If you increase it, you want your model to be very sure about prediction which means you will be filtering out false positives - you will be targeting precision. This might be the case when your model is a part of a mission-critical pipeline where decision made based on positive output of model is costly (in terms of money, time, human resources, computational resources etc...)
If you decrease it, your model will say that more examples are positives which will allow you to explore more examples that are potentially positive (you target recall. This is important when a false negative is disastrous e.g in medical cases (You would rather check whether low-probability patient has cancer rather than ignoring him and find out later that he was indeed sick)
For more examples please see When is precision more important over recall?
Now, choosing between recall and precision is a trade-off and you have to choose it based on you situation. Two tools to help you achieve this are ROC and Recall-Precision Curves How to Use ROC Curves and Precision-Recall Curves for Classification in Python which indicates how model handles false positives and false negatives depending on classification threshold
Many ML algorithms are capable of predicting a score for a class membership which needs to be interpreted before it can be plotted to a class label. And you achieve this by using a threshold, such as 0.5, whereby values >= than the threshold are mapped to one class and the rest mapped to another class.
Class 1 = Prediction < 0.5; Class 0 = Prediction => 0.5
It’s crucial to find the best threshold value for the kind of problem you're on and not just assume a classification threshold e.g. a 0.5;
Why? The default threshold can often result in pretty poor performance for classification problems with severe class imbalance.
See, ML thresholds are problem-specific and must be fine-tuned. Read a short article about it here
One of the best ways to determine the best threshold for your deep learning model in order to get the best results is to tune the threshold used to map probabilities to a class.
The best threshold for the CNN can be calculated directly using ROC Curves and Precision-Recall Curves. In some cases, you can use a grid search to fine-tune the threshold and find the optimal value.
The code below will help you check the option that will give the best results. GitHub link:
from deepchecks.checks.performance import PerformanceReport
check = PerformanceReport()
check.run(ds, clf)
I have a data set of 1500 records with two classes which are imbalanced. Class 0 is 1300 records while Class 1 is 200 records, hence a ratio of ard 6.5:1.
I built a random forest with this data set for classification. I know from past experience, if I use the whole data set, the recall is pretty low, which is probably due to the imbalanced class.
So I decided to undersample Class 0. My steps are as follows:
Randomly split the data set into train & test set of ratio 7:3 (hence 1050 for training and 450 for test.)
Now the train set has ~900 data of Class 0 ~100 for Class 1. I clustered ~900 data of Class 0, and undersample it (proportionally) to ~100 records.
So now train set ~100 Class 0 + ~100 Class 1 = ~200 records in total while the test set is 70 Class 0 + 380 Class 1 = 450 records in total.
Here comes my questions:
1) Are my steps valid? I split the train/test first and then undersample the majority class of the train set.
2) Now my train set (~200) < test set (450). Does it make sense?
3) The performance is still not very good. Precision is 0.34, recall is 0.72 and the f1 score is 0.46. Is there any way to improve? Should I use CV?
Many thanks for helping!
1) Are my steps valid? I split the train/test first and then
undersample the majority class of the train set.
You should split train and test so the class balance is preserved in both. If in your whole dataset ratio is 6.5:1 it should be the same both in train and test.
Yes, you should split it before undersampling (no need to undersample test cases), just remember to monitor multiple metrics (e.g. f1 score, recall, precision were already mentioned and you should be fine with those) as you are training on different distribution than test.
2) Now my train set (~200) < test set (450). Does it make sense?
Yes it does. You may also go for oversampling on training dataset (e.g. minority class is repeated at random to match the number of examples from majority). In this case you have to split before as well otherwise you may spoil your test set with training samples which is even more disastrous.
3) The performance is still not very good. Precision is 0.34, recall is 0.72 and the f1 score is 0.46. Is there any way to improve? Should I use CV?
It depends on specific problem, what I would do:
oversampling instead of undersampling - neural networks need a lot of data, you don't have many samples right now
try other non-DL algorithms (maybe SVM if you have a lot of features? RandomForest otherwise might be a good bet as well)
otherwise fine tune your neural network (focus especially on learning rate, use CV or related methods if you got the time)
try to use some pretrained neural networks if available for the task at hand
I am new to Machine Learning
I have a dataset which has highly unbalanced classes(dominated by negative class) and contains more than 2K numeric features and the target is [0,1]. I have trained a logistics regression though I am getting an accuracy of 89% but from confusion matrix, it was found the model True positive is very low. Below are the scores of my model
Accuracy Score : 0.8965989500114129
Precision Score : 0.3333333333333333
Recall Score : 0.029545454545454545
F1 Score : 0.05427974947807933
How I can increase my True Positives? Should I be using a different classification model?
I have tried the PCA and represented my data in 2 components, it increased the model accuracy up to 90%(approx) however True Positives was decreased again
There are several ways to do this :
You can change your model and test whether it performs better or not
You can Fix a different prediction threshold : here I guess you predict 0 if the output of your regression is <0.5, you could change the 0.5 into 0.25 for example. It would increase your True Positive rate, but of course, at the price of some more False Positives.
You can duplicate every positive example in your training set so that your classifier has the feeling that classes are actually balanced.
You could change the loss of the classifier in order to penalize more False Negatives (this is actually pretty close to duplicating your positive examples in the dataset)
I'm sure many other tricks could apply, here is just my favorite short-list.
I'm assuming that your purpose is to obtain a model with good classification accuracy on some test set, regardless of the form of that model.
In that case, if you have access to the computational resources, try Gradient-Boosted Trees. That's a ensemble classifier using multiple decision trees on subsets of your data, then a voting ensemble to make predictions. As far as I know, it can give good results with unbalanced class counts.
SciKitLearn has the function sklearn.ensemble.GradientBoostingClassifier for this. I have not used that particular one, but I use the regression version often and it seems good. I'm pretty sure MATLAB has this as a package too, if you have access.
2k features might be difficult for the SKL algorithm - I don't know I've never tried.
What is the size of your dataset?How many rows are we talking here?
Your dataset is not balanced and so its kind of normal for a simple classification algorithm to predict the 'majority-class' most of the times and give you an accuracy of 90%. Can you collect more data that will have more positive examples in it.
Or, just try oversampling/ under-sampling. see if that helps.
You can also use penalized version of the algorithm to impose penalty, whenever a wrong class is predicted. That may help.
You can try many different solutions.
If you have quite a lot data points. For instance you have 2k 1s and 20k 0s. You can try just dump those extra 0s only keep 2k 0s. Then train it. And also you can try to use different set of 2k 0s and same set of 2k 1s. To train multiple models. And make decision based on multiple models.
You also can try adding weights at the output layer. For instance, you have 10 times 0s than 1s. Try to multiply 10 at the 1s prediction value.
Probably you also can try to increase dropout?
And so on.
I am trying to train a GradientBoosting model on a heavily imbalanced data in Python. Class distribution is like 0.96 : 0.04 for class 0 and class 1 respectively.
After some parameter tuning considering the recall and precision scores I came up with a good model. Different metrics scores are like given below for validation set. Also, it is close to the Cross Validation Scores.
recall : 0.928777
precision : 0.974747
auc : 0.9636
kappa : 0.948455
f1 weighted : 0.994728
If I want to tune the model further, which metrics should I consider to increase.? In my problem miss-classifying 1 as 0 is more problematic than miss-predicting 0 as 1.
There are various techniques to work with when dealing with Class imbalance issue. Few as stated below:
(Links include pythons imblearn package and costcla package)
Resample:
Undersample majority class (class 0 in your case) You can try random undersampling for starters.
Oversample the minority class (Class 1). Explore SMOTE/ADASYN techniques.
Ensemble Techniques:
Bagging/Boosting techniques.
Cost-sensitive Learning: You should definitely explore this since you have mentioned:
In my problem miss-classifying 1 as 0 is more problematic than miss-predicting 0 as 1.
In cost sensitive learning using costcla package, you should try the following approach, keeping your base classifier as GradientBoostingRegressor:
costcla.sampling.cost_sampling(X, y, cost_mat, method='RejectionSampling', oversampling_norm=0.1, max_wc=97.5)
Here you can load a cost_mat[C_FP,C_FN,C_TP,C_TN] for each data point in train and test. C_FP and C_FN are based on the misclassification cost that you want to set for positives and negatives classes. Refer to the full tutorial on credit score data here.
I am trying to create a binary classification model for imbalance dataset using Random Forest - 0- 84K, 1- 16K. I have tried using class_weights = 'balanced', class_weights = {0:1, 1:5}, downsampling and oversampling but none of these seem to work. My metrics are usually in the below range:
Accuracy = 66%
Precision = 23%
Recall = 44%
I would really appreciate any help on this! Thanks
there are lots of ways to improve classifier behavior. If you think your data are balanced (or rather, your weight method balances them enough), then consider expanding your forest, either with deeper trees or more numerous trees.
Try other methods like SVM, or ANN, and see how they compare.
Try Stratified sampling for the dataset so that you can get the constant ration being taken in account for both the test and the training dataset. And then use the class weight balanced which you have already used. If you want the accuraccy improved there are tons other ways.
1) First be sure that the dataset being provided is accurate or verified.
2) You can increase the accuracy by playing with threshold of the probability (if in binary classification if its >0.7 confident then do a prediction else wise don't , the draw back in this approach would be NULL values or mostly being not predicting as algorithm is not confident enough, but for a business model its a good approach because people prefer less False Negatives in their model.
3) Use Stratified Sampling to equally divide the training and the testing dataset, so that constant ration is being divided. rather than train_test_splitting : stratified sampling will return you the indexes for training and testing . You can play with the (cross_validation : different iteration)
4) For the confusion matrix, have a look at the precision score per class and see which class is showing more( I believe if you apply threshold limitation it would solve the problem for this.
5) Try other classifiers , Logistic, SVM(linear or with other kernel) : LinearSVC or SVC , NaiveBayes. As per seen in most cases of Binary classification Logistc and SVC seems to be performing ahead of other algorithms. Although try these approach first.
6) Make sure to check the best parameters for the fitting such as choice of Hyper Parameters (using Gridsearch with couple of learning rates or different kernels or class weights or other parameters). If its textual classification are you applying CountVectorizer with TFIDF (and have you played with max_df and stop_words removal) ?
If you have tried these, then possibly be sure about the algorithm first.