How to Determine the best threshold value for deep learning model. I am working on predicting seizure epilepsy using CNN. I want to determine the best threshold for my deep learning model in order to get best results.
I am trying for more than 2 weeks to find how I can do it.
Any help would be appreciated.
code
history=model.fit_generator(generate_arrays_for_training(indexPat, filesPath, end=75), #end=75),
validation_data=generate_arrays_for_training(indexPat, filesPath, start=75),#start=75),
steps_per_epoch=int((len(filesPath)-int(len(filesPath)/100*25))),#*25),
validation_steps=int((len(filesPath)-int(len(filesPath)/100*75))),#*75),
verbose=2,
epochs=50, max_queue_size=2, shuffle=True, callbacks=[callback,call])
In general, choosing right classification threshold depends on the use case. You should remember that choosing threshold is not a part of hyperparameters tuning. The value of classification threshold greatly impacts the behaviour of model after you train it.
If you increase it, you want your model to be very sure about prediction which means you will be filtering out false positives - you will be targeting precision. This might be the case when your model is a part of a mission-critical pipeline where decision made based on positive output of model is costly (in terms of money, time, human resources, computational resources etc...)
If you decrease it, your model will say that more examples are positives which will allow you to explore more examples that are potentially positive (you target recall. This is important when a false negative is disastrous e.g in medical cases (You would rather check whether low-probability patient has cancer rather than ignoring him and find out later that he was indeed sick)
For more examples please see When is precision more important over recall?
Now, choosing between recall and precision is a trade-off and you have to choose it based on you situation. Two tools to help you achieve this are ROC and Recall-Precision Curves How to Use ROC Curves and Precision-Recall Curves for Classification in Python which indicates how model handles false positives and false negatives depending on classification threshold
Many ML algorithms are capable of predicting a score for a class membership which needs to be interpreted before it can be plotted to a class label. And you achieve this by using a threshold, such as 0.5, whereby values >= than the threshold are mapped to one class and the rest mapped to another class.
Class 1 = Prediction < 0.5; Class 0 = Prediction => 0.5
It’s crucial to find the best threshold value for the kind of problem you're on and not just assume a classification threshold e.g. a 0.5;
Why? The default threshold can often result in pretty poor performance for classification problems with severe class imbalance.
See, ML thresholds are problem-specific and must be fine-tuned. Read a short article about it here
One of the best ways to determine the best threshold for your deep learning model in order to get the best results is to tune the threshold used to map probabilities to a class.
The best threshold for the CNN can be calculated directly using ROC Curves and Precision-Recall Curves. In some cases, you can use a grid search to fine-tune the threshold and find the optimal value.
The code below will help you check the option that will give the best results. GitHub link:
from deepchecks.checks.performance import PerformanceReport
check = PerformanceReport()
check.run(ds, clf)
Related
Is there a nice way to enforce a limit on the false positives while training a ML model?
Let's suppose you start with a balanced dataset with two class. You develop a ML model for binary classification. As the task is easy the output distributions will be peaked respectively at 0 and 1 and overlapping around 0.5 . However what you really care about is that your false positive rate is sustainable and cannot exceed a certain amount.
So at best you would like to have that for pred > 0.8 you only have one class.
At the moment i'm weighting the two class to penalise an error on the class "0".
history = model.fit(..., class_weight={0:5, 1:1}, ...)
As expected it does decrease the fpr in the region pred > 0.8 and of course it will worsen the recall of class 1.
I'm wondering if there are other ways to enforce this.
Thank you
Depending on your problem , you can consider one-class classification svm. This article can be useful : https://towardsdatascience.com/outlier-detection-with-one-class-svms-5403a1a1878c . The article shows also why one-class classification is better to consider instead of some other classical techniques , such as oversampling/undersampling or class-weighting. But of course it depends on the problem you want to solve.
I am new to Machine Learning
I have a dataset which has highly unbalanced classes(dominated by negative class) and contains more than 2K numeric features and the target is [0,1]. I have trained a logistics regression though I am getting an accuracy of 89% but from confusion matrix, it was found the model True positive is very low. Below are the scores of my model
Accuracy Score : 0.8965989500114129
Precision Score : 0.3333333333333333
Recall Score : 0.029545454545454545
F1 Score : 0.05427974947807933
How I can increase my True Positives? Should I be using a different classification model?
I have tried the PCA and represented my data in 2 components, it increased the model accuracy up to 90%(approx) however True Positives was decreased again
There are several ways to do this :
You can change your model and test whether it performs better or not
You can Fix a different prediction threshold : here I guess you predict 0 if the output of your regression is <0.5, you could change the 0.5 into 0.25 for example. It would increase your True Positive rate, but of course, at the price of some more False Positives.
You can duplicate every positive example in your training set so that your classifier has the feeling that classes are actually balanced.
You could change the loss of the classifier in order to penalize more False Negatives (this is actually pretty close to duplicating your positive examples in the dataset)
I'm sure many other tricks could apply, here is just my favorite short-list.
I'm assuming that your purpose is to obtain a model with good classification accuracy on some test set, regardless of the form of that model.
In that case, if you have access to the computational resources, try Gradient-Boosted Trees. That's a ensemble classifier using multiple decision trees on subsets of your data, then a voting ensemble to make predictions. As far as I know, it can give good results with unbalanced class counts.
SciKitLearn has the function sklearn.ensemble.GradientBoostingClassifier for this. I have not used that particular one, but I use the regression version often and it seems good. I'm pretty sure MATLAB has this as a package too, if you have access.
2k features might be difficult for the SKL algorithm - I don't know I've never tried.
What is the size of your dataset?How many rows are we talking here?
Your dataset is not balanced and so its kind of normal for a simple classification algorithm to predict the 'majority-class' most of the times and give you an accuracy of 90%. Can you collect more data that will have more positive examples in it.
Or, just try oversampling/ under-sampling. see if that helps.
You can also use penalized version of the algorithm to impose penalty, whenever a wrong class is predicted. That may help.
You can try many different solutions.
If you have quite a lot data points. For instance you have 2k 1s and 20k 0s. You can try just dump those extra 0s only keep 2k 0s. Then train it. And also you can try to use different set of 2k 0s and same set of 2k 1s. To train multiple models. And make decision based on multiple models.
You also can try adding weights at the output layer. For instance, you have 10 times 0s than 1s. Try to multiply 10 at the 1s prediction value.
Probably you also can try to increase dropout?
And so on.
I am trying to create a binary classification model for imbalance dataset using Random Forest - 0- 84K, 1- 16K. I have tried using class_weights = 'balanced', class_weights = {0:1, 1:5}, downsampling and oversampling but none of these seem to work. My metrics are usually in the below range:
Accuracy = 66%
Precision = 23%
Recall = 44%
I would really appreciate any help on this! Thanks
there are lots of ways to improve classifier behavior. If you think your data are balanced (or rather, your weight method balances them enough), then consider expanding your forest, either with deeper trees or more numerous trees.
Try other methods like SVM, or ANN, and see how they compare.
Try Stratified sampling for the dataset so that you can get the constant ration being taken in account for both the test and the training dataset. And then use the class weight balanced which you have already used. If you want the accuraccy improved there are tons other ways.
1) First be sure that the dataset being provided is accurate or verified.
2) You can increase the accuracy by playing with threshold of the probability (if in binary classification if its >0.7 confident then do a prediction else wise don't , the draw back in this approach would be NULL values or mostly being not predicting as algorithm is not confident enough, but for a business model its a good approach because people prefer less False Negatives in their model.
3) Use Stratified Sampling to equally divide the training and the testing dataset, so that constant ration is being divided. rather than train_test_splitting : stratified sampling will return you the indexes for training and testing . You can play with the (cross_validation : different iteration)
4) For the confusion matrix, have a look at the precision score per class and see which class is showing more( I believe if you apply threshold limitation it would solve the problem for this.
5) Try other classifiers , Logistic, SVM(linear or with other kernel) : LinearSVC or SVC , NaiveBayes. As per seen in most cases of Binary classification Logistc and SVC seems to be performing ahead of other algorithms. Although try these approach first.
6) Make sure to check the best parameters for the fitting such as choice of Hyper Parameters (using Gridsearch with couple of learning rates or different kernels or class weights or other parameters). If its textual classification are you applying CountVectorizer with TFIDF (and have you played with max_df and stop_words removal) ?
If you have tried these, then possibly be sure about the algorithm first.
I'm using ensemble methods (random forest, xgbclassifier, etc) for classification.
One important aspect is feature importance prediction, which is like below:
Importance
Feature-A 0.25
Feature-B 0.09
Feature-C 0.08
.......
This model achieves accuracy score around 0.85; obviously Feature-A is dominantly important, so I decided to remove Feature-A and calculated again.
However, after removing Feature-A, I still found a good performance with accuracy around 0.79.
This doesn't make sense to me, because Feature-A contributes 25% for the model, if removed, why accuracy score is barely affected?
I know ensemble methods hold an advantage to combine 'weak' features into 'strong' ones, so accuracy score mostly relies on aggregation and less sensitive to important feature removal?
Thanks
It's possible there are other features that are redundant with Feature A. For instance, suppose that features G,H,I are redundant with feature A: if you know the value of features G,H,I, then the value of feature A is pretty much determined.
That would be consistent with your results. If we include feature A, the model will learn to us it, as it's very simple to get excellent accuracy using just feature A and ignoring features G,H,I, so it'll have excellent accuracy, high importance for feature A, and low importance for features G,H,I. If we exclude feature A, the model can still get almost-as-good accuracy by using features G,H,I, so it'll still have very good accuracy (though the model might become more complicated because the relationship between G,H,I and class is more complicated than the relationship between A and class).
I'm using scikit learn to perform cross validation using StratifiedKFold to compute the f1 score, but it says that some of my labels have the sum of true positives and false positives are equal to zero for some labels. I thought using StratifiedKFold should prevent this? Why am I getting this problem?
Also, is there a way to get the confusion matrix from the cross_val_score function?
Your classifier is probably classifying all data points as negative, so there are no positives. You can check that is the case by looking at the confusion matrix (docs and example here). It's hard to tell what is happening without information about your data and choice of classifier, but common causes include:
bug in your code. Check your training data contains negative data points, and that these data points contain non-zero features.
inappropriate classifier parameters. If using Naive Bayes, check your class biases. If using SVM, try using grid search over parameter values.
The sklearn classification_report function may come in handy (docs).
Re your second question: stratification ensures that each fold contains roughly the same proportion of data points from all classes. This does not mean your classifier will perform sensibly.
Update:
In a classification task (and especially when class imbalance is present) you are trading off precision for recall. Depending on your application, you can set your classifier so it does well most of the time (i.e. high accuracy) or so that it can detect the few points that you care about (i.e. high recall of the smaller classes). For example, if the task is to forward support emails to the right department, you want high accuracy. It is somewhat acceptable to misclassify the kind of email you get once a year, because you only upset one person. If your task is to detect posts by sexual predators on a children's forum, you definitely do not want to miss any of them, even if the price is that a few posts will get incorrectly flagged. Bottom line: you should optimise for your application.
Are you micro or macro averaging recall? In the former case, more weight will be given to the frequent classes (which is similar to optimising for accuracy), and in the latter all classes will have the same weight.