How to evaluate Pytorch model using metrics like precision and recall? - python

I have trained a simple Pytorch neural network on some data, and now wish to test and evaluate it using metrics like accuracy, recall, f1 and precision. I searched the Pytorch documentation thoroughly and could not find any classes or functions for these metrics. I then tried converting the predicted labels and the actual labels to numpy arrays and using scikit-learn's metrics, but the predicted labels don't seem to be either 0 or 1 (my labels), but instead continuous values. Because of this scikit-learn metrics don't work.
Fast.ai documentation didn't make much sense either, I could not understand which class to inherit for precision etc (although I was able to calculate accuracy). Any help would be much desperately appreciated.

Usually, in a binary classification setting, your neural network will output the probability that the event occurs (e.g., if you are using sigmoid activation and a single neuron at the output layer), which is a continuous value between 0 and 1. To evaluate precision and recall of your model (e.g., with scikit-learn's precision_score and recall_score), it is required that you convert the probability of your model into binary value. This is achieved by specifying a threshold value for your model's probability. (For a overview about threshold, please take a look at this reference: https://developers.google.com/machine-learning/crash-course/classification/thresholding)
Scikit-learn's precision_recall_curve (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html) is commonly used to understand how precision and recall metrics behave for different probability thresholds. By analysing the precision and recall values per threshold, you will be able to specify the best threshold for your problem (you may want higher precision, so you will aim for higher thresholds, e.g., 90%; or you may want to have a balanced precision and recall, and you will need to check the threshold that returns the best f1 score for your problem). A good overview on the topic may be found in the following reference: https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/
I hope this may be of help.

Related

How to perform supervised training of a deep neural network when the target label has only 0 and 1?

I am trying to train a Deep Neural Network (DNN) with labeled data. The labels are encoded in such a way that it only contains values 0 and 1. The shape of the encoded label is 5 x 5 x 232. About 95% of values in the label is 0and rests are 1. Currently, I am using binary_crossentroy loss function to train the network.
What is the best technique to train the DNN in such a scenario? Is the choice of binary_crossentroy
as the loss function is appropriate in this case? Any suggestion to improve the performance of the model.
You can try MSE loss. If you want to stick to binary cross-entropy (used in binary classification), consider using label smoothing.
You may use 2 other alternative loss functions instead of Binary cross-entropy.They are
Hinge Loss
An alternative to cross-entropy for binary classification problems is the hinge loss function, primarily developed for use with Support Vector Machine (SVM) models.
It is intended for use with binary classification where the target values are in the set {-1, 1}.
Squared Hinge Loss
For more Detail on loss function with examples.click here
Hope helpful, happy learning.
binary_crossentroy as loss is fine
Don't use accuracy as your metrics, because model will just predict every thing as label 0 and will still get 95% accuracy. Instead use F1 score (or precision or recall)
Use Weighted loss: I.e penalize class 1 heavily if they are wrong as compared to class 0.
Instead of class weights you can also use methods like oversampling form the minority class. (Techniques like SMOTE)
How to calculate class weight
You can use sklearn.utils.class_weight to calculate weight from your labels. Check this answer
In such scenarios where you have highly imbalanced data, I would suggest going with Random Forest with up-Sampling. This approach will up-sample the minority class and hence improve the model accuracy.

Determine the best classification threshold value for deep learning model

How to Determine the best threshold value for deep learning model. I am working on predicting seizure epilepsy using CNN. I want to determine the best threshold for my deep learning model in order to get best results.
I am trying for more than 2 weeks to find how I can do it.
Any help would be appreciated.
code
history=model.fit_generator(generate_arrays_for_training(indexPat, filesPath, end=75), #end=75),
validation_data=generate_arrays_for_training(indexPat, filesPath, start=75),#start=75),
steps_per_epoch=int((len(filesPath)-int(len(filesPath)/100*25))),#*25),
validation_steps=int((len(filesPath)-int(len(filesPath)/100*75))),#*75),
verbose=2,
epochs=50, max_queue_size=2, shuffle=True, callbacks=[callback,call])
In general, choosing right classification threshold depends on the use case. You should remember that choosing threshold is not a part of hyperparameters tuning. The value of classification threshold greatly impacts the behaviour of model after you train it.
If you increase it, you want your model to be very sure about prediction which means you will be filtering out false positives - you will be targeting precision. This might be the case when your model is a part of a mission-critical pipeline where decision made based on positive output of model is costly (in terms of money, time, human resources, computational resources etc...)
If you decrease it, your model will say that more examples are positives which will allow you to explore more examples that are potentially positive (you target recall. This is important when a false negative is disastrous e.g in medical cases (You would rather check whether low-probability patient has cancer rather than ignoring him and find out later that he was indeed sick)
For more examples please see When is precision more important over recall?
Now, choosing between recall and precision is a trade-off and you have to choose it based on you situation. Two tools to help you achieve this are ROC and Recall-Precision Curves How to Use ROC Curves and Precision-Recall Curves for Classification in Python which indicates how model handles false positives and false negatives depending on classification threshold
Many ML algorithms are capable of predicting a score for a class membership which needs to be interpreted before it can be plotted to a class label. And you achieve this by using a threshold, such as 0.5, whereby values >= than the threshold are mapped to one class and the rest mapped to another class.
Class 1 = Prediction < 0.5; Class 0 = Prediction => 0.5
It’s crucial to find the best threshold value for the kind of problem you're on and not just assume a classification threshold e.g. a 0.5;
Why? The default threshold can often result in pretty poor performance for classification problems with severe class imbalance.
See, ML thresholds are problem-specific and must be fine-tuned. Read a short article about it here
One of the best ways to determine the best threshold for your deep learning model in order to get the best results is to tune the threshold used to map probabilities to a class.
The best threshold for the CNN can be calculated directly using ROC Curves and Precision-Recall Curves. In some cases, you can use a grid search to fine-tune the threshold and find the optimal value.
The code below will help you check the option that will give the best results. GitHub link:
from deepchecks.checks.performance import PerformanceReport
check = PerformanceReport()
check.run(ds, clf)

how to select the metric to optimize in sklearn's fit function?

When using tensorflow to train a neural network I can set the loss function arbitrarily. Is there a way to do the same in sklearn when training a SVM? Let's say I want my classifier to only optimize sensitivity (regardless of the sense of it), how would I do that?
This is not possible with Support Vector Machines, as far as I know. With other models you might either change the loss that is optimized, or change the classification threshold on the predicted probability.
SVMs however minimize the hinge loss, and they do not model the probability of classes but rather their separating hyperplane, so there is not much room for manual adjustements.
If you need to focus on Sensitivity or Specificity, use a different model that allows maximizing that function directly, or that allows predicting the class probabilities (thinking Logistic Regressions, Tree based methods, for example)

Does tf.keras.metrics.AUC work on multi-class problems?

I have a multi-class classification problem and I want to measure AUC on training and test data.
tf.keras has implemented AUC metric (tf.keras.metrics.AUC), but I'm not be able to see whether this metric could safely be used in multi-class problems. Even, the example "Classification on imbalanced data" on the official Web page is dedicated to a binary classification problem.
I have implemented a CNN model that predicts six classes, having a softmax layer that gives the probabilities of all the classes. I used this metric as follows
self.model.compile(loss='categorical_crossentropy',
optimizer=Adam(hp.get("learning_rate")),
metrics=['accuracy', AUC()]),
and the code was executed without any problem. However, sometimes I see some results that are quite strange for me. For example, the model reported an accuracy of 0.78333336 and AUC equal to 0.97327775, Is this possible? Can a model have a low accuracy and an AUC so high?
I wonder that, although the code does not give any error, the AUC metric is computing wrong.
Somebody may confirm me whether or not this metrics support multi-class classification problems?
There is the argument multi_label which is a boolean inside your tf.keras.metrics.AUC call.
If True (not the default), multi-label data will be treated as such, and so AUC is computed separately for each label and then averaged across labels.
When False (the default), the data will be flattened into a single label before AUC computation. In the latter case, when multi-label data is passed to AUC, each label-prediction pair is treated as an individual data point.
The documentation recommends to set it to False for multi-class data.
e.g.: tf.keras.metrics.AUC(multi_label = True)
See the AUC Documentation for more details.
AUC can have a higher score than accuracy.
Additionally, you can use AUC to decide the cutoff threshold for a binary classifier(this cutoff is by default 0.5). Though there are more technical ways to decide this cutoff, you could simply simply increase it from 0 to 1 to find the value which maximizes your accuracy(this is a naive solution and 1 recommend you to read this https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/One_ROC_Curve_and_Cutoff_Analysis.pdf for an in depth explanation on cutoff analysis )

Targeting a specific metric to optimize in tensorflow

Is there any way we can target a specific metric to optimize using inbuilt tensorflow optimizers? If not, how to achieve this? For eg. If I want to focus only on maximizing F-score of my classifier specifically, is it possible to do so in tensorflow?
estimator = tf.estimator.LinearClassifier(
feature_columns=feature_cols,
config=my_checkpointing_config,
model_dir=output_dir,
optimizer=lambda: tf.train.FtrlOptimizer(
learning_rate=tf.train.exponential_decay(
learning_rate=0.1,
global_step=tf.train.get_or_create_global_step(),
decay_steps=1000,
decay_rate=0.96)))
I am trying to optimize my classifier specifically on the basis of getting a better F-score. Despite using the decaying learning_rate and 300 training steps I am getting inconsistent results. While checking the metrics in the logs, I found the behavior of precision, recall and accuracy to be very erratic. Despite increasing the number of training steps, there was no significant improvement. So I thought that if i could make the optimizer focus more on improving the F-score as a whole I might get better results. Hence the question. Is there something that I am missing?
In classification settings, optimizers minimize the loss, e.g. cross entropy; quantities like accuracy, F-score, precision, recall etc. are essentially business metrics, and they are not (and cannot be) directly minimized during the optimization process.
This is a question that pops up rather frequently here in SO in various disguises; here are some threads which will hopefully help you disentangle the concepts (although they refer to accuracy, precision, and recall, the argument is exactly the same for the F-score):
Loss & accuracy - Are these reasonable learning curves?
Cost function training target versus accuracy desired goal
Is there an optimizer in keras based on precision or recall instead of loss?
The bottom line, adapting one of my own (linked) answers:
Loss and metrics like accuracy or F-score are different things; roughly speaking, metrics like accuracy & F-score are what we are actually interested in from a business perspective, while the loss is the objective function that the learning algorithms (optimizers) are trying to minimize from a mathematical perspective. Even more roughly speaking, you can think of the loss as the "translation" of the business objective (accuracy, F-score etc) to the mathematical domain, a translation which is necessary in classification problems (in regression ones, usually the loss and the business objective are the same, or at least can be the same in principle, e.g. the RMSE)...
One could technically adjust the threshold parameter that distinguishes between class 1 and 0. For example, in logistic regression, if the threshold is lowered from 0.5 to 0.3, recall would decrease and precision would increase, and viceversa. But as others have mentioned, this is not the same as optimizing ("minimizing") the loss function.

Categories

Resources