I have a multi-class classification problem and I want to measure AUC on training and test data.
tf.keras has implemented AUC metric (tf.keras.metrics.AUC), but I'm not be able to see whether this metric could safely be used in multi-class problems. Even, the example "Classification on imbalanced data" on the official Web page is dedicated to a binary classification problem.
I have implemented a CNN model that predicts six classes, having a softmax layer that gives the probabilities of all the classes. I used this metric as follows
self.model.compile(loss='categorical_crossentropy',
optimizer=Adam(hp.get("learning_rate")),
metrics=['accuracy', AUC()]),
and the code was executed without any problem. However, sometimes I see some results that are quite strange for me. For example, the model reported an accuracy of 0.78333336 and AUC equal to 0.97327775, Is this possible? Can a model have a low accuracy and an AUC so high?
I wonder that, although the code does not give any error, the AUC metric is computing wrong.
Somebody may confirm me whether or not this metrics support multi-class classification problems?
There is the argument multi_label which is a boolean inside your tf.keras.metrics.AUC call.
If True (not the default), multi-label data will be treated as such, and so AUC is computed separately for each label and then averaged across labels.
When False (the default), the data will be flattened into a single label before AUC computation. In the latter case, when multi-label data is passed to AUC, each label-prediction pair is treated as an individual data point.
The documentation recommends to set it to False for multi-class data.
e.g.: tf.keras.metrics.AUC(multi_label = True)
See the AUC Documentation for more details.
AUC can have a higher score than accuracy.
Additionally, you can use AUC to decide the cutoff threshold for a binary classifier(this cutoff is by default 0.5). Though there are more technical ways to decide this cutoff, you could simply simply increase it from 0 to 1 to find the value which maximizes your accuracy(this is a naive solution and 1 recommend you to read this https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/One_ROC_Curve_and_Cutoff_Analysis.pdf for an in depth explanation on cutoff analysis )
Related
I'm using BayesSearchCV from scikit-optimize to train a model on a fairly imbalanced dataset. From what I'm reading precision or ROC AUC would be the best metrics for imbalanced dataset. In my code:
knn_b = BayesSearchCV(estimator=pipe, search_spaces=search_space, n_iter=40, random_state=7, scoring='roc_auc')
knn_b.fit(X_train, y_train)
The number of iterations is just a random value I chose (although I get a warning saying I already reached the best result, and there is not a way to early stop as far as I'm aware?). For the scoring parameter, I specified roc_auc, which I'm assuming it will be the primary metric to monitor for the best parameter in the results. So when I call knn_b.best_params_, I should have the parameters where the roc_auc metrics is higher. Is that correct?
My confusion is when I look at the results using knn_b.cv_results_. Shouldn't the mean_test_score be the roc_auc score because of the scoring param in the BayesSearchCV class? What I'm doing it plotting the results and seeing how each combination of params performed.
sns.relplot(
data=knn_b.cv_results_, kind='line', x='param_classifier__n_neighbors', y='mean_test_score',
hue='param_scaler', col='param_classifier__p',
)
When I try to use to roc_auc_score() function on the true and predicted values, I get something completely different.
Is the mean_test_score here different? How would I be able to get the individual/mean roc_auc score of each CV/split of each iteration? Similarly for when I want to use RandomizedSearchCV or GridSearchCV.
EDIT: tldr; I want to know what's being computed exactly in mean_test_score. I thought it was roc_auc because of the scoring param, or accuracy, but it seems to be neither.
mean_test_score is the AUROC, because of your scoring parameter, yes.
Your main problem is that the ROC curve (and the area under it) require the probability predictions (or other continuous score), not the hard class predictions. Your manual calculation is thus incorrect.
You shouldn't expect exactly the same score anyway. Your second score is on the test set, and the first score is optimistically biased by the hyperparameter selection.
I have trained a simple Pytorch neural network on some data, and now wish to test and evaluate it using metrics like accuracy, recall, f1 and precision. I searched the Pytorch documentation thoroughly and could not find any classes or functions for these metrics. I then tried converting the predicted labels and the actual labels to numpy arrays and using scikit-learn's metrics, but the predicted labels don't seem to be either 0 or 1 (my labels), but instead continuous values. Because of this scikit-learn metrics don't work.
Fast.ai documentation didn't make much sense either, I could not understand which class to inherit for precision etc (although I was able to calculate accuracy). Any help would be much desperately appreciated.
Usually, in a binary classification setting, your neural network will output the probability that the event occurs (e.g., if you are using sigmoid activation and a single neuron at the output layer), which is a continuous value between 0 and 1. To evaluate precision and recall of your model (e.g., with scikit-learn's precision_score and recall_score), it is required that you convert the probability of your model into binary value. This is achieved by specifying a threshold value for your model's probability. (For a overview about threshold, please take a look at this reference: https://developers.google.com/machine-learning/crash-course/classification/thresholding)
Scikit-learn's precision_recall_curve (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html) is commonly used to understand how precision and recall metrics behave for different probability thresholds. By analysing the precision and recall values per threshold, you will be able to specify the best threshold for your problem (you may want higher precision, so you will aim for higher thresholds, e.g., 90%; or you may want to have a balanced precision and recall, and you will need to check the threshold that returns the best f1 score for your problem). A good overview on the topic may be found in the following reference: https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/
I hope this may be of help.
When using tensorflow to train a neural network I can set the loss function arbitrarily. Is there a way to do the same in sklearn when training a SVM? Let's say I want my classifier to only optimize sensitivity (regardless of the sense of it), how would I do that?
This is not possible with Support Vector Machines, as far as I know. With other models you might either change the loss that is optimized, or change the classification threshold on the predicted probability.
SVMs however minimize the hinge loss, and they do not model the probability of classes but rather their separating hyperplane, so there is not much room for manual adjustements.
If you need to focus on Sensitivity or Specificity, use a different model that allows maximizing that function directly, or that allows predicting the class probabilities (thinking Logistic Regressions, Tree based methods, for example)
Is there any way we can target a specific metric to optimize using inbuilt tensorflow optimizers? If not, how to achieve this? For eg. If I want to focus only on maximizing F-score of my classifier specifically, is it possible to do so in tensorflow?
estimator = tf.estimator.LinearClassifier(
feature_columns=feature_cols,
config=my_checkpointing_config,
model_dir=output_dir,
optimizer=lambda: tf.train.FtrlOptimizer(
learning_rate=tf.train.exponential_decay(
learning_rate=0.1,
global_step=tf.train.get_or_create_global_step(),
decay_steps=1000,
decay_rate=0.96)))
I am trying to optimize my classifier specifically on the basis of getting a better F-score. Despite using the decaying learning_rate and 300 training steps I am getting inconsistent results. While checking the metrics in the logs, I found the behavior of precision, recall and accuracy to be very erratic. Despite increasing the number of training steps, there was no significant improvement. So I thought that if i could make the optimizer focus more on improving the F-score as a whole I might get better results. Hence the question. Is there something that I am missing?
In classification settings, optimizers minimize the loss, e.g. cross entropy; quantities like accuracy, F-score, precision, recall etc. are essentially business metrics, and they are not (and cannot be) directly minimized during the optimization process.
This is a question that pops up rather frequently here in SO in various disguises; here are some threads which will hopefully help you disentangle the concepts (although they refer to accuracy, precision, and recall, the argument is exactly the same for the F-score):
Loss & accuracy - Are these reasonable learning curves?
Cost function training target versus accuracy desired goal
Is there an optimizer in keras based on precision or recall instead of loss?
The bottom line, adapting one of my own (linked) answers:
Loss and metrics like accuracy or F-score are different things; roughly speaking, metrics like accuracy & F-score are what we are actually interested in from a business perspective, while the loss is the objective function that the learning algorithms (optimizers) are trying to minimize from a mathematical perspective. Even more roughly speaking, you can think of the loss as the "translation" of the business objective (accuracy, F-score etc) to the mathematical domain, a translation which is necessary in classification problems (in regression ones, usually the loss and the business objective are the same, or at least can be the same in principle, e.g. the RMSE)...
One could technically adjust the threshold parameter that distinguishes between class 1 and 0. For example, in logistic regression, if the threshold is lowered from 0.5 to 0.3, recall would decrease and precision would increase, and viceversa. But as others have mentioned, this is not the same as optimizing ("minimizing") the loss function.
I am trying to create a binary classification model for imbalance dataset using Random Forest - 0- 84K, 1- 16K. I have tried using class_weights = 'balanced', class_weights = {0:1, 1:5}, downsampling and oversampling but none of these seem to work. My metrics are usually in the below range:
Accuracy = 66%
Precision = 23%
Recall = 44%
I would really appreciate any help on this! Thanks
there are lots of ways to improve classifier behavior. If you think your data are balanced (or rather, your weight method balances them enough), then consider expanding your forest, either with deeper trees or more numerous trees.
Try other methods like SVM, or ANN, and see how they compare.
Try Stratified sampling for the dataset so that you can get the constant ration being taken in account for both the test and the training dataset. And then use the class weight balanced which you have already used. If you want the accuraccy improved there are tons other ways.
1) First be sure that the dataset being provided is accurate or verified.
2) You can increase the accuracy by playing with threshold of the probability (if in binary classification if its >0.7 confident then do a prediction else wise don't , the draw back in this approach would be NULL values or mostly being not predicting as algorithm is not confident enough, but for a business model its a good approach because people prefer less False Negatives in their model.
3) Use Stratified Sampling to equally divide the training and the testing dataset, so that constant ration is being divided. rather than train_test_splitting : stratified sampling will return you the indexes for training and testing . You can play with the (cross_validation : different iteration)
4) For the confusion matrix, have a look at the precision score per class and see which class is showing more( I believe if you apply threshold limitation it would solve the problem for this.
5) Try other classifiers , Logistic, SVM(linear or with other kernel) : LinearSVC or SVC , NaiveBayes. As per seen in most cases of Binary classification Logistc and SVC seems to be performing ahead of other algorithms. Although try these approach first.
6) Make sure to check the best parameters for the fitting such as choice of Hyper Parameters (using Gridsearch with couple of learning rates or different kernels or class weights or other parameters). If its textual classification are you applying CountVectorizer with TFIDF (and have you played with max_df and stop_words removal) ?
If you have tried these, then possibly be sure about the algorithm first.