I'm using AUC metrics to do a multilabel classification. Since keras has removed prediction_classes for obtaining the prediction classes, I just use a threshold of 0.5 to get the output classes. However, as I understand, for AUC the threshold should not be 0.5 for an imbalanced data set. How can I get the threshold that was used for training the model?
Besides, I know that AUC is used for binary classification. Can I just use it for multilabel problem? How to calculate the threshold? By taking the average or not.
You can use AUC for the multi-label problem, check this.
import numpy as np
y_true = np.random.randint(0,2,(100,4))
y_pred = np.random.randint(0,2,(100,4))
m = tf.keras.metrics.AUC(multi_label=True, thresholds=[0, 0.5])
m(y_true, y_pred).numpy()
FYI, from tf 2.5, it now supports logit predictions.
Related
I wanted to do Cross Validation on a regression (non-classification ) model and ended getting mean accuracies of about 0.90. however, i don't know what metric is used in the method to find out the accuracies. I know how splitting in k-fold cross validation works . I just don't know the formula that the scikit learn library is using to calculate the accuracy of prediction. (I know how it works for classification model though). Can someone give me the metric/formula used by sklearn.model_selection.cross_val_score?
Thanks in advance.
from sklearn.model_selection import cross_val_score
def metrics_of_accuracy(classifier , X_train , y_train) :
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
return accuracies
By default, sklearn uses accuracy in case of classification and r2_score for regression when you use the model.score method(same for cross_val_score). So r2_score in this case whose formula is
r2 = 1 - (SSE(y_hat)/SSE(y_mean))
where
SSE(y_hat) is the squared error for predictions made
SSE(y_mean) is the squared error when all predictions are the mean of the actual predictions
Yes, Also I can use the same metric using sklearn.metrics-> r2_score.
r2_score(y_true, y_pred). This score is also called Coefficient of determination or R-squared.
The formula for the same is as follows -
Find the link to image below.
https://i.stack.imgur.com/USaWH.png
For more on this -
https://en.wikipedia.org/wiki/Coefficient_of_determination
I did a grid search on a logistic regression and set scoring to 'roc_auc'. The grid_clf1.best_score_ gave me an auc of 0.7557. After that I wanted to plot the ROC curve of the best model. The ROC curve I saw had an AUC of 0.50 I do not understand this at all.
I looked into the predicted probabilites and I saw that they were all 0.0 or 1.0. Hence, I think something went wrong here but I cannot find what it is.
My code is as follows for the grid search cv:
clf1 = Pipeline([('RS', RobustScaler()), ('LR',
LogisticRegression(random_state=1, solver='saga'))])
params = {'LR__C': np.logspace(-3, 0, 5),
'LR__penalty': ['l1']}
grid_clf1 = GridSearchCV(clf1, params, scoring='roc_auc', cv = 5,
n_jobs=-1)
grid_clf1.fit(X_train, y_train)
grid_clf1.best_estimator_
grid_clf1.best_score_
So this gave an AUC of 0.7557 for the best model.
Then if I calculate the AUC for the model myself:
y_pred_proba = grid_clf1.best_estimator_.predict_probas(X_test)[::,1]
print(roc_auc_score(y_test, y_pred_proba))
This gave me an AUC of 0.50.
It looks like there are two problems with your example code:
You compare ROC_AUC scores on different datasets. During fitting train set is used, and test set is used when roc_auc_score is called
Scoring with cross validation works slightly different than simple roc_auc_score function call. It can be expanded to np.mean(cross_val_score(...))
So, if take that into account you will get the same scoring values. You can use the colab notebook as a reference.
Tried googling up, but could not find how to implement Sklearn metrics like cohen kappa, roc, f1score in keras as a metric for imbalanced data.
How to implement Sklearn Metric in Keras as Metric?
Metrics in Keras and in Sklearn mean different things.
In Keras metrics are almost same as loss. They get called during training at the end of each batch and each epoch for reporting and logging purposes. Example use is having the loss 'mse' but you still would like to see 'mae'. In this case you can add 'mae' as a metrics to the model.
In Sklearn metric functions are applied on predictions as per the definition "The metrics module implements functions assessing prediction error for specific purposes". While there's an overlap, the statistical functions of Sklearn doesn't fit to the definition of metrics in Keras. Sklearn metrics can return float, array, 2D array with both dimensions greater than 1. There is no such object in Keras by the predict method.
Answer to your question:
It depends where you want to trigger:
End of each batch or each epoch
You can write a custom callback that is fired at the end of batch.
After prediction
This seems to be easier. Let Keras predict on the entire dataset, capture the result and then feed the y_true and y_pred arrays to the respective Sklearn metric.
Everything you need to live is in the confuse matrix. Calculate the confuse matrix and follow my formulas:
In practice, this is done as follows:
from sklearn.metrics import confusion_matrix
NBC = NBC.fit(X_train,y_train)
cm = confusion_matrix(y_test, NBC.predict(X_test))
tn, fp, fn, tp = cm.ravel()
print('tn: ',tn)
print('fp: ',fp)
print('fn: ',fn)
print('tp: ',tp)
print('------------------')
print(cm)
and now:
p_0 = (tn+đĄđ)/(tn+fp+fn+đĄđ)
print('p_0:',p_0)
P_class0 = ((tn+fp)/(tn+fp+fn+đĄđ))*((tn+fn)/(tn+fp+fn+đĄđ))
print('P_yes: ',P_yes)
P_class1 = ((fn+đĄđ)/(tn+fp+fn+đĄđ))*((fp+đĄđ)/(tn+fp+fn+đĄđ))
print('P_no: ',P_no)
pe = P_yes + P_no
print('pe: ',pe)
Îș = (p_0-pe)/(1-pe)
print('Îș: ',Îș)
I have two classifier in python such as svm and logistic regression.
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn import svm
scaler = preprocessing.StandardScaler()
scaler.fit(synthetic_data)
synthetic_data = scaler.transform(synthetic_data)
test_data = scaler.transform(test_data)
svc = svm.SVC(tol=0.0001, C=100.0).fit(synthetic_data, synthetic_label)
predictedSVM = svc.predict(test_data)
print(accuracy_score(test_label, predictedSVM))
LRmodel = LogisticRegression(penalty='l2', tol=0.0001, C=100.0, random_state=1,max_iter=1000, n_jobs=-1)
predictedLR = LRmodel.fit(synthetic_data, synthetic_label).predict(test_data)
print(accuracy_score(test_label, predictedLR))
I use same input but their accuracy is so different. svm sometimes predicts all predicted svm as 1. Accuracy of svm is 0.45 and accuracy of logistic regression is 0.75. I changed parameters of C in a different ways, but I have still some problems.
It is because SVC by default uses radial kernel (http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), which is something different than linear classification.
If you want to use linear kernel add parameter kernel='linear' to SVC.
If you want to keep using radial kernel, I suggest to also change gamma parameter.
I am working on my regression model based on the IMDB data, to predict IMDB value. On my linear-regression, i was unable to obtain the accuracy score.
my line of code:
metrics.accuracy_score(test_y, linear_predicted_rating)
Error :
ValueError: continuous is not supported
if i were to change that line to obtain the r2 score,
metrics.r2_score(test_y,linear_predicted_rating)
i was able to obtain r2 without any error.
Any clue why i am seeing this?
Thanks.
Edit:
One thing i found out is test_y is panda data frame whereas the linear_predicted_rating is in numpy array format.
metrics.accuracy_score is used to measure classification accuracy, it can't be used to measure accuracy of regression model because it doesn't make sense to see accuracy for regression - predictions rarely can equal the expected values. And if predictions differ from expected values by 1%, the accuracy will be zero, though these predictions are great
Here are some metrics for regression: http://scikit-learn.org/stable/modules/classes.html#regression-metrics
NOTE: Accuracy (e.g. classification accuracy) is a measure for classification, not regression so we can't calculate accuracy for a regression model. For regression, one of the matrices we've to get the score (ambiguously termed as accuracy) is R-squared (R2).
You can get the R2 score (i.e accuracy) of your prediction using the score(X, y, sample_weight=None) function from LinearRegression as follows by changing the logic accordingly.
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
r2_score = regressor.score(x_test,y_test)
print(r2_score*100,'%')
output (a/c to my model)
86.23%
The above is R squared value and not the accuracy :
# R squared value
metrics.explained_variance_score(y_test, predictions)
What does your variables look like. Code below works well.
from sklearn import metrics
test_y, linear_predicted_rating = [1,2,3,4], [1,2,3,5]
metrics.accuracy_score(test_y, linear_predicted_rating)
You can not predict the accuracy of regression model,however you can analyze your model using Mean absolute error ,Mean squared error ,Root mean squared error,Max error,median error R-square etc.
for reference
you can go this to gain more knowledge