I wanted to do Cross Validation on a regression (non-classification ) model and ended getting mean accuracies of about 0.90. however, i don't know what metric is used in the method to find out the accuracies. I know how splitting in k-fold cross validation works . I just don't know the formula that the scikit learn library is using to calculate the accuracy of prediction. (I know how it works for classification model though). Can someone give me the metric/formula used by sklearn.model_selection.cross_val_score?
Thanks in advance.
from sklearn.model_selection import cross_val_score
def metrics_of_accuracy(classifier , X_train , y_train) :
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
return accuracies
By default, sklearn uses accuracy in case of classification and r2_score for regression when you use the model.score method(same for cross_val_score). So r2_score in this case whose formula is
r2 = 1 - (SSE(y_hat)/SSE(y_mean))
where
SSE(y_hat) is the squared error for predictions made
SSE(y_mean) is the squared error when all predictions are the mean of the actual predictions
Yes, Also I can use the same metric using sklearn.metrics-> r2_score.
r2_score(y_true, y_pred). This score is also called Coefficient of determination or R-squared.
The formula for the same is as follows -
Find the link to image below.
https://i.stack.imgur.com/USaWH.png
For more on this -
https://en.wikipedia.org/wiki/Coefficient_of_determination
Related
Exploring some classification models in Scikit learn I noticed that the scores I got for log loss and for ROC AUC were consistently lower while performing cross validation than while fitting and predicting on the whole training set (done to check for overfitting), thing that did not make sense to me.
Specifically, using cross_validate I set the scorings as ['neg_log_loss', 'roc_auc'] and while performing manual fitting and prediction on the training set I used the metric functions log_loss' and roc_auc_score.
To try to figure out what was happening, i wrote a code to perform the cross validation manually in order to be able to call the metric functions manually on the various folds and compare the results with the ones from cross_validate. As you can see below, I got different results even like this!
from sklearn.model_selection import StratifiedKFold
kf = KFold(n_splits=3, random_state=42, shuffle=True)
log_reg = LogisticRegression(max_iter=1000)
for train_index, test_index in kf.split(dataset, dataset_labels):
X_train, X_test = dataset[train_index], dataset[test_index]
y_train, y_test = dataset_labels_np[train_index], dataset_labels_np[test_index]
log_reg.fit(X_train, y_train)
pr = log_reg.predict(X_test)
ll = log_loss(y_test, pr)
print(ll)
from sklearn.model_selection import cross_val_score
cv_ll = cross_val_score(log_reg, dataset_prepared_stand, dataset_labels, scoring='neg_log_loss',
cv=KFold(n_splits=3, random_state=42, shuffle=True))
print(abs(cv_ll))
Outputs:
4.795481869275026
4.560119170517534
5.589818973403791
[0.409817 0.32309 0.398375]
The output running the same code for ROC AUC are:
0.8609669592272686
0.8678563239907938
0.8367147503682851
[0.925635 0.94032 0.910885]
To be sure to have written the code right, I also tried the code using 'accuracy' as scoring for cross validation and accuracy_score as metric function and the results are instead consistent:
0.8611584327086882
0.8679727427597955
0.838160136286201
[0.861158 0.867973 0.83816 ]
Can someone explain me why the results in the case of the log loss and the ROC AUC are different? Thanks!
Log-loss and auROC both need probability predictions, not the hard class predictions. So change
pr = log_reg.predict(X_test)
to
pr = log_reg.predict_proba(X_test)[:, 1]
(the subscripting is to grab the probabilities for the positive class, and assumes you're doing binary classification).
I did a grid search on a logistic regression and set scoring to 'roc_auc'. The grid_clf1.best_score_ gave me an auc of 0.7557. After that I wanted to plot the ROC curve of the best model. The ROC curve I saw had an AUC of 0.50 I do not understand this at all.
I looked into the predicted probabilites and I saw that they were all 0.0 or 1.0. Hence, I think something went wrong here but I cannot find what it is.
My code is as follows for the grid search cv:
clf1 = Pipeline([('RS', RobustScaler()), ('LR',
LogisticRegression(random_state=1, solver='saga'))])
params = {'LR__C': np.logspace(-3, 0, 5),
'LR__penalty': ['l1']}
grid_clf1 = GridSearchCV(clf1, params, scoring='roc_auc', cv = 5,
n_jobs=-1)
grid_clf1.fit(X_train, y_train)
grid_clf1.best_estimator_
grid_clf1.best_score_
So this gave an AUC of 0.7557 for the best model.
Then if I calculate the AUC for the model myself:
y_pred_proba = grid_clf1.best_estimator_.predict_probas(X_test)[::,1]
print(roc_auc_score(y_test, y_pred_proba))
This gave me an AUC of 0.50.
It looks like there are two problems with your example code:
You compare ROC_AUC scores on different datasets. During fitting train set is used, and test set is used when roc_auc_score is called
Scoring with cross validation works slightly different than simple roc_auc_score function call. It can be expanded to np.mean(cross_val_score(...))
So, if take that into account you will get the same scoring values. You can use the colab notebook as a reference.
I want to use Support Vector Regression (SVR) for regression as it seems quite powerful when I have several features. As I found a very-easy-to-use implementation in scikit-learn, I'm using that implementation. My questions below are regarding this python package in particular, but if you have any solution in any other language or package please let me know as well.
So, I'm using the following code:
from sklearn.svm import SVR
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold
svr_rbf = SVR(kernel='rbf')
scoring = ['neg_mean_absolute_error', 'neg_mean_squared_error', 'r2']
scores = cross_validate(estimator, X, y, cv=KFold(10, shuffle=True), scoring=scoring, return_train_score=False)
score = -1 * scores['test_neg_mean_absolute_error']
print("MAE: %.4f (%.4f)" % (score.mean(), score.std()))
score = -1 * scores['test_neg_mean_squared_error']
print("MSE: %.4f (%.4f)" % (score.mean(), score.std()))
score = scores['test_r2']
print("R^2: %.4f (%.4f)" % (score.mean(), score.std()))
As you can see, I can easily use 10-fold cross validation by dividing my data in 10 shuffled folds, and getting all the MAE, MSE and r^2 for each fold very easily.
However, my big question is how can I get the pvalue, r and adjusted r^2 for my SVR regression model specifically, just like I find in other python packages including statsmodels for linear regressions?
I guess I will have to implement the cross validation with KFold by myself in order to achieve this, but I don't think that's a big problem. The issue is that I'm not sure how to get these scores from sklearn's implementation of SVR itself.
I am working on my regression model based on the IMDB data, to predict IMDB value. On my linear-regression, i was unable to obtain the accuracy score.
my line of code:
metrics.accuracy_score(test_y, linear_predicted_rating)
Error :
ValueError: continuous is not supported
if i were to change that line to obtain the r2 score,
metrics.r2_score(test_y,linear_predicted_rating)
i was able to obtain r2 without any error.
Any clue why i am seeing this?
Thanks.
Edit:
One thing i found out is test_y is panda data frame whereas the linear_predicted_rating is in numpy array format.
metrics.accuracy_score is used to measure classification accuracy, it can't be used to measure accuracy of regression model because it doesn't make sense to see accuracy for regression - predictions rarely can equal the expected values. And if predictions differ from expected values by 1%, the accuracy will be zero, though these predictions are great
Here are some metrics for regression: http://scikit-learn.org/stable/modules/classes.html#regression-metrics
NOTE: Accuracy (e.g. classification accuracy) is a measure for classification, not regression so we can't calculate accuracy for a regression model. For regression, one of the matrices we've to get the score (ambiguously termed as accuracy) is R-squared (R2).
You can get the R2 score (i.e accuracy) of your prediction using the score(X, y, sample_weight=None) function from LinearRegression as follows by changing the logic accordingly.
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
r2_score = regressor.score(x_test,y_test)
print(r2_score*100,'%')
output (a/c to my model)
86.23%
The above is R squared value and not the accuracy :
# R squared value
metrics.explained_variance_score(y_test, predictions)
What does your variables look like. Code below works well.
from sklearn import metrics
test_y, linear_predicted_rating = [1,2,3,4], [1,2,3,5]
metrics.accuracy_score(test_y, linear_predicted_rating)
You can not predict the accuracy of regression model,however you can analyze your model using Mean absolute error ,Mean squared error ,Root mean squared error,Max error,median error R-square etc.
for reference
you can go this to gain more knowledge
While working with a linear regression model I split the data into a training set and test set. I then calculated R^2, RMSE, and MAE using the following:
lm.fit(X_train, y_train)
R2 = lm.score(X,y)
y_pred = lm.predict(X_test)
RMSE = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
MAE = metrics.mean_absolute_error(y_test, y_pred)
I thought that I was calculating R^2 for the entire data set (instead of comparing the training and original data). However, I learned that you must fit the model before you score it, therefore I'm not sure if I'm scoring the original data (as inputted in R2) or the data that I used to fit the model (X_train, and y_train). When I run:
lm.fit(X_train, y_train)
lm.score(X_train, y_train)
I get a different result than what I got when I was scoring X and y. So my question is are the inputs to the .score parameter compared to the model that was fitted (thereby making lm.fit(X,y); lm.score(X,y) the R^2 value for the original data and lm.fit(X_train, y_train); lm.score(X,y) the R^2 value for the original data based off the model created in .fit.) or is something else entirely happening?
fit() that only fit the data which is synonymous to train, that is fit the data means train the data.
score is something like testing or predict.
So one should use different dataset for training the classifier and testing the acuracy
One can do like this.
X_train,X_test,y_train,y_test=cross_validation.train_test_split(X,y,test_size=0.2)
clf=neighbors.KNeighborsClassifier()
clf.fit(X_train,y_train)
accuracy=clf.score(X_test,y_test)