Unable to obtain accuracy score for my linear - python

I am working on my regression model based on the IMDB data, to predict IMDB value. On my linear-regression, i was unable to obtain the accuracy score.
my line of code:
metrics.accuracy_score(test_y, linear_predicted_rating)
Error :
ValueError: continuous is not supported
if i were to change that line to obtain the r2 score,
metrics.r2_score(test_y,linear_predicted_rating)
i was able to obtain r2 without any error.
Any clue why i am seeing this?
Thanks.
Edit:
One thing i found out is test_y is panda data frame whereas the linear_predicted_rating is in numpy array format.

metrics.accuracy_score is used to measure classification accuracy, it can't be used to measure accuracy of regression model because it doesn't make sense to see accuracy for regression - predictions rarely can equal the expected values. And if predictions differ from expected values by 1%, the accuracy will be zero, though these predictions are great
Here are some metrics for regression: http://scikit-learn.org/stable/modules/classes.html#regression-metrics

NOTE: Accuracy (e.g. classification accuracy) is a measure for classification, not regression so we can't calculate accuracy for a regression model. For regression, one of the matrices we've to get the score (ambiguously termed as accuracy) is R-squared (R2).
You can get the R2 score (i.e accuracy) of your prediction using the score(X, y, sample_weight=None) function from LinearRegression as follows by changing the logic accordingly.
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train,y_train)
r2_score = regressor.score(x_test,y_test)
print(r2_score*100,'%')
output (a/c to my model)
86.23%

The above is R squared value and not the accuracy :
# R squared value
metrics.explained_variance_score(y_test, predictions)

What does your variables look like. Code below works well.
from sklearn import metrics
test_y, linear_predicted_rating = [1,2,3,4], [1,2,3,5]
metrics.accuracy_score(test_y, linear_predicted_rating)

You can not predict the accuracy of regression model,however you can analyze your model using Mean absolute error ,Mean squared error ,Root mean squared error,Max error,median error R-square etc.
for reference
you can go this to gain more knowledge

Related

Calculating percentage error from R-squared error

I created a ML model with Scikit Learn and Python. I calculated R-Squared error. Is there a way to convert this error to percentage error?
For example, If my true values are 100 and 50, and predicted values are 90 and 40, my average percentage error is 15%, because error for first prediction is 10% and error for second prediction is 20%.
Is there a way to calculate percentage error *(average percentage error) based on the value that I get for R-squared?
It is not possible. The R-squared is calculated via RSS, or residual sum of squares. Your r-squared is 1 - (RSS in model)/(RSS in intercept only model). From the above, you can see that R-squared is not really an error per se, but the percentage of variance explained.
We can use an example dataset
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
import numpy as np
X, y = load_boston(return_X_y=True)
reg = LinearRegression().fit(X, y)
We let the prediction and mean of y be:
ybar = reg.predict(X)
ymean = y.mean()
The R-squared is
1 - sum((y-ybar)**2) / sum((y-ymean)**2)
0.7406426641094095
reg.score(X, y)
0.7406426641094095
Whereas your percentage error is:
np.mean(abs(y-ybar)/y)
0.16417298806489977
As you can see, it is not quite possible to just get back the mean percentage error from Rsq because you have already summed up the residuals and in the percentage error, you need the error relative to the observation
From your question, it sounds like you're working with a regression model. I would recommend looking into sklearn's built-in regression accuracy methods instead of trying to use R^2, which, is too an accuracy metric. For what you are trying to do, I would probably recommend trying out the mean_absolute_error or median_absolute_error- but other accuracy metrics can be useful in tuning your model!
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import median_absolute_error
y_pred = model.predict(X_test)
MAE = mean_absolute_error(y_test, y_pred)
MEDAE = median_absolute_error(y_test, y_pred)
If you're building a classifier should be able to use sklearn's accuracy_score metric. This will divide the number of correct predictions by the total number of predictions. Multiplying this number by 100 will yield the percentage of correct predictions. To get the percent of incorrect predictions you can just use 100(1-accuracy_score).
The above answer seems adequate but I've felt a confusion in the question so I'm leaving this right here.
R-squared is a metric that answers the question "If I were to use the average of target, would it be better than my predictions?", and gives you the value where if your model is worst than the baseline (average of target) gives you below zero, and if your model is better than the baseline, gives you something close to one. It is stated before but there are guidelines regarding which error you should use. If you're a starter I would suggest you to use mean squared error, as in the end you will get "the slope and the intercept where the mean squared error is zero" (some fancy differentiations take place there). In MSE, the distance between the data points and your model's predictions get squared, every squared distance (error) is summed up and then it takes the mean of the summed errors. Therefore, there's no way you can calculate error from R-squared, as they're not really related. You can implement MSE within the sklearn documentation like this:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
For other metrics in the sklearn (including classification and clustering metrics), see here. Sklearn's documentation is even better than the tutorials online.
You can also simply type sklearn.metrics.SCORERS.keys() to see the available metrics in the sklearn.

Metric for K-fold Cross Validation for Regression models

I wanted to do Cross Validation on a regression (non-classification ) model and ended getting mean accuracies of about 0.90. however, i don't know what metric is used in the method to find out the accuracies. I know how splitting in k-fold cross validation works . I just don't know the formula that the scikit learn library is using to calculate the accuracy of prediction. (I know how it works for classification model though). Can someone give me the metric/formula used by sklearn.model_selection.cross_val_score?
Thanks in advance.
from sklearn.model_selection import cross_val_score
def metrics_of_accuracy(classifier , X_train , y_train) :
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
return accuracies
By default, sklearn uses accuracy in case of classification and r2_score for regression when you use the model.score method(same for cross_val_score). So r2_score in this case whose formula is
r2 = 1 - (SSE(y_hat)/SSE(y_mean))
where
SSE(y_hat) is the squared error for predictions made
SSE(y_mean) is the squared error when all predictions are the mean of the actual predictions
Yes, Also I can use the same metric using sklearn.metrics-> r2_score.
r2_score(y_true, y_pred). This score is also called Coefficient of determination or R-squared.
The formula for the same is as follows -
Find the link to image below.
https://i.stack.imgur.com/USaWH.png
For more on this -
https://en.wikipedia.org/wiki/Coefficient_of_determination

How can we calculate accuracy for the Random forest classifier if we are using 4 label classification?

I am trying to predict the quality attributes of the product which have been sold over last decade.
Based on the likes/dislikes i have kept the 4 label for the product
Labels are: bad , good, very good,very bad
I have downloaded the last decade data and categorized the samples in these 4 labels. When i put the input in random forest classifier , it is giving the valid result and giving the feature importance:
Here is the code for same:
classifier = RandomForestClassifier(
n_estimators=100, n_jobs=6, oob_score=True, random_state=50,
max_features="auto", min_samples_leaf=50
)
'''
classifier = RandomForestClassifier(
n_estimators=100, n_jobs=6, oob_score=True, random_state=50#, max_depth=3
)
I just want to understand , how we can calculate the accuracy of the model as it has 4 labels.
There are a few accuracies you can check to assess model quality; the first is the overall model accuracy (how many did it get right). For this you can simply use the sklearn accuracy score
from sklearn.metrics import accuracy_score
accuracy_score(y_true, y_pred)
Of course this is not giving you enough information about which class is being miss-classified and to what (eg. it might be more acceptable to categorize very good as good rather than bad). For this you need a confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_true, y_pred)
You probably also want to look into recall and precision as they will help understand the matrix and quantify it.
What you can also do since your labels are ranked, is to convert them to int values and adress the problem with a regression instead of a classification (then convert the outputs back to ints). This way the model will get an understanding of order, therefore you get an ordinal classification.
EDIT:
Just in case the answer is not clear, you get y_pred in the following way:
classifier.fit(x_train, y_train)
y_pred = classifier.predict(x_val)

AUC score of gridsearch cv of best_score_ is different from auc_roc_score from best model of gridsearch cv

I did a grid search on a logistic regression and set scoring to 'roc_auc'. The grid_clf1.best_score_ gave me an auc of 0.7557. After that I wanted to plot the ROC curve of the best model. The ROC curve I saw had an AUC of 0.50 I do not understand this at all.
I looked into the predicted probabilites and I saw that they were all 0.0 or 1.0. Hence, I think something went wrong here but I cannot find what it is.
My code is as follows for the grid search cv:
clf1 = Pipeline([('RS', RobustScaler()), ('LR',
LogisticRegression(random_state=1, solver='saga'))])
params = {'LR__C': np.logspace(-3, 0, 5),
'LR__penalty': ['l1']}
grid_clf1 = GridSearchCV(clf1, params, scoring='roc_auc', cv = 5,
n_jobs=-1)
grid_clf1.fit(X_train, y_train)
grid_clf1.best_estimator_
grid_clf1.best_score_
So this gave an AUC of 0.7557 for the best model.
Then if I calculate the AUC for the model myself:
y_pred_proba = grid_clf1.best_estimator_.predict_probas(X_test)[::,1]
print(roc_auc_score(y_test, y_pred_proba))
This gave me an AUC of 0.50.
It looks like there are two problems with your example code:
You compare ROC_AUC scores on different datasets. During fitting train set is used, and test set is used when roc_auc_score is called
Scoring with cross validation works slightly different than simple roc_auc_score function call. It can be expanded to np.mean(cross_val_score(...))
So, if take that into account you will get the same scoring values. You can use the colab notebook as a reference.

Relationship between sklearn .fit() and .score()

While working with a linear regression model I split the data into a training set and test set. I then calculated R^2, RMSE, and MAE using the following:
lm.fit(X_train, y_train)
R2 = lm.score(X,y)
y_pred = lm.predict(X_test)
RMSE = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
MAE = metrics.mean_absolute_error(y_test, y_pred)
I thought that I was calculating R^2 for the entire data set (instead of comparing the training and original data). However, I learned that you must fit the model before you score it, therefore I'm not sure if I'm scoring the original data (as inputted in R2) or the data that I used to fit the model (X_train, and y_train). When I run:
lm.fit(X_train, y_train)
lm.score(X_train, y_train)
I get a different result than what I got when I was scoring X and y. So my question is are the inputs to the .score parameter compared to the model that was fitted (thereby making lm.fit(X,y); lm.score(X,y) the R^2 value for the original data and lm.fit(X_train, y_train); lm.score(X,y) the R^2 value for the original data based off the model created in .fit.) or is something else entirely happening?
fit() that only fit the data which is synonymous to train, that is fit the data means train the data.
score is something like testing or predict.
So one should use different dataset for training the classifier and testing the acuracy
One can do like this.
X_train,X_test,y_train,y_test=cross_validation.train_test_split(X,y,test_size=0.2)
clf=neighbors.KNeighborsClassifier()
clf.fit(X_train,y_train)
accuracy=clf.score(X_test,y_test)

Categories

Resources