After using sklearn.linear_model.LogisticRegression to fit a training data set, I would like to obtain the value of the cost function for the training data set and a cross validation data set.
Is it possible to have sklearn simply give me the value (at the fit minimum) of the function it minimized?
The function is stated in the documentation at http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression (depending on the regularization one has chosen). But I can't find how to get sklearn to give me the value of this function.
I would have thought this is what LogisticRegression.score does, but that simply returns the accuracy (the fraction of data points its prediction classifies correctly).
I have found sklearn.metrics.log_loss, but of course this is not the actual function being minimized.
Unfortunately there is no "nice" way to do so, but there is a private function
_logistic_loss(w, X, y, alpha, sample_weight=None) in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py, thus you can call it by hand
from sklearn.linear_model.logistic import _logistic_loss
print _logistic_loss(clf.coef_, X, y, 1 / clf.C)
where clf is your learned LogisticRegression
I used below code to calculate cost value.
import numpy as np
cost = np.sum((reg.predict(x) - y) ** 2)
where reg is your learned LogisticRegression
I have the following suggestions.
You can write the codes for the loss function of logistic regression as a function.
After you get your predicted labels of data, you can revoke your defined function to calculate the cost values.
Related
I created a ML model with Scikit Learn and Python. I calculated R-Squared error. Is there a way to convert this error to percentage error?
For example, If my true values are 100 and 50, and predicted values are 90 and 40, my average percentage error is 15%, because error for first prediction is 10% and error for second prediction is 20%.
Is there a way to calculate percentage error *(average percentage error) based on the value that I get for R-squared?
It is not possible. The R-squared is calculated via RSS, or residual sum of squares. Your r-squared is 1 - (RSS in model)/(RSS in intercept only model). From the above, you can see that R-squared is not really an error per se, but the percentage of variance explained.
We can use an example dataset
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
import numpy as np
X, y = load_boston(return_X_y=True)
reg = LinearRegression().fit(X, y)
We let the prediction and mean of y be:
ybar = reg.predict(X)
ymean = y.mean()
The R-squared is
1 - sum((y-ybar)**2) / sum((y-ymean)**2)
0.7406426641094095
reg.score(X, y)
0.7406426641094095
Whereas your percentage error is:
np.mean(abs(y-ybar)/y)
0.16417298806489977
As you can see, it is not quite possible to just get back the mean percentage error from Rsq because you have already summed up the residuals and in the percentage error, you need the error relative to the observation
From your question, it sounds like you're working with a regression model. I would recommend looking into sklearn's built-in regression accuracy methods instead of trying to use R^2, which, is too an accuracy metric. For what you are trying to do, I would probably recommend trying out the mean_absolute_error or median_absolute_error- but other accuracy metrics can be useful in tuning your model!
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import median_absolute_error
y_pred = model.predict(X_test)
MAE = mean_absolute_error(y_test, y_pred)
MEDAE = median_absolute_error(y_test, y_pred)
If you're building a classifier should be able to use sklearn's accuracy_score metric. This will divide the number of correct predictions by the total number of predictions. Multiplying this number by 100 will yield the percentage of correct predictions. To get the percent of incorrect predictions you can just use 100(1-accuracy_score).
The above answer seems adequate but I've felt a confusion in the question so I'm leaving this right here.
R-squared is a metric that answers the question "If I were to use the average of target, would it be better than my predictions?", and gives you the value where if your model is worst than the baseline (average of target) gives you below zero, and if your model is better than the baseline, gives you something close to one. It is stated before but there are guidelines regarding which error you should use. If you're a starter I would suggest you to use mean squared error, as in the end you will get "the slope and the intercept where the mean squared error is zero" (some fancy differentiations take place there). In MSE, the distance between the data points and your model's predictions get squared, every squared distance (error) is summed up and then it takes the mean of the summed errors. Therefore, there's no way you can calculate error from R-squared, as they're not really related. You can implement MSE within the sklearn documentation like this:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
For other metrics in the sklearn (including classification and clustering metrics), see here. Sklearn's documentation is even better than the tutorials online.
You can also simply type sklearn.metrics.SCORERS.keys() to see the available metrics in the sklearn.
In the sklearn.model_selection.cross_val_predict page it is stated:
Generate cross-validated estimates for each input data point. It is
not appropriate to pass these predictions into an evaluation metric.
Can someone explain what does it mean? If this gives estimate of Y (y prediction) for every Y (true Y), why can't I calculate metrics such as RMSE or coefficient of determination using these results?
It seems to be based on how samples are grouped and predicted. From the user guide linked in the cross_val_predict docs:
Warning Note on inappropriate usage of cross_val_predict
The result of
cross_val_predict may be different from those obtained using
cross_val_score as the elements are grouped in different ways. The
function cross_val_score takes an average over cross-validation folds,
whereas cross_val_predict simply returns the labels (or probabilities)
from several distinct models undistinguished. Thus, cross_val_predict
is not an appropriate measure of generalisation error.
The cross_val_score seems to say that it averages across all of the folds, while the cross_val_predict groups individual folds and distinct models but not all and therefore it won't necessarily generalize as well. For example, using the sample code from the sklearn page:
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.metrics import mean_squared_error, make_scorer
diabetes = datasets.load_diabetes()
X = diabetes.data[:200]
y = diabetes.target[:200]
lasso = linear_model.Lasso()
y_pred = cross_val_predict(lasso, X, y, cv=3)
print("Cross Val Prediction score:{}".format(mean_squared_error(y,y_pred)))
print("Cross Val Score:{}".format(np.mean(cross_val_score(lasso, X, y, cv=3, scoring = make_scorer(mean_squared_error)))))
Cross Val Prediction score:3993.771257795029
Cross Val Score:3997.1789145156217
Just to add a little more clarity, it is easier to understand the difference if you consider a non-linear scoring function such as Maximum-Absolute-Error instead of something like a mean-absolute error.
cross_val_score() would compute the maximum-absolute-error on each off the 3-folds (assuming 3 fold cross-validator) and report the aggregate (say mean?) over 3 such scores. That is, something like mean of (a, b, c) where a , b, c are the max-abs-errors for the 3 folds respectively. I guess it is safe to conclude the returned value as the max-absolute-error of your estimator in the average or general case.
with cross_val_predict() you would get 3-sets of predictions corresponding to 3-folds and taking the maximum-absolute-error over the aggregate (concatenation) of these 3-sets of predictions is certainly not the same as above. Even if the predicted values are identical in both the scenarios, what you end up with here is max of (a, b,c ). Also, max(a,b,c) would be an unreasonable and overly pessimistic characterization of the max-absolute-error score of your model.
I am trying to evaluate a multivariable dataset by leave-one-out cross-validation and then remove those samples not predictive of the original dataset (Benjamini-corrected, FDR > 10%).
Using the docs on cross-validation, I've found the leave-one-out iterator. However, when trying to get the score for the nth fold, an exception is raised saying that more than one sample is needed. Why does .predict() work while .score() doesn't? How can I get the score for a single sample? Do I need to use another approach?
Unsuccessful code:
from sklearn import ensemble, cross_validation, datasets
dataset = datasets.load_linnerud()
x, y = dataset.data, dataset.target
clf = ensemble.RandomForestRegressor(n_estimators=500)
loo = cross_validation.LeaveOneOut(x.shape[0])
for train_i, test_i in loo:
score = clf.fit(x[train_i], y[train_i]).score(x[test_i], y[test_i])
print('Sample %d score: %f' % (test_i[0], score))
Resulting exception:
ValueError: r2_score can only be computed given more than one sample.
[EDIT, to clarify]:
I am not asking why this doesn't work, but for a different approach that does. After fitting/training my model, how do I test how good a single sample fits the trained model?
cross_validation.LeaveOneOut(x.shape[0]) is creating as many folds as the number of rows. This results in each validation run getting only one instance.
Now, to draw a "line" you need two points, whereas with your one instance, you only have one point. That's what your error message says, that it needs more than one instance (or sample) to draw the "line" that will be used to calculate the r^2 value.
Generally, in the ML world, people report 10-fold or 5-fold cross validation result. So I would recommend setting the n to 10 or 5, accordingly.
Edit: After a quick discussion with #banana, we realized that the question was not understood correctly initially. Since it is not possible to get the R2 score for a single data point, an alternative is to calculate the distance between the actual and predicted points. This can be done using
numpy.linalg.norm(clf.predict(x[test_i])[0] - y[test_i])
How can one use cross_val_score for regression? The default scoring seems to be accuracy, which is not very meaningful for regression. Supposedly I would like to use mean squared error, is it possible to specify that in cross_val_score?
Tried the following two but doesn't work:
scores = cross_validation.cross_val_score(svr, diabetes.data, diabetes.target, cv=5, scoring='mean_squared_error')
and
scores = cross_validation.cross_val_score(svr, diabetes.data, diabetes.target, cv=5, scoring=metrics.mean_squared_error)
The first one generates a list of negative numbers while mean squared error should always be non-negative. The second one complains that:
mean_squared_error() takes exactly 2 arguments (3 given)
I dont have the reputation to comment but I want to provide this link for you and/or a passersby where the negative output of the MSE in scikit learn is discussed - https://github.com/scikit-learn/scikit-learn/issues/2439
In addition (to make this a real answer) your first option is correct in that not only is MSE the metric you want to use to compare models but R^2 cannot be calculated depending (I think) on the type of cross-val you are using.
If you choose MSE as a scorer, it outputs a list of errors which you can then take the mean of, like so:
# Doing linear regression with leave one out cross val
from sklearn import cross_validation, linear_model
import numpy as np
# Including this to remind you that it is necessary to use numpy arrays rather
# than lists otherwise you will get an error
X_digits = np.array(x)
Y_digits = np.array(y)
loo = cross_validation.LeaveOneOut(len(Y_digits))
regr = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(regr, X_digits, Y_digits, scoring='mean_squared_error', cv=loo,)
# This will print the mean of the list of errors that were output and
# provide your metric for evaluation
print scores.mean()
The first one is correct. It outputs the negative of the MSE, as it always tries to maximize the score. Please help us by suggesting an improvement to the documentation.
I am using sklearn.lda for a classification purpose and was a little puzzled about the score function that prints the mean classification error.
Is it determined by leave one out - jackknife?
How do I interpret the result? It's only a float value without much documentation.
Thanks in advance,
EL
The score method takes samples X and their true labels y and compares its own predictions with y. It returns the mean accuracy, which is always a single figure. For example,
lda = LDA().fit(X, y)
print(lda.score(X, y))
will print the accuracy of the classifier on its own training set.
Every classifier has a score method, which usually (though not necessarily) returns mean accuracy. The method is used by the GridSearchCV model selection algorithm to determine the quality of the classifier if you don't explicitly give it a scoring argument.