I need to implement Lasso and Ridge Regression and calculate hyperparameters by means of cross-validation.
I found the code that does it, but I cannot quite understand it.
lassocv = LassoCV(alphas=None, cv=15, max_iter=100000, normalize=True)
lassocv.fit(X_train, y_train)
lasso = Lasso(alpha=lassocv.alpha_, normalize=True)
lasso.fit(X_train, y_train)
rmse = np.sqrt(mean_squared_error(y_val, lasso.predict(X_val)))
ridgecv = RidgeCV(alphas=alphas, normalize=True)
ridgecv.fit(X_train, y_train)
ridge = Ridge(alpha=ridgecv.alpha_, normalize=True)
ridge.fit(X_train, y_train)
rmse = np.sqrt(mean_squared_error(y_val, ridge.predict(X_val)))
So, why alphas=alphas in RidgeCV? If I write alphas=None, an error occurs. Why is it not necessary to write cv for ridgecv?
I think the answer is related to the way the strength of regularization has been set on the different models.
In RidgeCV, you need to declare alphas explicitly because the parameter tuning is not done on the basis of the error calculated from cross-validation.
In LassoCV however, they have a way of automatically calculating alphas from the CV error. However, if you want to set them explicitly, you can do so by changing the optional parameters: n_alpha and alphas
Refer: https://web.stanford.edu/~hastie/Papers/ESLII.pdf
Chapter 7 section 4 for further details.
Related
I have a time-dependent data set, where I (as an example) am trying to do some hyperparameter tuning on a Lasso regression.
For that I use sklearn's TimeSeriesSplit instead of regular Kfold CV, i.e. something like this:
tscv = TimeSeriesSplit(n_splits=5)
model = GridSearchCV(
estimator=pipeline,
param_distributions= {"estimator__alpha": np.linspace(0.05, 1, 50)},
scoring="neg_mean_absolute_percentage_error",
n_jobs=-1,
cv=tscv,
return_train_score=True,
max_iters=10,
early_stopping=True,
)
model.fit(X_train, y_train)
With this I get a model, which I can then use for predictions etc. The idea behind that cross validation is based on this:
However, my issue is that I would actually like to have the predictions from all the test sets from all cv's. And I have no idea how to get that out of the model ?
If I try the cv_results_ I get the score (from the scoring parameter) for each split and each hyperparameter. But I don't seem to be able to find the prediction values for each value in each test split. And I actually need that for some backtesting. I don't think it would be "fair" to use the final model to predict the previous values. I would imagine there would be some kind of overfitting in that case.
So yeah, is there any way for me to extract the predicted values for each split ?
You can have custom scoring functions in GridSearchCV.With that you can predict outputs with the estimator given to the GridSearchCV in that particular fold.
from the documentation scoring parameter is
Strategy to evaluate the performance of the cross-validated model on the test set.
from sklearn.metrics import mean_absolute_percentage_error
def custom_scorer(clf, X, y):
y_pred = clf.predict(X)
# save y_pred somewhere
return -mean_absolute_percentage_error(y, y_pred)
model = GridSearchCV(estimator=pipeline,
scoring=custom_scorer)
The input X and y in the above code came from the test set. clf is the given pipeline to the estimator parameter.
Obviously your estimator should implement the predict method (should be a valid model in scikit-learn). You can add other scorings to the custom one to avoid non-sense scores from the custom function.
Exploring some classification models in Scikit learn I noticed that the scores I got for log loss and for ROC AUC were consistently lower while performing cross validation than while fitting and predicting on the whole training set (done to check for overfitting), thing that did not make sense to me.
Specifically, using cross_validate I set the scorings as ['neg_log_loss', 'roc_auc'] and while performing manual fitting and prediction on the training set I used the metric functions log_loss' and roc_auc_score.
To try to figure out what was happening, i wrote a code to perform the cross validation manually in order to be able to call the metric functions manually on the various folds and compare the results with the ones from cross_validate. As you can see below, I got different results even like this!
from sklearn.model_selection import StratifiedKFold
kf = KFold(n_splits=3, random_state=42, shuffle=True)
log_reg = LogisticRegression(max_iter=1000)
for train_index, test_index in kf.split(dataset, dataset_labels):
X_train, X_test = dataset[train_index], dataset[test_index]
y_train, y_test = dataset_labels_np[train_index], dataset_labels_np[test_index]
log_reg.fit(X_train, y_train)
pr = log_reg.predict(X_test)
ll = log_loss(y_test, pr)
print(ll)
from sklearn.model_selection import cross_val_score
cv_ll = cross_val_score(log_reg, dataset_prepared_stand, dataset_labels, scoring='neg_log_loss',
cv=KFold(n_splits=3, random_state=42, shuffle=True))
print(abs(cv_ll))
Outputs:
4.795481869275026
4.560119170517534
5.589818973403791
[0.409817 0.32309 0.398375]
The output running the same code for ROC AUC are:
0.8609669592272686
0.8678563239907938
0.8367147503682851
[0.925635 0.94032 0.910885]
To be sure to have written the code right, I also tried the code using 'accuracy' as scoring for cross validation and accuracy_score as metric function and the results are instead consistent:
0.8611584327086882
0.8679727427597955
0.838160136286201
[0.861158 0.867973 0.83816 ]
Can someone explain me why the results in the case of the log loss and the ROC AUC are different? Thanks!
Log-loss and auROC both need probability predictions, not the hard class predictions. So change
pr = log_reg.predict(X_test)
to
pr = log_reg.predict_proba(X_test)[:, 1]
(the subscripting is to grab the probabilities for the positive class, and assumes you're doing binary classification).
I wanted to do Cross Validation on a regression (non-classification ) model and ended getting mean accuracies of about 0.90. however, i don't know what metric is used in the method to find out the accuracies. I know how splitting in k-fold cross validation works . I just don't know the formula that the scikit learn library is using to calculate the accuracy of prediction. (I know how it works for classification model though). Can someone give me the metric/formula used by sklearn.model_selection.cross_val_score?
Thanks in advance.
from sklearn.model_selection import cross_val_score
def metrics_of_accuracy(classifier , X_train , y_train) :
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()
return accuracies
By default, sklearn uses accuracy in case of classification and r2_score for regression when you use the model.score method(same for cross_val_score). So r2_score in this case whose formula is
r2 = 1 - (SSE(y_hat)/SSE(y_mean))
where
SSE(y_hat) is the squared error for predictions made
SSE(y_mean) is the squared error when all predictions are the mean of the actual predictions
Yes, Also I can use the same metric using sklearn.metrics-> r2_score.
r2_score(y_true, y_pred). This score is also called Coefficient of determination or R-squared.
The formula for the same is as follows -
Find the link to image below.
https://i.stack.imgur.com/USaWH.png
For more on this -
https://en.wikipedia.org/wiki/Coefficient_of_determination
I'm training a dataset and then testing it on some other dataset.
To improve performance, I wanted to fine-tune my parameters with a 5-fold cross validation.
However, I think I'm not writing the correct code as when I try to fit the model to my testing set, it says it hasn't fit it yet. I though the cross-validation part fitted the model? Or maybe I have to extract it?
Here's my code:
svm = SVC(kernel='rbf', probability=True, random_state=42)
accuracies = cross_val_score(svm, data_train, lbs_train, cv=5)
pred_test = svm.predict(data_test)
accuracy = accuracy_score(lbs_test, pred_test)
That is correct, the cross_validate_score doesn't return a fitted model. In your example, you have cv=5 which means that the model was fit 5 times. So, which of those do you want? The last?
The function cross_val_score is a simpler version of the sklearn.model_selection.cross_validate. Which doesn't only return the scores, but more information.
So you can do something like this:
from sklearn.model_selection import cross_validate
svm = SVC(kernel='rbf', probability=True, random_state=42)
cv_results = cross_validate(svm, data_train, lbs_train, cv=5, return_estimator=True)
# cv_results is a dict with the following keys:
# 'test_score' which is what cross_val_score returns
# 'train_score'
# 'fit_time'
# 'score_time'
# 'estimator' which is a tuple of size cv and only if return_estimator=True
accuracies = cv_results['test_score'] # what you had before
svms = cv_results['estimator']
print(len(svms)) # 5
svm = svms[-1] # the last fitted svm, or pick any that you want
pred_test = svm.predict(data_test)
accuracy = accuracy_score(lbs_test, pred_test)
Note, here you need to pick one of the 5 fitted SVMs. Ideally, you would use cross-validation for testing the performance of your model. So, you don't need to do it again at the end. Then, you would fit your model one more time, but this time with ALL the data which would be the model you will actually use in production.
Another note, you mentioned that you want this to fine tune the parameters of your model. Perhaps you should look at hyper-parameter optimization. For example: https://datascience.stackexchange.com/a/36087/54395 here you will see how to use cross-validation and define a parameter search space.
I'm using both the Scikit-Learn and Seaborn logistic regression functions -- the former for extracting model info (i.e. log-odds, parameters, etc.) and the later for plotting the resulting sigmoidal curve fit to the probability estimations.
Maybe my intuition is incorrect for how to interpret this plot, but I don't seem to be getting results as I'd expect:
#Build and visualize a simple logistic regression
ap_X = ap[['TOEFL Score']].values
ap_y = ap['Chance of Admit'].values
ap_lr = LogisticRegression()
ap_lr.fit(ap_X, ap_y)
def ap_log_regplot(ap_X, ap_y):
plt.figure(figsize=(15,10))
sns.regplot(ap_X, ap_y, logistic=True, color='green')
return None
ap_log_regplot(ap_X, ap_y)
plt.xlabel('TOEFL Score')
plt.ylabel('Probability')
plt.title('Logistic Regression: Probability of High Chance by TOEFL Score')
plt.show
Seems alright, but then I attempt to use the predict_proba function in Scikit-Learn to find the probabilities of Chance to Admit given some arbitrary value for TOEFL Score (in this case 108, 104, and 112):
eight = ap_lr.predict_proba(108)[:, 1]
four = ap_lr.predict_proba(104)[:, 1]
twelve = ap_lr.predict_proba(112)[:, 1]
print(eight, four, twelve)
Where I get:
[0.49939019] [0.44665597] [0.55213799]
To me, this seems to indicate that a TOEFL Score of 112 gives an individual a 55% chance of being admitted based on this data set. If I were to extend a vertical line from 112 on the x-axis to the sigmoid curve, I'd expect the intersection at around .90.
Am I interpreting/modeling this correctly? I realize that I'm using two different packages to calculate the model coefficients but with another model using a different data set, I seem to get correct predictions that fit the logistic curve.
Any ideas or am I completely modeling/interpreting this inaccurately?
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=4)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
logreg = LogisticRegression()
logreg.fit(x_train, y_train)
y_pred = logreg.predict(x_test)
print('log: ', metrics.accuracy_score(y_test, y_pred))
you can easily find model accuracy like this and decide which model you can use for your application data.
After some searching, Cross-Validated provided the correct answer to my question. Although it already exists on Cross-Validated, I wanted to provide this answer on Stack Overflow as well.
Simply put, Scikit-Learn automatically adds a regularization penalty to the logistic model that shrinks the coefficients. Statsmodels does not add this penalty. There is apparently no way to turn this off so one has to set the C= parameter within the LogisticRegression instantiation to some arbitrarily high value like C=1e9.
After trying this and comparing the Scikit-Learn predict_proba() to the sigmoidal graph produced by regplot (which uses statsmodels for its calculation), the probability estimates align.
Link to full post: https://stats.stackexchange.com/questions/203740/logistic-regression-scikit-learn-vs-statsmodels