GridSearchCV returns worse accuracy than default - python

I was working with the Heart Disease Prediction dataset from Kaggle and came up with something odd that I couldn't find an answer.
With default Logistic Regression with 'liblinear' solver (C = 1) I get an accuracy on the train set of 87.26% and 86.81% on the test set. Pretty good. However, I tried using GridSearchCV tweaking C in case I find better values and I constantly get worse results (an accuracy about 85% on train set and 82.5% on test set).
Is GridSearchCV using some other metric for comparing these values of C? I just don't understand why it returns a worse solution.
I leave the last part of my code here.
Default Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver = 'liblinear', random_state = 2)
lr = lr.fit(X_train, y_train)
model_score(lr, X_train, y_train, X_test, y_test)
GridSearchCV
from sklearn.model_selection import GridSearchCV
# Finding better hyperparameters for the Logistic Regression
lr_params = [ {'C': np.logspace(-1, 0.3, 30)} ]
lr = LogisticRegression(solver = 'liblinear', random_state = 2)
lr_cv = GridSearchCV(lr, lr_params, cv = 5, scoring = 'accuracy')
lr_cv.fit(X_train, y_train)
lr_best_params = lr_cv.best_params_
lr = LogisticRegression(**lr_best_params)
lr.fit(X_train, y_train)
model_score(lr, X_train, y_train, X_test, y_test)
EDIT
Full code in this link (check section 4-4.1).

Related

Reproducing Model results from RandomizedSearchCV

I have used RandomizedSearchCV to tune the parameters of my Random Forest model, as in the code cell below:
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid,
n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
rf_random.fit(X_train[rf_cols], y_train)
It turns out that the rf_random model outperforms any of my manually trained models with the same parameters which were retrieved using
rf_random.best_params_
I need to reproduce the exact prediction that I have made using RandomizedSearchCV, but I am unable to do so, mainly due to two reasons:
best_params_ differ on each run
I am having trouble understanding how RandomizedSearchCV splits the data into train set and validation set, which means that it is nearly impossible for me to train a new model that behaves the same.
What can I do? What more information do I need to reproduce the results? Or is it even possible to reproduce results from RandomizedSearchCV, despite the fact that I have fixed my random_state to 42? Should I stick to GridSearchCV instead if I need to reproduce the results?
I believe you are looking for the best_estimator_ attribute of RandomizedSearchCV which will return the fitted estimator which scored highest on the left out data:
kf = KFold(n_splits=3, random_state=42)
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid,
n_iter = 100, cv = kf, verbose=2, random_state=42, n_jobs = -1)
rf_random.fit(X_train[rf_cols], y_train)
tuned_rf = rf_random.best_estimator_
for train_index, test_index in kf.split(X_train[rf_cols], y_train):
# use train and test indices here

Python: I want to perform 5 fold cross validation for logistic regression and report scores. Do I use LogisticRegressionCV() or cross_val_score()?

cross_val_scores gives different results than LogisticRegressionCV, and I can't figure out why.
Here is my code:
seed = 42
test_size = .33
X_train, X_test, Y_train, Y_test = train_test_split(scale(X),Y, test_size=test_size, random_state=seed)
#Below is my model that I use throughout the program.
model = LogisticRegressionCV(random_state=42)
print('Logistic Regression results:')
#For cross_val_score below, I just call LogisticRegression (and not LogRegCV) with the same parameters.
scores = cross_val_score(LogisticRegression(random_state=42), X_train, Y_train, scoring='accuracy', cv=5)
print(np.amax(scores)*100)
print("%.2f%% average accuracy with a standard deviation of %0.2f" % (scores.mean() * 100, scores.std() * 100))
model.fit(X_train, Y_train)
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(Y_test, predictions)
coef=np.round(model.coef_,2)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
The output is this.
Logistic Regression results:
79.90483019359885
79.69% average accuracy with a standard deviation of 0.14
Accuracy: 79.81%
Why is the maximum accuracy from cross_val_score higher than the accuracy used by LogisticRegressionCV?
And, I recognize that cross_val_scores does not return a model, which is why I want to use LogisticRegressionCV, but I am struggling to understand why it is not performing as well. Likewise, I am not sure how to get the standard deviations of the predictors from LogisticRegressionCV.
For me, there might be some points to take into consideration:
Cross validation is generally used whenever you should simulate a validation set (for instance when the training set is not that big to be divided into training, validation and test sets) and only uses training data. In your case you're computing accuracy of model on test data, making it impossible to exactly compare results.
According to the docs:
Cross-validation estimators are named EstimatorCV and tend to be roughly equivalent to GridSearchCV(Estimator(), ...). The advantage of using a cross-validation estimator over the canonical estimator class along with grid search is that they can take advantage of warm-starting by reusing precomputed results in the previous steps of the cross-validation process. This generally leads to speed improvements.
If you look at this snippet, you'll see that's what happens indeed:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
data = load_breast_cancer()
X, y = data['data'], data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
estimator = LogisticRegression(random_state=42, solver='liblinear')
grid = {
'C': np.power(10.0, np.arange(-10, 10)),
}
gs = GridSearchCV(estimator, param_grid=grid, scoring='accuracy', cv=5)
gs.fit(X_train, y_train)
print(gs.best_score_) # 0.953846153846154
lrcv = LogisticRegressionCV(Cs=list(np.power(10.0, np.arange(-10, 10))),
cv=5, scoring='accuracy', solver='liblinear', random_state=42)
lrcv.fit(X_train, y_train)
print(lrcv.scores_[1].mean(axis=0).max()) # 0.953846153846154
I would suggest to have a look here, too, so as to get the details of lrcv.scores_[1].mean(axis=0).max().
Eventually, to get the same results with cross_val_score you should better write:
score = cross_val_score(gs.best_estimator_, X_train, y_train, cv=5, scoring='accuracy')
score.mean() # 0.953846153846154

Sklearn GridSearch with PredefinedSplit scoring does not match a standalone classifier

I am using sklearn GridSearch to find best parameters for random forest classification using a predefined validation set. The scores from the best estimator returned by GridSearch do not match the scores obtained by training a separate classifier with the same parameters.
The data split definition
X = pd.concat([X_train, X_devel])
y = pd.concat([y_train, y_devel])
test_fold = -X.index.str.contains('train').astype(int)
ps = PredefinedSplit(test_fold)
The GridSearch definition
n_estimators = [10]
max_depth = [4]
grid = {'n_estimators': n_estimators, 'max_depth': max_depth}
rf = RandomForestClassifier(random_state=0)
rf_grid = GridSearchCV(estimator = rf, param_grid = grid, cv = ps, scoring='recall_macro')
rf_grid.fit(X, y)
The classifier definition
clf = RandomForestClassifier(n_estimators=10, max_depth=4, random_state=0)
clf.fit(X_train, y_train)
The recall was calculated explicitly using sklearn.metrics.recall_score
y_pred_train = clf.predict(X_train)
y_pred_devel = clf.predict(X_devel)
uar_train = recall_score(y_train, y_pred_train, average='macro')
uar_devel = recall_score(y_devel, y_pred_devel, average='macro')
GridSearch
uar train: 0.32189884516029466
uar devel: 0.3328299259976279
Random Forest:
uar train: 0.483040291148839
uar devel: 0.40706644557392435
What is the reason for such a mismatch?
There are multiple issues here:
Your input arguments to recall_score are reversed. The actual correct order is:
recall_score(y_true, y_test)
But you are are doing:
recall_score(y_pred_train, y_train, average='macro')
Correct that to:
recall_score(y_train, y_pred_train, average='macro')
You are doing rf_grid.fit(X, y) for grid-search. That means that after finding the best parameter combinations, the GridSearchCV will fit the whole data (whole X, ignoring the PredefinedSplit because that's only used during cross-validation in search of best parameters). So in essence, the estimator from GridSearchCV will have seen the whole data, so scores will be different from what you get when you do clf.fit(X_train, y_train)
It's because in your GridSearchCV you are using the scoring function as recall-macro which basically return the recall score which is macro averaged. See this link.
However, when you are returning the default score from your RandomForestClassifier it returns the mean accuracy. So, that is why the scores are different. See this link for info on the same. (Since one is recall and the other is accuracy).

How to run SVC classifier after running 10-fold cross validation in sklearn?

I'm relatively new to machine learning and would like some help in the following:
I ran a Support Vector Machine Classifier (SVC) on my data with 10-fold cross validation and calculated the accuracy score (which was around 89%). I'm using Python and scikit-learn to perform the task. Here's a code snippet:
def get_scores(features,target,classifier):
X_train, X_test, y_train, y_test =train_test_split(features, target ,
test_size=0.3)
scores = cross_val_score(
classifier,
X_train,
y_train,
cv=10,
scoring='accuracy',
n_jobs=-1)
return(scores)
get_scores(features_from_df,target_from_df,svm.SVC())
Now, how can I use my classifier (after running the 10-folds cv) to test it on X_test and compare the predicted results to y_test? As you may have noticed, I only used X_train and y_train in the cross validation process.
I noticed that sklearn have cross_val_predict:
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html should I replace my cross_val_score by cross_val_predict? just FYI: my target data column is binarized (have values of 0s and 1s).
If my approach is wrong, please advise me with the best way to proceed with.
Thanks!
You only need to split your X and y. Do not split the train and test.
Then you can pass your classifier in your case svm to the cross_val_score function to get the accuracy for each experiment.
In just 3 lines of code:
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, X, y, cv=10)
print scores
from sklearn.metrics import classification_report
classifier = svm.SVC()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(classification_report(y_test , y_pred)
You're almost there:
# Build your classifier
classifier = svm.SVC()
# Train it on the entire training data set
classifier.fit(X_train, y_train)
# Get predictions on the test set
y_pred = classifier.predict(X_test)
At this point, you can use any metric from the sklearn.metrics module to determine how well you did. For example:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_pred))

Difference Between Python's Functions `cls.score` and `cls.cv_result_`

I have written a code for a logistic regression in Python (Anaconda 3.5.2 with sklearn 0.18.2). I have implemented GridSearchCV() and train_test_split() to sort parameters and split the input data.
My goal is to find the overall (average) accuracy over the 10 folds with a standard error on the test data. Additionally, I try to predict correctly predicted class labels, creating a confusion matrix and preparing a classification report summary.
Please, advise me in the following:
(1) Is my code correct? Please, check each part.
(2) I have tried two different Sklearn functions, clf.score() and clf.cv_results_. I see that they give different results. Which one is correct? (However, the summaries are not included).
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.pipeline import Pipeline
# Load any n x m data and label column. No missing or NaN values.
# I am skipping loading data part. One can load any data to test below code.
sc = StandardScaler()
lr = LogisticRegression()
pipe = Pipeline(steps=[('sc', sc), ('lr', lr)])
parameters = {'lr__C': [0.001, 0.01]}
if __name__ == '__main__':
clf = GridSearchCV(pipe, parameters, n_jobs=-1, cv=10, refit=True)
X_train, X_test, y_train, y_test = train_test_split(Data, labels, random_state=0)
# Train the classifier on data1's feature and target data
clf.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}% \n".format((clf.score(X_train, y_train))*100))
print("Accuracy on test set: {:.2f}%\n".format((clf.score(X_test, y_test))*100))
print("Best Parameters: ")
print(clf.best_params_)
# Alternately using cv_results_
print("Accuracy on training set: {:.2f}% \n", (clf.cv_results_['mean_train_score'])*100))
print("Accuracy on test set: {:.2f}%\n", (clf.cv_results_['mean_test_score'])*100))
# Predict class labels
y_pred = clf.best_estimator_.predict(X_test)
# Confusion Matrix
class_names = ['Positive', 'Negative']
confMatrix = confusion_matrix(y_test, y_pred)
print(confMatrix)
# Accuracy Report
classificationReport = classification_report(labels, y_pred, target_names=class_names)
print(classificationReport)
I will appreciate any advise.
First of all, the desired metrics, i. e. the accuracy metrics, is already considered a default scorer of LogisticRegression(). Thus, we may omit to define scoring='accuracy' parameter of GridSearchCV().
Secondly, the parameter score(X, y) returns the value of the chosen metrics IF the classifier has been refit with the best_estimator_ after sorting all possible options taken from param_grid. It works like so as you have provided refit=True. Note that clf.score(X, y) == clf.best_estimator_.score(X, y). Thus, it does not print out the averaged metrics but rather the best metrics.
Thirdly, the parameter cv_results_ is a much broader summary as it includes the results of each fit. However, it prints out the averaged results obtained by averaging the batch results. These are the values that you wish to store.
Quick Example
Let me hereby introduce a toy example for better understanding:
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, 0.2)
param_grid = {'C': [0.001, 0.01]}
clf = GridSearchCV(cv=10, estimator=LogisticRegression(), refit=True,
param_grid=param_grid)
clf.fit(X_train, y_train)
clf.best_estimator_.score(X_train, y_train)
print('____')
clf.cv_results_
This code yields the following:
0.98107957707289928 # which is the best possible accuracy score
{'mean_fit_time': array([ 0.15465896, 0.23701136]),
'mean_score_time': array([ 0.0006465 , 0.00065773]),
'mean_test_score': array([ 0.934335 , 0.9376739]),
'mean_train_score': array([ 0.96475625, 0.98225632]),
'param_C': masked_array(data = [0.001 0.01],
'params': ({'C': 0.001}, {'C': 0.01})
mean_train_score has two mean values as I grid over two options for C parameter.
I hope that helps!

Categories

Resources