Model help using Scikit-learn when using GridSearch - python

As part of the Enron project, built the attached model, Below is the summary of the steps,
Below model gives highly perfect scores
cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels) ---> with the full dataset
for train_ind, test_ind in cv.split(features,labels):
x_train, x_test = features[train_ind], features[test_ind]
y_train, y_test = labels[train_ind],labels[test_ind]
gcv.best_estimator_.predict(x_test)
Below model gives more reasonable but low scores
cv = StratifiedShuffleSplit(n_splits = 100, test_size = 0.2, random_state = 42)
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels) ---> with the full dataset
for train_ind, test_ind in cv.split(features,labels):
x_train, x_test = features[train_ind], features[test_ind]
y_train, y_test = labels[train_ind],labels[test_ind]
gcv.best_estimator_.fit(x_train,y_train)
gcv.best_estimator_.predict(x_test)
Used Kbest to find out the scores and sorted the features and trying a combination of higher and lower scores.
Used SVM with a GridSearch using a StratifiedShuffle
Used the best_estimator_ to predict and calculate the precision and recall.
The problem is estimator is spitting out perfect scores, in some case 1
But when I refit the best classifier on training data then run the test it gives reasonable scores.
My doubt/question was what exactly GridSearch does with the test data after the split using the Shuffle split object we send in to it. I assumed it would not fit anything on Test data, if that was true then when I predict using the same test data, it should not give this high scores right.? since i used random_state value, the shufflesplit should have created the same copy for the Grid fit and also for the predict.
So, is using the same Shufflesplit for two wrong?

GridSearchCV as #Gauthier Feuillen said is used to search best parameters of an estimator for given data.
Description of GridSearchCV:-
gcv = GridSearchCV(pipe, clf_params,cv=cv)
gcv.fit(features,labels)
clf_params will be expanded to get all possible combinations separate using ParameterGrid.
features will now be split into features_train and features_test using cv. Same for labels
Now the gridSearch estimator (pipe) will be trained using features_train and labels_inner and scored using features_test and labels_test.
For each possible combination of parameters in step 3, The steps 4 and 5 will be repeated for cv_iterations. The average of score across cv iterations will be calculated, which will be assigned to that parameter combination. This can be accessed using cv_results_ attribute of gridSearch.
For the parameters which give the best score, the internal estimator will be re initialized using those parameters and refit for the whole data supplied into it(features and labels).
Because of last step, you are getting different scores in first and second approach. Because in the first approach, all data is used for training and you are predicting for that data only. Second approach has prediction on previously unseen data.

Basically the grid search will:
Try every combination of your parameter grid
For each of them it will do a K-fold cross validation
Select the best available.
So your second case is the good one. Otherwise you are actually predicting data that you trained with (which is not the case in the second option, there you only keep the best parameters from your gridsearch)

Related

Get all prediction values for each CV in GridSearchCV

I have a time-dependent data set, where I (as an example) am trying to do some hyperparameter tuning on a Lasso regression.
For that I use sklearn's TimeSeriesSplit instead of regular Kfold CV, i.e. something like this:
tscv = TimeSeriesSplit(n_splits=5)
model = GridSearchCV(
estimator=pipeline,
param_distributions= {"estimator__alpha": np.linspace(0.05, 1, 50)},
scoring="neg_mean_absolute_percentage_error",
n_jobs=-1,
cv=tscv,
return_train_score=True,
max_iters=10,
early_stopping=True,
)
model.fit(X_train, y_train)
With this I get a model, which I can then use for predictions etc. The idea behind that cross validation is based on this:
However, my issue is that I would actually like to have the predictions from all the test sets from all cv's. And I have no idea how to get that out of the model ?
If I try the cv_results_ I get the score (from the scoring parameter) for each split and each hyperparameter. But I don't seem to be able to find the prediction values for each value in each test split. And I actually need that for some backtesting. I don't think it would be "fair" to use the final model to predict the previous values. I would imagine there would be some kind of overfitting in that case.
So yeah, is there any way for me to extract the predicted values for each split ?
You can have custom scoring functions in GridSearchCV.With that you can predict outputs with the estimator given to the GridSearchCV in that particular fold.
from the documentation scoring parameter is
Strategy to evaluate the performance of the cross-validated model on the test set.
from sklearn.metrics import mean_absolute_percentage_error
def custom_scorer(clf, X, y):
y_pred = clf.predict(X)
# save y_pred somewhere
return -mean_absolute_percentage_error(y, y_pred)
model = GridSearchCV(estimator=pipeline,
scoring=custom_scorer)
The input X and y in the above code came from the test set. clf is the given pipeline to the estimator parameter.
Obviously your estimator should implement the predict method (should be a valid model in scikit-learn). You can add other scorings to the custom one to avoid non-sense scores from the custom function.

Model Fitting and Cross-Validation

I'm training a dataset and then testing it on some other dataset.
To improve performance, I wanted to fine-tune my parameters with a 5-fold cross validation.
However, I think I'm not writing the correct code as when I try to fit the model to my testing set, it says it hasn't fit it yet. I though the cross-validation part fitted the model? Or maybe I have to extract it?
Here's my code:
svm = SVC(kernel='rbf', probability=True, random_state=42)
accuracies = cross_val_score(svm, data_train, lbs_train, cv=5)
pred_test = svm.predict(data_test)
accuracy = accuracy_score(lbs_test, pred_test)
That is correct, the cross_validate_score doesn't return a fitted model. In your example, you have cv=5 which means that the model was fit 5 times. So, which of those do you want? The last?
The function cross_val_score is a simpler version of the sklearn.model_selection.cross_validate. Which doesn't only return the scores, but more information.
So you can do something like this:
from sklearn.model_selection import cross_validate
svm = SVC(kernel='rbf', probability=True, random_state=42)
cv_results = cross_validate(svm, data_train, lbs_train, cv=5, return_estimator=True)
# cv_results is a dict with the following keys:
# 'test_score' which is what cross_val_score returns
# 'train_score'
# 'fit_time'
# 'score_time'
# 'estimator' which is a tuple of size cv and only if return_estimator=True
accuracies = cv_results['test_score'] # what you had before
svms = cv_results['estimator']
print(len(svms)) # 5
svm = svms[-1] # the last fitted svm, or pick any that you want
pred_test = svm.predict(data_test)
accuracy = accuracy_score(lbs_test, pred_test)
Note, here you need to pick one of the 5 fitted SVMs. Ideally, you would use cross-validation for testing the performance of your model. So, you don't need to do it again at the end. Then, you would fit your model one more time, but this time with ALL the data which would be the model you will actually use in production.
Another note, you mentioned that you want this to fine tune the parameters of your model. Perhaps you should look at hyper-parameter optimization. For example: https://datascience.stackexchange.com/a/36087/54395 here you will see how to use cross-validation and define a parameter search space.

Getting probabilities of best model for RandomizedSearchCV

I'm using RandomizedSearchCV to get the best parameters with a 10-fold cross-validation and 100 iterations. This works well. But now I would like to also get the probabilities of each predicted test data point (like predict_proba) from the best performing model.
How can this be done?
I see two options. First, perhaps it is possible to get these probabilities directly from the RandomizedSearchCV or second, getting the best parameters from RandomizedSearchCV and then doing again a 10-fold cross-validation (with the same seed so that I get the same splits) with this best parameters.
Edit: Is the following code correct to get the probabilities of the best performing model? X is the training data and y are the labels and model is my RandomizedSearchCV containing a Pipeline with imputing missing values, standardization and SVM.
cv_outer = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
y_prob = np.empty([y.size, nrClasses]) * np.nan
best_model = model.fit(X, y).best_estimator_
for train, test in cv_outer.split(X, y):
probas_ = best_model.fit(X[train], y[train]).predict_proba(X[test])
y_prob[test] = probas_
If I understood it right, you would like to get the individual scores of every sample in your test split for the case with the highest CV score. If that is the case, you have to use one of those CV generators which give you control over split indices, such as those here: http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html#cross-validation-generators
If you want to calculate scores of a new test sample with the best performing model, the predict_proba() function of RandomizedSearchCV would suffice, given that your underlying model supports it.
Example:
import numpy
skf = StratifiedKFold(n_splits=10, random_state=0, shuffle=True)
scores = cross_val_score(svc, X, y, cv=skf, n_jobs=-1)
max_score_split = numpy.argmax(scores)
Now that you know that your best model happens at max_score_split, you can get that split yourself and fit your model with it.
train_indices, test_indices = k_fold.split(X)[max_score_split]
X_train = X[train_indices]
y_train = y[train_indices]
X_test = X[test_indices]
y_test = y[test_indices]
model.fit(X_train, y_train) # this is your model object that should have been created before
And finally get your predictions by:
model.predict_proba(X_test)
I haven't tested the code myself but should work with minor modifications.
You need to look in cv_results_ this will give you the scores, and mean scores for all of your folds, along with a mean, fitting time etc...
If you want to predict_proba() for each of the iterations, the way to do this would be to loop through the params given in cv_results_, re-fit the model for each of then, then predict the probabilities, as the individual models are not cached anywhere, as far as I know.
best_params_ will give you the best fit parameters, for if you want to train a model just using the best parameters next time.
See cv_results_ in the information page http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

Using Scikit-Learn GridSearchCV for cross validation with PredefinedSplit - Suspiciously good cross validation results

I'd like to use scikit-learn's GridSearchCV to perform a grid search and calculate the cross validation error using a predefined development and validation split (1-fold cross validation).
I'm afraid that I've done something wrong, because my validation accuracy is suspiciously high. Where I think I'm going wrong: I'm splitting up my training data into development and validation sets, training on the development set and recording the cross validation score on the validation set. My accuracy might be inflated because I am really training on a mix of the development and validation sets, then testing on the validation set. I'm not sure if I'm using scikit-learn's PredefinedSplit module correctly. Details below:
Following this answer, I did the following:
import numpy as np
from sklearn.model_selection import train_test_split, PredefinedSplit
from sklearn.grid_search import GridSearchCV
# I split up my data into training and test sets.
X_train, X_test, y_train, y_test = train_test_split(
data[training_features], data[training_response], test_size=0.2, random_state=550)
# sanity check - dimensions of training and test splits
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
# dimensions of X_train and x_test are (323430, 26) and (323430,1) respectively
# dimensions of X_test and y_test are (80858, 26) and (80858, 1)
''' Now, I define indices for a pre-defined split.
this is a 323430 dimensional array, where the indices for the development
set are set to -1, and the indices for the validation set are set to 0.'''
validation_idx = np.repeat(-1, y_train.shape)
np.random.seed(550)
validation_idx[np.random.choice(validation_idx.shape[0],
int(round(.2*validation_idx.shape[0])), replace = False)] = 0
# Now, create a list which contains a single tuple of two elements,
# which are arrays containing the indices for the development and
# validation sets, respectively.
validation_split = list(PredefinedSplit(validation_idx).split())
# sanity check
print(len(validation_split[0][0])) # outputs 258744
print(len(validation_split[0][0]))/float(validation_idx.shape[0])) # outputs .8
print(validation_idx.shape[0] == y_train.shape[0]) # True
print(set(validation_split[0][0]).intersection(set(validation_split[0][1]))) # set([])
Now, I run a grid search using GridSearchCV. My intention is that a model will be fit on the development set for each parameter combination over the grid, and the cross validation score will be recorded when the resulting estimator is applied to the validation set.
# a vanilla XGboost model
model1 = XGBClassifier()
# create a parameter grid for the number of trees and depth of trees
n_estimators = range(300, 1100, 100)
max_depth = [8, 10]
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)
# A grid search.
# NOTE: I'm passing a PredefinedSplit object as an argument to the `cv` parameter.
grid_search = GridSearchCV(model1, param_grid,
scoring='neg_log_loss',
n_jobs=-1,
cv=validation_split,
verbose=1)
Now, here is where a red flag is raised for me. I use the best estimator found by the gridsearch to find the accuracy on the validation set. It's very high - 0.89207865689639176. What's worse is that it's almost identical to the accuracy that I get if I use the classifier on the data development set (on which I just trained) - 0.89295597192591902. BUT - when I use the classifier on the true test set, I get a much lower accuracy, roughly .78:
# accurracy score on the validation set. This yields .89207865
accuracy_score(y_pred =
grid_result2.predict(X_train.iloc[validation_split[0][1]]),
y_true=y_train[validation_split[0][1]])
# accuracy score when applied to the development set. This yields .8929559
accuracy_score(y_pred =
grid_result2.predict(X_train.iloc[validation_split[0][0]]),
y_true=y_train[validation_split[0][0]])
# finally, the score when applied to the test set. This yields .783
accuracy_score(y_pred = grid_result2.predict(X_test), y_true = y_test)
To me, the almost exact correspondence between the model's accuracy when applied to the development and validation datasets, and the significant loss in accuracy when applied to the test set is a clear sign that I'm training on the validation data by accident, and thus my cross validation score is not representative of the true accuracy of the model.
I can't seem to find where I went wrong - mostly because I don't know what GridSearchCV is doing under the hood when it receives a PredefinedSplit object as the argument to the cv parameter.
Any ideas where I went wrong? If you need more details/elaboration, please let me know. The code is also in this notebook on github.
Thanks!
You need to set refit=False (not a default option), otherwise the grid search will refit the estimator on the whole dataset (ignoring cv) after the grid search completes.
Yes, there was a data leaking problem for the validation data. You need to set refit = False for GridSearchCV and it will not refit the whole data including training and validation data.

How do I handle unbalanced classes in my classifier?

I am using LinearSVM to classify my documents into categories. However, my dataset is unbalanced with some categories having 48,000 documents under them and some as small as 100. When I train my model, even with using Stratified KFold, I see that the category with 48,000 documents get a larger portion of documents(3300) compared to others. In such a case, it would definitely give me biased predictions. How can I make sure this selection isn't biased?
kf=StratifiedKFold(labels, n_folds=10, shuffle=True)
for train_index, test_index in kf:
X_train, X_test = docs[train_index],docs[test_index]
Y_train, Y_test = labels[train_index],labels[test_index]
Then I'm writing these(X_train, Y_train) to a file, computing the feature matrix and passing them to the classifier as follows:
model1 = LinearSVC()
model1 = model1.fit(matrix, label_tmp)
pred = model1.predict(matrix_test)
print("Accuracy is:")
print(metrics.accuracy_score(label_test, pred))
print(metrics.classification_report(label_test, pred))
The StratifiedKFold method by default takes into account the ratio of labels in all your classes, meaning that each fold will have the exact (or close to exact) ratio of each label in that sample. Whether you want to adjust for this or not is somewhat up to you - you can either let the classifier learn some kind of bias for labels with more samples (as you are now), or you can do one of two things:
Construct a separate train / test set, where the training set has equal number of samples in each label (therefore in your case, each class label in the training set might only have 50 examples, which is not ideal). Then you can train on your training set and test on the rest. If you do this multiple times with different samples, you are essentially doing k-fold cross validation, just choosing your sample sizes in a different way.
You can change your loss function (i.e. the way you initialize LinearSVC() to account for the class imbalances. For example: model = LinearSVC(class_weight='balanced'). This will cause the model to learn a loss function that takes class imbalances into account.

Categories

Resources