How to save GridSearchCV object? - python

Lately, I have been working on applying grid search cross validation (sklearn GridSearchCV) for hyper-parameter tuning in Keras with Tensorflow backend. An soon as my model is tuned
I am trying to save the GridSearchCV object for later use without success.
The hyper-parameter tuning is done as follows:
x_train, x_val, y_train, y_val = train_test_split(NN_input, NN_target, train_size = 0.85, random_state = 4)
history = History()
kfold = 10
regressor = KerasRegressor(build_fn = create_keras_model, epochs = 100, batch_size=1000, verbose=1)
neurons = np.arange(10,101,10)
hidden_layers = [1,2]
optimizer = ['adam','sgd']
activation = ['relu']
dropout = [0.1]
parameters = dict(neurons = neurons,
hidden_layers = hidden_layers,
optimizer = optimizer,
activation = activation,
dropout = dropout)
gs = GridSearchCV(estimator = regressor,
param_grid = parameters,
scoring='mean_squared_error',
n_jobs = 1,
cv = kfold,
verbose = 3,
return_train_score=True))
grid_result = gs.fit(NN_input,
NN_target,
callbacks=[history],
verbose=1,
validation_data=(x_val, y_val))
Remark: create_keras_model function initializes and compiles a Keras Sequential model.
After the cross validation is performed I am trying to save the grid search object (gs) with the following code:
from sklearn.externals import joblib
joblib.dump(gs, 'GS_obj.pkl')
The error I am getting is the following:
TypeError: can't pickle _thread.RLock objects
Could you please let me know what might be the reason for this error?
Thank you!
P.S.: joblib.dump method works well for saving GridSearchCV objects that are used
for the training MLPRegressors from sklearn.

Use
import joblib directly
instead of
from sklearn.externals import joblib
Save objects or results with:
joblib.dump(gs, 'model_file_name.pkl')
and load your results using:
joblib.load("model_file_name.pkl")
Here is a simple working example:
import joblib
#save your model or results
joblib.dump(gs, 'model_file_name.pkl')
#load your model for further usage
joblib.load("model_file_name.pkl")

Try this:
from sklearn.externals import joblib
joblib.dump(gs.best_estimator_, 'filename.pkl')
If you want to dump your object into one file - use:
joblib.dump(gs.best_estimator_, 'filename.pkl', compress = 1)
Simple Example:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.externals import joblib
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
gs = GridSearchCV(svc, parameters)
gs.fit(iris.data, iris.target)
joblib.dump(gs.best_estimator_, 'filename.pkl')
#['filename.pkl']
EDIT 1:
you can also save the whole object:
joblib.dump(gs, 'gs_object.pkl')

Subclass the sklearn.model_selection._search.BaseSearchCV class. Override the fit(self, X, y=None, groups=None, **fit_params) method, and modify its internal evaluate_candidates(candidate_params) function. Instead of immediately returning the results dictionary from evaluate_candidates(candidate_params), perform your serialization here (or in the _run_search method depending on your use case). With some additional modifications, this approach has the added benefit of allowing you to execute the grid search sequentially (see the comment in the source code here: _search.py). Note that the results dictionary returned by evaluate_candidates(candidate_params) is the same as the cv_results dictionary. This approach worked for me, but I was also attempting to add save-and-restore functionality for interrupted grid search executions.

Related

dask_xgboost.predict works but cannot be shown -Data must be 1-dimensional

I am trying to create model using XGBoost.
It seems like I manage to train the model, however, when I try to predict my test data and to see the actual prediction, I get the following error:
ValueError: Data must be 1-dimensional
This is how I tried to predict my data:
from dask_ml.model_selection import train_test_split
import dask
import xgboost
import dask_xgboost
from dask.distributed import Client
import dask_ml.model_selection as dcv
#split the data
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.33,random_state=42)
client = Client(n_workers=10, threads_per_worker=1)
#Trying to do hyperparamter running
model_xgb = xgb.XGBRegressor(seed=42,verbose=True)
params={
'learning_rate':[0.1,0.01,0.05],
'max_depth':[1,5,8],
'gamma':[0,0.5,1],
'scale_pos_weight':[1,3,5]
}
grid_search = GridSearchCV(model_xgb, params, cv=3, scoring='neg_mean_squared_error')
grid_search.fit(x_train, y_train)
#train data with best paraeters
bst = dask_xgboost.train(client, grid_search.best_params_, x_train, y_train, num_boost_round=10)
#predict data
dask_xgboost.predict(client, bst, x_test).persist()
The last line with the predict works, but when I addl compute to the endd in order to see the actual array I get the dimensional error:
dask_xgboost.predict(client, bst, x_test).persist().compute()
>>>ValueError: Data must be 1-dimensional
How can I get predictions with .predict?
As noted in the pip page for dask-xgboost:
Dask-XGBoost has been deprecated and is no longer maintained.
The functionality of this project has been included directly
in XGBoost. To use Dask and XGBoost together, please use
xgboost.dask instead
https://xgboost.readthedocs.io/en/latest/tutorials/dask.html.
The code you provided has a few missing assignments and expressions (e.g. how x is defined, where GridSearchCV is imported from). A few things that probably should be changed:
# note the .dask
model_xgb = xgb.dask.DaskXGBRegressor(seed=42, verbose=True)
grid_search = GridSearchCV(model_xgb, params, cv=3, scoring='neg_mean_squared_error')
grid_search.fit(x_train, y_train)
#train data with best params
model_xgb.client = client
model_xgb.set_params(grid_search.best_params_)
model_xgb.fit(X_train, y_train, eval_set=[(X_test, y_test)])

set_params() in sklean pipeline not working with TransformTargetRegressor

I would like to make a prediction of a single tree of my random forest. However, if I wrap my pipeline around TransformedTargetRegressor .set_params does not seem to work.
Please find below an example:
from sklearn.datasets import load_boston
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
# loading data
boston = load_boston()
X = boston["data"]
Y = boston["target"]
# pipeline and training
pipe = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestRegressor(n_estimators = 100, max_depth = 4, random_state = 0))
])
treg = TransformedTargetRegressor(regressor=pipe, transformer=StandardScaler())
treg.fit(X, Y)
# single tree from random forest
tree = treg.regressor_.named_steps['model'].estimators_[0]
x_sample = X[0:1]
print('baseline: ', treg.predict(x_sample))
x_scaled = treg.regressor_.named_steps['scaler'].transform(x_sample)
y_predicted = tree.predict(x_scaled)
y_transformed = treg.transformer_.inverse_transform([y_predicted])
print("internal pipeline changes: ", y_transformed)
new_model = treg.set_params(**{'regressor__model': tree})
y_predicted = new_model.predict(x_sample)
print('with set_params(): ', y_predicted)
The output that I am getting is shown below. I would expect 'with set_params()' to be the same like 'internal pipeline changes:
baseline: [26.41013313]
internal pipeline changes: [[30.02424242]]
with set_params(): [26.41013313]
TransformedTargetRegressor has a parameter regressor and an attribute regressor_. The former can be set with set_params and is considered a hyperparameter, but is not used in prediction; rather, it is cloned and fitted when the TTR is fitted, and stored in the regressor_ attribute.
So you cannot use set_params to update the fitted regressor attribute. (You can check that in your code, new_model.regressor_['model'] is still a random forest.) The best you can do is directly modify the attribute (though this is probably unorthodox, and in some situations may lead to other issues):
import copy
mod_model = copy.deepcopy(treg)
mod_model.regressor_.steps[-1] = ('model', tree)
y_predicted = mod_model.predict(x_sample)
print('with modifying regressor: ', y_predicted)
Apparently, scikit-learn TransformedTargetRegressor objects don't allow you to change the regressor used to predict, unless you re-fit the dataset on the new regressor in set_params. If you do this:
new_model = treg.set_params(**{'regressor__model': tree})
print(new_model)
you can see that the new parameters have been set. However, as you correctly discovered, the estimator used in predict is still the old one. If you want to change the estimator in the object, you can do:
new_model = treg.set_params(**{'regressor__model': tree})
new_model.fit(X, Y)
new_model.predict(x_sample)
And you can see that the prediction changes and uses the single tree to perform the estimation. If you are interested in the sinlge tree's prediciton and not re-fit on the whole dataset, you can just call, tree.predict() separately.

sklearn RandomizedSearchCV with Pipelined KerasClassifier

I am performing a hyperparameter tuning optimization tasks with sklearn on a Keras models. I am trying to optimize KerasClassifiers within a Pipeline...
Code follows:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold,RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.pipeline import Pipeline
my_seed=7
dataframe = pd.read_csv("z:/sonar.all-data.txt", header=None)
dataset = dataframe.values
# split into input and output variables
X = dataset[:,:60].astype(float)
Y = dataset[:,60]
encoder = LabelEncoder()
Y_encoded=encoder.fit_transform(Y)
myScaler = StandardScaler()
X_scaled = myScaler.fit_transform(X)
def create_keras_model(hidden=60):
model = Sequential()
model.add(Dense(units=hidden, input_dim=60, kernel_initializer="normal", activation="relu"))
model.add(Dense(1, kernel_initializer="normal", activation="sigmoid"))
#compile model
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
return model
def create_pipeline(hidden=60):
steps = []
steps.append(('scaler', StandardScaler()))
steps.append(('dl', KerasClassifier(build_fn=create_keras_model,hidden=hidden, verbose=0)))
pipeline = Pipeline(steps)
return pipeline
my_neurons = [15, 30, 60]
my_epochs= [50, 100, 150]
my_batch_size = [5,10]
my_param_grid = dict(hidden=my_neurons, epochs=my_epochs, batch_size=my_batch_size)
model2Tune = KerasClassifier(build_fn=create_keras_model, verbose=0)
model2Tune2 = create_pipeline()
griglia = RandomizedSearchCV(estimator=model2Tune, param_distributions = my_param_grid, n_iter=8 )
griglia.fit(X_scaled, Y_encoded) #this works
griglia2 = RandomizedSearchCV(estimator=create_pipeline, param_distributions = my_param_grid, n_iter=8 )
griglia2.fit(X, Y_encoded) #this does not
We see that RandomizedSearchCV works with griglia, whilst it does not work with griglia2, returning
"TypeError: estimator should be an estimator implementing 'fit'
method, was passed".
Is it possible to amend the code to make it run under a Pipeline object?
Thanks in advance
The estimator parameter wants an object, not a pointer. Currently you are passing a pointer to method which generates the pipeline object. Try adding () to it to solve this:
griglia2 = RandomizedSearchCV(estimator=create_pipeline(), param_distributions = my_param_grid, n_iter=8 )
Now for the second comment about the invalid parameters error. You need to append the name you defined when creating the pipeline to the actual parameters, so that they can be passed successfully.
Look at the description at the of Pipeline usage here.
Use this:
my_param_grid = dict(dl__hidden=my_neurons, dl__epochs=my_epochs,
dl__batch_size=my_batch_size)
Notice the dl__ (with two underscores). This is useful when you want to tune the parameters of multiple objects inside the pipeline.
For example, lets say along with the above parameters, you want to also tune or specify the parameters of StandardScaler.
Then your parameter grid becomes:
my_param_grid = dict(dl__hidden=my_neurons, dl__epochs=my_epochs,
dl__batch_size=my_batch_size,
scaler__with_mean=False)
Hope this clears things.

Can I send callbacks to a KerasClassifier?

I want the classifier to run faster and stop early if the patience reaches the number I set. In the following code it does 10 iterations of fitting the model.
import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.constraints import maxnorm
from keras.optimizers import SGD
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load dataset
dataframe = pandas.read_csv("sonar.csv", header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:60].astype(float)
Y = dataset[:,60]
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
calls=[EarlyStopping(monitor='acc', patience=10), ModelCheckpoint('C:/Users/Nick/Data Science/model', monitor='acc', save_best_only=True, mode='auto', period=1)]
def create_baseline():
# create model
model = Sequential()
model.add(Dropout(0.2, input_shape=(33,)))
model.add(Dense(33, init='normal', activation='relu', W_constraint=maxnorm(3)))
model.add(Dense(16, init='normal', activation='relu', W_constraint=maxnorm(3)))
model.add(Dense(122, init='normal', activation='softmax'))
# Compile model
sgd = SGD(lr=0.1, momentum=0.8, decay=0.0, nesterov=False)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
return model
numpy.random.seed(seed)
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=create_baseline, nb_epoch=300, batch_size=16, verbose=0, callbacks=calls)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
Here is the resulting error-
RuntimeError: Cannot clone object <keras.wrappers.scikit_learn.KerasClassifier object at 0x000000001D691438>, as the constructor does not seem to set parameter callbacks
I changed the cross_val_score in the following-
numpy.random.seed(seed)
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=create_baseline, nb_epoch=300, batch_size=16, verbose=0, callbacks=calls)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold, fit_params={'callbacks':calls})
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
and now I get this error-
ValueError: need more than 1 value to unpack
This code came from here. The code is by far the most accurate I've used so far. The problem is that there is no defined model.fit() anywhere in the code. It also takes forever to fit. The fit() operation occurs at the results = cross_val_score(...) and there's no parameters to throw a callback in there.
How do I go about doing this?
Also, how do I run the model trained on a test set?
I need to be able to save the trained model for later use...
Reading from here, which is the source code of KerasClassifier, you can pass it the arguments of fit and they should be used.
I don't have your dataset so I cannot test it, but you can tell me if this works and if not I will try and adapt the solution. Change this line :
estimators.append(('mlp', KerasClassifier(build_fn=create_baseline, nb_epoch=300, batch_size=16, verbose=0, callbacks=[...your_callbacks...])))
A small explaination of what's happening : KerasClassifier is taking all the possibles arguments for fit, predict, score and uses them accordingly when each method is called. They made a function that filters the arguments that should go to each of the above functions that can be called in the pipeline.
I guess there are several fit and predict calls inside the StratifiedKFold step to train on different splits everytime.
The reason why it takes forever to fit and it fits 10 times is because one fit is doing 300 epochs, as you asked. So the KFold is repeating this step over the different folds :
calls fit with all the parameters given to KerasClassifier (300 epochs and batch size = 16). It's training on 9/10 of your data and using 1/10 as validation.
EDIT :
Ok, so I took the time to download the dataset and try your code... First of all you need to correct a "few" things in your network :
your input have a 60 features. You clearly show it in your data prep :
X = dataset[:,:60].astype(float)
so why would you have this :
model.add(Dropout(0.2, input_shape=(33,)))
please change to :
model.add(Dropout(0.2, input_shape=(60,)))
About your targets/labels. You changed the objective from the original code (binary_crossentropy) to categorical_crossentropy. But you didn't change your Y array. So either do this in your data preparation :
from keras.utils.np_utils import to_categorical
encoded_Y = to_categorical(encoder.transform(Y))
or change your objective back to binary_crossentropy.
Now the network's output size : 122 on the last dense layer? your dataset obviously has 2 categories so why are you trying to output 122 classes? it won't match the target. Please change back your last layer to :
model.add(Dense(2, init='normal', activation='softmax'))
if you choose to use categorical_crossentropy, or
model.add(Dense(1, init='normal', activation='sigmoid'))
if you go back to binary_crossentropy.
So now that your network compiles, I could start to troubleshout.
here is your solution
So now I could get the real error message. It turns out that when you feed fit_params=whatever in the cross_val_score() function, you are feeding those parameters to a pipeline. In order to know to which part of the pipeline you want to send those parameters you have to specify it like this :
fit_params={'mlp__callbacks':calls}
Your error was saying that the process couldn't unpack 'callbacks'.split('__', 1) into 2 values. It was actually looking for the name of the pipeline's step to apply this to.
It should be working now :)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold, fit_params={'mlp__callbacks':calls})
BUT, you should be aware of what's happening here... the cross validation actually calls the create_baseline() function to recreate the model from scratch 10 times an trains it 10 times on different parts of the dataset. So it's not doing epochs as you were saying, it's doing 300 epochs 10 times.
What is also happening as a consequence of using this tool : since the models are always differents, it means the fit() method is applied 10 times on different models, therefore, the callbacks are also applied 10 different times and the files saved by ModelCheckpoint() get overriden and you find yourself only with the best model of the last run.
This is intrinsec to the tools you use, I don't see any way around this. This comes as consequence to using different general tools that weren't especially thought to be used together with all the possible configurations.
Try:
estimators.append(('mlp',
KerasClassifier(build_fn=create_model2,
nb_epoch=300,
batch_size=16,
verbose=0,
callbacks=[list_of_callbacks])))
where list_of_callbacks is a list of callbacks you want to apply. You could find details here. It's mentioned there that parameters fed to KerasClassifier could be legal fitting parameters.
It's also worth to mention that if you are using multiple runs with GPUs there might be a problem due to several reported memory leaks especially when you are using theano. I also noticed that running multiple fits consequently may show results which seem to be not independent when using sklearn API.
Edit:
Try also:
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold, fit_params = {'mlp__callbacks': calls})
Instead of putting callbacks list in a wrapper instantiation.
This is what I have done
results = cross_val_score(estimator, X, Y, cv=kfold,
fit_params = {'callbacks': [checkpointer,plateau]})
and has worked so far
Despite the TensorFlow, Keras & SciKeras documentation suggesting you can define training callbacks via the fit method, for my setup it turns out (like #NassimBen suggests) you should do it through the model constructor instead.
Rather than this:
model = KerasClassifier(..).fit(X, y, callbacks=[<HERE>])
Try this:
model = KerasClassifier(callbacks=[<HERE>]).fit(X, y)

When do items in the Pipeline call fit_transform(), and when do they call transform()? (scikit-learn, Pipeline)

I'm trying to fit a model that I've put together using Pipeline:
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
cross_validation_object = cross_validation.StratifiedKFold(Y, n_folds = 10)
scaler = MinMaxScaler(feature_range = [0,1])
logistic_fit = LogisticRegression()
pipeline_object = Pipeline([('scaler', scaler),('model', logistic_fit)])
tuned_parameters = [{'model__C': [0.01,0.1,1,10],
'model__penalty': ['l1','l2']}]
grid_search_object = GridSearchCV(pipeline_object, tuned_parameters, cv = cross_validation_object, scoring = 'accuracy')
grid_search_object.fit(X_train,Y_train)
My question: Is the best_estimator going to scale the test data based on the values in the training data? For example, if I call:
grid_search_object.best_estimator_.predict(X_test)
It will NOT try to fit the scaler on the X_test data, right? It will just transform it using the original parameters.
Thanks!
The predict methods never fit any data. In this case, exactly as you describe it, the best_estimator_ pipeline is going to scale based on the scaling it has learnt on the training set.

Categories

Resources