I am performing a hyperparameter tuning optimization tasks with sklearn on a Keras models. I am trying to optimize KerasClassifiers within a Pipeline...
Code follows:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold,RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.pipeline import Pipeline
my_seed=7
dataframe = pd.read_csv("z:/sonar.all-data.txt", header=None)
dataset = dataframe.values
# split into input and output variables
X = dataset[:,:60].astype(float)
Y = dataset[:,60]
encoder = LabelEncoder()
Y_encoded=encoder.fit_transform(Y)
myScaler = StandardScaler()
X_scaled = myScaler.fit_transform(X)
def create_keras_model(hidden=60):
model = Sequential()
model.add(Dense(units=hidden, input_dim=60, kernel_initializer="normal", activation="relu"))
model.add(Dense(1, kernel_initializer="normal", activation="sigmoid"))
#compile model
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
return model
def create_pipeline(hidden=60):
steps = []
steps.append(('scaler', StandardScaler()))
steps.append(('dl', KerasClassifier(build_fn=create_keras_model,hidden=hidden, verbose=0)))
pipeline = Pipeline(steps)
return pipeline
my_neurons = [15, 30, 60]
my_epochs= [50, 100, 150]
my_batch_size = [5,10]
my_param_grid = dict(hidden=my_neurons, epochs=my_epochs, batch_size=my_batch_size)
model2Tune = KerasClassifier(build_fn=create_keras_model, verbose=0)
model2Tune2 = create_pipeline()
griglia = RandomizedSearchCV(estimator=model2Tune, param_distributions = my_param_grid, n_iter=8 )
griglia.fit(X_scaled, Y_encoded) #this works
griglia2 = RandomizedSearchCV(estimator=create_pipeline, param_distributions = my_param_grid, n_iter=8 )
griglia2.fit(X, Y_encoded) #this does not
We see that RandomizedSearchCV works with griglia, whilst it does not work with griglia2, returning
"TypeError: estimator should be an estimator implementing 'fit'
method, was passed".
Is it possible to amend the code to make it run under a Pipeline object?
Thanks in advance
The estimator parameter wants an object, not a pointer. Currently you are passing a pointer to method which generates the pipeline object. Try adding () to it to solve this:
griglia2 = RandomizedSearchCV(estimator=create_pipeline(), param_distributions = my_param_grid, n_iter=8 )
Now for the second comment about the invalid parameters error. You need to append the name you defined when creating the pipeline to the actual parameters, so that they can be passed successfully.
Look at the description at the of Pipeline usage here.
Use this:
my_param_grid = dict(dl__hidden=my_neurons, dl__epochs=my_epochs,
dl__batch_size=my_batch_size)
Notice the dl__ (with two underscores). This is useful when you want to tune the parameters of multiple objects inside the pipeline.
For example, lets say along with the above parameters, you want to also tune or specify the parameters of StandardScaler.
Then your parameter grid becomes:
my_param_grid = dict(dl__hidden=my_neurons, dl__epochs=my_epochs,
dl__batch_size=my_batch_size,
scaler__with_mean=False)
Hope this clears things.
Related
I would like to make a prediction of a single tree of my random forest. However, if I wrap my pipeline around TransformedTargetRegressor .set_params does not seem to work.
Please find below an example:
from sklearn.datasets import load_boston
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
# loading data
boston = load_boston()
X = boston["data"]
Y = boston["target"]
# pipeline and training
pipe = Pipeline([
('scaler', StandardScaler()),
('model', RandomForestRegressor(n_estimators = 100, max_depth = 4, random_state = 0))
])
treg = TransformedTargetRegressor(regressor=pipe, transformer=StandardScaler())
treg.fit(X, Y)
# single tree from random forest
tree = treg.regressor_.named_steps['model'].estimators_[0]
x_sample = X[0:1]
print('baseline: ', treg.predict(x_sample))
x_scaled = treg.regressor_.named_steps['scaler'].transform(x_sample)
y_predicted = tree.predict(x_scaled)
y_transformed = treg.transformer_.inverse_transform([y_predicted])
print("internal pipeline changes: ", y_transformed)
new_model = treg.set_params(**{'regressor__model': tree})
y_predicted = new_model.predict(x_sample)
print('with set_params(): ', y_predicted)
The output that I am getting is shown below. I would expect 'with set_params()' to be the same like 'internal pipeline changes:
baseline: [26.41013313]
internal pipeline changes: [[30.02424242]]
with set_params(): [26.41013313]
TransformedTargetRegressor has a parameter regressor and an attribute regressor_. The former can be set with set_params and is considered a hyperparameter, but is not used in prediction; rather, it is cloned and fitted when the TTR is fitted, and stored in the regressor_ attribute.
So you cannot use set_params to update the fitted regressor attribute. (You can check that in your code, new_model.regressor_['model'] is still a random forest.) The best you can do is directly modify the attribute (though this is probably unorthodox, and in some situations may lead to other issues):
import copy
mod_model = copy.deepcopy(treg)
mod_model.regressor_.steps[-1] = ('model', tree)
y_predicted = mod_model.predict(x_sample)
print('with modifying regressor: ', y_predicted)
Apparently, scikit-learn TransformedTargetRegressor objects don't allow you to change the regressor used to predict, unless you re-fit the dataset on the new regressor in set_params. If you do this:
new_model = treg.set_params(**{'regressor__model': tree})
print(new_model)
you can see that the new parameters have been set. However, as you correctly discovered, the estimator used in predict is still the old one. If you want to change the estimator in the object, you can do:
new_model = treg.set_params(**{'regressor__model': tree})
new_model.fit(X, Y)
new_model.predict(x_sample)
And you can see that the prediction changes and uses the single tree to perform the estimation. If you are interested in the sinlge tree's prediciton and not re-fit on the whole dataset, you can just call, tree.predict() separately.
I'm using sklearn pipelines to build a Keras autoencoder model and use gridsearch to find the best hyperparameters. This works fine if I use a Multilayer Perceptron model for classification; however, in the autoencoder I need the output values to be the same as input. In other words, I am using a StandardScalar instance in the pipeline to scale the input values and therefore this leads to my question: how can I make the StandardScalar instance inside the pipeline to work on both the input data as well as target data, so that they end up to be the same?
I'm providing a code snippet as an example.
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, KFold
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop, Adam
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
X, y = make_classification (n_features = 50, n_redundant = 0, random_state = 0,
scale = 100, n_clusters_per_class = 1)
# Define wrapper
def create_model (learn_rate = 0.01, input_shape, metrics = ['mse']):
model = Sequential ()
model.add (Dense (units = 64, activation = 'relu',
input_shape = (input_shape, )))
model.add (Dense (32, activation = 'relu'))
model.add (Dense (8, activation = 'relu'))
model.add (Dense (32, activation = 'relu'))
model.add (Dense (input_shape, activation = None))
model.compile (loss = 'mean_squared_error',
optimizer = Adam (lr = learn_rate),
metrics = metrics)
return model
# Create scaler
my_scaler = StandardScaler ()
steps = list ()
steps.append (('scaler', my_scaler))
standard_scaler_transformer = Pipeline (steps)
# Create classifier
clf = KerasRegressor (build_fn = create_model, verbose = 2)
# Assemble pipeline
# How to scale input and output??
clf = Pipeline (steps = [('scaler', my_scaler),
('classifier', clf)],
verbose = True)
# Run grid search
param_grid = {'classifier__input_shape' : [X.shape [1]],
'classifier__batch_size' : [50],
'classifier__learn_rate' : [0.001],
'classifier__epochs' : [5, 10]}
cv = KFold (n_splits = 5, shuffle = False)
grid = GridSearchCV (estimator = clf, param_grid = param_grid,
scoring = 'neg_mean_squared_error', verbose = 1, cv = cv)
grid_result = grid.fit (X, X)
print ('Best: %f using %s' % (grid_result.best_score_, grid_result.best_params_))
You can use TransformedTargetRegressor to apply arbitrary transformations on the target values (i.e. y) by providing either a function (i.e. using func argument) or a transformer (i.e. transformer argument).
In this case (i.e. fitting an auto-encoder model), since you want to apply the same StandardScalar instance on the target values as well, you can use transformer argument. And it could be done in one of the following ways:
You can use it as one of the pipeline steps, wrapping the regressor:
scaler = StandardScaler()
regressor = KerasRegressor(...)
pipe = Pipeline(steps=[
('scaler', scaler),
('ttregressor', TransformedTargetRegressor(regressor, transformer=scaler))
])
# Use `__regressor` to access the regressor hyperparameters
param_grid = {'ttregressor__regressor__hyperparam_name' : ...}
gridcv = GridSearchCV(estimator=pipe, param_grid=param_grid, ...)
gridcv.fit(X, X)
Alternatively, you can wrap it around the GridSearchCV like this:
ttgridcv = TransformedTargetRegressor(GridSearchCV(...), transformer=scalar)
ttgridcv.fit(X, X)
# Use `regressor_` attribute to access the fitted regressor (i.e. `GridSearchCV` instance)
print(ttgridcv.regressor_.best_score_, ttgridcv.regressor_.best_params_))
I want to build a neural network using Keras on transforms of my input variables AND my output variables using the sklearn Pipeline (so I can perform CV). I am trying to use TransformedTargetRegressor, but my mean squared errors do not make sense to me.
This is my code which is adapted from Sklearn's example for TransformedTargetRegressor using the Boston Housing dataset and adding a simple neural network that scales the input variables (X).
Set up (this section is fine):
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import load_boston
from sklearn.compose import TransformedTargetRegressor
from sklearn.model_selection import train_test_split
#load data
X, y = load_boston(return_X_y=True)
#define simple neural network
def simple_nn():
model = Sequential()
model.add(Dense(13, input_dim=13, activation='relu'))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer = 'adam')
return model
#create pipeline for input variables (X) preprocessing
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(build_fn=simple_nn, epochs=100, batch_size=5, verbose=True)))
pipeline = Pipeline(estimators)
I am trying to do the following (section in question):
#Section in question
transformer = MinMaxScaler()
model = TransformedTargetRegressor(regressor=pipeline,
transformer=transformer)
results = cross_val_score(model, X, y, cv=KFold(n_splits=5))
The resulting cross validation scores are:
array([ 0.61321517, 0.35811762, -2.67674546, -0.30623006, -0.38187424])
The middle number is of particular concern to me since the y target is supposed to have been scaled from 0 to 1, so a mean squared error of -2.67 seems wrong. What am I doing wrong here?
A mean squared error is squared, and thus can't be negative.
That means that your score is not the mean squared error.
The cross_val_score documentation tells us that if not defined, the scorer default to the estimator scorer :
"If None, the estimator’s default scorer (if available) is used.
In your case, it's the TransformedTargetRegressor regressor that is being used. And the TransformedTargetRegressor documentation tells us that its default score :
Return the coefficient of determination R^2 of the prediction.
So the values your are displaying are R2 scores. It can be negative if your model perform badly. See this question for instance.
As a good practice, you should always define the scorer you want to use, to avoid relying on the wrong one.
I want to build an sklearn VotingClassifier ensemble out of multiple different models (Decision Tree, SVC, and a Keras Network). All of them need a different kind of data preprocessing, which is why I made a pipeline for each of them.
# Define pipelines
# DTC pipeline
featuriser = Featuriser()
dtc = DecisionTreeClassifier()
dtc_pipe = Pipeline([('featuriser',featuriser),('dtc',dtc)])
# SVC pipeline
scaler = TimeSeriesScalerMeanVariance(kind='constant')
flattener = Flattener()
svc = SVC(C = 100, gamma = 0.001, kernel='rbf')
svc_pipe = Pipeline([('scaler', scaler),('flattener', flattener), ('svc', svc)])
# Keras pipeline
cnn = KerasClassifier(build_fn=get_model())
cnn_pipe = Pipeline([('scaler',scaler),('cnn',cnn)])
# Make an ensemble
ensemble = VotingClassifier(estimators=[('dtc', dtc_pipe),
('svc', svc_pipe),
('cnn', cnn_pipe)],
voting='hard')
The Featuriser,TimeSeriesScalerMeanVariance and Flattener classes are some custom made transformers that all employ fit,transform and fit_transform methods.
When I try to ensemble.fit(X, y) fit the whole ensemble I get the error message:
ValueError: The estimator list should be a classifier.
Which I can understand, as the individual estimators are not specifically classifiers but pipelines. Is there a way to still make it work?
The problem is with the KerasClassifier. It does not provide the _estimator_type, which was checked in _validate_estimator.
It is not the problem of using pipeline. Pipeline provides this information as a property. See here.
Hence, the quick fix is setting _estimator_type='classifier'.
A reproducible example:
# Define pipelines
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler, Normalizer
from sklearn.ensemble import VotingClassifier
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.datasets import make_classification
from keras.layers import Dense
from keras.models import Sequential
X, y = make_classification()
# DTC pipeline
featuriser = MinMaxScaler()
dtc = DecisionTreeClassifier()
dtc_pipe = Pipeline([('featuriser', featuriser), ('dtc', dtc)])
# SVC pipeline
scaler = Normalizer()
svc = SVC(C=100, gamma=0.001, kernel='rbf')
svc_pipe = Pipeline(
[('scaler', scaler), ('svc', svc)])
# Keras pipeline
def get_model():
# create model
model = Sequential()
model.add(Dense(10, input_dim=20, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
cnn = KerasClassifier(build_fn=get_model)
cnn._estimator_type = "classifier"
cnn_pipe = Pipeline([('scaler', scaler), ('cnn', cnn)])
# Make an ensemble
ensemble = VotingClassifier(estimators=[('dtc', dtc_pipe),
('svc', svc_pipe),
('cnn', cnn_pipe)],
voting='hard')
ensemble.fit(X, y)
Lately, I have been working on applying grid search cross validation (sklearn GridSearchCV) for hyper-parameter tuning in Keras with Tensorflow backend. An soon as my model is tuned
I am trying to save the GridSearchCV object for later use without success.
The hyper-parameter tuning is done as follows:
x_train, x_val, y_train, y_val = train_test_split(NN_input, NN_target, train_size = 0.85, random_state = 4)
history = History()
kfold = 10
regressor = KerasRegressor(build_fn = create_keras_model, epochs = 100, batch_size=1000, verbose=1)
neurons = np.arange(10,101,10)
hidden_layers = [1,2]
optimizer = ['adam','sgd']
activation = ['relu']
dropout = [0.1]
parameters = dict(neurons = neurons,
hidden_layers = hidden_layers,
optimizer = optimizer,
activation = activation,
dropout = dropout)
gs = GridSearchCV(estimator = regressor,
param_grid = parameters,
scoring='mean_squared_error',
n_jobs = 1,
cv = kfold,
verbose = 3,
return_train_score=True))
grid_result = gs.fit(NN_input,
NN_target,
callbacks=[history],
verbose=1,
validation_data=(x_val, y_val))
Remark: create_keras_model function initializes and compiles a Keras Sequential model.
After the cross validation is performed I am trying to save the grid search object (gs) with the following code:
from sklearn.externals import joblib
joblib.dump(gs, 'GS_obj.pkl')
The error I am getting is the following:
TypeError: can't pickle _thread.RLock objects
Could you please let me know what might be the reason for this error?
Thank you!
P.S.: joblib.dump method works well for saving GridSearchCV objects that are used
for the training MLPRegressors from sklearn.
Use
import joblib directly
instead of
from sklearn.externals import joblib
Save objects or results with:
joblib.dump(gs, 'model_file_name.pkl')
and load your results using:
joblib.load("model_file_name.pkl")
Here is a simple working example:
import joblib
#save your model or results
joblib.dump(gs, 'model_file_name.pkl')
#load your model for further usage
joblib.load("model_file_name.pkl")
Try this:
from sklearn.externals import joblib
joblib.dump(gs.best_estimator_, 'filename.pkl')
If you want to dump your object into one file - use:
joblib.dump(gs.best_estimator_, 'filename.pkl', compress = 1)
Simple Example:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from sklearn.externals import joblib
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
gs = GridSearchCV(svc, parameters)
gs.fit(iris.data, iris.target)
joblib.dump(gs.best_estimator_, 'filename.pkl')
#['filename.pkl']
EDIT 1:
you can also save the whole object:
joblib.dump(gs, 'gs_object.pkl')
Subclass the sklearn.model_selection._search.BaseSearchCV class. Override the fit(self, X, y=None, groups=None, **fit_params) method, and modify its internal evaluate_candidates(candidate_params) function. Instead of immediately returning the results dictionary from evaluate_candidates(candidate_params), perform your serialization here (or in the _run_search method depending on your use case). With some additional modifications, this approach has the added benefit of allowing you to execute the grid search sequentially (see the comment in the source code here: _search.py). Note that the results dictionary returned by evaluate_candidates(candidate_params) is the same as the cv_results dictionary. This approach worked for me, but I was also attempting to add save-and-restore functionality for interrupted grid search executions.