Is there a diffrence of fit on GridSearchCV and Pipeline in sklearn?

Is there a diffrence of fit on GridSearchCV and Pipeline in sklearn? - python

Maybe this is just a bug or I am really stupid, I wrapped (or better said a colleague wrapped) a Keras model using some Keras transformations also wrapped so we can use the Keras model with the sklearn library.
Now when I use fit on the Pipeline it works fine. It runs and it returns a working model instance. However when I use a GridSearchCV for some reason it fails to do the transforms (or so it would seem) and it gives me the following error:
InvalidArgumentError (see above for traceback): indices[11,2] = 26048 is not in [0, 10001)
[[Node: embedding_4/Gather = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](embedding_4/embeddings/read, embedding_4/Cast)]]
The code looks something like this:
vocab_size = 10001
class TextsToSequences(Tokenizer, BaseEstimator, TransformerMixin):
def __init__(self, **kwargs):
super().__init__(**kwargs)
def fit(self, X, y=None):
print('fitting the text')
print(self.document_count)
self.fit_on_texts(X)
return self
def transform(self, X, y=None):
print('transforming the text')
r = np.array(self.texts_to_sequences(X))
print(r)
print(self.document_count)
return r
class Padder(BaseEstimator, TransformerMixin):
def __init__(self, maxlen=500):
self.maxlen = maxlen
self.max_index = None
def fit(self, X, y=None):
#self.max_index = pad_sequences(X, maxlen=self.maxlen).max()
return self
def transform(self, X, y=None):
print('pad the text')
X = pad_sequences(X, maxlen=self.maxlen, padding='post')
#X[X > self.max_index] = 0
print(X)
return X
maxlen = 15
def makeLstmModel():
model = Sequential()
model.add(Embedding(10001, 100, input_length=15))
model.add(LSTM(35, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(16, activation='sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
return model
lstmmodel = KerasClassifier(build_fn=makeLstmModel, epochs=5, batch_size=1000, verbose=42)
pipeline = [
('seq', TextsToSequences(num_words=vocab_size)),
('pad', Padder(maxlen)),
('clf', lstmmodel)
]
textClassifier = Pipeline(pipeline)
#Setup parameters
parameters = {} #Some params to use in gridsearch
skf = StratifiedKFold(n_splits=numberOfFolds, shuffle=True, random_state=1)
gscv = GridSearchCV(textClassifier, parameters, cv=skf, iid=False, n_jobs=1, verbose=50)
gscv.fit(x_train, y_train)
Now the above code fails with InvalidArgumentError, but when I run fit with the Pipeline it works:
Is there a difference between fit() in GridSearchCV and Pipeline? Am I really stupid or is this just a bug?
BTW, I am currently forced to use Sklearn 0.19.1.

After hours of thinking and debugging, I came to the following conclusion:
Pipeline.fit() is able to auto fill **kwargs arguments.
GridSearchCV.fit() is not able to auto fill **kwargs arguments.
I tested this on sklearn 0.19.1
My issue was that the bag of words created with Keras Tokenizer was created using the num_words parameter which limits the bag to a maximum number of words. My colleague did a bad job at this hence the number of words matches to the number of input dimensions in the LSTM model. Because the num_words were never set, the bag was always bigger than the input dimension.
The num_words were passed to the Tokenizer as **kwargs arguments.
class TextsToSequences(Tokenizer, BaseEstimator, TransformerMixin):
def __init__(self, **kwargs):
super().__init__(**kwargs)
For some reason GridSearchCV.fit() is not able to fill this automatically. The solution to this would be to use fixed arguments.
class TextsToSequences(Tokenizer, BaseEstimator, TransformerMixin):
def __init__(self, num_words=8000, **kwargs):
super().__init__(num_words, **kwargs)
After this change GridSearchCV.fit() works.

Related

Using KerasRegressor with cross_validate fails because of uncloneability

I am using Keras' sklearn wrapper for a regressor, namely tf.keras.wrappers.scikit_learn.KerasRegressor.
I want this regressor to work within sklearn's cross validation scheme, namely sklearn.model_selection.cross_validate.
The regressor generally works without CV.
However, the latter fails, because I have a necessary parameter in the regressor's __init__ method that defines the batch input shape and it appears to be missing.
This seems to be the case because MyRegressor or KerasRegressor isn't correctly cloneable using clone(estimator). The specific error message is:
KeyError: 'batch_input_shape'
Is there a way to make MyRegressor work with cross_validate? Am I somehow violating sklearn's requirements?
Please see this condensed working example:
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_validate
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.models import Sequential
from tensorflow.keras.wrappers.scikit_learn import KerasRegressor
class MyRegressor(KerasRegressor):
def __init__(self, batch_input_shape, build_fn=None, **kwargs):
self.batch_input_shape = batch_input_shape
super().__init__(**kwargs)
def __call__(self, *kwargs):
model = Sequential([
LSTM(16, stateful=True, batch_input_shape=self.batch_input_shape),
Dense(1),
])
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['RootMeanSquaredError'])
return model
def reset_states(self):
self.model.reset_states()
X, y = make_regression(6400, 5)
X = X.reshape(X.shape[0], 1, X.shape[1])
batch_size = 64
batch_input_shape = (batch_size, 1, X.shape[-1])
# Works fine
reg = MyRegressor(batch_input_shape)
for i in range(10):
reg.fit(X, y, batch_size=batch_size)
reg.reset_states()
# Doesn't work
reg = MyRegressor(batch_input_shape)
results = cross_validate(reg, X, y, scoring=['neg_mean_squared_error'])

Cloneability requires a proper get_params method. Most often this is obtained by inheriting from sklearn's BaseEstimator, but KerasRegressor instead implements its own directly (source). The way it does it is incompatible with your additional batch_input_shape; you can tweak it to make things work:
def get_params(self, deep=False):
res = self.sk_params.copy() # sk_params was set by KerasRegressor.__init__
res.update({
'build_fn': self.build_fn,
'batch_input_shape': self.batch_input_shape,
})
return res
(I get an error in your example after this update, about input shapes. But I'm less familiar with batch sizes and keras to be able to answer that followup.)

Pipeline GridSearchCV, corresponding parameters in different steps

I am trying to do some hyper-parameter tuning in my pipeline and have the following setup:
model = KerasClassifier(build_fn = create_model, epochs = 5)
pipeline = Pipeline(steps =[('Tokenizepadder', TokenizePadding()),
('NN', model)] )
Where I have a variable 'maxlen' in both the Tokenizepadder and my Neural Network (for the Neural Network it is called max_length, I was afraid naming them the same would cause errors later in the code). When I try to perform a grid search, I am struggling to have these values correspond. If I perform grid search for these values seperately, they won't mach and there will be a problem with the input data not matching the neural network.
In short I would like to do something like:
pipeline = Pipeline(steps =[('Tokenizepadder', TokenizePadding()),
('NN', KerasClassifier(build_fn = create_model, epochs = 5, max_length = pipeline.get_params()['Tokenizepadder__maxlen']))] )
So that when I am performing a grid search for the parameter 'Tokenizepadder__maxlen', it will change the value 'NN__max_length' to the same value.

May be you can change your classifier and tokenizer, to pass around max_len parameter. Then, only grid search with tokenizer max_len parameter.
Not the cleanest way, but might do.
from sklearn.base import BaseEstimator, TransformerMixin, EstimatorMixin
class TokeinizePadding(BaseEstimator, TransformerMixin):
def __init__(self, max_len, ...):
self.max_len = max_len
...
def fit(self, X, y=None):
...
return self
def transform(self, X, y=None):
data = ... # your stuff
return {"array": data, "max_len": self.max_len}
class KerasClassifier(...):
...
def fit(data, y):
self.max_len = data["max_len"]
self.build_model()
X = data["array"]
... # your stuff

does validation_data in model.fit() method in Tensorflow Keras have to be a tuple?

I'm implementing a complicated loss function so I use a custom layer to pass the loss. Something like：
class SIAMESE_LOSS(Layer):
def __init__(self, **kwargs):
super(SIAMESE_LOSS, self).__init__(**kwargs)
#staticmethod
def mmd_loss(source_samples, target_samples):
return mmd(source_samples, target_samples)
#staticmethod
def regression_loss(pred, labels):
return K.mean(mae(pred, labels))
def call(self, inputs, **kwargs):
source_labels = inputs[0]
target_labels = inputs[1]
source_pred = inputs[2]
target_pred = inputs[3]
source_samples = inputs[4]
target_samples = inputs[5]
source_loss = self.regression_loss(source_pred, source_labels)
target_loss = self.regression_loss(target_pred, target_labels)
mmd_loss = self.mmd_loss(source_samples, target_samples)
self.add_loss(source_loss)
self.add_loss(target_loss)
self.add_loss(mmd)
self.add_metric(source_loss, aggregation='mean', name='source_mae')
self.add_metric(target_loss, aggregation='mean', name='target_mae')
self.add_metric(mmd_loss, aggregation='mean', name='MMD')
return mmd_loss+target_loss+source_loss
So the labels are sent to the model as inputs.
Therefore fitting the model will be like:
history = model.fit(
x=[train_data_s, train_data_t, self.train_labels, self.train_data_t],
y=None,
batch_size=self.batch_size,
epochs=base_epochs,
verbose=2,
callbacks=cp_callback,
validation_data=[val_data_s, val_data_t, self.val_labels, self.val_labels_t],
shuffle=True
)
However, according to the official document in Tensorflow, validation_data should be:
Data on which to evaluate the loss and any model metrics at the end of
each epoch. The model will not be trained on this data.
validation_data will override validation_split. validation_data could
be: tuple (x_val, y_val) of Numpy arrays or tensors tuple (x_val,
y_val, val_sample_weights) of Numpy arrays dataset For the first two
cases, batch_size must be provided. For the last case,
validation_steps could be provided. Note that validation_data does not
support all the data types that are supported in x, eg, dict,
generator or keras.utils.Sequence.
There's no 'label' that should be passed since they're already handled by the model as inputs. How can I solve the problem if I still wanna use validation data?

to write your own loss you need to inherit from class Loss and then implement your loss calculation in the init and call methods.
https://www.tensorflow.org/api_docs/python/tf/keras/losses/Loss
so you dont need to train without passing y in model.fit()

Is there a way to put a customized keras model inside a python class method?

I'm trying to build a deep neural network in python using a class method. ((The main idea behind it is to try to customize the loss function later on))
I'm trying to use Keras in the function that defines the neural network structure but it doesn't seem to be working.
# create a class to
class PGNN(keras.Sequential):
def __init__(self,x,y):
super().__init__()
X = np.concatenate([x,y], axis=1)
self.X = X
self.x = X[:,0:1]
self.y = X[:,1:2]
def build_model_u(self):
model_u = models.Sequential
model_u.add(layers.Dense(64, activation='tanh', input_shape= 1000))
model_u.add(layers.Dense(32, activation='tanh'))
model_u.add(layers.Dense(16, activation='tanh'))
model_u.add(layers.Dense(8, activation='tanh'))
model_u.add(layers.Dense(4, activation='tanh'))
model_u.add(layers.Dense(1))
model_u.compile(optimizer='Adam', loss='mse', metrics=['mae'])
def train(self, x_train, y_train):
model = build_model_u(self)
model.fit()
def predict(self, x_test):
model.predict(x_test)
def validation(self, x_test, y_test):
model.evaluate(x_test,y_test, verbose=2)
I expected the model to start training when I call model.fit(x_train,y_train) but I always get the error "build_model_u is not defined"
model = build_model_u(self)
NameError: name 'build_model_u' is not defined

You must be calling the method like
model = build_model_u
model.fit(x_train,y_train)
Call the function like this
model = build_model_u()
model.fit(x_train,y_train)
and also you can remove the output variable
output= model_u.add(layers.Dense(3))
and keep it this way
model_u.add(layers.Dense(3))

Yes there is way. You have to extends your custom model class with keras.Model and overtire call method then you can call fit method from your own custom class.
You can follow the following Keras documentations.
https://keras.io/models/about-keras-models/#model-subclassing

PyTorch naive single label classification with embedding layer fails at random

I am new to PyTorch and I am trying out the Embedding Layer.
I wrote a naive classification task, where all the inputs are the equal and all the labels are set to 1.0. I hence expect the model to learn quickly to predict 1.0.
The input is always 0, which is fed into a nn.Embedding(1,32) layer, followed by nn.Linear(32,1) and nn.Relu().
However, an unexpected and undesired behavior occurs: training outcome is different for different times I run the code.
For example,
setting the random seed to 10, model converges: loss decreases and model always predicts 1.0
setting the random seed to 1111, model doesn't converge: loss doesn't decrease and model always predicts 0.5. In those cases the parameters are not updated
Here is the minimal, replicable code:
from torch.nn import BCEWithLogitsLoss
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torch.autograd import Variable
from torch.utils.data import Dataset
import torch
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.vgg_fc = nn.Linear(32, 1)
self.relu = nn.ReLU()
self.embeddings = nn.Embedding(1, 32)
def forward(self, data):
emb = self.embeddings(data['index'])
return self.relu(self.vgg_fc(emb))
class MyDataset(Dataset):
def __init__(self):
pass
def __len__(self):
return 1000
def __getitem__(self, idx):
return {'label': 1.0, 'index': 0}
def train():
model = MyModel()
db = MyDataset()
dataloader = DataLoader(db, batch_size=256, shuffle=True, num_workers=16)
loss_function = BCEWithLogitsLoss()
optimizer_rel = optim.SGD(model.parameters(), lr=0.1)
for epoch in range(50):
for i_batch, sample_batched in enumerate(dataloader):
model.zero_grad()
out = model({'index': Variable(sample_batched['index'])})
labels = Variable(sample_batched['label'].type(torch.FloatTensor).view(sample_batched['label'].shape[0], 1))
loss = loss_function(out, labels)
loss.backward()
optimizer_rel.step()
print 'Epoch:', epoch, 'batch', i_batch, 'Tr_Loss:', loss.data[0]
return model
if __name__ == '__main__':
# please, try seed 10 (converge) and seed 1111 (fails)
torch.manual_seed(10)
train()
Without specifying the random seed, different runs have different outcome.
Why is, in those cases, the model unable to learn such a easy task?
Is there any mistake in the way I use nn.Embedding layer?
Thank you

I found the problem was the final relu layer, before the sigmoid.
As stated here, that layer will:
throw away information without adding any additional benefit
Removing the layer, the network learned as expected with any seed.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is there a diffrence of fit on GridSearchCV and Pipeline in sklearn? - python

Related

Using KerasRegressor with cross_validate fails because of uncloneability

Pipeline GridSearchCV, corresponding parameters in different steps

does validation_data in model.fit() method in Tensorflow Keras have to be a tuple?

Is there a way to put a customized keras model inside a python class method?

PyTorch naive single label classification with embedding layer fails at random

Categories

Resources