Finding the optimum learning rate & epochs in a Neural Network

Finding the optimum learning rate & epochs in a Neural Network - python

I have created a single layer neural network with two outputs (one for each class, 0 or 1) trained with the sigmoid method and SGD optimizer. I have also trained the NN without any hidden layer. Furthermore I have validated the performance of the model using StratifiedKFold with 4 splits. The model trained is designed with a lr=0.1 and epochs=150, however, I don´t know if these values are optimising the model. For this reason, I would like to run 20 combinations of the learning rate parameter and epochs to see the most accurate result and for which combination of these parameters I´m obtaining it. Below the restrictions:
epochs: values between 10 and 150
learning rate: values between 0.01 and 1
Please, see below the code:
from sklearn.model_selection import StratifiedKFold
from keras.models import Sequential
from keras.layers.core import Dense
from keras import layers
from keras.optimizers import SGD
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
#Function to create the NN model
def create_model():
#Neural Network model
ann = Sequential()
#Number of columns of training dataset
n_cols = x_train.shape[1]
#Output
ann.add(Dense(units=1,activation='sigmoid',input_shape=(n_cols,)))
#SGD Optimizer
sgd = SGD(lr=0.1)
#Compile with SGD optimizer and binary_crossentropy
ann.compile(optimizer=sgd,
loss='binary_crossentropy',
metrics=['accuracy'])
return ann
#Creating the model
model=KerasClassifier(build_fn=create_model,epochs=150,batch_size=10,verbose=0)
#Evaluating the model using StratifiedKFold
kfold = StratifiedKFold(n_splits=4, shuffle=True, random_state=2)
results=cross_val_score(model,x_train,y_train,cv=kfold)
#Accuracies
print(results)
To create the 20 combinations formed by the learning rate and epochs, firstly, I have created random values of lr and epochs:
#Epochs
epo = np.random.randint(10,150)
#Learning Rate
learn = np.random.randint(0.01,1)
My problem is that I don´t know how to fit this into the code of the NN in order to find which is the combination that gives the best accuracy of the model.

there is no need to optimize the number of epochs you could easily use early stop which will stop when there is no improvement in your loss or accuracy
so just set your epochs a big number (for example 300 ) and add:
keras.callbacks.callbacks.EarlyStopping(monitor='val_loss', min_delta=0.1)
you also can call the best weights (just before the model started to overfit) by:
restore_best_weights=True

First in the create_model() function you defined the optimizer you passed the learning rate as parameter to:
#SGD Optimizer
sgd = SGD(lr=0.1)
This is the starting learning rate of the optimization process, from this point the optimizer handles the optimal learning rate.
Nevertheless you can pass several starting learning rate inside a loop calling repeatedly the create_model() function and passsing a learning rate parameter to that.
Furthermore as parsa mentioned choosing the right epoch number is based on the validation result, which shows where your model became overfitted. There is the point is, where the epoch number reached its optimum.

Related

Looking for ways and experiences to improve Keras model accuracy

I am an electrical engineer and I am looking for a solution to calculate the DC current of a permanent synchronous motor. So I decided to check the ANN solutions with Keras and so on.Long story short, I'll show you a screenshot of some measured signals.
The first 5 signals are the measured signals. The last one is the DC current, which I will estimate. Here the value was recorded with the help of a current clamp. Okay, I started building a model in Python and tried some things that I assume will increase the accuracy of the model. But after all that, I am not getting that good results from the model and my hope is that maybe I am choosing wrong parameters or not an ideal model for this purpose.
Here is my code:
import numpy as np
from keras.layers import Dense, LSTM
from keras.models import Sequential
from keras.callbacks import EarlyStopping
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from matplotlib import pyplot as plt
import seaborn as sns
# Import input (x) and output (y) data, and asign these to df1 and df1
df = pd.read_csv('train_data.csv')
df = df[['rpm','iq','uq','udc','idc']]
X = df[df.columns[:-1]]
Y = df.idc
plt.figure()
sns.heatmap(df.corr(),annot=True)
plt.show()
# Split the data into input (x) training and testing data, and ouput (y) training and testing data,
# with training data being 80% of the data, and testing data being the remaining 20% of the data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)#, shuffle=True)
# Scale both training and testing input data
X_train = preprocessing.maxabs_scale(X_train)
X_test = preprocessing.maxabs_scale(X_test)
model = Sequential()
model.add(Dense(4, input_shape=(4,)))
model.add(Dense(4, input_shape=(4,)))
model.add(Dense(1, input_shape=(4,)))
model.compile(optimizer="adam", loss="msle", metrics=['mean_squared_logarithmic_error','accuracy'])
# Pass several parameters to 'EarlyStopping' function and assign it to 'earlystopper'
earlystopper = EarlyStopping(monitor='val_loss', min_delta=0, patience=15, verbose=1, mode='auto')
model.summary()
history = model.fit(X_train, y_train, epochs = 2000, validation_split = 0.3, verbose = 2, callbacks = [earlystopper])
# Runs model (the one with the activation function, although this doesn't really matter as they perform the same)
# with its current weights on the training and testing data
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Calculates and prints r2 score of training and testing data
print("The R2 score on the Train set is:\t{:0.3f}".format(r2_score(y_train, y_train_pred)))
print("The R2 score on the Test set is:\t{:0.3f}".format(r2_score(y_test, y_test_pred)))
df = pd.read_csv('test_two_data.csv')
df = df[['rpm','iq','uq','udc','idc']]
X = df[df.columns[:-1]]
Y = df.idc
X_validate = preprocessing.maxabs_scale(X)
y_pred = model.predict(X_validate)
plt.plot(Y)
plt.plot(y_pred)
plt.show()
(weight_0,bias_0) = model.layers[0].get_weights()
(weight_1,bias_1) = model.layers[1].get_weights()
One limitation is that I can't use LSTM layers or other complex algorithms because I need to implement the trained model in a microcontroller on a motor application later.
I guess you could find some words for me to make my model a little better in accuracy.
At the end here is a figure where I show you the worse prediction performance. Orange is the prediction and blue is the measured current.
The training dataset was this one.
The correlation between the individual values can be found here. Since the values of id and ud have no correlation to idc, I decided to delete them.

The most important thing to keep in mind when trying to improve the accuracy of the model is ALWAYS Normalise the input data which basically means rescaling real-valued numeric attributes into the range 0 and 1. I am not able to understand the way you are providing the training data to the model. Could you please explain that. It would be better in understanding and identifying the scope of higher accuracy.
Now if we talk about parameters, I would suggest you the addition of a Tuning Algorithm for the parameters to get the optimized value of each parameter.
It is always a good parctice to include hidden layers which could provide better feature extract.

Machine Learning Model overfitting

So I build a GRU model and I'm comparing 3 different datasets on the same model. I was just running the first dataset and set the number of epochs to 25, but I have noticed that my validation loss is increasing just after the 6th epoch, doesn't that indicate overfitting, am I doing something wrong?
import pandas as pd
import tensorflow as tf
from keras.layers.core import Dense
from keras.layers.recurrent import GRU
from keras.models import Sequential
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from google.colab import files
from tensorboardcolab import TensorBoardColab, TensorBoardColabCallback
tbc=TensorBoardColab() # Tensorboard
df10=pd.read_csv('/content/drive/My Drive/Isolation Forest/IF 10 PERCENT.csv',index_col=None)
df2_10= pd.read_csv('/content/drive/My Drive/2019 Dataframe/2019 10minutes IF 10 PERCENT.csv',index_col=None)
X10_train= df10[['WindSpeed_mps','AmbTemp_DegC','RotorSpeed_rpm','RotorSpeedAve','NacelleOrientation_Deg','MeasuredYawError','Pitch_Deg','WindSpeed1','WindSpeed2','WindSpeed3','GeneratorTemperature_DegC','GearBoxTemperature_DegC']]
X10_train=X10_train.values
y10_train= df10['Power_kW']
y10_train=y10_train.values
X10_test= df2_10[['WindSpeed_mps','AmbTemp_DegC','RotorSpeed_rpm','RotorSpeedAve','NacelleOrientation_Deg','MeasuredYawError','Pitch_Deg','WindSpeed1','WindSpeed2','WindSpeed3','GeneratorTemperature_DegC','GearBoxTemperature_DegC']]
X10_test=X10_test.values
y10_test= df2_10['Power_kW']
y10_test=y10_test.values
# scaling values for model
x_scale = MinMaxScaler()
y_scale = MinMaxScaler()
X10_train= x_scale.fit_transform(X10_train)
y10_train= y_scale.fit_transform(y10_train.reshape(-1,1))
X10_test= x_scale.fit_transform(X10_test)
y10_test= y_scale.fit_transform(y10_test.reshape(-1,1))
X10_train = X10_train.reshape((-1,1,12))
X10_test = X10_test.reshape((-1,1,12))
# creating model using Keras
model10 = Sequential()
model10.add(GRU(units=512, return_sequences=True, input_shape=(1,12)))
model10.add(GRU(units=256, return_sequences=True))
model10.add(GRU(units=256))
model10.add(Dense(units=1, activation='sigmoid'))
model10.compile(loss=['mse'], optimizer='adam',metrics=['mse'])
model10.summary()
history10=model10.fit(X10_train, y10_train, batch_size=256, epochs=25,validation_split=0.20, verbose=1, callbacks=[TensorBoardColabCallback(tbc)])
score = model10.evaluate(X10_test, y10_test)
print('Score: {}'.format(score))
y10_predicted = model10.predict(X10_test)
y10_predicted = y_scale.inverse_transform(y10_predicted)
y10_test = y_scale.inverse_transform(y10_test)
plt.plot( y10_predicted, label='Predicted')
plt.plot( y10_test, label='Measurements')
plt.legend()
plt.savefig('/content/drive/My Drive/Figures/Power Prediction 10 Percent.png')
plt.show()

LSTMs(and also GRUs in spite of their lighter construction) are notorious for easily overfitting.
Reduce the number of units(the output size) in each of the layers(32(layer1)-64(layer2); you could also eliminate the last layer altogether.
The second of all, you are using the activation 'sigmoid', but your loss function + metric is mse.
Ensure that your problem is either a regression or a classification one. If it is indeed a regression, then the activation function should be 'linear' at the last step. If it is a classification one, you should change your loss_function to binary_crossentropy and your metric to 'accuracy'.
Therefore, the plot displayed is just misleading for the moment. If you modify like I suggested and you still get such a train-val loss plot, then we can state for sure that you have an overfitting case.

Keras autoencoder : validation loss > training loss - but performing well on testing dataset

IN SHORT:
I have trained an Autoencoder whose validation loss is always higher than its training loss (see attached figure). I would think that this is a signal of overfitting. However, my Autoencoder performs well on the testing dataset. I was wondering if:
1) with reference to the architecture of the network, provided below, anyone could provide insights on how to reduce the validation loss (and how it is possible that the validation loss is much higher than the training one, despite the performance of the Autoencoder being good on the testing dataset);
2) if it is actually a problem that there is this gap between training and validation loss (when the performance on the testing dataset is actually good).
DETAILS:
I coded up my deep Autoencoder in Keras (code below). The architecture is 2001 (input layer) - 1000 - 500 - 200 - 50 - 200 - 500 - 1000 - 2001 (output layer). My samples are 1d functions of time. Each of them has 2001 time components. I have 2000 samples, which I split in 1500 for training, 500 for testing. Ot of the 1500 training samples, 20% of them (i.e. 300) are used as validation set. I normalize the training set removing the mean and dividing by the standard deviation. I use the mean and standard deviation of the training dataset to normalise the testing dataset as well.
I train the Autoencoder using Adamax optimizer and mean squared error as loss function.
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras import optimizers
import numpy as np
import copy
# data
data = # read my input samples. They are 1d functions of time and I have 2000 of them.
# Each function has 2001 time components
# shuffling data before training
import random
random.seed(4)
random.shuffle(data)
# split training (1500 samples) and testing (500 samples) dataset
X_train = data[:1500]
X_test = data[1500:]
# normalize training and testing set using mean and std deviation of training set
X_mean = X_train.mean()
X_train -= X_mean
X_std = X_train.std()
X_train /= X_std
X_test -= X_mean
X_test /= X_std
### MODEL ###
# Architecture
# input layer
input_shape = [X_train.shape[1]]
X_input = Input(input_shape)
# hidden layers
x = Dense(1000, activation='tanh', name='enc0')(X_input)
encoded = Dense(500, activation='tanh', name='enc1')(x)
encoded_2 = Dense(200, activation='tanh', name='enc2')(encoded)
encoded_3 = Dense(50, activation='tanh', name='enc3')(encoded_2)
decoded_2 = Dense(200, activation='tanh', name='dec2')(encoded_3)
decoded_1 = Dense(500, activation='tanh', name='dec1')(decoded_2)
x2 = Dense(1000, activation='tanh', name='dec0')(decoded_1)
# output layer
decoded = Dense(input_shape[0], name='out')(x2)
# the Model
model = Model(inputs=X_input, outputs=decoded, name='autoencoder')
# optimizer
opt = optimizers.Adamax()
model.compile(optimizer=opt, loss='mse', metrics=['acc'])
print(model.summary())
###################
### TRAINING ###
epochs = 1000
# train the model
history = model.fit(x = X_train, y = X_train,
epochs=epochs,
batch_size=100,
validation_split=0.2) # using 20% of training samples for validation
# Testing
prediction = model.predict(X_test)
for i in range(len(prediction)):
prediction[i] = np.multiply(prediction[i], X_std) + X_mean
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(epochs)
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
plt.close()

2) if it is actually a problem that there is this gap between training and validation loss (when the performance on the testing dataset is actually good).
This is just the generalization gap, i.e. the expected gap in the performance between the training and validation sets; quoting from a recent blog post by Google AI:
An important concept for understanding generalization is the generalization gap, i.e., the difference between a model’s performance on training data and its performance on unseen data drawn from the same distribution.
.
I would think that this is a signal of overfitting. However, my Autoencoder performs well on the testing dataset.
It is not, but the reason is not exactly what you think (let alone the fact that "well" is a highly subjective term).
The telltale signature of overfitting is when your validation loss starts increasing, while your training loss continues decreasing, i.e.:
Your graph does not show such a behavior; also, notice the gap (pun intended) between the curves in the above plot (adapted from the Wikipedia entry on overfitting).
how it is possible that the validation loss is much higher than the training one, despite the performance of the Autoencoder being good on the testing dataset
There is absolutely no contradiction here; notice that your training loss is almost zero, which is not necessarily surprising in itself, but it would certainly be surprising if the validation loss were anywhere close to zero. And, again, "good" is a highly subjective term.
In other words, nothing in the info you have provided shows that there is something wrong with your model...

Keras: Re-use trained weights in a new experiment

I am quite new to Keras so apologies in advance for any stupid mistakes. I am currently attempting to try out some good old cross-domain transfer learning between two datasets. I have a model here that is trained and executed on a voice recognition dataset that I have generated (code is at the bottom of this question because it's quite long)
If I were to train a new model, say model_2 on a different dataset, then I'd get a baseline from the initial random distribution of weights.
I wonder, is it possible to train model_1 and model_2, then, and this is the bit I don't know how to do; can I take the two 256 and 128 dense layers from model_1 (with trained weights) and use them as starting points for a model_3 - which is dataset 2 with the initial weight distribution from model_1?
So, in the end, I have the following:
Model_1 which starts from a random distribution and trains on dataset 1
Model_2 which starts from a random distribution and trains on dataset 2
Model_3 which starts from the distribution trained in Model_1 and trains on dataset 2.
My question is, how would I go about doing step 3 in the above? I don't want to freeze the weights, I just want an initial distribution for training from a past experiment
Any help would be greatly appreciated. Thank you! Apologies if I didn't make it quite clear enough what I'm going for
My code to train Model_1 is as follows:
import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.callbacks import EarlyStopping
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from keras.utils import np_utils
from keras.layers.normalization import BatchNormalization
import time
start = time.clock()
# fix random seed for reproducibility
seed = 1
numpy.random.seed(seed)
# load dataset
dataframe = pandas.read_csv("voice.csv", header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
numVars = len(dataframe.columns) - 1
numClasses = dataframe[numVars].nunique()
X = dataset[:,0:numVars].astype(float)
Y = dataset[:,numVars]
print("THERE ARE " + str(numVars) + " ATTRIBUTES")
print("THERE ARE " + str(numClasses) + " UNIQUE CLASSES")
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = np_utils.to_categorical(encoded_Y)
calls = [EarlyStopping(monitor='acc', min_delta=0.0001, patience=100, verbose=2, mode='max', restore_best_weights=True)]
# define baseline model
def baseline_model():
# create model
model = Sequential()
model.add(BatchNormalization())
model.add(Dense(256, input_dim=numVars, activation='sigmoid'))
model.add(Dense(128, activation='sigmoid'))
model.add(Dense(numClasses, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
estimator = KerasClassifier(build_fn=baseline_model, epochs=2000, batch_size=1000, verbose=1)
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, dummy_y, cv=kfold, fit_params={'callbacks':calls})
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
#your code here
print (time.clock() - start)
PS: Input attributes and outputs will all be the same between the two datasets, all that will change are attribute values. I am curious, can this be done if the two datasets have different numbers of output classes?

In short, to fine-tune Model_3 from Model_1, just call model.load_weights('/path/to/model_1.h5', by_name=True) after model.compile(...). Of course, you must have saved the trained Model_1 first.
If I understood correct, you have the same number of features and classes among the two datasets, so you do not even need to re-design your model. If you had different set of classes, then you had to give different names to the last layers of Model_1 and Model_3:
model.add(Dense(numClasses, activation='softmax', name='some_unique_name'))

Can I send callbacks to a KerasClassifier?

I want the classifier to run faster and stop early if the patience reaches the number I set. In the following code it does 10 iterations of fitting the model.
import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.constraints import maxnorm
from keras.optimizers import SGD
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load dataset
dataframe = pandas.read_csv("sonar.csv", header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:60].astype(float)
Y = dataset[:,60]
# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)
calls=[EarlyStopping(monitor='acc', patience=10), ModelCheckpoint('C:/Users/Nick/Data Science/model', monitor='acc', save_best_only=True, mode='auto', period=1)]
def create_baseline():
# create model
model = Sequential()
model.add(Dropout(0.2, input_shape=(33,)))
model.add(Dense(33, init='normal', activation='relu', W_constraint=maxnorm(3)))
model.add(Dense(16, init='normal', activation='relu', W_constraint=maxnorm(3)))
model.add(Dense(122, init='normal', activation='softmax'))
# Compile model
sgd = SGD(lr=0.1, momentum=0.8, decay=0.0, nesterov=False)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
return model
numpy.random.seed(seed)
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=create_baseline, nb_epoch=300, batch_size=16, verbose=0, callbacks=calls)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
Here is the resulting error-
RuntimeError: Cannot clone object <keras.wrappers.scikit_learn.KerasClassifier object at 0x000000001D691438>, as the constructor does not seem to set parameter callbacks
I changed the cross_val_score in the following-
numpy.random.seed(seed)
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasClassifier(build_fn=create_baseline, nb_epoch=300, batch_size=16, verbose=0, callbacks=calls)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold, fit_params={'callbacks':calls})
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
and now I get this error-
ValueError: need more than 1 value to unpack
This code came from here. The code is by far the most accurate I've used so far. The problem is that there is no defined model.fit() anywhere in the code. It also takes forever to fit. The fit() operation occurs at the results = cross_val_score(...) and there's no parameters to throw a callback in there.
How do I go about doing this?
Also, how do I run the model trained on a test set?
I need to be able to save the trained model for later use...

Reading from here, which is the source code of KerasClassifier, you can pass it the arguments of fit and they should be used.
I don't have your dataset so I cannot test it, but you can tell me if this works and if not I will try and adapt the solution. Change this line :
estimators.append(('mlp', KerasClassifier(build_fn=create_baseline, nb_epoch=300, batch_size=16, verbose=0, callbacks=[...your_callbacks...])))
A small explaination of what's happening : KerasClassifier is taking all the possibles arguments for fit, predict, score and uses them accordingly when each method is called. They made a function that filters the arguments that should go to each of the above functions that can be called in the pipeline.
I guess there are several fit and predict calls inside the StratifiedKFold step to train on different splits everytime.
The reason why it takes forever to fit and it fits 10 times is because one fit is doing 300 epochs, as you asked. So the KFold is repeating this step over the different folds :
calls fit with all the parameters given to KerasClassifier (300 epochs and batch size = 16). It's training on 9/10 of your data and using 1/10 as validation.
EDIT :
Ok, so I took the time to download the dataset and try your code... First of all you need to correct a "few" things in your network :
your input have a 60 features. You clearly show it in your data prep :
X = dataset[:,:60].astype(float)
so why would you have this :
model.add(Dropout(0.2, input_shape=(33,)))
please change to :
model.add(Dropout(0.2, input_shape=(60,)))
About your targets/labels. You changed the objective from the original code (binary_crossentropy) to categorical_crossentropy. But you didn't change your Y array. So either do this in your data preparation :
from keras.utils.np_utils import to_categorical
encoded_Y = to_categorical(encoder.transform(Y))
or change your objective back to binary_crossentropy.
Now the network's output size : 122 on the last dense layer? your dataset obviously has 2 categories so why are you trying to output 122 classes? it won't match the target. Please change back your last layer to :
model.add(Dense(2, init='normal', activation='softmax'))
if you choose to use categorical_crossentropy, or
model.add(Dense(1, init='normal', activation='sigmoid'))
if you go back to binary_crossentropy.
So now that your network compiles, I could start to troubleshout.
here is your solution
So now I could get the real error message. It turns out that when you feed fit_params=whatever in the cross_val_score() function, you are feeding those parameters to a pipeline. In order to know to which part of the pipeline you want to send those parameters you have to specify it like this :
fit_params={'mlp__callbacks':calls}
Your error was saying that the process couldn't unpack 'callbacks'.split('__', 1) into 2 values. It was actually looking for the name of the pipeline's step to apply this to.
It should be working now :)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold, fit_params={'mlp__callbacks':calls})
BUT, you should be aware of what's happening here... the cross validation actually calls the create_baseline() function to recreate the model from scratch 10 times an trains it 10 times on different parts of the dataset. So it's not doing epochs as you were saying, it's doing 300 epochs 10 times.
What is also happening as a consequence of using this tool : since the models are always differents, it means the fit() method is applied 10 times on different models, therefore, the callbacks are also applied 10 different times and the files saved by ModelCheckpoint() get overriden and you find yourself only with the best model of the last run.
This is intrinsec to the tools you use, I don't see any way around this. This comes as consequence to using different general tools that weren't especially thought to be used together with all the possible configurations.

Try:
estimators.append(('mlp',
KerasClassifier(build_fn=create_model2,
nb_epoch=300,
batch_size=16,
verbose=0,
callbacks=[list_of_callbacks])))
where list_of_callbacks is a list of callbacks you want to apply. You could find details here. It's mentioned there that parameters fed to KerasClassifier could be legal fitting parameters.
It's also worth to mention that if you are using multiple runs with GPUs there might be a problem due to several reported memory leaks especially when you are using theano. I also noticed that running multiple fits consequently may show results which seem to be not independent when using sklearn API.
Edit:
Try also:
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold, fit_params = {'mlp__callbacks': calls})
Instead of putting callbacks list in a wrapper instantiation.

This is what I have done
results = cross_val_score(estimator, X, Y, cv=kfold,
fit_params = {'callbacks': [checkpointer,plateau]})
and has worked so far

Despite the TensorFlow, Keras & SciKeras documentation suggesting you can define training callbacks via the fit method, for my setup it turns out (like #NassimBen suggests) you should do it through the model constructor instead.
Rather than this:
model = KerasClassifier(..).fit(X, y, callbacks=[<HERE>])
Try this:
model = KerasClassifier(callbacks=[<HERE>]).fit(X, y)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.