Graph Model evaluation index error - python

I am trying to learn Keras GraphNN using simple examples. I have a simple example dataset with 784 features and I want to run this example:
# graph model with one input and two outputs
graph = Graph()
graph.add_input(name='input', input_shape=(784,))
graph.add_node(Dense(input_dim=784, output_dim=13), name='dense1', input='input')
graph.add_node(Dense(input_dim=13, output_dim=1), name='dense2', input='input')
graph.add_node(Dense(input_dim=13, output_dim=1), name='dense3', input='dense1')
graph.add_output(name='output1', input='dense2')
graph.add_output(name='output2', input='dense3')
graph.compile('rmsprop', {'output1': 'mse', 'output2': 'mse'})
graph.fit({'input': X_train, 'output1': y_train, 'output2': y_train}, nb_epoch=30)
### here is where I am facing difficulty
score = graph.evaluate({'input': X_test, 'output1': y_test, 'output2': y_test}, batch_size=16, verbose=1)
print 'score: ', score
The documentation mentions that graph.evaluate():
evaluate(data, batch_size=128, verbose=1): Show performance of the model over some validation data.
Return: The loss score over the data.
Arguments: Same meaning as fit method above. verbose is used as a binary flag (progress bar or nothing).
And from the definition of the graph.fit() we know that:
Arguments:
data:dictionary mapping input names out outputs names to appropriate numpy arrays. All arrays should contain the same number of samples.
Although my fit method runs perfect, I get IndexError: index 1 is out of bounds for size 1 on evaluate
My input shapes are:
Xtrain: (32738, 784)
Xtest: (16125, 784)
ytest: (16125,)
ytrain: (32738,)
What am I missing here ?

Related

Splitting data to training, testing and valuation when making Keras model

I'm a little confused about splitting the dataset when I'm making and evaluating Keras machine learning models.
Lets say that I have dataset of 1000 rows.
features = df.iloc[:,:-1]
results = df.iloc[:,-1]
Now I want to split this data into training and testing (33% of data for testing, 67% for training):
x_train, X_test, y_train, y_test = train_test_split(features, results, test_size=0.33)
I have read on the internet that fitting the data into model should look like this:
history = model.fit(features, results, validation_split = 0.2, epochs = 10, batch_size=50)
So I'm fitting the full data (features and results) to my model, and from that data I'm using 20% of data for validation: validation_split = 0.2.
So basically, my model will be trained with 80% of data, and tested on 20% of data.
So confusion starts when I need to evaluate the model:
score = model.evaluate(x_test, y_test, batch_size=50)
Is this correct?
I mean, why should I split the data into training and testing, where does x_train and y_train go?
Can you please explain to me whats the correct order of steps for creating model?
Generally, in training time (model. fit), you have two sets: one is for the training set and another is for validation/tuning/development set. With the training set, you train the model, and with the validation set, you need to find the best set of hyper-parameter. And when you're done, you may then test your model with unseen data set - a set that was completely hidden from the model unlike the training or validation set.
Now, when you used
X_train, X_test, y_train, y_test = train_test_split(features, results, test_size=0.33)
By this, you split the features and results into 33% of data for testing, 67% for training. Now, you can do two things
use the (X_test and y_test as validation set in model.fit(...). Or,
use them for final prediction in model. predict(...)
So, if you choose these test sets as a validation set ( number 1 ), you would do as follows:
model.fit(x=X_train, y=y_trian,
validation_data = (X_test, y_test), ...)
In the training log, you will get the validation results along with the training score. The validation results should be the same if you later compute model.evaluate(X_test, y_test).
Now, if you choose those test set as a final prediction or final evaluation set ( number 2 ), then you need to make validation set newly or use the validation_split argument as follows:
model.fit(x=X_train, y=y_trian,
validation_split = 0.2, ...)
The Keras API will take the .2 percentage of the training data (X_train and y_train) and use it for validation. And lastly, for the final evaluation of your model, you can do as follows:
y_pred = model.predict(x_test, batch_size=50)
Now, you can compare with y_test and y_pred with some relevant metrics.
Generally, you'd want to use your X_train, y_train data that you have split as arguments in the fit method. So it would look something like:
history = model.fit(X_train, y_train, batch_size=50)
While not splitting your data before throwing it into the fit method and adding the validation_split arguments work as well, just be careful to refer to the keras documentation on the validation_data and validation_split arguments to make sure that you are splitting them up as expected.
There is a related question here:
https://datascience.stackexchange.com/questions/38955/how-does-the-validation-split-parameter-of-keras-fit-function-work
Keras documentation:
https://keras.rstudio.com/reference/fit.html
I have read on the internet that fitting the data into model should
look like this:
That means you need to fit features and labels. You already split them into x_train & y_train. So your fit should look like this:
history = model.fit(x_train, y_train, validation_split = 0.2, epochs = 10, batch_size=50)
So confusion starts when I need to evaluate the model:
score = model.evaluate(x_test, y_test, batch_size=50) --> Is this correct?
That's correct, you evaluate the model by using testing features and corresponding labels. Furthermore if you want to get only for example predicted labels, you can use:
y_hat = model.predict(X_test)
Then you can compare y_hat with y_test, i.e get a confusion matrix etc.

keras: could not convert string to float in model.fit

I have a data frame like this, of DNA sequences:
Feature Label
GCTAGATGACAGT 0
TTTTAAAACAG 1
TAGCTATACT 2
TGGGGCAAAAAAAA 0
AATGTCG 3
AATGTCG 0
AATGTCG 1
Where there is one column with a DNA sequence, and a label that can either be 0,1,2,3 (i.e. a category of that DNA sequence). I want to develop a NN that predicts probability of classification of each sequence into the 1,2 or 3 category (not 0, i don't care about 0). Each sequence can appear multiple times in the data frame, and it is possible that each sequence appears in multiple (or all) categories. So the output should look like this:
GCTAGATGACAGT (0.9,0.1,0.2)
TTTTAAAACAG (0.7,0.6,0.3)
TAGCTATACT (0.3,0.3,0.2)
TGGGGCAAAAAAAA (0.1,0.5,0.6)
Where the numbers in the tuple are the probability that the sequence is found in category 1,2 and 3.
I wrote this basic code to get started. You can see I've commented out trickier bits, I'm trying to get a basic method working and then I'll gradually expand on it, but i've included everything so people can see the general idea I was thinking of.
# Split into input (X) and output (Y) variables
X = df.iloc[:,[0]].as_matrix() #as matrix due to this error: https://stackoverflow.com/questions/45479239/pandas-keyerror-not-in-index-when-training-a-keras-model
y = df.iloc[:,-1].as_matrix()
print(X[0:10])
print(y[0:10])
# Define 10-fold cross validation test harness
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
kf = kfold.get_n_splits(X)
cvscores = []
for train, test in kfold.split(X, Y):
X_train, X_test = X[train], X[test]
y_train, y_test = y[train], y[test]
# Pre-process the data
# X_train = sequence.pad_sequences(X[train], maxlen=30) #based on 30 aa being max we're interested in
# X_test = sequence.pad_sequences(X[test], maxlen=30) #based on 30 aa being max we're interested in
# Create model
model = Sequential()
# model.add(Embedding(3000, 32, input_length=30))
# model.add(Bidirectional(LSTM(20, return_sequences=True), input_shape=(n_timesteps, 1)))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Monitor val accuracy and perform early stopping
# es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200)
# mc = ModelCheckpoint('best_model.h5', monitor='val_accuracy', mode='max', verbose=1, save_best_only=True)
# Fit the model
model.fit(X_train, y_train, epochs=150, batch_size=10, verbose=0)
# Evaluate the model
# scores = model.evaluate(X[test], Y[test], verbose=0)
# print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
# cvscores.append(scores[1] * 100)
#print("%.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores), numpy.std(cvscores)))
#output a three sigmoid model, and plot accuracy and loss
The output first prints the sequences, as expected (i.e. the print statement):
[['GCTAGATGACAGT']
['TTTTAAAACAG']
['TAGCTATACT']
['TGGGGCAAAAAAAA']
['AATGTCG']
['AATGTCG']
['AATGTCG']
['TTATATAAAAG']
['GCTGGGAG']
['TTTGCGTATAGATAGATAG']]
[0 1 2 0 3 0 1 2 2 0]
And then I get the error:
ValueError: could not convert string to float: 'XXX' (where XXXX is one of the sequences in the data set, but not one of the top 10 in the output above), and further up in the error it points to the value error being in the line:
model.fit(X_train, y_train, epochs=150, batch_size=10, verbose=0)
I did see this question, but I don't think mine is the same root cause. Can someone explain why I'm getting this? I'm wondering is it because I haven't explained to the model yet/properly that I'm dealing with calculating probability of a sequence instead of a categorical feature?
As I can see on the prints statement you are feeding your NN withs strings/text and this is not possible. You have to encode them into numbers. To carry out this operation different approaches are available: you can one-hot encode your characters or you can create a trainable embedding for each character.
I suggest you Tokenizer from TF which can help you in the process of numerical encoding of text sequences

How to get the log loss?

I am playing with the Leaf Classification data set and I am struggling to compute the log loss of my model after testing it. After importing it from the metrics class here I do:
# fitting the knn with train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Optimisation via gridSearch
knn=KNeighborsClassifier()
params={'n_neighbors': range(1,40), 'weights':['uniform', 'distance'], 'metric':['minkowski','euclidean'],'algorithm': ['auto','ball_tree','kd_tree', 'brute']}
k_grd=GridSearchCV(estimator=knn,param_grid=params,cv=5)
k_grd.fit(X_train,y_train)
# testing
yk_grd=k_grd.predict(X_test)
# calculating the logloss
print (log_loss(y_test, yk_grd))
However, my last line is resulting in the following error:
y_true and y_pred contain different number of classes 93, 2. Please provide the true labels explicitly through the labels argument. Classes found in y_true.
But the when I run the following:
X_train.shape, X_test.shape, y_train.shape, y_test.shape, yk_grd.shape
# results
((742, 192), (248, 192), (742,), (248,), (248,))
what am i missing really?
From sklearn.metrics.log_loss documentantion:
y_pred : array-like of float, shape = (n_samples, n_classes) or
(n_samples,)
Predicted probabilities, as returned by a classifier’s predict_proba method.
Then, to get log loss:
yk_grd_probs = k_grd.predict_proba(X_test)
print(log_loss(y_test, yk_grd_probs))
If you still get an error, it means that a specific class is missing in y_test.
Use:
print(log_loss(y_test, yk_grd_probs, labels=all_classes))
where all_classes is a list containing all the classes in your dataset.

CNN text document classification with Keras: How to fit the model of "independent layers of two input"

I have a CNN model that inputs two independent documents and do the convolutional layer and pooling layer separately. After pooling layer, it then concatenate two pooled feature maps together and feed to one fully connected layer. The model can successfully compiled. However, Now, I have problem fitting my model when doing the train history.
Main idea of my problem is that: How to revise the x_train to x_train1 and x_train2. x_test to x_test1 and x_test2. While y_train and y_test remain the same.
I've reference this kind of pattern from a site:
** history = model.fit(X_train, y_train, epochs=100, verbose=False, validation_data=(X_test, y_test), batch_size=10)**
# fit the model method 1
train_history = model.fit(train_data = (X_train1, X_train2),
validation_data=(x_test1, x_test2), epochs=8, batch_size=8, verbose=1)
# =================
# fit the model method 2
train_history = model.fit(x = [x_train1, x_train2], y = [y_train], validation_split=0.2, validation_data=[[x_test1, x_test2], [y_test]], epochs=8, batch_size=8, verbose=1)
The error massage of first method 1 is:
Unrecognized keyword arguments: {'train_data': (['.....'......}
The error massage of first method 2 is:
ValueError: Error when checking input: expected input_1 to have shape (10,) but got array with shape (1,)
every kind of x_train and x_test data is a long list that contains many strings; whereas y_train and y_test is also a list the maps with the string in x_test/x_train to a specific label, the label looks like this if it were to classify into 3 types: [0,0,1], [0,1,0], [1,0,0]
I expect to fit the model successfully.

Using `predict` in Keras to predict an 1D array in the same order as given

I am doing regression in Keras, with a neural network with 1 input, 10 hidden units and 1 output. I fit the model, as usual:
model.fit(x_train, y_train, nb_epoch=15, batch_size=32)
now I want to predict for a xtest that is (as x_train and y_train) a very big 1-dimensional numpy array. In the documentation of the Keras web, you can find:
predict(self, x, batch_size=32, verbose=0)
so I understand you have to do:
model.predict(xtest, batch_size=32)
I am confused by the batch_size instruction. Does it mean that predict takes the values of xtest in a random way?
Because what I need is that predict generates the outputs in exactly the same order as given by xtest. I mean, first of all the output predicted for xtest[0], then the output predicted for xtest[1], then the output predicted for xtest[2]... and so on. With that array predicted I want to do some comparisons with an actual ytest that I have and do some statistics. So, the order is essential. How can I do it?
Thank you in advance.
The predict method preserves the order of examples. Batch size is essential when your data is big and you simply cannot load a lot of examples to your memory. Then it's loaded and evaluated batch by batch in order of original set.

Categories

Resources