I have a data frame like this, of DNA sequences:
Feature Label
GCTAGATGACAGT 0
TTTTAAAACAG 1
TAGCTATACT 2
TGGGGCAAAAAAAA 0
AATGTCG 3
AATGTCG 0
AATGTCG 1
Where there is one column with a DNA sequence, and a label that can either be 0,1,2,3 (i.e. a category of that DNA sequence). I want to develop a NN that predicts probability of classification of each sequence into the 1,2 or 3 category (not 0, i don't care about 0). Each sequence can appear multiple times in the data frame, and it is possible that each sequence appears in multiple (or all) categories. So the output should look like this:
GCTAGATGACAGT (0.9,0.1,0.2)
TTTTAAAACAG (0.7,0.6,0.3)
TAGCTATACT (0.3,0.3,0.2)
TGGGGCAAAAAAAA (0.1,0.5,0.6)
Where the numbers in the tuple are the probability that the sequence is found in category 1,2 and 3.
I wrote this basic code to get started. You can see I've commented out trickier bits, I'm trying to get a basic method working and then I'll gradually expand on it, but i've included everything so people can see the general idea I was thinking of.
# Split into input (X) and output (Y) variables
X = df.iloc[:,[0]].as_matrix() #as matrix due to this error: https://stackoverflow.com/questions/45479239/pandas-keyerror-not-in-index-when-training-a-keras-model
y = df.iloc[:,-1].as_matrix()
print(X[0:10])
print(y[0:10])
# Define 10-fold cross validation test harness
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
kf = kfold.get_n_splits(X)
cvscores = []
for train, test in kfold.split(X, Y):
X_train, X_test = X[train], X[test]
y_train, y_test = y[train], y[test]
# Pre-process the data
# X_train = sequence.pad_sequences(X[train], maxlen=30) #based on 30 aa being max we're interested in
# X_test = sequence.pad_sequences(X[test], maxlen=30) #based on 30 aa being max we're interested in
# Create model
model = Sequential()
# model.add(Embedding(3000, 32, input_length=30))
# model.add(Bidirectional(LSTM(20, return_sequences=True), input_shape=(n_timesteps, 1)))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Monitor val accuracy and perform early stopping
# es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200)
# mc = ModelCheckpoint('best_model.h5', monitor='val_accuracy', mode='max', verbose=1, save_best_only=True)
# Fit the model
model.fit(X_train, y_train, epochs=150, batch_size=10, verbose=0)
# Evaluate the model
# scores = model.evaluate(X[test], Y[test], verbose=0)
# print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
# cvscores.append(scores[1] * 100)
#print("%.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores), numpy.std(cvscores)))
#output a three sigmoid model, and plot accuracy and loss
The output first prints the sequences, as expected (i.e. the print statement):
[['GCTAGATGACAGT']
['TTTTAAAACAG']
['TAGCTATACT']
['TGGGGCAAAAAAAA']
['AATGTCG']
['AATGTCG']
['AATGTCG']
['TTATATAAAAG']
['GCTGGGAG']
['TTTGCGTATAGATAGATAG']]
[0 1 2 0 3 0 1 2 2 0]
And then I get the error:
ValueError: could not convert string to float: 'XXX' (where XXXX is one of the sequences in the data set, but not one of the top 10 in the output above), and further up in the error it points to the value error being in the line:
model.fit(X_train, y_train, epochs=150, batch_size=10, verbose=0)
I did see this question, but I don't think mine is the same root cause. Can someone explain why I'm getting this? I'm wondering is it because I haven't explained to the model yet/properly that I'm dealing with calculating probability of a sequence instead of a categorical feature?
As I can see on the prints statement you are feeding your NN withs strings/text and this is not possible. You have to encode them into numbers. To carry out this operation different approaches are available: you can one-hot encode your characters or you can create a trainable embedding for each character.
I suggest you Tokenizer from TF which can help you in the process of numerical encoding of text sequences
Related
I have seen that many others on stackoverflow have posted about this same problem, but I haven't been able to figure out how to apply those solutions to my example.
I have been working on creating a model to predict an outcome of either 0 or 1 based on a dataset which contains 16 features - Everything has seemed to work fine (accuracy evaluation, epoch completion, etc.).
As mentioned, my training features include 16 different variables, but when I pass in a list that contains 16 unique values separate from the training dataset in order to try and make an individual prediction (of either 0 or 1), I get this error:
ValueError: Layer sequential_11 expects 1 input(s), but it received 16 input tensors.
Here is my code -
y = datas.Result
X = datas.drop(columns = ['Date', 'home_team', 'away_team', 'home_pitcher', 'away_pitcher', 'Result'])
X = X.values.astype('float32')
y = y.values.astype('float32')
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)
X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size = 0.2)
model=keras.Sequential([
keras.layers.Dense(32, input_shape = (16,)),
keras.layers.Dense(20,activation=tf.nn.relu),
keras.layers.Dense(2,activation='softmax')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['acc'])
history = model.fit(X_train,y_train,epochs=20, validation_data=(X_validation, y_validation))
#all variables within features list are single values, ex: .351, 11, .991, etc.
features = [t1_pqm,t2_pqm,t1_elo,t2_elo,t1_era,t2_era,t1_bb9,t2_bb9,t1_fip,t2_fip,t1_ba,t2_ba,t1_ops,t2_ops,t1_so,t2_so]
prediction = model.predict(features)
The model expects an input of shape (None,16) but features has the shape (16,) (1D list). The easiest solution is to make it an numpy array with the right shape (1, 16):
features = np.array([[t1_pqm,t2_pqm,t1_elo,t2_elo,t1_era,t2_era,t1_bb9,t2_bb9,t1_fip,t2_fip,t1_ba,t2_ba,t1_ops,t2_ops,t1_so,t2_so]])
I'm using Tensorflow to train a network to predict the third item in a list of numbers.
When I train, the network appears to train quite well and do well on both the training and test set. However, when I evaluate its performance myself, it seems to be doing quite poorly.
For example, at the end of training, Tensorflow says that the validation loss is 2.1 x 10^(-5). However, when I compute it myself, I get 0.17 x 10^0. What am I doing wrong?
Here's code that can be run on Google Colab:
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
def create_dataset(k=5, n=2, example_amount=200):
'''Create a dataset of numbers where the goal is to always output the nth number'''
# UPGRADE: this could be done better with numpy to just generate all the examples at once
example_amount = 1000
x = []
y = []
ans = [x, y]
for i in range(example_amount):
example_x = np.random.rand(k)
example_y = example_x[n]
x.append(example_x)
y.append(example_y)
return ans
def tensorize(tensor_like) -> tf.Tensor:
'''Turn stuff into tensors'''
return tf.convert_to_tensor(tensor_like, dtype=tf.float32)
def split_dataset(dataset, train_split=0.8, random_state=42):
'''
Takes in a list (or tuple) where index 0 contains the inputs and index 1 contains the outputs
outputs x_train, x_test, y_train, y_test, train_indexes, test_indexes all as tf.Tensor
'''
indices = np.arange(len(dataset[0]))
return tuple([tensorize(data) for data in train_test_split(dataset[0], dataset[1], indices, train_size=train_split, random_state=random_state)])
# how many numbers in each example
K = 5
# the index of the solution
N = 2
# how many examples
EXAMPLE_AMOUNT = 20000
# what percentage of the examples are in the training set
TRAIN_SPLIT = 0.5
# how long to train for
epochs = 50
dataset = create_dataset(K, N, EXAMPLE_AMOUNT)
x_train, x_test, y_train, y_test, train_indexes, test_indexes = split_dataset(dataset, train_split=TRAIN_SPLIT)
model_input = tf.keras.layers.Input(shape=(K,), name="input")
model_dense1 = tf.keras.layers.Dense(10, name="dense1")(model_input)
model_dense2 = tf.keras.layers.Dense(10, name="dense2")(model_dense1)
model_output = tf.keras.layers.Dense(1, name="output")(model_dense2)
model = tf.keras.Model(inputs=model_input, outputs=model_output)
model.compile(optimizer=tf.keras.optimizers.Adam(), loss="mse")
history = model.fit(x=x_train, y=y_train, validation_data=(x_test, y_test), epochs=epochs)
# the validation loss as Tensorflow computes it
print(history.history["val_loss"][-1]) # 2.1036579710198566e-05
# the validation loss as I compute it
val_loss = tf.math.reduce_mean(tf.keras.losses.MSE(y_test, model.predict(x_test))).numpy()
print(val_loss) # 0.1655631
What you miss is that the shape of y_test.
y_test.numpy().shape
(500,) <-- causing the behaviour
Simply reshape it like:
val_loss = tf.math.reduce_mean(tf.keras.losses.MSE(y_test.numpy().reshape(-1,1), model.predict(x_test))).numpy()
print(val_loss) # 1.1548506e-05
Also:
history.history["val_loss"][-1] # 1.1548506336112041e-05
Or you can flatten() both of the data while calculating it:
val_loss = tf.math.reduce_mean(tf.keras.losses.MSE(y_test.numpy().flatten(), model.predict(x_test).flatten())).numpy()
print(val_loss) # 1.1548506e-05
I'm working on a classification task, trying to reconstruct a network from paper. In that paper, they are talking about doing a train test split 300 times and training the network each time after they are taking the mean of all predictions from each network for specific input data.
So here's the question: What is the best option for doing that, I've already reconstructed their network and thinking about using a for loop and saving outputs of each network in a data frame but can't get it the right way.
Here's the code :
# Set X and Y for training
X = dum_bll_fsrq.drop(['type2', 'name', 'Type_is_bll', 'Type_is_fsrq'], axis = 1)
Y = dum_bll_fsrq.iloc[:,-2:]
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, stratify = Y)
# Create model
model_two_neuron = tf.keras.Sequential([
tf.keras.layers.Dense(40, input_shape=(15,)), # input shape required
tf.keras.layers.Dense(2, activation=tf.nn.sigmoid)
])
model_two_neuron.compile(optimizer=tf.keras.optimizers.Adam(),
loss=tf.keras.losses.MeanSquaredError(),
metrics=[tf.keras.metrics.Precision()])
# Train
model_two_neuron.fit(X_train, y_train, epochs=20)
You can use callbacks to save the best weights for each of your models, then evaluate the best results saved by callbacks after training.
Here is a basic example, provided in the Documentation:
model.compile(loss=..., optimizer=...,
metrics=['accuracy'])
EPOCHS = 10
checkpoint_filepath = '/tmp/checkpoint'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_filepath,
save_weights_only=True,
monitor='val_accuracy',
mode='max',
save_best_only=True)
# Model weights are saved at the end of every epoch, if it's the best seen
# so far.
model.fit(epochs=EPOCHS, callbacks=[model_checkpoint_callback])
# The model weights (that are considered the best) are loaded into the model.
model.load_weights(checkpoint_filepath)
I have a CNN model that inputs two independent documents and do the convolutional layer and pooling layer separately. After pooling layer, it then concatenate two pooled feature maps together and feed to one fully connected layer. The model can successfully compiled. However, Now, I have problem fitting my model when doing the train history.
Main idea of my problem is that: How to revise the x_train to x_train1 and x_train2. x_test to x_test1 and x_test2. While y_train and y_test remain the same.
I've reference this kind of pattern from a site:
** history = model.fit(X_train, y_train, epochs=100, verbose=False, validation_data=(X_test, y_test), batch_size=10)**
# fit the model method 1
train_history = model.fit(train_data = (X_train1, X_train2),
validation_data=(x_test1, x_test2), epochs=8, batch_size=8, verbose=1)
# =================
# fit the model method 2
train_history = model.fit(x = [x_train1, x_train2], y = [y_train], validation_split=0.2, validation_data=[[x_test1, x_test2], [y_test]], epochs=8, batch_size=8, verbose=1)
The error massage of first method 1 is:
Unrecognized keyword arguments: {'train_data': (['.....'......}
The error massage of first method 2 is:
ValueError: Error when checking input: expected input_1 to have shape (10,) but got array with shape (1,)
every kind of x_train and x_test data is a long list that contains many strings; whereas y_train and y_test is also a list the maps with the string in x_test/x_train to a specific label, the label looks like this if it were to classify into 3 types: [0,0,1], [0,1,0], [1,0,0]
I expect to fit the model successfully.
I am trying to learn Keras GraphNN using simple examples. I have a simple example dataset with 784 features and I want to run this example:
# graph model with one input and two outputs
graph = Graph()
graph.add_input(name='input', input_shape=(784,))
graph.add_node(Dense(input_dim=784, output_dim=13), name='dense1', input='input')
graph.add_node(Dense(input_dim=13, output_dim=1), name='dense2', input='input')
graph.add_node(Dense(input_dim=13, output_dim=1), name='dense3', input='dense1')
graph.add_output(name='output1', input='dense2')
graph.add_output(name='output2', input='dense3')
graph.compile('rmsprop', {'output1': 'mse', 'output2': 'mse'})
graph.fit({'input': X_train, 'output1': y_train, 'output2': y_train}, nb_epoch=30)
### here is where I am facing difficulty
score = graph.evaluate({'input': X_test, 'output1': y_test, 'output2': y_test}, batch_size=16, verbose=1)
print 'score: ', score
The documentation mentions that graph.evaluate():
evaluate(data, batch_size=128, verbose=1): Show performance of the model over some validation data.
Return: The loss score over the data.
Arguments: Same meaning as fit method above. verbose is used as a binary flag (progress bar or nothing).
And from the definition of the graph.fit() we know that:
Arguments:
data:dictionary mapping input names out outputs names to appropriate numpy arrays. All arrays should contain the same number of samples.
Although my fit method runs perfect, I get IndexError: index 1 is out of bounds for size 1 on evaluate
My input shapes are:
Xtrain: (32738, 784)
Xtest: (16125, 784)
ytest: (16125,)
ytrain: (32738,)
What am I missing here ?