I'm working on a classification task, trying to reconstruct a network from paper. In that paper, they are talking about doing a train test split 300 times and training the network each time after they are taking the mean of all predictions from each network for specific input data.
So here's the question: What is the best option for doing that, I've already reconstructed their network and thinking about using a for loop and saving outputs of each network in a data frame but can't get it the right way.
Here's the code :
# Set X and Y for training
X = dum_bll_fsrq.drop(['type2', 'name', 'Type_is_bll', 'Type_is_fsrq'], axis = 1)
Y = dum_bll_fsrq.iloc[:,-2:]
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, stratify = Y)
# Create model
model_two_neuron = tf.keras.Sequential([
tf.keras.layers.Dense(40, input_shape=(15,)), # input shape required
tf.keras.layers.Dense(2, activation=tf.nn.sigmoid)
])
model_two_neuron.compile(optimizer=tf.keras.optimizers.Adam(),
loss=tf.keras.losses.MeanSquaredError(),
metrics=[tf.keras.metrics.Precision()])
# Train
model_two_neuron.fit(X_train, y_train, epochs=20)
You can use callbacks to save the best weights for each of your models, then evaluate the best results saved by callbacks after training.
Here is a basic example, provided in the Documentation:
model.compile(loss=..., optimizer=...,
metrics=['accuracy'])
EPOCHS = 10
checkpoint_filepath = '/tmp/checkpoint'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_filepath,
save_weights_only=True,
monitor='val_accuracy',
mode='max',
save_best_only=True)
# Model weights are saved at the end of every epoch, if it's the best seen
# so far.
model.fit(epochs=EPOCHS, callbacks=[model_checkpoint_callback])
# The model weights (that are considered the best) are loaded into the model.
model.load_weights(checkpoint_filepath)
Related
I am trying to build a LSTM model in order to detect sentiment of texts. (0 -> normal, 1 -> hateful)After I trained my model, I send some texts to my model for prediction. The predicted results are as I expected. However, after I load my model as "h5" file, I cannot get same accuracies even if I send same texts. Here is my training codes:
texts = tweets['text']
labels = tweets['label']
labels = LabelEncoder().fit_transform(labels)
labels = labels.reshape(-1, 1)
X_train, X_test, Y_train, Y_test = train_test_split(texts, labels, test_size=0.20)
tokenizer.fit_on_texts(X_train)
sequences = tokenizer.texts_to_sequences(X_train)
sequences_matrix = sequence.pad_sequences(sequences, maxlen=max_len)
inputs = Input(name='inputs', shape=[max_len])
layer = Embedding(max_words, 50, input_length=max_len)(inputs)
layer = LSTM(64)(layer)
layer = Dense(256, name='FC1')(layer)
layer = Activation('relu')(layer)
layer = Dropout(0.5)(layer)
layer = Dense(1, name='out_layer')(layer)
layer = Activation('sigmoid')(layer)
model = Model(inputs=inputs, outputs=layer)
earlyStopping = EarlyStopping(monitor='val_loss', min_delta=0.0001,
restore_best_weights=False)
model.summary()
model.compile(loss='binary_crossentropy', optimizer=RMSprop(), metrics=['accuracy'])
model.fit(sequences_matrix, Y_train, batch_size=128, shuffle=True, epochs=10,
validation_split=0.2, callbacks=[earlyStopping])
model.save("ModelsDL/LSTM.h5")
test_sequences = tokenizer.texts_to_sequences(X_test)
test_sequences_matrix = sequence.pad_sequences(test_sequences, maxlen=max_len)
accr = model.evaluate(test_sequences_matrix, Y_test)
print('Test set\n Loss: {:0.3f}\n Accuracy: {:0.3f}'.format(accr[0], accr[1]))
texts = ["hope", "feel relax", "feel energy", "peaceful day"]
tokenizer.fit_on_texts(texts)
test_samples_token = tokenizer.texts_to_sequences(texts)
test_samples_tokens_pad = pad_sequences(test_samples_token, maxlen=max_len)
print(model.predict(x=test_samples_tokens_pad))
del model
The output of print(model.predict(x=test_samples_tokens_pad)) is:
[[0.0387207 ]
[0.02622151]
[0.3856796 ]
[0.03749594]]
Text with "normal" sentiment results closer to 0.Also text with "hateful" sentiment results closer to 1.
As you see in the output, my results are consistent because they have "normal" sentiment.
However, after I load my model, I always encounter different results. Here is my codes:
texts = ["hope", "feel relax", "feel energy", "peaceful day"] # same texts
model = load_model("ModelsDL/LSTM.h5")
tokenizer.fit_on_texts(texts)
test_samples_token = tokenizer.texts_to_sequences(texts)
test_samples_tokens_pad = pad_sequences(test_samples_token, maxlen=max_len)
print(model.predict(x=test_samples_tokens_pad))
Output of print(model.predict(x=test_samples_tokens_pad)):
[[0.9838583 ]
[0.99957573]
[0.9999665 ]
[0.9877912 ]]
As you notice, The same LSTM model treated the texts as if they had a hateful context.
What should I do for this problem ?
EDIT: I solved the problem. I saved the tokenizer which is used while model training. Then, I loaded that tokenizer before tokenizer.fit_on_texts(texts) for predicted texts.
On your test train split code you need to give a random state to get similar results.For example;
X_train, X_test, Y_train, Y_test = train_test_split(texts, labels, test_size=0.20,random_state=15).
Try every state like 1,2,3,4....Once you get the result you like then you can save it and use after with same random state.Hope it would solve your problem.
I'm a little confused about splitting the dataset when I'm making and evaluating Keras machine learning models.
Lets say that I have dataset of 1000 rows.
features = df.iloc[:,:-1]
results = df.iloc[:,-1]
Now I want to split this data into training and testing (33% of data for testing, 67% for training):
x_train, X_test, y_train, y_test = train_test_split(features, results, test_size=0.33)
I have read on the internet that fitting the data into model should look like this:
history = model.fit(features, results, validation_split = 0.2, epochs = 10, batch_size=50)
So I'm fitting the full data (features and results) to my model, and from that data I'm using 20% of data for validation: validation_split = 0.2.
So basically, my model will be trained with 80% of data, and tested on 20% of data.
So confusion starts when I need to evaluate the model:
score = model.evaluate(x_test, y_test, batch_size=50)
Is this correct?
I mean, why should I split the data into training and testing, where does x_train and y_train go?
Can you please explain to me whats the correct order of steps for creating model?
Generally, in training time (model. fit), you have two sets: one is for the training set and another is for validation/tuning/development set. With the training set, you train the model, and with the validation set, you need to find the best set of hyper-parameter. And when you're done, you may then test your model with unseen data set - a set that was completely hidden from the model unlike the training or validation set.
Now, when you used
X_train, X_test, y_train, y_test = train_test_split(features, results, test_size=0.33)
By this, you split the features and results into 33% of data for testing, 67% for training. Now, you can do two things
use the (X_test and y_test as validation set in model.fit(...). Or,
use them for final prediction in model. predict(...)
So, if you choose these test sets as a validation set ( number 1 ), you would do as follows:
model.fit(x=X_train, y=y_trian,
validation_data = (X_test, y_test), ...)
In the training log, you will get the validation results along with the training score. The validation results should be the same if you later compute model.evaluate(X_test, y_test).
Now, if you choose those test set as a final prediction or final evaluation set ( number 2 ), then you need to make validation set newly or use the validation_split argument as follows:
model.fit(x=X_train, y=y_trian,
validation_split = 0.2, ...)
The Keras API will take the .2 percentage of the training data (X_train and y_train) and use it for validation. And lastly, for the final evaluation of your model, you can do as follows:
y_pred = model.predict(x_test, batch_size=50)
Now, you can compare with y_test and y_pred with some relevant metrics.
Generally, you'd want to use your X_train, y_train data that you have split as arguments in the fit method. So it would look something like:
history = model.fit(X_train, y_train, batch_size=50)
While not splitting your data before throwing it into the fit method and adding the validation_split arguments work as well, just be careful to refer to the keras documentation on the validation_data and validation_split arguments to make sure that you are splitting them up as expected.
There is a related question here:
https://datascience.stackexchange.com/questions/38955/how-does-the-validation-split-parameter-of-keras-fit-function-work
Keras documentation:
https://keras.rstudio.com/reference/fit.html
I have read on the internet that fitting the data into model should
look like this:
That means you need to fit features and labels. You already split them into x_train & y_train. So your fit should look like this:
history = model.fit(x_train, y_train, validation_split = 0.2, epochs = 10, batch_size=50)
So confusion starts when I need to evaluate the model:
score = model.evaluate(x_test, y_test, batch_size=50) --> Is this correct?
That's correct, you evaluate the model by using testing features and corresponding labels. Furthermore if you want to get only for example predicted labels, you can use:
y_hat = model.predict(X_test)
Then you can compare y_hat with y_test, i.e get a confusion matrix etc.
I have a data frame like this, of DNA sequences:
Feature Label
GCTAGATGACAGT 0
TTTTAAAACAG 1
TAGCTATACT 2
TGGGGCAAAAAAAA 0
AATGTCG 3
AATGTCG 0
AATGTCG 1
Where there is one column with a DNA sequence, and a label that can either be 0,1,2,3 (i.e. a category of that DNA sequence). I want to develop a NN that predicts probability of classification of each sequence into the 1,2 or 3 category (not 0, i don't care about 0). Each sequence can appear multiple times in the data frame, and it is possible that each sequence appears in multiple (or all) categories. So the output should look like this:
GCTAGATGACAGT (0.9,0.1,0.2)
TTTTAAAACAG (0.7,0.6,0.3)
TAGCTATACT (0.3,0.3,0.2)
TGGGGCAAAAAAAA (0.1,0.5,0.6)
Where the numbers in the tuple are the probability that the sequence is found in category 1,2 and 3.
I wrote this basic code to get started. You can see I've commented out trickier bits, I'm trying to get a basic method working and then I'll gradually expand on it, but i've included everything so people can see the general idea I was thinking of.
# Split into input (X) and output (Y) variables
X = df.iloc[:,[0]].as_matrix() #as matrix due to this error: https://stackoverflow.com/questions/45479239/pandas-keyerror-not-in-index-when-training-a-keras-model
y = df.iloc[:,-1].as_matrix()
print(X[0:10])
print(y[0:10])
# Define 10-fold cross validation test harness
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
kf = kfold.get_n_splits(X)
cvscores = []
for train, test in kfold.split(X, Y):
X_train, X_test = X[train], X[test]
y_train, y_test = y[train], y[test]
# Pre-process the data
# X_train = sequence.pad_sequences(X[train], maxlen=30) #based on 30 aa being max we're interested in
# X_test = sequence.pad_sequences(X[test], maxlen=30) #based on 30 aa being max we're interested in
# Create model
model = Sequential()
# model.add(Embedding(3000, 32, input_length=30))
# model.add(Bidirectional(LSTM(20, return_sequences=True), input_shape=(n_timesteps, 1)))
model.add(Dense(1, activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Monitor val accuracy and perform early stopping
# es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200)
# mc = ModelCheckpoint('best_model.h5', monitor='val_accuracy', mode='max', verbose=1, save_best_only=True)
# Fit the model
model.fit(X_train, y_train, epochs=150, batch_size=10, verbose=0)
# Evaluate the model
# scores = model.evaluate(X[test], Y[test], verbose=0)
# print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
# cvscores.append(scores[1] * 100)
#print("%.2f%% (+/- %.2f%%)" % (numpy.mean(cvscores), numpy.std(cvscores)))
#output a three sigmoid model, and plot accuracy and loss
The output first prints the sequences, as expected (i.e. the print statement):
[['GCTAGATGACAGT']
['TTTTAAAACAG']
['TAGCTATACT']
['TGGGGCAAAAAAAA']
['AATGTCG']
['AATGTCG']
['AATGTCG']
['TTATATAAAAG']
['GCTGGGAG']
['TTTGCGTATAGATAGATAG']]
[0 1 2 0 3 0 1 2 2 0]
And then I get the error:
ValueError: could not convert string to float: 'XXX' (where XXXX is one of the sequences in the data set, but not one of the top 10 in the output above), and further up in the error it points to the value error being in the line:
model.fit(X_train, y_train, epochs=150, batch_size=10, verbose=0)
I did see this question, but I don't think mine is the same root cause. Can someone explain why I'm getting this? I'm wondering is it because I haven't explained to the model yet/properly that I'm dealing with calculating probability of a sequence instead of a categorical feature?
As I can see on the prints statement you are feeding your NN withs strings/text and this is not possible. You have to encode them into numbers. To carry out this operation different approaches are available: you can one-hot encode your characters or you can create a trainable embedding for each character.
I suggest you Tokenizer from TF which can help you in the process of numerical encoding of text sequences
I have tried searching for possible similar questions, but I have not been able to find any so far. My problem is:
I want to use cross-validation using non-overlapping subsets of data using KFold. What I did was to create subsets using KFold and fix the outcome by setting randome_state to a certain integer. When I print out the subsets multiple times results look OK. However, the problem is when I use the same subsets on a model multiple times using model.predict (meaning running my code multiple times), I get different results. Naturally, I suspect there is something wrong with my implementation of training model. But I cannot figure out what it is. I would very much appreciate a hint. Here is my code:
random.seed(42)
# define K-fold cross validation test harness
kf = KFold(n_splits=3, random_state=42, shuffle=True)
for train_index, test_index in kf.split(data):l
print ('Train', train_index, '\nTest ', test_index)
# create model
testX= data[test_index]
trainX = data[train_index]
testYcheck = labels[test_index]
testP = Path[test_index]
# convert the labels from integers to vectors
trainY = to_categorical(labels[train_index], num_classes=2)
testY = to_categorical(labels[test_index], num_classes=2)
# construct the image generator for data augmentation
aug = ImageDataGenerator(rotation_range=30, width_shift_range=0.1,
height_shift_range=0.1, shear_range=0.2, zoom_range=0.2,
horizontal_flip=True, fill_mode="nearest")
# train the network
print("[INFO] training network...")
model.fit_generator(aug.flow(trainX, trainY, batch_size=BS),
validation_data=(testX, testY),
steps_per_epoch=len(trainX) // BS, epochs=EPOCHS, verbose=1)
#predict the test data
y_pred = model.predict(testX)
predYl = []
for element in range(len(y_pred)):
if y_pred[element,1] > y_pred[element,0]:
predYl.append(1)
else:
predYl.append(0)
pred_Y= np.array(predYl)
# Compute confusion matrix
cnf_matrix = confusion_matrix(testYcheck, pred_Y)
np.set_printoptions(precision=2)
print (cnf_matrix)
I'm currently undertaking my first 'real' DL project of (surprise) predicting stock movements. I know that I'm 1000:1 to make anything useful but I'm enjoying it and want to see it through, I've learnt more in my few weeks of attempting this than I have in the prior 6 months of completing MOOC's.
I'm building an LSTM using Keras to currently predict the next 1 step forward and have attempted the task as both classification (up/down/steady) and now as a regression problem. Both result in a similar roadblock in that my validation loss never improves from epoch #1.
I can get the model to overfit such that training loss approaches zero with MSE (or 100% accuracy if classification), but at no stage does the validation loss decrease. This screams overfitting to my untrained eye so I added varying amounts of dropout but all that does is stifle the learning of the model/training accuracy and shows no improvements on the validation accuracy.
I have attempted to change a significant number of hyperparameters - learning rate, optimiser, batchsize, lookback window, #layers, #units, dropout, #samples, etc, also tried with subset of data and subset of features but I just can't get it to work so I'm very thankful for any help.
Code Below (it's not pretty I know):
# Import saved full dataframe ~ 200 features
import feather
df = feather.read_dataframe('df_feathered')
df.set_index('time', inplace=True)
# Difference the dataset to make stationary
df = df.diff(periods=1, axis=0)
# MAKE LARGE SAMPLE FOR TESTING
df_train = df.loc['2017-3-1':'2017-6-30']
df_val = df.loc['2017-7-1':'2017-8-31']
df_test = df.loc['2017-9-1':'2017-9-30']
# Make x_train, x_val sets by dropping target variable
x_train = df_train.drop('close+1', axis=1)
x_val = df_val.drop('close+1', axis=1)
# Scale the training data first then fit the transform to the test set
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_val)
# scaler = MinMaxScaler(feature_range=(0,1))
# x_train = scaler.fit_transform(df_train1)
# x_test = scaler.transform(df_val1)
# Create y_train, y_test, simply target variable for regression
y_train = df_train['close+1']
y_test = df_val['close+1']
# Define Lookback window for LSTM input
sliding_window = 15
# Convert x_train, x_test, y_train, y_test into 3d array (samples,
timesteps, features) for LSTM input
dataXtrain = []
for i in range(len(x_train)-sliding_window-1):
a = x_train[i:(i+sliding_window), 0:(x_train.shape[1])]
dataXtrain.append(a)
dataXtest = []
for i in range(len(x_test)-sliding_window-1):
a = x_test[i:(i+sliding_window), 0:(x_test.shape[1])]
dataXtest.append(a)
dataYtrain = []
for i in range(len(y_train)-sliding_window-1):
dataYtrain.append(y_train[i + sliding_window])
dataYtest = []
for i in range(len(y_test)-sliding_window-1):
dataYtest.append(y_test[i + sliding_window])
# Make data the divisible by a variety of batch_sizes for training
# Started at 1000 to not include replaced NaN values
dataXtrain = np.array(dataXtrain[1000:172008])
dataYtrain = np.array(dataYtrain[1000:172008])
dataXtest = np.array(dataXtest[1000:83944])
dataYtest = np.array(dataYtest[1000:83944])
# Checking input shapes
print('dataXtrain size is: {}'.format((dataXtrain).shape))
print('dataXtest size is: {}'.format((dataXtest).shape))
print('dataYtrain size is: {}'.format((dataYtrain).shape))
print('dataYtest size is: {}'.format((dataYtest).shape))
### ACTUAL LSTM MODEL
batch_size = 256
timesteps = dataXtrain.shape[1]
features = dataXtrain.shape[2]
# Model set-up, stacked 4 layer stateful LSTM
model = Sequential()
model.add(LSTM(512, return_sequences=True, stateful=True,
batch_input_shape=(batch_size, timesteps, features)))
model.add(LSTM(256,stateful=True, return_sequences=True))
model.add(LSTM(256,stateful=True, return_sequences=True))
model.add(LSTM(128,stateful=True))
model.add(Dense(1, activation='linear'))
model.summary()
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.9, patience=5, min_lr=0.000001, verbose=1)
def coeff_determination(y_true, y_pred):
from keras import backend as K
SS_res = K.sum(K.square( y_true-y_pred ))
SS_tot = K.sum(K.square( y_true - K.mean(y_true) ) )
return ( 1 - SS_res/(SS_tot + K.epsilon()) )
model.compile(loss='mse',
optimizer='nadam',
metrics=[coeff_determination,'mse','mae','mape'])
history = model.fit(dataXtrain, dataYtrain,validation_data=(dataXtest, dataYtest),
epochs=100,batch_size=batch_size, shuffle=False, verbose=1, callbacks=[reduce_lr])
score = model.evaluate(dataXtest, dataYtest,batch_size=batch_size, verbose=1)
print(score)
predictions = model.predict(dataXtest, batch_size=batch_size)
print(predictions)
import matplotlib.pyplot as plt
%matplotlib inline
#plt.plot(history.history['mean_squared_error'])
#plt.plot(history.history['val_mean_squared_error'])
plt.plot(history.history['coeff_determination'])
plt.plot(history.history['val_coeff_determination'])
#plt.plot(history.history['mean_absolute_error'])
#plt.plot(history.history['mean_absolute_percentage_error'])
#plt.plot(history.history['val_mean_absolute_percentage_error'])
#plt.title("MSE")
plt.ylabel("R2")
plt.xlabel("epoch")
plt.legend(["train", "val"], loc="best")
plt.show()
plt.plot(history.history["loss"][5:])
plt.plot(history.history["val_loss"][5:])
plt.title("model loss")
plt.ylabel("loss")
plt.xlabel("epoch")
plt.legend(["train", "val"], loc="best")
plt.show()
plt.figure(figsize=(20,8))
plt.plot(dataYtest)
plt.plot(predictions)
plt.title("Prediction")
plt.ylabel("Price")
plt.xlabel("Time")
plt.legend(["Truth", "Prediction"], loc="best")
plt.show()
Maybe you should remember you are predicting sock returns, which it's very likely to predict nothing. So val_loss increasing is not overfitting at all. Instead of adding more dropouts, maybe you should think about adding more layers to increase it's power.
Try to reduce learning rate much (and remove dropouts for now).
Why do you use
shuffle=False
in fit() function?