So I have a GRU model that predict output power. For the training data I have a csv file which has data from 2018, while for my testting data it is a different csv file which has data from 2019.
I just had to short questions.
Since I'm using 2 different csv files one for testing and one for training, I do not need to train_test_split?
When it comes to model.fit, I really don't know the difference between Validation_data and Validation_split and which one should I use?
I have tested these 3 lines seperately, the 2nd and 3rd line give me the same exact results , while the first gives me way lower val_loss.
Thank you.
history=model.fit(X_train, y_train, batch_size=256, epochs=25, validation_split=0.1, verbose=1, callbacks=[TensorBoardColabCallback(tbc)])
history=model.fit(X_train, y_train, batch_size=256, epochs=25, validation_data=(X_test, y_test), verbose=1, callbacks=[TensorBoardColabCallback(tbc)])
history=model.fit(X_train, y_train, batch_size=256, epochs=25, validation_data=(X_test, y_test), validation_split=0.1, verbose=1, callbacks=[TensorBoardColabCallback(tbc)])
You can do what you want, yes you can use one file to train and one to validate. But you could also merge them then use train_test_split if you wish. However, I would recommend you to merge them as you have data from different periods of time, there may be differences.
Using validation_data means you are providing the training set and validation set yourself, whereas using validation_split means you only provide a training set and keras splits it into a training set and a validation set (with the validation set being validation_split times the size of the training set)
Related
I am trying to understand the tensorflow architectures and a way to see how the calculations were made to obtain a result in each epoch at history variable in model.fit() function:
history = model.fit(X_train, y_train, epochs=150, batch_size=16, verbose=0, validation_split=0.3, steps_per_epoch=10)
and at model.evaluate() function, about the final test accuracy and loss obtained:
loss, acc = model.evaluate(X_test, y_test, verbose=0)
I have tried to debug the functions behind model.fit and model.evaluate but I realized that there are lots of functions and it makes hard to understand how the calculations are made.
I want to know what should I do to visualize the calculations made in these two functions?
Not an ML expert but the normal flow I follow to train a machine learning model is after data cleaning, split the dataset to train, and test using scikit-learn's train test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X.values,
y.values,
test_size=0.30,
random_state=0)
skipping over the whole model building process...when you go to train the model(fit it) after defining and compiling it, it is to use the validation split parameter like below
history = model.fit(X_train, y_train, epochs=10,validation_split=0.2)
this seems to again divide the training datasets by 20% data points to validate our model during training. If you had let's say 1000 data points(rows) in a dataset the first code above will lead to
700 training data points for training and 300 for testing
the second will again divide that 700's 20% for validation leaving as
640 data points for training and 160 for validation
leaving us with small data to train our model with.
I recently encountered a method tho where you use the test data for validation like below
history = model.fit(x_train, y_train, validation_data=(x_test, y_test))
My question is what will actually happen to the validation data after training completes is it automatically added to train our model somehow improving our accuracy at the end and also is using the test data for validation enough and if we do that will it have an effect when we try to evaluate our model using the test data.
As stated by keras (the last model.fit method with validation_data comes from keras)
validation_split
Float between 0 and 1. Fraction of the training data to be used as
validation data. The model will set apart this fraction of the
training data, will not train on it, and will evaluate the loss and
any model metrics on this data at the end of each epoch. The
validation data is selected from the last samples in the x and y data
provided, before shuffling.
validation_data
Data on which to evaluate the loss and any model metrics at the end of
each epoch. The model will not be trained on this data. This could be
a list (x_val, y_val) or a list (x_val, y_val, val_sample_weights).
validation_data will override validation_split.
So with this setup, the model will not train on the validation data.
In theory, validation data is used to evaluate your model and tune its hyperparameters.
If you need to put this model in production you should retrain the model with all the data, knowing that the performance would be what you obtained from the validation\test data
Anyway, the performances on validation\test data are often an optimistic estimate of the performance
source
I have an LSTM model that was trained on the multi-feature daily dataset and predicts the target feature's value one day in the future.
How should I retrain the model each day as the new data becomes available? Should I rerun the model.fit with the full dataset (which gets updated each day) like in the example below?
model.fit(x_train, y_train, epochs=50, batch_size=20,
validation_data=(x_test, y_test), verbose=2, shuffle=False)
Or I can call model.fit only with the newly available data?
# run at the beggining once
model.fit(x_train, y_train, epochs=50, batch_size=20,
validation_data=(x_test, y_test), verbose=2, shuffle=False)
# run every day as the new data gets available.
model.fit(x_yesterday, x_yesterday)
Assuming you are using Keras, I would use train_on_batch See this previous question for the answer: What is the use of train_on_batch() in keras?
i split my data into training and test samples (70/30) for regression-forecasting based problem (MLP, LSTM, etc.).
Within the code:
history = model.fit(X_train, y_train, epochs=100, batch_size=32,
validation_data=(X_test, y_test), verbose=0, shuffle=False)
I put my test data as the validation set and did couple weeks worth of predictions. So i did not hold back the test data...
But now that i think about it, i guess it was wrong to put the test data into the fit function, or was it ok?
NEVER EVER! use your testing that as part of training or validation. The test set should only be used for inference after training. So yes it's wrong to use your test data in the fit function, it should only be in model.predict(y_test)
I'm training my data in batches using train_on_batch, but it seems train_on_batch doesn't have an option to use callbacks, which seems to be a requirement to use checkpoints.
I can't use model.fit as that seems to require I load all of my data into memory.
model.fit_generator is giving me strange problems (like hanging at end of an epoch).
Here is the example from Keras API docs showing the use of ModelCheckpoint:
from keras.callbacks import ModelCheckpoint
model = Sequential()
model.add(Dense(10, input_dim=784, kernel_initializer='uniform'))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
checkpointer = ModelCheckpoint(filepath='/tmp/weights.hdf5', verbose=1,
save_best_only=True)
model.fit(x_train, y_train, batch_size=128, epochs=20,
verbose=0, validation_data=(X_test, Y_test), callbacks=[checkpointer])
If you train on each batch manually, you can do whatever you want at any #epoch(#batch). No need to use callback, just call model.save or model.save_weights.