I use keras to train an LSTM. The input sequences are of different length. Lets say the sequences have lengths between 1 and num_seq. Therefore, I group the sequences by length in each epoch in order to use a batch size > 1:
for epoch in xrange(nb_epochs):
for i in range(1,num_seq):
X,y = get_sequences(length=i)
model.fit(X,y,batch_size=100,epochs=1, validation_split=0.1, callbacks=None)
Because I use a custom loop over the epochs, callbacks which use the epoch information do not work properly (e.g. the tensorboard, history, etc). What would be a way around this problem? Is there a way to tell the fit function, which epoch it currently does?
When doing manipulation on your training data during training you should use model.train_on_batch incrementally or - better yet - use fit_generator which lets you define a python generator that produces (x,y) tuples for each batch. This then takes care of the proper invocation of callbacks as well.
For example:
def train_gen():
while True:
for i in range(1,num_seq):
X,y = get_sequences(length=i)
yield X, y
model.fit_generator(train_gen, steps_per_epoch=num_seq)
The downside of this is that you have to do the batching yourself and also have to supply the validation split yourself which you can do with a generator as well (therefore you can reuse most of the code).
Related
I came across this notebook that covers forecasting. I got it through this article.
I am confused about the 2nd and 4th line from below
train_data = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_data = train_data.cache().shuffle(buffer_size).batch(batch_size).repeat()
val_data = tf.data.Dataset.from_tensor_slices((x_vali, y_vali))
val_data = val_data.batch(batch_size).repeat()
I understand that we are trying to shuffle our data as we dont want to feed data to our model in the serial order. On additional reading I realized that it is better to have buffer_size same as the size of the dataset. But I am not sure what repeat is doing in this case. Could someone explain what is being done here and what is the function of repeat?
I also looked at this page and saw below text but still not clear.
The following methods in tf.Dataset :
repeat( count=0 ) The method repeats the dataset count number of times.
shuffle( buffer_size, seed=None, reshuffle_each_iteration=None) The method shuffles the samples in the dataset. The buffer_size is the number of samples which are randomized and returned as tf.Dataset.
batch(batch_size,drop_remainder=False) Creates batches of the dataset with batch size given as batch_size which is also the length of the batches.
The repeat call with nothing passed to the count param makes this dataset repeat infinitely.
In python terms, Datasets are a subclass of python iterables. If you have an object ds of type tf.data.Dataset, then you can execute iter(ds). If the dataset was generated by repeat(), then it will never run out of items, i.e., it will never throw a StopIteration exception.
In the notebook you referenced, the call to tf.keras.Model.fit() is passed an argument of 100 to the param steps_per_epoch. This means that the dataset should be infinitely repeating, and Keras will pause training to run validation every 100 steps.
tldr: leave it in.
https://github.com/tensorflow/tensorflow/blob/3f878cff5b698b82eea85db2b60d65a2e320850e/tensorflow/python/data/ops/dataset_ops.py#L134-L3445
https://docs.python.org/3/library/exceptions.html
I'm working on a Keras model with images separated into patches.
I have a quite peculiar pipeline:
for i in range(n_iteration):
print("Epoch:", i, "/", n_iteration)
start = time.time()
self.train_batch, self.validation_batch = self.get_batch()
end = time.time()
print("Time for loading: ",end - start)
K.set_value(self.batch_source, self.train_batch[0][:self.batch_size])
K.set_value(self.batch_target, self.train_batch[0][self.batch_size:])
pred = self.model.predict(self.train_batch[0])
K.set_value(self.gamma, self.compute_gamma(pred))
hist = self.model.train_on_batch(self.train_batch[0], self.train_batch[1])
I need to compute based on the prediction of my model at a time t (for a given batch) a certain value named gamma.This value is then taken into account in my loss function but is not differentiable, therefore I canno't integrate it's computation in my loss function.
When measuring the necessary time for loading and training, it appears that the bottleneck is in the loading phase.
My question is: Is it possible to load several batches (the function self.get_batch() while computing the prediction, gamma and training on an other batch?
I guess the idea would be to create some kind of queue in which I store my batches, but I don't really know how to do that.
PS: in my get_batch function I'm accessing an hdf5 file, can it cause any trouble in multiprocessing ?
Thank you in advance.
Suppose you are training a custom tf.estimator.Estimator with tf.estimator.train_and_evaluate using a validation dataset in a setup similar to that of #simlmx's:
classifier = tf.estimator.Estimator(
model_fn=model_fn,
model_dir=model_dir,
params=params)
train_spec = tf.estimator.TrainSpec(
input_fn = training_data_input_fn,
)
eval_spec = tf.estimator.EvalSpec(
input_fn = validation_data_input_fn,
)
tf.estimator.train_and_evaluate(
classifier,
train_spec,
eval_spec
)
Often, one uses a validation dataset to cut off training to prevent over-fitting when the loss continues to improve for the training dataset but not for the validation dataset.
Currently the tf.estimator.EvalSpec allows one to specify after how many steps (defaults to 100) to evaluate the model.
How can one (if possible not using tf.contrib functions) designate to terminate training after n number of evaluation calls (n * steps) where the evaluation loss does not improve and then save the "best" model / checkpoint (determined by validation dataset) to a unique file name (e.g. best_validation.checkpoint)
I understand your confusion now. The documentation for stop_if_no_decrease_hook states (emphasis mine):
max_steps_without_decrease: int, maximum number of training steps with
no decrease in the given metric.
eval_dir: If set, directory
containing summary files with eval metrics. By default,
estimator.eval_dir() will be used.
Looking through the code of the hook (version 1.11), though, you find:
def stop_if_no_metric_improvement_fn():
"""Returns `True` if metric does not improve within max steps."""
eval_results = read_eval_metrics(eval_dir) #<<<<<<<<<<<<<<<<<<<<<<<
best_val = None
best_val_step = None
for step, metrics in eval_results.items(): #<<<<<<<<<<<<<<<<<<<<<<<
if step < min_steps:
continue
val = metrics[metric_name]
if best_val is None or is_lhs_better(val, best_val):
best_val = val
best_val_step = step
if step - best_val_step >= max_steps_without_improvement: #<<<<<
tf_logging.info(
'No %s in metric "%s" for %s steps, which is greater than or equal '
'to max steps (%s) configured for early stopping.',
increase_or_decrease, metric_name, step - best_val_step,
max_steps_without_improvement)
return True
return False
What the code does is load the evaluation results (produced with your EvalSpec parameters) and extract the eval results and the global_step (or whichever other custom step you use to count) associated with the specific evaluation record.
This is the source of the training steps part of the docs: the early stopping is not triggered according to the number of non-improving evaluations, but to the number of non-improving evals in a certain step range (which IMHO is a bit counter-intuitive).
So, to recap: Yes, the early-stopping hook uses the evaluation results to decide when it's time to cut the training, but you need to pass in the number of training steps you want to monitor and keep in mind how many evaluations will happen in that number of steps.
Examples with numbers to hopefully clarify more
Let's assume you're training indefinitely long having an evaluation every 1k steps. The specifics of how the evaluation runs is not relevant, as long as it runs every 1k steps producing a metric we want to monitor.
If you set the hook as hook = tf.contrib.estimator.stop_if_no_decrease_hook(my_estimator, 'my_metric_to_monitor', 10000) the hook will consider the evaluations happening in a range of 10k steps.
Since you're running 1 eval every 1k steps, this boils down to early-stopping if there's a sequence of 10 consecutive evals without any improvement.
If then you decide to rerun with evals every 2k steps, the hook will only consider a sequence of 5 consecutive evals without improvement.
Keeping the best model
First of all, an important note: this has nothing to do with early stopping, the issue of keeping a copy of the best model through the training and the one of stopping the training once performance start degrading are completely unrelated.
Keeping the best model can be done very easily defining a tf.estimator.BestExporter in your EvalSpec (snippet taken from the link):
serving_input_receiver_fn = ... # define your serving_input_receiver_fn
exporter = tf.estimator.BestExporter(
name="best_exporter",
serving_input_receiver_fn=serving_input_receiver_fn,
exports_to_keep=5) # this will keep the 5 best checkpoints
eval_spec = [tf.estimator.EvalSpec(
input_fn=eval_input_fn,
steps=100,
exporters=exporter,
start_delay_secs=0,
throttle_secs=5)]
If you don't know how to define the serving_input_fn have a look here
This allows you to keep the overall best 5 models you obtained, stored as SavedModels (which is the preferred way to store models at the moment).
I know this question is asked more than one time, but I couldn't understand codes or the logic behind.
In my data set, first I created a layer, sigmoid layer, then I connected this layer to the output layer and I've used softmax function in the output layer.
fl = tf.layers.dense(x, 10,activation=tf.sigmoid)
output = tf.layers.dense(fl, 2,activation=tf.nn.softmax)
I've created loss and accuracy, initialized variables, set optimizer and train variables, then I start running on my data:
loss = tf.losses.softmax_cross_entropy(onehot_labels=y,logits=output)
accuracy = tf.metrics.accuracy(tf.argmax(y_train,1),tf.argmax(output,1))
# inits
init_local = tf.local_variables_initializer()
init_global = tf.global_variables_initializer()
sess.run(init_global)
sess.run(init_local)
optimizer = tf.train.GradientDescentOptimizer(rate)
train = optimizer.minimize(loss)
for i in range(1000):
_, lv = sess.run((train, loss))
if i%5 == 0:
print("L: " + str(lv))
print("Accuracy: "+str(sess.run(accuracy)))
I can see that my loss value decreases every time I run on the training set. And my accuracy is ~0.93.
The problem is, from now on, I don't know how to test this model with real data.
Also, how can I draw a histogram of my real data? I have correct labels for my real data as well.
I will assume that you use Dataset to feed your training data and you want to run on test data immediately after training (since you don't have checkpoints in your code).
When using Dataset, you would create an iterator and call get_next() on it. Then, you would use the return values of get_next() as inputs to your model.
To run your model on the test data, you can use two high-level approaches:
If you test data has the same format as you train data, create a dataset that reads your test data. Then, create another copy (sometimes called a "tower") of your model (operations will be new but variables will be shared) that uses the test Dataset. Then, use sess.run() similarly to how you use it for training - you might not need to compute loss or train, but only accuracy.
If you test data has different format, you can feed it directly, by using feed_dict argument to sess.run(). You would feed your test data as values for tensors returned from get_next(). Usually, one feeds placeholders, but TensorFlow allows you to feed any tensor.
As for histograms, Tensorboard has a nice way of visualizing them: https://www.tensorflow.org/programmers_guide/tensorboard_histograms.
I have a pipeline to read train and validation datasets from tfrecords.
I build batches using tf.train.batch. During training I want to switch between training and evaluation on validation dataset.
Here is simplified snippet of code how I implement it now.
is_training_pl = tf.placeholder(tf.bool)
images_train, labels_train = tf.train.batch([img_train, label_train])
images_val, labels_val = tf.train.batch([img_val, label_val])
data = tf.cond(is_training_pl, lambda: [images_train, labels_train], lambda: [images_val, labels_val])
loss = my_model(input=data)
I know that one can do it with tf.cond, but the problem with it is that both train and val batch ops would be executed when tf.cond is called.
On github ebrevdo told (link to the comment) that it's possible to use tf.train.maybe_batch for this purpose instead, which is more efficient.
Can anyone give an example of how to use tf.train.batch in my case please?