Run evaluation after part of a training epoch

Run evaluation after part of a training epoch - python

I load two datasets with the dataset api, one for training and one for evaluation. I switch between them with sess.run(train_init_op) before running the evaluation or training.
Currently I run the evaluation after finishing one epoch, i.e. after the training dataset was run through completely.
If I want to evaluate my network before the training dataset was finished, I would have to switch earlier, and by doing this tensorflow would forget where it has been in the training dataset. Is there any way to remember the state of the training dataset iterator? And switch back to this position after the evaluation has finished?

I think it is not only about remembering the position in the training set, but also accumulated gradients, params of an optimizer (if you use something like Adam) etc. Switching context between training and validation can be tricky.
For instance, in Google object detection API there is a separate validation process that monitors for fresh checkpoints and run validation on them. Meanwhile training is running further. Thus, by setting the checkpoint interval you can achieve any validation frequency you want.

Related

Tensorflow stop and resume training

I am using Tensorflow to train my model. I am routinely saving my model every 10 epochs. I have a limited number of samples to train, so I am augmenting my dataset to make a larger training dataset.
If I need to use my saved model to resume training after a power outage would it be best to resume training using the same dataset or to make a new dataset?

Your question very much depends on how you're augmenting your dataset. If your augmentation skews the statistical distribution of the underlying dataset then you should resume training with the pre-power outage dataset. Otherwise, you're assuming that your augmentation has not changed the distribution of the dataset.
It is a fairly safe assumption to make (assuming your augmentations do not change the data in an extremely significant way) that you are safe to resume training on a new dataset or the old dataset without significant change in accuracy.

Keras + TensorFlow Model

I'm currently creating a model and while creating it I came with some questions. Does training the same model with the same data multiple times leads to better precision of those objects, since your training it every time? And what could be the issue when sometimes the object gets 90% precision and when I re-run it gets lower precision or even not predicting the right object? Is it because of Tensorflow running on the GPU?

I will guess that you are doing image recognition and that you want to identify images (objects) using a neuronal network made with Keras. You should train it once, but during training you will do several epochs, meaning the algorithm adapts the weights in several rounds (epochs). For each round it goes over all training images. Once trained, you can use the model to identify images/objects.
You can evaluate the accuracy of your trained model over the same training set, but it is better to use a different set (see train_test_split from sklearn for instance).
Training is a stochastic process, meaning that every time you train your network it will be different in the end. Hence, you will get different accurcies. The stochasticity comes from different initial weights or from using stochastic gradient descent methods for instance.
The question does not appear to have anything to do with Keras or TensorFlow but basic understandting of how neuronal networks work. There is no connection to running Tensorflow on the GPU. You will also not get better precision by training with the same objects. If you train your model on a dataset for a very long time (many epochs), you might get into overfitting. Then, the accuracy of your model on this training dataset will be very high, but the model will have low accuracy on other datasets.

A common technique is split your date in train and validation datasets, then repeatedly train your model using EarlyStopping. This will train on the training dataset, then calculate the loss against the validation dataset, and then keep training until no further improvement is seen. You can set a patience parameter to wait for X epochs without an improvement to stop training (and optionally save the best model)
https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/
Another trick is image augmentation with ImageDataGenerator which will generate synthetic data for you (rotations, shifts, mirror images, brightness adjusts, noise etc). This can effectively increase the amount of data you have to train with, thus reducing overfitting.
https://machinelearningmastery.com/how-to-configure-image-data-augmentation-when-training-deep-learning-neural-networks/

what's the mechanism of `tf.estimator.train_and_evaluate` function to control training and evaluation period?

When I trained a SSD object detection model 20K steps using TensorFlow Object Detection API, I found that the training time varies:
It was training fast on the first 10 minutes, and around 500 steps were performed (i.e. 0.83 steps/seconds). Then it slowed down and took about 40~50 minutes to perform single training step, evaluate the model on the evaluation dataset and save the checkpoint on disk. So I interrupted the training after few steps and continued by restoring the training.
Every time, it training fast on the first 10 minutes and then slowed down sharply as the figures showed.
The model's training are implemented by TensorFlow's Estimator API tf.estimator.train_and_evaluate()
Can anyone explain how it works? How the estimator controls the training and evaluation period? I do not want to evaluate the model for every step!

If you look at the EvalSpec and TrainSpec there is an argument throttle_secs, which is responsible for deciding when evaluation is called. Refer to this heated discussion, which has many details about Estimator methods! Controlling this would be the option to control train and eval cycles. Also in general, train_and_evaluate will work by building a graph of the the training and evaluation operation. The training graph is created only once, but evaluation graph is recreated every time you need to evaluate. This means that it will load the checkpoint that was created during training, which maybe one reason why this is taking so long! Maybe InMemoryEvaluationHook that is mentioned in that discussion can help you out, since it does not reload the checkpoint everytime evaluation is called.

Difference between training and testing accuracy+ Tensorflow tutorial

The code in this tensorflow tutorial uses this section of the code to calculate the validation accuracy right?
eval_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": eval_data},
y=eval_labels,
num_epochs=1,
shuffle=False)
eval_results = mnist_classifier.evaluate(input_fn=eval_input_fn)
print(eval_results)
Question: So if I had to calculate the training set accuracy that is to see if my model is overfitting my training set data, if I changed the value of "x" to train_data and feed the training data for testing as well, Would it give me the training set accuracy?
If not, how do I check if my model is overfitting my dataset?
How does the number of steps affect the accuracy?
Like if I have trained it for 20000 steps and then if I train it for another 100. Why does it change the accuracy? Is it since the weights are being calculated all over again? Would it be advisable to do something like this then?
mnist_classifier.train(
input_fn=train_input_fn,
steps=20000,
hooks=[logging_hook])

Normally you have 3 datasets, 1 for training, 1 for validation and 1 for testing. All these datasets have to be unique, an image of the training set may not occur in the validation or test set, etc. You train with the training set and after each epoch, you validate the model with the validation data. The optimizers will always try to update the weights to perfectly classify the training data, the training accuracy will therefore get very high (>90). The validation data is data the model has never seen before, and its done after each epoch (or x amount of steps) to show how well the model reacts to data is hasn't seen before, this shows how well the model will improve overtime.
The more you train, the higher the training accuracy will become, since the optimizer will do its best to get that value to 100%. The validation data, that does not update the weights, also increases overtime, but not continuously. While the training accuracy keeps improving, the validation accuracy might stop improving. The moment the validation accuracy decreases over time, well then you're overfitting. This means that the model is focusing too much on the training data, and that it can't classify another character correctly if it differs from the training set.
At the end of all the training you use a test set, this will determine the actual accuracy of your model on new data.
#xmacz: I cannot add comments yet, only answers so I just update my answer. Yes, I checked the source code, your first lines of code tests the model on test data

The evaluate is just a function which does some numerical activities to the input data and produces some output. If you use it for testing data it should give the testing accuracy and if you input the training data it should output the training accuracy.
At the end of the day it is just mathematics. What the output is intuitively, is something that you would have to ascertain.

How to know whether your model is overfitting is something you do while training the model. You have to set apart another set called validation dataset which is different from the test and training sets. A typical split of datasets is 70%-20%-10% for training, testing and validating respectively.
During the training, every n steps you test your model on the validation dataset. During the first iterations the score on your validation set will get better but at some point it will get worse. You can use this information to stop your training when your model starts to overfit but doing it right is an art. You could for instance stop after 5 tests that your accuracy has been decreasing consecutively, because sometimes you can see that it gets worse but in the next test it gets better. It's hard to say, it depends on many factors.
Regarding to your second question, iterating another 100 steps could make your model better or worse, depending on whether it's overfitting or not, so I'm afraid that question doesn't have a clear answer. The weights will rarely stop changing because the iterations/steps are "moving" them, for good or for bad. Again, it's difficult to say how to get good results, but you could try early stopping using a validation set, as I've mentioned before.

What does train_on_batch() do in keras model?

I saw a sample of code (too big to paste here) where the author used model.train_on_batch(in, out) instead of model.fit(in, out). The official documentation of Keras says:
Single gradient update over one batch of samples.
But I don't get it. Is it the same as fit(), but instead of doing many feed-forward and backprop steps, it does it once? Or am I wrong?

Yes, train_on_batch trains using a single batch only and once.
While fit trains many batches for many epochs. (Each batch causes an update in weights).
The idea of using train_on_batch is probably to do more things yourself between each batch.

It is used when we want to understand and do some custom changes after each batch training.
A more precide use case is with the GANs.
You have to update discriminator but during update the GAN network you have to keep the discriminator untrainable. so you first train the discriminator and then train the gan keeping discriminator untrainable.
see this for more understanding:
https://medium.com/datadriveninvestor/generative-adversarial-network-gan-using-keras-ce1c05cfdfd3

The method fit of the model train the model for one pass through the data you gave it, however because of the limitations in memory (especially GPU memory), we can't train on a big number of samples at once, so we need to divide this data into small piece called mini-batches (or just batchs). The methode fit of keras models will do this data dividing for you and pass through all the data you gave it.
However, sometimes we need more complicated training procedure we want for example to randomly select new samples to put in the batch buffer each epoch (e.g. GAN training and Siamese CNNs training ...), in this cases we don't use the fancy an simple fit method but instead we use the train_on_batch method. To use this methode we generate a batch of inputs and a batch of outputs(labels) in each iteration and pass it to this method and it will train the model on the whole samples in the batch at once and gives us the loss and other metrics calculated with respect to the batch samples.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.