How to get a callback when the specified epoch number is over?

How to get a callback when the specified epoch number is over? - python

I want to fine turn my model when using Keras, and I want to change my training data and learning rate to train when the epochs arrive 10, So how to get a callback when the specified epoch number is over.

You need to write your own Callback subclass.
https://keras.io/callbacks/ (general information)
https://github.com/keras-team/keras/blob/master/keras/callbacks.py#L275 (source code for the Callback base class)
Your Callback subclass should define an on_epoch_end() method, which accepts the epoch number as an argument.

Actually, the way keras works this is probably not the best way to go, it would be much better to treat this as fine tuning, meaning that you finish the 10 epochs, save the model and then load the model (from another script) and continue training with the lr and data you fancy.
There are several reasons for this.
It is much clearer and easier to debug. You check you model properly after the 10 epochs, verify that it works properly and carry on
It is much better to do several experiments this way, starting from epoch 10.
Good luck!

Related

How to break the Word2vec training from a callback function?

I am training a skipgram model using gensim word2vec. I would like to exit the training before reaching the number of epochs passed in the parameters based on a specific accuracy test in a different set of data in order to avoid the overfitting of the model.
Is there a way in gensim to interrupt the train of word2vec from a callback function?

If in fact more training makes your Word2Vec model worse on some external evaluation, there is likely something else wrong with your setup. (For example, many many online code examples that call train() multiple times in a loop mismanage the learning-rate alpha such that it actually goes negative, which would mean each training-example results in anti-corrections to the model via backpropagation.)
If instead the main problem is truly overfitting, a better solution than conditional early-stopping would probably be adjusting other parameters, such as the model size, so that it can't overshoot useful generalization no matter how many training passes are made.
But if you really want to try the less-good approach of early stopping, you could potentially raise a catchable exception in your callback, and catch it outside train() to allow your other code to continue with the results of the aborted training. For example...
A custom exception...
class OverfitException(Exception):
pass
...then in your callback...
raise OverfitException()
...and around training...
try:
model.train(...)
except OverfitException:
print("training cut short")
# ... & your code with partially-trained model continues
But again, this is not the best way to deal with overfitting or other cases where more training is seeming to hurt evaluation-scores.

Why does more epochs make my model worse?

Most of my code is based on this article and the issue I'm asking about is evident there, but also in my own testing. It is a sequential model with LSTM layers.
Here is a plotted prediction over real data from a model that was trained with around 20 small data sets for one epoch.
Here is another plot but this time with a model trained on more data for 10 epochs.
What causes this and how can I fix it? Also that first link I sent shows the same result at the bottom - 1 epoch does great and 3500 epochs is terrible.
Furthermore, when I run a training session for the higher data count but with only 1 epoch, I get identical results to the second plot.
What could be causing this issue?

A few questions:
Is this graph for training data or validation data?
Do you consider it better because:
The graph seems cool?
You actually have a better "loss" value?
If so, was it training loss?
Or validation loss?
Cool graph
The early graph seems interesting, indeed, but take a close look at it:
I clearly see huge predicted valleys where the expected data should be a peak
Is this really better? It sounds like a random wave that is completely out of phase, meaning that a straight line would indeed represent a better loss than this.
Take a look a the "training loss", this is what can surely tell you if your model is better or not.
If this is the case and your model isn't reaching the desired output, then you should probably make a more capable model (more layers, more units, a different method, etc.). But be aware that many datasets are simply too random to be learned, no matter how good the model.
Overfitting - Training loss gets better, but validation loss gets worse
In case you actually have a better training loss. Ok, so your model is indeed getting better.
Are you plotting training data? - Then this straight line is actually better than a wave out of phase
Are you plotting validation data?
What is happening with the validation loss? Better or worse?
If your "validation" loss is getting worse, your model is overfitting. It's memorizing the training data instead of learning generally. You need a less capable model, or a lot of "dropout".
Often, there is an optimal point where the validation loss stops going down, while the training loss keeps going down. This is the point to stop training if you're overfitting. Read about the EarlyStopping callback in keras documentation.
Bad learning rate - Training loss is going up indefinitely
If your training loss is going up, then you've got a real problem there, either a bug, a badly prepared calculation somewhere if you're using custom layers, or simply a learning rate that is too big.
Reduce the learning rate (divide it by 10, or 100), create and compile a "new" model and restart training.
Another problem?
Then you need to detail your question properly.

TensorFlow train batches for multiple epochs?

I don't understand how to run the result of tf.train.batch for multiple epochs. It runs out once of course and I don't know how to restart it.
Maybe I can repeat it using tile, which is complicated but described in full here.
If I can redraw a batch each time that would be fine -- I would need batch_size random integers between 0 and num_examples. (My examples all sit in local RAM). I haven't found an easy way to get these random draws at once.
Ideally there is a reshuffle too when the batch is repeated, but it makes more sense to me to run an epoch then reshuffle, etc., instead of join the training space to itself num_epochs size, then shuffle.
I think this is confusing because I'm not really building an input pipeline since my input fits in memory, but yet I still need to be building out batching, shuffling and multiple epochs which possibly requires more knowledge of input pipeline.

tf.train.batch simply groups upstream samples into batches, and nothing more. It is meant to be used at the end of an input pipeline. Data and epochs are dealt with upstream.
For example, if your training data fits into a tensor, you could use tf.train.slice_input_producer to produce samples. This function has arguments for shuffling and epochs.

Flow of execution in word2vec tensorflow

From past few days, I have been trying to figure out the flow of execution in the code https://github.com/tensorflow/models/blob/master/tutorials/embedding/word2vec.py#L28 .
I understood the logic behind negative sampling and loss function, but I am getting so confused about the flow of execution inside the train function, especially when it comes to _train_thread_body function. I am so confused about the while and if loops ( what is the impact ) and the concurrency related parts. It would be great, if someone can give a decent explanation, before down-voting this.

This sample code is called "Multi-threaded word2vec mini-batched skip-gram model", that's why it uses several independent threads for training. Word2Vec can be trained with a single thread as well, but this tutorial shows that word2vec is faster to compute when done in parallel.
The input, label and epoch tensors are provided by the native word2vec.skipgram_word2vec function, which is implemented in tutorials/embedding/word2vec_kernels.cc file. There you can see that current_epoch is a tensor updated once the whole corpus of sentences is processed.
The method you're asking about is actually pretty simple:
def _train_thread_body(self):
initial_epoch, = self._session.run([self._epoch])
while True:
_, epoch = self._session.run([self._train, self._epoch])
if epoch != initial_epoch:
break
First, it computes the current epoch, then it invokes the training until the epoch is increased. This means that all of threads running this method will make exactly one epoch of training. Each thread is doing one step at a time in parallel with others.
self._train is an op that optimizes the loss function (see optimize method), which is computed from current examples and labels (see build_graph method). The exact value of these tensors is in native code again, namely in NextExample. Essentially, each call of word2vec.skipgram_word2vec extracts the set of examples and labels, which form the input to the optimization function. Hope, it makes it clearer now.
By the way, this model uses NCE loss in training, not negative sampling.

Accessing Variable in Keras Callback

So I've a CNN implemented. I have made custom callbacks that are confirmed working but I have an issue.
This is a sample output.
Example of iteration 5 (batch-size of 10,000 for simplicity)
50000/60000 [========================>.....] - ETA: 10s ('new lr:', 0.01)
('accuracy:', 0.70)
I have 2 callbacks (tested to work as shown in the output):
(1) Changes the learning rate at each iteration. (2) Prints the accuracy at each iteration.
I have an external script that determines the learning rate by taking in the accuracy.
Question:
How to make the accuracy at each iteration available so that an external script can access it? In essence an accessible variable at each iteration. I'm able to access it only once the process is over with AccuracyCallback.accuracy
Problem
I can pass a changing learning rate. But how do I get the accuracy once this has been passed in a form of an accessible variable at each iteration?
Example
My external script determines the learning rate at iteration 1: 0.01. How do I get the accuracy as an accessible variable in my external script at iteration 1 instead of a print statement?

You can create your own callback
class AccCallback(keras.callbacks.Callback):
def on_batch_end(self, batch, logs={}):
accuracy = logs.get('acc')
# pass accuracy to your 'external' script and set new lr here
In order for logs.get('acc') to work, you have to tell Keras to monitor it:
model.compile(optimizer='...', loss='...', metrics=['accuracy'])
Lastly, note that the type of accuracy is ndarray here. Should it cause you any issue, I suggest wrapping it: float(accuracy).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.