I am training a skipgram model using gensim word2vec. I would like to exit the training before reaching the number of epochs passed in the parameters based on a specific accuracy test in a different set of data in order to avoid the overfitting of the model.
Is there a way in gensim to interrupt the train of word2vec from a callback function?
If in fact more training makes your Word2Vec model worse on some external evaluation, there is likely something else wrong with your setup. (For example, many many online code examples that call train() multiple times in a loop mismanage the learning-rate alpha such that it actually goes negative, which would mean each training-example results in anti-corrections to the model via backpropagation.)
If instead the main problem is truly overfitting, a better solution than conditional early-stopping would probably be adjusting other parameters, such as the model size, so that it can't overshoot useful generalization no matter how many training passes are made.
But if you really want to try the less-good approach of early stopping, you could potentially raise a catchable exception in your callback, and catch it outside train() to allow your other code to continue with the results of the aborted training. For example...
A custom exception...
class OverfitException(Exception):
pass
...then in your callback...
raise OverfitException()
...and around training...
try:
model.train(...)
except OverfitException:
print("training cut short")
# ... & your code with partially-trained model continues
But again, this is not the best way to deal with overfitting or other cases where more training is seeming to hurt evaluation-scores.
Related
I want to fine turn my model when using Keras, and I want to change my training data and learning rate to train when the epochs arrive 10, So how to get a callback when the specified epoch number is over.
You need to write your own Callback subclass.
https://keras.io/callbacks/ (general information)
https://github.com/keras-team/keras/blob/master/keras/callbacks.py#L275 (source code for the Callback base class)
Your Callback subclass should define an on_epoch_end() method, which accepts the epoch number as an argument.
Actually, the way keras works this is probably not the best way to go, it would be much better to treat this as fine tuning, meaning that you finish the 10 epochs, save the model and then load the model (from another script) and continue training with the lr and data you fancy.
There are several reasons for this.
It is much clearer and easier to debug. You check you model properly after the 10 epochs, verify that it works properly and carry on
It is much better to do several experiments this way, starting from epoch 10.
Good luck!
Most of my code is based on this article and the issue I'm asking about is evident there, but also in my own testing. It is a sequential model with LSTM layers.
Here is a plotted prediction over real data from a model that was trained with around 20 small data sets for one epoch.
Here is another plot but this time with a model trained on more data for 10 epochs.
What causes this and how can I fix it? Also that first link I sent shows the same result at the bottom - 1 epoch does great and 3500 epochs is terrible.
Furthermore, when I run a training session for the higher data count but with only 1 epoch, I get identical results to the second plot.
What could be causing this issue?
A few questions:
Is this graph for training data or validation data?
Do you consider it better because:
The graph seems cool?
You actually have a better "loss" value?
If so, was it training loss?
Or validation loss?
Cool graph
The early graph seems interesting, indeed, but take a close look at it:
I clearly see huge predicted valleys where the expected data should be a peak
Is this really better? It sounds like a random wave that is completely out of phase, meaning that a straight line would indeed represent a better loss than this.
Take a look a the "training loss", this is what can surely tell you if your model is better or not.
If this is the case and your model isn't reaching the desired output, then you should probably make a more capable model (more layers, more units, a different method, etc.). But be aware that many datasets are simply too random to be learned, no matter how good the model.
Overfitting - Training loss gets better, but validation loss gets worse
In case you actually have a better training loss. Ok, so your model is indeed getting better.
Are you plotting training data? - Then this straight line is actually better than a wave out of phase
Are you plotting validation data?
What is happening with the validation loss? Better or worse?
If your "validation" loss is getting worse, your model is overfitting. It's memorizing the training data instead of learning generally. You need a less capable model, or a lot of "dropout".
Often, there is an optimal point where the validation loss stops going down, while the training loss keeps going down. This is the point to stop training if you're overfitting. Read about the EarlyStopping callback in keras documentation.
Bad learning rate - Training loss is going up indefinitely
If your training loss is going up, then you've got a real problem there, either a bug, a badly prepared calculation somewhere if you're using custom layers, or simply a learning rate that is too big.
Reduce the learning rate (divide it by 10, or 100), create and compile a "new" model and restart training.
Another problem?
Then you need to detail your question properly.
From past few days, I have been trying to figure out the flow of execution in the code https://github.com/tensorflow/models/blob/master/tutorials/embedding/word2vec.py#L28 .
I understood the logic behind negative sampling and loss function, but I am getting so confused about the flow of execution inside the train function, especially when it comes to _train_thread_body function. I am so confused about the while and if loops ( what is the impact ) and the concurrency related parts. It would be great, if someone can give a decent explanation, before down-voting this.
This sample code is called "Multi-threaded word2vec mini-batched skip-gram model", that's why it uses several independent threads for training. Word2Vec can be trained with a single thread as well, but this tutorial shows that word2vec is faster to compute when done in parallel.
The input, label and epoch tensors are provided by the native word2vec.skipgram_word2vec function, which is implemented in tutorials/embedding/word2vec_kernels.cc file. There you can see that current_epoch is a tensor updated once the whole corpus of sentences is processed.
The method you're asking about is actually pretty simple:
def _train_thread_body(self):
initial_epoch, = self._session.run([self._epoch])
while True:
_, epoch = self._session.run([self._train, self._epoch])
if epoch != initial_epoch:
break
First, it computes the current epoch, then it invokes the training until the epoch is increased. This means that all of threads running this method will make exactly one epoch of training. Each thread is doing one step at a time in parallel with others.
self._train is an op that optimizes the loss function (see optimize method), which is computed from current examples and labels (see build_graph method). The exact value of these tensors is in native code again, namely in NextExample. Essentially, each call of word2vec.skipgram_word2vec extracts the set of examples and labels, which form the input to the optimization function. Hope, it makes it clearer now.
By the way, this model uses NCE loss in training, not negative sampling.
Recently, I use keras to train a network to classify pictures, and use the keras function model.fit_generator() to fit my model. The fit_generator() will automatically run the model in validation data and return a validation accuracy when finish a epoch.
But odd thing happened, when I used the model to predict the validation data and compared the results with the correct class, the validation accuracy is lower than what I get when use the fit_generator().
I have two assumptions:
1. I use a generator to get data from dictionary, so I assume in one single epoch, the generator may repeatedly fetch data which is highly fitted to the model, so that the accuracy may be higher.
2. keras may use some tricks or preprocess the data when do validation, thus enhance the accuracy.
I tried to look through the source code and document of keras, but nothing helped. I would be very thankful if anyone could give me some advice about the problem.
I am new to tensorflow, so please pardon my ignorance.
I have a tensorflow demo model "from an online tutorial" that should predict stockmarket prices for S&P. When I run the code I get inconsistent results everytime I run it. Training data does not change, I suppressed block shuffling , ...
But, When I run the prediction 2 times in the same run I get consistent results "i.e. use Only one training , run prediction twice".
My questions are:
Why am I getting inconsistent results?
If you are going to release such code to production , would you
just take the last time you ran this model training results? if not, then what would you do?
Does it make sense to force the model to produce consistent predictions? how would
you do that?
Here is my code location github repo
In training a neural network there is more randomness involved than just the batch shuffling. The initial weights of the layers are also randomly initialized.
Typically you would use the best model you have trained so far. To determine which model is the best you usually use some test dataset you did not use during training.
It is probably not a good sign if your performance fluctuates for different training runs. This means your result depends a lot on the random initialization. But I personally don't know about any general techniques to make learning more stable. But there probably are some.