From past few days, I have been trying to figure out the flow of execution in the code https://github.com/tensorflow/models/blob/master/tutorials/embedding/word2vec.py#L28 .
I understood the logic behind negative sampling and loss function, but I am getting so confused about the flow of execution inside the train function, especially when it comes to _train_thread_body function. I am so confused about the while and if loops ( what is the impact ) and the concurrency related parts. It would be great, if someone can give a decent explanation, before down-voting this.
This sample code is called "Multi-threaded word2vec mini-batched skip-gram model", that's why it uses several independent threads for training. Word2Vec can be trained with a single thread as well, but this tutorial shows that word2vec is faster to compute when done in parallel.
The input, label and epoch tensors are provided by the native word2vec.skipgram_word2vec function, which is implemented in tutorials/embedding/word2vec_kernels.cc file. There you can see that current_epoch is a tensor updated once the whole corpus of sentences is processed.
The method you're asking about is actually pretty simple:
def _train_thread_body(self):
initial_epoch, = self._session.run([self._epoch])
while True:
_, epoch = self._session.run([self._train, self._epoch])
if epoch != initial_epoch:
break
First, it computes the current epoch, then it invokes the training until the epoch is increased. This means that all of threads running this method will make exactly one epoch of training. Each thread is doing one step at a time in parallel with others.
self._train is an op that optimizes the loss function (see optimize method), which is computed from current examples and labels (see build_graph method). The exact value of these tensors is in native code again, namely in NextExample. Essentially, each call of word2vec.skipgram_word2vec extracts the set of examples and labels, which form the input to the optimization function. Hope, it makes it clearer now.
By the way, this model uses NCE loss in training, not negative sampling.
Related
I am training a skipgram model using gensim word2vec. I would like to exit the training before reaching the number of epochs passed in the parameters based on a specific accuracy test in a different set of data in order to avoid the overfitting of the model.
Is there a way in gensim to interrupt the train of word2vec from a callback function?
If in fact more training makes your Word2Vec model worse on some external evaluation, there is likely something else wrong with your setup. (For example, many many online code examples that call train() multiple times in a loop mismanage the learning-rate alpha such that it actually goes negative, which would mean each training-example results in anti-corrections to the model via backpropagation.)
If instead the main problem is truly overfitting, a better solution than conditional early-stopping would probably be adjusting other parameters, such as the model size, so that it can't overshoot useful generalization no matter how many training passes are made.
But if you really want to try the less-good approach of early stopping, you could potentially raise a catchable exception in your callback, and catch it outside train() to allow your other code to continue with the results of the aborted training. For example...
A custom exception...
class OverfitException(Exception):
pass
...then in your callback...
raise OverfitException()
...and around training...
try:
model.train(...)
except OverfitException:
print("training cut short")
# ... & your code with partially-trained model continues
But again, this is not the best way to deal with overfitting or other cases where more training is seeming to hurt evaluation-scores.
I want to fine turn my model when using Keras, and I want to change my training data and learning rate to train when the epochs arrive 10, So how to get a callback when the specified epoch number is over.
You need to write your own Callback subclass.
https://keras.io/callbacks/ (general information)
https://github.com/keras-team/keras/blob/master/keras/callbacks.py#L275 (source code for the Callback base class)
Your Callback subclass should define an on_epoch_end() method, which accepts the epoch number as an argument.
Actually, the way keras works this is probably not the best way to go, it would be much better to treat this as fine tuning, meaning that you finish the 10 epochs, save the model and then load the model (from another script) and continue training with the lr and data you fancy.
There are several reasons for this.
It is much clearer and easier to debug. You check you model properly after the 10 epochs, verify that it works properly and carry on
It is much better to do several experiments this way, starting from epoch 10.
Good luck!
When I trained a SSD object detection model 20K steps using TensorFlow Object Detection API, I found that the training time varies:
It was training fast on the first 10 minutes, and around 500 steps were performed (i.e. 0.83 steps/seconds). Then it slowed down and took about 40~50 minutes to perform single training step, evaluate the model on the evaluation dataset and save the checkpoint on disk. So I interrupted the training after few steps and continued by restoring the training.
Every time, it training fast on the first 10 minutes and then slowed down sharply as the figures showed.
The model's training are implemented by TensorFlow's Estimator API tf.estimator.train_and_evaluate()
Can anyone explain how it works? How the estimator controls the training and evaluation period? I do not want to evaluate the model for every step!
If you look at the EvalSpec and TrainSpec there is an argument throttle_secs, which is responsible for deciding when evaluation is called. Refer to this heated discussion, which has many details about Estimator methods! Controlling this would be the option to control train and eval cycles. Also in general, train_and_evaluate will work by building a graph of the the training and evaluation operation. The training graph is created only once, but evaluation graph is recreated every time you need to evaluate. This means that it will load the checkpoint that was created during training, which maybe one reason why this is taking so long! Maybe InMemoryEvaluationHook that is mentioned in that discussion can help you out, since it does not reload the checkpoint everytime evaluation is called.
This is probably a very basic question. I' am new to deep learning and from what I gathered until now, one generally creates batches of data and once all the training data has been used (or "enough" of it), the process is repeated a couple of times (each iteration is called an epoch). However, when I look at the tutorial of CIFAR10:
CIFAR10 Tensorflow tutorial
There is no such thing as epochs. They are only mentioned here:
cifar10.py
as NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN , NUM_EXAMPLES_PER_EPOCH_FOR_EVAL and NUM_EPOCHS_PER_DECAY.
Do they use this to implicitly define the epochs?
num_batches_per_epoch = NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN /FLAGS.batch_size
I also ask because I'm a bit confused about how i should set the num_epochsargument here (in my own model):
tf.train.string_input_producer(...,num_epochs=num_epochs,...)`
should I just set it to NONE or do I have to calculate the number of epochs first?
There are two things in your question:
Understanding: One epoch does not mean one iteration for most situations. One epoch means one pass of the full training set. NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN etc. are defined here as 50000. CIFAR-10 has 50000 examples for training. Then it will be easy to understand num_batches_per_epoch.
As for coding, in tf.train.string_input_producer(...,num_epochs=num_epochs,...), you can check API that explains num_epochs. For CIFAR-10, you don't specify num_epochs (because this string_input_producer does not read each example directly. The database is divided into 5 parts/files, each of which stores 10000 examples, and string_input_producer reads files).
I'm using Keras with Theano to train a basic logistic regression model.
Say I've got a training set of 1 million entries, it's too large for my system to use the standard model.fit() without blowing away memory.
I decide to use a python generator function and fit my model using model.fit_generator().
My generator function returns batch sized chunks of the 1M training examples (they come from a DB table, so I only pull enough records at a time to satisfy each batch request, keeping memory usage in check).
It's an endlessly looping generator, once it reaches the end of the 1 million, it loops and continues over the set
There is a mandatory argument in fit_generator() to specify samples_per_epoch. The documentation indicates
samples_per_epoch: integer, number of samples to process before going to the next epoch.
I'm assuming the fit_generator() doesn't reset the generator each time an epoch runs, hence the need for a infinitely running generator.
I typically set the samples_per_epoch to be the size of the training set the generator is looping over.
However, if samples_per_epoch this is smaller than the size of the training set the generator is working from and the nb_epoch > 1:
Will you get odd/adverse/unexpected training resulting as it seems the epochs will have differing sets training examples to fit to?
If so, do you 'fastforward' you generator somehow?
I'm dealing some something similar right now. I want to make my epochs shorter so I can record more information about the loss or adjust my learning rate more often.
Without diving into the code, I think the fact that .fit_generator works with the randomly augmented/shuffled data produced by the keras builtin ImageDataGenerator supports your suspicion that it doesn't reset the generator per epoch. So I believe you should be fine, as long as the model is exposed to your whole training set it shouldn't matter if some of it is trained in a separate epoch.
If you're still worried you could try writing a generator that randomly samples your training set.