Tensorflow CIFAR10 Tutorial: Determining number of epochs in training process - python

This is probably a very basic question. I' am new to deep learning and from what I gathered until now, one generally creates batches of data and once all the training data has been used (or "enough" of it), the process is repeated a couple of times (each iteration is called an epoch). However, when I look at the tutorial of CIFAR10:
CIFAR10 Tensorflow tutorial
There is no such thing as epochs. They are only mentioned here:
cifar10.py
as NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN , NUM_EXAMPLES_PER_EPOCH_FOR_EVAL and NUM_EPOCHS_PER_DECAY.
Do they use this to implicitly define the epochs?
num_batches_per_epoch = NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN /FLAGS.batch_size
I also ask because I'm a bit confused about how i should set the num_epochsargument here (in my own model):
tf.train.string_input_producer(...,num_epochs=num_epochs,...)`
should I just set it to NONE or do I have to calculate the number of epochs first?

There are two things in your question:
Understanding: One epoch does not mean one iteration for most situations. One epoch means one pass of the full training set. NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN etc. are defined here as 50000. CIFAR-10 has 50000 examples for training. Then it will be easy to understand num_batches_per_epoch.
As for coding, in tf.train.string_input_producer(...,num_epochs=num_epochs,...), you can check API that explains num_epochs. For CIFAR-10, you don't specify num_epochs (because this string_input_producer does not read each example directly. The database is divided into 5 parts/files, each of which stores 10000 examples, and string_input_producer reads files).

Related

Missing one batch in the training for loop?

The data has n_rows rows
The batch size is batch_size
I see some code uses:
n_batches = int(n_rows / batch_size)
What if n_rows is not a multiple of batch size?
Is the n_batches still correct?
In fact you can see that in several code, and we know that labeled data is extremely valuable so you don't want to loose some precious labeled examples. At first glance it looks like a bug, and it seems that we are loosing some training examples , but we have to get a closer look at the code.
When you see that, in general, as in the code that you sent, at each epoch (based on the fact that one epoch is seeing n_batches = int(n_rows / batch_size) examples), the data is shuffled after each epoch. Therefore through time (after several epochs) you'll see all your training examples. We're not loosing any examples \o/
Small conclusion: If you see that, ensure that the data is shuffled at each epoch, otherwise your network might never see some training examples.
What are the advantages of doing that ?
It's efficient:
By using this mechanism you ensure that at each training step your network will see batch_size examples, and you won't perform a training loop with a small number of training examples.
It's more rigorous: Imagine you have one example left and you don't shuffle. At each epoch , assuming your loss is the average loss of the batch, for this last example it will be equivalent to have a batch that consist of one element repeated batch_size time, it will be like weighting this example to have more importance. If you shuffle this effect will be reduced (since the remaining example will change through time), but it's more rigorous to have a constant batch size during your training epoch.
There are also some advantages of shuffling your data during training see:
statexchange post
I'll also add to the post, that if you are using mechanism such as Batch Normalization, it's better to have a constant batch size during training, for example if n_rows % batch_size = 1 , passing a single example as batch during training can create some troubles.
Note:
I speak about a constant batch size during a training epoch and not over the whole training cycle (multiple epochs) , because even if it's normally the case (to be constant during the whole training process), you can find some research work that modify the size of the batches during training e.g. Don't Decay the Learning Rate, Increase the Batch Size.

On training LSTMs efficiently but well, parallelism vs training regime

For a model that I intend to spontaneously generate sequences I find that training it sample by sample and keeping state in between feels most natural. I've managed to construct this in Keras after reading many helpful resources. (SO: Q and two fantastic answers, Macine Learning Mastery 1, 2, 3)
First a sequence is constructed (in my case one-hot encoded too). X and Y are procuded from this sequence by shifting Y forward one time step. Training is done in batches of one sample and one time step.
For Keras this looks something like this:
data = get_some_data() # Shape (samples, features)
Y = data[1:, :] # Shape (samples-1, features)
X = data[:-1, :].reshape((-1, 1, data.shape[-1])) # Shape (samples-1, 1, features)
model = Sequential()
model.add(LSTM(256, batch_input_shape=(1, 1, X.shape[-1]), stateful=True))
model.add(Dense(Y.shape[-1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])
for epoch in range(10):
model.fit(X, Y, batch_size=1, shuffle=False)
model.reset_states()
It does work. However after consulting my Task Manager, it seems it's only using ~10 % of my GPU resources, which are already quite limited. I'd like to improve this to speed up training. Increasing batch size would allow for parallel computations.
The network in its current state presumably "remembers" things even from the start of the training sequence. For training in batches one would need to first set up the sequence and then predict one value - and do this for multiple values. To train on the full sequence one would need to generate data in the shape (samples-steps, steps, features). I imagine it wouldn't be uncommon to have a sequence spanning at least a couple of hundred time steps. So that would mean a huge increase in the data amount.
Between framing the problem a bit differently and requiring more data to be stored in memory, and utilising only a small amount of processing resources, I must ask:
Is my intuition of the natural way of training and statefulness correct?
Are there other downsides to this training with one sample per batch?
Could the utilisation issues be resolved any other way?
Finally, is there an accepted way of performing this kind of training to generate long sequences?
Any help is greatly appreciated, I'm fairly new with LSTMs.
I do not know your specific application, however sending only one timestep of data in is surely not a good idea. You should instead, give the LSTM the entire sequence of previously given one-hot vectors (presumably words), and pre-pad (with zeros) if necessary as it appears you are working on sequences of varying length. Consider also using an embedding layer before your LSTM if these are indeed words. Read the documentation carefully.
The utilization of your GPU being low is not a problem. You simply do not have enough data to fully utilize all resources in each batch. Training with batches is a sequential process, there is no real way to parallelize this process, at least a way that is introductory and beneficial to what your goals appear to be. If you do give the LSTM more data per timestep, however, this surely will increase your utilization.
statefull in an LSTM does not do what you think it does. An LSTM always remembers the sequence it is iterating over as it updates it's internal hidden states, h and c. Furthermore, the weight transformations that "build" those internal states are learned during training. What stateful does is preserve the previous hidden state from the last batch index. Meaning, the final hidden state at the third element in the batch is sent as the initial hidden state in the third element of the next batch and so on. I do not believe this is useful for your applications.
There are downsides to training the LSTM with one sample per batch. In general, training with min-batches increases stability. However, you appear to not be training with one sample per batch but instead one timestep per sample.
Edit (from comments)
If you use stateful and send the next 'character' of your sequence in the same index of the previous batch this would be analogous to sending the full sequence timesteps per sample. I would still recommend the initial approach described above in order to improve the speed of the application and to be in more line with other LSTM applications. I see no disadvantages to the approach of sending the full sequence per sample instead of doing it along every batch. However, the advantage of speed, being able to shuffle your input data per batch, and being more readable/consistent would be worth the change IMO.

What does epochs mean in Doc2Vec and train when I have to manually run the iteration?

I am trying to understand the epochs parameter in the Doc2Vec function and epochs parameter in the train function.
In the following code snippet, I manually set up a loop of 4000 iterations. Is it required or passing 4000 as epochs parameter in the Doc2Vec enough? Also how epochs in Doc2Vec is different from epochs in train?
documents = Documents(train_set)
model = Doc2Vec(vector_size=100, dbow_words=1, dm=0, epochs=4000, window=5,
seed=1337, min_count=5, workers=4, alpha=0.001, min_alpha=0.025)
model.build_vocab(documents)
for epoch in range(model.epochs):
print("epoch "+str(epoch))
model.train(documents, total_examples=total_length, epochs=1)
ckpnt = model_name+"_epoch_"+str(epoch)
model.save(ckpnt)
print("Saving {}".format(ckpnt))
Also, how and when are the weights updated?
You don't have to manually run the iteration, and you shouldn't call train() more than once unless you're an expert who needs to do so for very specific reasons. If you've seen this technique in some online example you're copying, that example is likely outdated and misleading.
Call train() once, with your preferred number of passes as the epochs parameter.
Also, don't use a starting alpha learning-rate that is low (0.001) that then rises to a min_alpha value 25 times larger (0.025) - that's not how this is supposed to work, and most users shouldn't need to adjust the alpha-related defaults at all. (Again, if you're getting this from an online example somewhere - that's a bad example. Let them know they're giving bad advice.)
Also, 4000 training epochs is absurdly large. A value of 10-20 is common in published work, when dealing with tens-of-thousands to millions of documents. If your dataset is smaller, it may not work well with Doc2Vec, but sometimes more epochs (or smaller vector_size) can still learn something generalizable from tiny data - but still expect to use closer to dozens of epochs (not thousands).
A good intro (albeit with a tiny dataset that barely works with Doc2Vec) is the doc2vec-lee.ipynb Jupyter notebook that's bundled with gensim, and also viewable online at:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
Good luck!

TensorFlow train batches for multiple epochs?

I don't understand how to run the result of tf.train.batch for multiple epochs. It runs out once of course and I don't know how to restart it.
Maybe I can repeat it using tile, which is complicated but described in full here.
If I can redraw a batch each time that would be fine -- I would need batch_size random integers between 0 and num_examples. (My examples all sit in local RAM). I haven't found an easy way to get these random draws at once.
Ideally there is a reshuffle too when the batch is repeated, but it makes more sense to me to run an epoch then reshuffle, etc., instead of join the training space to itself num_epochs size, then shuffle.
I think this is confusing because I'm not really building an input pipeline since my input fits in memory, but yet I still need to be building out batching, shuffling and multiple epochs which possibly requires more knowledge of input pipeline.
tf.train.batch simply groups upstream samples into batches, and nothing more. It is meant to be used at the end of an input pipeline. Data and epochs are dealt with upstream.
For example, if your training data fits into a tensor, you could use tf.train.slice_input_producer to produce samples. This function has arguments for shuffling and epochs.

In Keras, If samples_per_epoch is less than the 'end' of the generator when it (loops back on itself) will this negatively affect result?

I'm using Keras with Theano to train a basic logistic regression model.
Say I've got a training set of 1 million entries, it's too large for my system to use the standard model.fit() without blowing away memory.
I decide to use a python generator function and fit my model using model.fit_generator().
My generator function returns batch sized chunks of the 1M training examples (they come from a DB table, so I only pull enough records at a time to satisfy each batch request, keeping memory usage in check).
It's an endlessly looping generator, once it reaches the end of the 1 million, it loops and continues over the set
There is a mandatory argument in fit_generator() to specify samples_per_epoch. The documentation indicates
samples_per_epoch: integer, number of samples to process before going to the next epoch.
I'm assuming the fit_generator() doesn't reset the generator each time an epoch runs, hence the need for a infinitely running generator.
I typically set the samples_per_epoch to be the size of the training set the generator is looping over.
However, if samples_per_epoch this is smaller than the size of the training set the generator is working from and the nb_epoch > 1:
Will you get odd/adverse/unexpected training resulting as it seems the epochs will have differing sets training examples to fit to?
If so, do you 'fastforward' you generator somehow?
I'm dealing some something similar right now. I want to make my epochs shorter so I can record more information about the loss or adjust my learning rate more often.
Without diving into the code, I think the fact that .fit_generator works with the randomly augmented/shuffled data produced by the keras builtin ImageDataGenerator supports your suspicion that it doesn't reset the generator per epoch. So I believe you should be fine, as long as the model is exposed to your whole training set it shouldn't matter if some of it is trained in a separate epoch.
If you're still worried you could try writing a generator that randomly samples your training set.

Categories

Resources