i need to understand how the epochs/iterations affect the training of a deep learning model.
I am training a NER model with Spacy 2.1.3, my documents are very long so i cannot train more than 200 documents per iteration. So basically i do
from the document 0 to the document 200 -> 20 epochs
from the document 201 to the document 400 -> 20 epochs
and so on.
Maybe, it is a stupid question but, should the epochs of the next batches be the same as the first 0-200? so if i chose 20 epochs i must train the next with 20 epochs too?
Thanks
i need to understand how the epochs/iterations affect the training of a deep learning model - nobody is sure about that one. You may overfit after certain amount of epochs, you should check you accuracy (or other metrics) on validation dataset. Techniques like Early Stopping are often employed in order to battle this.
so i cannot train more than 200 documents per iteration. - do you mean a batch of examples? If so, it should be smaller (too much information in single iteration and too costly). 32 is usually used for textual data, up to 64. Batch sizes are often smaller the more epochs you train in order to get into the minimum better (or to escape saddle points).
Furthermore, you should use Python's generators so you can iterate over data of size bigger than your RAM capacity.
Last but not least, each example is usually trained once per epoch. Different approaches (say oversampling or undersampling) are sometimes used but usually when your classes distribution is imbalanced (say 10% examples belong to class0and 90% to class1`) or neural network has problems with specific class (though this one requires more well thought out approach).
The common practice is to train each batch with only 1 epoch. Training on the same subset of data for 20 epochs can lead to overfitting which harms your model performance.
To understand better how epochs number trained on each batch affect your performance you can do a grid search and compare the results.
Related
I'm using a relatively simple neural network with fully connected layers in keras. For some reason, the accuracy drastically increases basically to its final value after only one training epoch (likewise, the loss sharply decreases). I've tried architectures with larger and smaller numbers of hidden layers too. This network also performs poorly on the testing data, so I am trying to find a more optimal architecture or improve my training set accordingly.
It is trained on a set of 6500 1D array-like data, and I'm using a batch size of 512.
As said by Murilo, hard to say much without more information but it can come from multiple things:
Your network learns through the batches of each epoch, meaning that
your ~12 batches (6500/512) are already enough to learn a good bit of
classification.
Your weights are not really well initialized, and produce a huge
loss for the first epoch. The massive decrease in the loss is
actually the solver 'squishing' the weights. The best explanation I
found for this comes from A. Karpathy in his 'MakeMore' tutorial:
https://youtu.be/P6sfmUTpUmc?t=260
Now this sudden decrease of the loss is not extreme here (from 0.5 to 0.2) so I would not care much. I agree with Murilo that low accuracy in validation can come from too few samples in your validation set, or a bad shuffling between train and validation sets.
The data has n_rows rows
The batch size is batch_size
I see some code uses:
n_batches = int(n_rows / batch_size)
What if n_rows is not a multiple of batch size?
Is the n_batches still correct?
In fact you can see that in several code, and we know that labeled data is extremely valuable so you don't want to loose some precious labeled examples. At first glance it looks like a bug, and it seems that we are loosing some training examples , but we have to get a closer look at the code.
When you see that, in general, as in the code that you sent, at each epoch (based on the fact that one epoch is seeing n_batches = int(n_rows / batch_size) examples), the data is shuffled after each epoch. Therefore through time (after several epochs) you'll see all your training examples. We're not loosing any examples \o/
Small conclusion: If you see that, ensure that the data is shuffled at each epoch, otherwise your network might never see some training examples.
What are the advantages of doing that ?
It's efficient:
By using this mechanism you ensure that at each training step your network will see batch_size examples, and you won't perform a training loop with a small number of training examples.
It's more rigorous: Imagine you have one example left and you don't shuffle. At each epoch , assuming your loss is the average loss of the batch, for this last example it will be equivalent to have a batch that consist of one element repeated batch_size time, it will be like weighting this example to have more importance. If you shuffle this effect will be reduced (since the remaining example will change through time), but it's more rigorous to have a constant batch size during your training epoch.
There are also some advantages of shuffling your data during training see:
statexchange post
I'll also add to the post, that if you are using mechanism such as Batch Normalization, it's better to have a constant batch size during training, for example if n_rows % batch_size = 1 , passing a single example as batch during training can create some troubles.
Note:
I speak about a constant batch size during a training epoch and not over the whole training cycle (multiple epochs) , because even if it's normally the case (to be constant during the whole training process), you can find some research work that modify the size of the batches during training e.g. Don't Decay the Learning Rate, Increase the Batch Size.
I have trained an LSTM model for Time Series Forecasting. I have used an early stopping method with a patience of 150 epochs.
I have used a dropout of 0.2, and this is the plot of train and validation loss:
The early stopping method stop the training after 650 epochs, and save the best weight around epoch 460 where the validation loss was the best.
My question is :
Is it normal that the train loss is always above the validation loss?
I know that if it was the opposite(validation loss above the train) it would have been a sign of overfitting.
But what about this case?
EDIT:
My dataset is a Time Series with hourly temporal frequence. It is composed of 35000 instance. I have split the data into 80 % train and 20% validation but in temporal order. So for example the training will contain the data until the beginning of 2017 and the validation the data from 2017 until the end.
I have created this plot by averaging the data over 15 days and this is the result:
So maybe the reason is as you said that the validation data have an easier pattern. How can i solve this problem?
For most cases, the validation loss should be higher than the training loss because the labels in the training set are accessible to the model. In fact, one good habit to train a new network is to use a small subset of the data and see whether the training loss can converge to 0 (fully overfits the training set). If not, it means this model is somehow incompetent to memorize the data.
Let's go back to your problem. I think the observation that validation loss is less than training loss happens. But this possibly is not because of your model, but how you split the data. Consider that there are two types of patterns (A and B) in the dataset. If you split in a way that the training set contains both pattern A and pattern B, while the small validation set only contains pattern B. In this case, if B is easier to be recognized, then you might get a higher training loss.
In a more extreme example, pattern A is almost impossible to recognize but there are only 1% of them in the dataset. And the model can recognize all pattern B. If the validation set happens to have only pattern B, then the validation loss will be smaller.
As alex mentioned, using K-fold is a good solution to make sure every sample will be used as both validation and training data. Also, printing out the confusion matrix to make sure all labels are relatively balanced is another method to try.
Usually the opposite is true. But since you are using drop out,it is common to have the validation loss less than the training loss.And like others have suggested try k-fold cross validation
I am trying to understand the epochs parameter in the Doc2Vec function and epochs parameter in the train function.
In the following code snippet, I manually set up a loop of 4000 iterations. Is it required or passing 4000 as epochs parameter in the Doc2Vec enough? Also how epochs in Doc2Vec is different from epochs in train?
documents = Documents(train_set)
model = Doc2Vec(vector_size=100, dbow_words=1, dm=0, epochs=4000, window=5,
seed=1337, min_count=5, workers=4, alpha=0.001, min_alpha=0.025)
model.build_vocab(documents)
for epoch in range(model.epochs):
print("epoch "+str(epoch))
model.train(documents, total_examples=total_length, epochs=1)
ckpnt = model_name+"_epoch_"+str(epoch)
model.save(ckpnt)
print("Saving {}".format(ckpnt))
Also, how and when are the weights updated?
You don't have to manually run the iteration, and you shouldn't call train() more than once unless you're an expert who needs to do so for very specific reasons. If you've seen this technique in some online example you're copying, that example is likely outdated and misleading.
Call train() once, with your preferred number of passes as the epochs parameter.
Also, don't use a starting alpha learning-rate that is low (0.001) that then rises to a min_alpha value 25 times larger (0.025) - that's not how this is supposed to work, and most users shouldn't need to adjust the alpha-related defaults at all. (Again, if you're getting this from an online example somewhere - that's a bad example. Let them know they're giving bad advice.)
Also, 4000 training epochs is absurdly large. A value of 10-20 is common in published work, when dealing with tens-of-thousands to millions of documents. If your dataset is smaller, it may not work well with Doc2Vec, but sometimes more epochs (or smaller vector_size) can still learn something generalizable from tiny data - but still expect to use closer to dozens of epochs (not thousands).
A good intro (albeit with a tiny dataset that barely works with Doc2Vec) is the doc2vec-lee.ipynb Jupyter notebook that's bundled with gensim, and also viewable online at:
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb
Good luck!
Assuming we have 500k items worth of training data, does it matter if we train the model one item at a time or 'n' items at a time or all at once?
Considering inputTrainingData and outputTrainingData to be [[]] and train_step to be any generic tensorflow training step.
Option 1 Train one item at a time -
for i in range(len(inputTrainingData)):
train_step.run(feed_dict={x: [inputTrainingData[i]], y: [outputTrainingData[i]], keep_prob: .60}, session= sess)
Option 2 Train on all at once -
train_step.run(feed_dict={x: inputTrainingData, y: outputTrainingData, keep_prob: .60}, session= sess)
Is there any difference between options 1 and 2 above as far as the quality of training is concerned?
Yes, there is a difference. Option 1 is much less memory consuming but is also much less accurate. Option 2 could eat up all of your RAM but should prove more accurate. However, if you use all your training set at once, be sure to limit the number of steps to avoid over-fitting.
Ideally, use data in batches (typically between 16 and 256).
Most optimization techniques are 'stochastic', i.e. they rely on a statistical sample of examples to estimate a model update.
To sum up:
- More data => more accuracy (but more memory) => higher risk of over-fitting (so limit the amount of training steps)
There is a different between this options. Normally you have to use a batchsize to train for example 128 iterations of data.
You also could use a batchsize of one, like the first of you examples.
The advantage of this method is you can output the training efficient of the neural network.
If you are learning all data at ones, you will bi a little bit faster, but you will know only at the end if you efficient is good.
Best way is to make a batchsize and learn by stack. So you can output you efficient after every stack and control your efficient.
Mathematically these two methods are different. One is called stochastic gradient descent and the other is called batch gradient descent. You are missing the most commonly used one - mini batch gradient descent. There has been a lot of research on this topic but basically different batch sizes have different convergence properties. Generally people use batch sizes that are greater than one but not the full dataset. This is usually necessary since most datasets cannot fit into memory all at once. Also if your model uses batch normalization then a batch size of one won't converge. This paper discusses the effects of batch size (among other things) on performance. The takeaway is that larger batch sizes do not generalize as well. (They actually argue it isn't the batch size itself the but the fact that you have fewer updates when the batch is larger. I would recommend batch sizes of 32 to start and experiment to see how batch size effects performance.
Here is a graph of the effects of batch size on training and validation performance from the paper I linked.