The data has n_rows rows
The batch size is batch_size
I see some code uses:
n_batches = int(n_rows / batch_size)
What if n_rows is not a multiple of batch size?
Is the n_batches still correct?
In fact you can see that in several code, and we know that labeled data is extremely valuable so you don't want to loose some precious labeled examples. At first glance it looks like a bug, and it seems that we are loosing some training examples , but we have to get a closer look at the code.
When you see that, in general, as in the code that you sent, at each epoch (based on the fact that one epoch is seeing n_batches = int(n_rows / batch_size) examples), the data is shuffled after each epoch. Therefore through time (after several epochs) you'll see all your training examples. We're not loosing any examples \o/
Small conclusion: If you see that, ensure that the data is shuffled at each epoch, otherwise your network might never see some training examples.
What are the advantages of doing that ?
It's efficient:
By using this mechanism you ensure that at each training step your network will see batch_size examples, and you won't perform a training loop with a small number of training examples.
It's more rigorous: Imagine you have one example left and you don't shuffle. At each epoch , assuming your loss is the average loss of the batch, for this last example it will be equivalent to have a batch that consist of one element repeated batch_size time, it will be like weighting this example to have more importance. If you shuffle this effect will be reduced (since the remaining example will change through time), but it's more rigorous to have a constant batch size during your training epoch.
There are also some advantages of shuffling your data during training see:
statexchange post
I'll also add to the post, that if you are using mechanism such as Batch Normalization, it's better to have a constant batch size during training, for example if n_rows % batch_size = 1 , passing a single example as batch during training can create some troubles.
Note:
I speak about a constant batch size during a training epoch and not over the whole training cycle (multiple epochs) , because even if it's normally the case (to be constant during the whole training process), you can find some research work that modify the size of the batches during training e.g. Don't Decay the Learning Rate, Increase the Batch Size.
Related
We are developing a prediction model using deepchem's GCNModel.
Model learning and performance verification proceeded without problems, but it was confirmed that a lot of time was spent on prediction.
We are trying to predict a total of 1 million data, and the parameters used are as follows.
model = GCNModel(n_tasks=1, mode='regression', number_atom_features=32, learning_rate=0.0001, dropout=0.2, batch_size=32, device=device, model_dir=model_path)
I changed the batch size to improve the performance, and it was confirmed that the time was faster when the value was decreased than when the value was increased.
All models had the same GPU memory usage.
From common sense I know, it is estimated that the larger the batch size, the faster it will be. But can you tell me why it works in reverse?
We would be grateful if you could also let us know how we can further improve the prediction time.
let's clarify some definitions first.
Epoch
Times that your model and learning algorithm will walk through your dataset.
(Complete passes)
BatchSize
The number of samples(every single row of your training data) before updating the internal model. in other words, the number of samples processed before the model is updated.
So Your batch size is something between 1 and your len(training_data)
Generally, more batch size gives more accuracy of training data.
Epoch ↑ Batch Size ↑ Accuracy ↑ Speed ↓
So the short answer to question is more batch size takes more memory and needs more process and obviously takes longer time to learn.
https://stats.stackexchange.com/questions/153531/what-is-batch-size-in-neural-network
Here is the link for more details.
There are two components regarding the speed:
Your batch size and model size
Your CPU/GPU power in spawning and processing batches
And two of them need to be balanced. For example, if your model finishes prediction of this batch, but the next batch is not yet spawned, you will notice a drop in GPU utilization for a brief moment. Sadly there is no inner metrics that directly tell you this balance - try using time.time() to benchmark your model's prediction as well as the dataloader speed.
However, I don't think that's worth the effort, so you can keep decreasing the batch size up to the point there is no improvement - that's where to stop.
Obviously, I know that adding in validation data would make training take longer but the time difference I am talking here is absurd. Code:
# Training
def training(self, callback_bool):
if callback_bool:
callback_list = []
else:
callback_list = []
self.history = self.model.fit(self.x_train, self.y_train, validation_data=(self.x_test, self.y_test),
batch_size=1, steps_per_epoch=10, epochs=100)
The code above takes me more than 30 minutes to train even though the size of my test data is 10,000 data points. The size of my train data is 40,000 data points and when I train without validation data, I am done within seconds. Is there a way to remedy this? Why does it take this long? To boot, I am training on a gpu as well!
I assume validation works as intended, and you have a problem in the training process itself. You are using batch_size = 1 and steps_per_epoch = 10, which means the model will see only 10 data points during every epoch. That's why it takes only few seconds. On the other hand, you don't use the validation_steps argument, which means the validation after every epoch will run until your validation dataset is exshausted, i.e. for 10.000 steps. Hence the difference in times. You can read more about model.fit and its arguments in the official documentation.
If your training dataset isn't infinite, I suggest you to remove the steps_per_epoch argument. If it is, pass it the value of len(x_train)//batch_size instead. That way the model will be fed with every single training data point for each epoch. I assume every epoch will take ~1.5 hours instead of seconds you currently have. Also I suggest to increase the batch_size, if there is no specific reason to use batch size of 1.
Edited: typos
i need to understand how the epochs/iterations affect the training of a deep learning model.
I am training a NER model with Spacy 2.1.3, my documents are very long so i cannot train more than 200 documents per iteration. So basically i do
from the document 0 to the document 200 -> 20 epochs
from the document 201 to the document 400 -> 20 epochs
and so on.
Maybe, it is a stupid question but, should the epochs of the next batches be the same as the first 0-200? so if i chose 20 epochs i must train the next with 20 epochs too?
Thanks
i need to understand how the epochs/iterations affect the training of a deep learning model - nobody is sure about that one. You may overfit after certain amount of epochs, you should check you accuracy (or other metrics) on validation dataset. Techniques like Early Stopping are often employed in order to battle this.
so i cannot train more than 200 documents per iteration. - do you mean a batch of examples? If so, it should be smaller (too much information in single iteration and too costly). 32 is usually used for textual data, up to 64. Batch sizes are often smaller the more epochs you train in order to get into the minimum better (or to escape saddle points).
Furthermore, you should use Python's generators so you can iterate over data of size bigger than your RAM capacity.
Last but not least, each example is usually trained once per epoch. Different approaches (say oversampling or undersampling) are sometimes used but usually when your classes distribution is imbalanced (say 10% examples belong to class0and 90% to class1`) or neural network has problems with specific class (though this one requires more well thought out approach).
The common practice is to train each batch with only 1 epoch. Training on the same subset of data for 20 epochs can lead to overfitting which harms your model performance.
To understand better how epochs number trained on each batch affect your performance you can do a grid search and compare the results.
Assuming we have 500k items worth of training data, does it matter if we train the model one item at a time or 'n' items at a time or all at once?
Considering inputTrainingData and outputTrainingData to be [[]] and train_step to be any generic tensorflow training step.
Option 1 Train one item at a time -
for i in range(len(inputTrainingData)):
train_step.run(feed_dict={x: [inputTrainingData[i]], y: [outputTrainingData[i]], keep_prob: .60}, session= sess)
Option 2 Train on all at once -
train_step.run(feed_dict={x: inputTrainingData, y: outputTrainingData, keep_prob: .60}, session= sess)
Is there any difference between options 1 and 2 above as far as the quality of training is concerned?
Yes, there is a difference. Option 1 is much less memory consuming but is also much less accurate. Option 2 could eat up all of your RAM but should prove more accurate. However, if you use all your training set at once, be sure to limit the number of steps to avoid over-fitting.
Ideally, use data in batches (typically between 16 and 256).
Most optimization techniques are 'stochastic', i.e. they rely on a statistical sample of examples to estimate a model update.
To sum up:
- More data => more accuracy (but more memory) => higher risk of over-fitting (so limit the amount of training steps)
There is a different between this options. Normally you have to use a batchsize to train for example 128 iterations of data.
You also could use a batchsize of one, like the first of you examples.
The advantage of this method is you can output the training efficient of the neural network.
If you are learning all data at ones, you will bi a little bit faster, but you will know only at the end if you efficient is good.
Best way is to make a batchsize and learn by stack. So you can output you efficient after every stack and control your efficient.
Mathematically these two methods are different. One is called stochastic gradient descent and the other is called batch gradient descent. You are missing the most commonly used one - mini batch gradient descent. There has been a lot of research on this topic but basically different batch sizes have different convergence properties. Generally people use batch sizes that are greater than one but not the full dataset. This is usually necessary since most datasets cannot fit into memory all at once. Also if your model uses batch normalization then a batch size of one won't converge. This paper discusses the effects of batch size (among other things) on performance. The takeaway is that larger batch sizes do not generalize as well. (They actually argue it isn't the batch size itself the but the fact that you have fewer updates when the batch is larger. I would recommend batch sizes of 32 to start and experiment to see how batch size effects performance.
Here is a graph of the effects of batch size on training and validation performance from the paper I linked.
I'm using Keras with Theano to train a basic logistic regression model.
Say I've got a training set of 1 million entries, it's too large for my system to use the standard model.fit() without blowing away memory.
I decide to use a python generator function and fit my model using model.fit_generator().
My generator function returns batch sized chunks of the 1M training examples (they come from a DB table, so I only pull enough records at a time to satisfy each batch request, keeping memory usage in check).
It's an endlessly looping generator, once it reaches the end of the 1 million, it loops and continues over the set
There is a mandatory argument in fit_generator() to specify samples_per_epoch. The documentation indicates
samples_per_epoch: integer, number of samples to process before going to the next epoch.
I'm assuming the fit_generator() doesn't reset the generator each time an epoch runs, hence the need for a infinitely running generator.
I typically set the samples_per_epoch to be the size of the training set the generator is looping over.
However, if samples_per_epoch this is smaller than the size of the training set the generator is working from and the nb_epoch > 1:
Will you get odd/adverse/unexpected training resulting as it seems the epochs will have differing sets training examples to fit to?
If so, do you 'fastforward' you generator somehow?
I'm dealing some something similar right now. I want to make my epochs shorter so I can record more information about the loss or adjust my learning rate more often.
Without diving into the code, I think the fact that .fit_generator works with the randomly augmented/shuffled data produced by the keras builtin ImageDataGenerator supports your suspicion that it doesn't reset the generator per epoch. So I believe you should be fine, as long as the model is exposed to your whole training set it shouldn't matter if some of it is trained in a separate epoch.
If you're still worried you could try writing a generator that randomly samples your training set.