Accumulate gradients with distributed strategy in Tensorflow 2 - python

I have implemented a distributed strategy to train my model on multiple GPUs.
strategy = tf.distribute.MirroredStrategy(devices=devices[:FLAGS.n_gpus])
strategy.run(fn=self.train_step, args=(model, data))
My model now got more complex and bigger and I had to reduce the batch size to fit it onto the GPUs.
The gradient is quite noisy now and I want to increase the batch size again by accumulating gradients.
Now my question is: is this even possible when using a mirrored strategy? I know that loss and gradients are combined across the replicas anyway, so is there a way to sum them across the replicas AND e.g. a loop running over the batches? I tried the straight-forward thing and returned the per replica calculated gradients to add and apply them outside the strategy.run() like that:
for b in batches:
per_replica_gradients = strategy.run(fn=self.train_step, args=(model, data))
total_gradient += per_replica_gradients
optimizer.apply_gradients(zip(total_gradient, model.trainable_variables)
but Tensorflow tells me that this is not possible and the gradients have to be applied withing the strategy.run(). This also makes sense to me but I wonder whether there is a possibility to accumulate gradients AND use a mirrored strategy?

You could use tf.distribute.ReplicaContext.all_reduce: This differs from Strategy.reduce in that it is for replica context and does not copy the results to the host device. all_reduce should be typically used for reductions inside the training step such as gradients.
More details can be found in the document here.

Related

How to deal with batch normalization for multiple datasets?

I am working on a task of generating synthetic data to help the training of my model. This means that the training is performed on synthetic + real data, and tested on real data.
I was told that batch normalization layers might be trying to find weights that are good for all while training, which is a problem since the distribution of my synthetic data is not exactly equal to the distribution of the real data. So, the idea would be to have different ‘copies’ of the weights of batch normalization layers. So that the neural network estimates different weights for synthetic and real data, and uses just the weights of real data for evaluation.
Could someone suggest me good ways to actually implement that in pytorch? My idea was the following, after each epoch of training in a dataset I would go through all batchnorm layers and save their weights. Then at the beginning of the next epoch I would iterate again loading the right weights. Is this a good approach? Still, I am not sure how I should deal with the batch-norm weights at test time since batch-norm treats it differently.
It sounds like the problem you're worried about is that your neural network will learn weights that work well when the batch norm is computed for a batch of both real and synthetic data, and then later at test time it will compute a batch norm on just real data?
Rather than trying to track multiple batch norms, you probably just want to set track_running_stats to True for your batch norm layer, and then put it into eval mode when testing. This will cause it to compute a running mean and variance over multiple batches while training, and then it will use that mean and variance later at test time, rather than looking at the batch stats for the test batches.
(This is often what you want anyway, because depending on your use case, you might be sending very small batches to the deployed model, and so you want to use a pre-computed mean and variance rather than relying on stats for those small batches.)
If you really want to be computing fresh means and variances at test time, what I would do is instead of passing a single batch with both real and synthetic data into your network, I'd pass in one batch of real data, then one batch of synthetic data, and average the two losses together before backprop. (Note that if you do this you should not rely on the running mean and variance later -- you'll have to either set track_running_stats to False, or reset it when you're done and run through a few dummy batches with only real data to compute reasonable values. This is because the running mean and variance stats are only useful if they're expected to be roughly the same for every batch, and you're instead polarizing the values by feeding in different types of data in different batches.)

How to update model parameters along with batch norm with accumulated gradients?

So, similar to this question: How to update model parameters with accumulated gradients?
I have a large network, and a very small batch size. To combat this I want to accumulate gradients (multiple forward passes) and then apply the update of the parameters using the mean gradient.
However, my network has BN layers. How should I handle this?

Is it normal to obtain different test results with different batch sizes with tensorflow

I am using tensorflow for a classification problem.
I have some utility for saving and loading my network. When I restore the network, I can specify a different batch size than for training.
My problem is that I am getting different test results when I restore with a different batch size. However, there is no difference when using the same batch size.
EDIT: Please note that I am not using dropout.
The difference is between 0% and 1% (0.5% on average).
My network is a fully connected layer that predicts two different outputs. I did not have the issue when I only had one task to predict.
My loss op is a sum of both losses.
What could be the issue? Does it have to do with Tensorflow's parallelization strategy?
This normally means that you did not set the phase_train parameter back to false after testing.

Does size of training data for an epoch matter in tensorflow?

Assuming we have 500k items worth of training data, does it matter if we train the model one item at a time or 'n' items at a time or all at once?
Considering inputTrainingData and outputTrainingData to be [[]] and train_step to be any generic tensorflow training step.
Option 1 Train one item at a time -
for i in range(len(inputTrainingData)):
train_step.run(feed_dict={x: [inputTrainingData[i]], y: [outputTrainingData[i]], keep_prob: .60}, session= sess)
Option 2 Train on all at once -
train_step.run(feed_dict={x: inputTrainingData, y: outputTrainingData, keep_prob: .60}, session= sess)
Is there any difference between options 1 and 2 above as far as the quality of training is concerned?
Yes, there is a difference. Option 1 is much less memory consuming but is also much less accurate. Option 2 could eat up all of your RAM but should prove more accurate. However, if you use all your training set at once, be sure to limit the number of steps to avoid over-fitting.
Ideally, use data in batches (typically between 16 and 256).
Most optimization techniques are 'stochastic', i.e. they rely on a statistical sample of examples to estimate a model update.
To sum up:
- More data => more accuracy (but more memory) => higher risk of over-fitting (so limit the amount of training steps)
There is a different between this options. Normally you have to use a batchsize to train for example 128 iterations of data.
You also could use a batchsize of one, like the first of you examples.
The advantage of this method is you can output the training efficient of the neural network.
If you are learning all data at ones, you will bi a little bit faster, but you will know only at the end if you efficient is good.
Best way is to make a batchsize and learn by stack. So you can output you efficient after every stack and control your efficient.
Mathematically these two methods are different. One is called stochastic gradient descent and the other is called batch gradient descent. You are missing the most commonly used one - mini batch gradient descent. There has been a lot of research on this topic but basically different batch sizes have different convergence properties. Generally people use batch sizes that are greater than one but not the full dataset. This is usually necessary since most datasets cannot fit into memory all at once. Also if your model uses batch normalization then a batch size of one won't converge. This paper discusses the effects of batch size (among other things) on performance. The takeaway is that larger batch sizes do not generalize as well. (They actually argue it isn't the batch size itself the but the fact that you have fewer updates when the batch is larger. I would recommend batch sizes of 32 to start and experiment to see how batch size effects performance.
Here is a graph of the effects of batch size on training and validation performance from the paper I linked.

How to efficiently compute gradients layer-wise in Tensorflow?

I am trying to implement the distributed synchronous SGD approach described in this paper using Tensorflow. For that I need to compute and apply gradients layer-wise. In principle I can do it in the following way (obs! incomplete code:
#WORKER CODE
opt = tf.train.GradientDescentOptimizer(learning_rate)
for layer_vars in all_layer_vars:
grads_vars = opt.compute_gradients(loss, layer_vars)
grads = sess.run([grad_var[0] for grad_var in grads_vars], feed_dict)
send_grads_to_master(zip(grads, layer_vars))
#MASTER CODE
while (True):
grads_vars = receive_grads_from_worker()
sess.run(opt.apply_gradients(grads_vars))
What I wonder is whether in this scenario (with several compute_gradients() calls, within different session.run()'s) the number of internal operations performed by Tensorflow is the same or higher than in the "standard" scenario where all grads are computed with just one invocation of compute_gradients().
That is, thinking on the backpropagation algorithm, I wonder if in this distributed scenario Tensorflow will compute the different "delta's" only once, or not. If the latter, is there a more efficient way of doing what I want?

Categories

Resources