How to deal with batch normalization for multiple datasets? - python

I am working on a task of generating synthetic data to help the training of my model. This means that the training is performed on synthetic + real data, and tested on real data.
I was told that batch normalization layers might be trying to find weights that are good for all while training, which is a problem since the distribution of my synthetic data is not exactly equal to the distribution of the real data. So, the idea would be to have different ‘copies’ of the weights of batch normalization layers. So that the neural network estimates different weights for synthetic and real data, and uses just the weights of real data for evaluation.
Could someone suggest me good ways to actually implement that in pytorch? My idea was the following, after each epoch of training in a dataset I would go through all batchnorm layers and save their weights. Then at the beginning of the next epoch I would iterate again loading the right weights. Is this a good approach? Still, I am not sure how I should deal with the batch-norm weights at test time since batch-norm treats it differently.

It sounds like the problem you're worried about is that your neural network will learn weights that work well when the batch norm is computed for a batch of both real and synthetic data, and then later at test time it will compute a batch norm on just real data?
Rather than trying to track multiple batch norms, you probably just want to set track_running_stats to True for your batch norm layer, and then put it into eval mode when testing. This will cause it to compute a running mean and variance over multiple batches while training, and then it will use that mean and variance later at test time, rather than looking at the batch stats for the test batches.
(This is often what you want anyway, because depending on your use case, you might be sending very small batches to the deployed model, and so you want to use a pre-computed mean and variance rather than relying on stats for those small batches.)
If you really want to be computing fresh means and variances at test time, what I would do is instead of passing a single batch with both real and synthetic data into your network, I'd pass in one batch of real data, then one batch of synthetic data, and average the two losses together before backprop. (Note that if you do this you should not rely on the running mean and variance later -- you'll have to either set track_running_stats to False, or reset it when you're done and run through a few dummy batches with only real data to compute reasonable values. This is because the running mean and variance stats are only useful if they're expected to be roughly the same for every batch, and you're instead polarizing the values by feeding in different types of data in different batches.)

Related

Training design / sequential loading of images for Mask-RCNN

I am training a deep learning model using Mask RCNN from the following git repository: matterport/Mask_RCNN. I rely on a heavy augmentation of my dataset (original dataset: 59 images of 1988x1355x3 with each > 80 annotations), which I store locally (necessary to evaluate type/degree of augmentation vs validation metrics). The augmented dataset counts 6000 images. This dataset varies in x and y dimensions of the image, because of reducing resolution and affine transformations - I assume the different x,y-dimensions will not affect the final tests.
However, my Python kernel crashes whenever I load more than 'X' images to train the model.
Hence, I came up with the idea of splitting the dataset in sub-datasets and iterate through the sub-dataset, using the 'last' trained weights as starting point for the new round. But I am not sure if the results will be the same (read: same, taken the stochastic nature of 'stochastic gradient descent' into account)?
I wonder, if the results would be the same, if I don't iterate through the sub-datasets per epoch, but train Y epochs (eg. 20 for 'heads' only, 10 for 'all layers')?
Yet, I am sure this is not the most efficient way of solving this issues. Ideas for improvement are welcome.
Note, I am not using keras.preprocessing.image.ImageDataGenerator(), as I have understood it, it randomly generates data and feeds it to the model by replacing the input for the epoch, whereas I would like to feed the whole dataset to the model.
I came up with the idea of splitting the dataset in sub-datasets and iterate through the sub-dataset, using the 'last' trained weights as starting point for the new round. But I am not sure if the results will be the same?
You are doing the same thing as ImageDataGenerator is doing (creating your own mini-batches) but less optimally. The result will be same with respect to what?
If you mean with respect to a model that was trained with all the data in a single batch? - Most probably not. As a lower batch means slower convergence. But this can be solved by training for more epochs.
Another issue is reproducibility. If you want to reproduce your model with same results each time just use seeds.
import random
random.seed(1)
import numpy as np
np.random.seed(1)
import tensorflow
tensorflow.random.set_seed(1)
Another concept is gradient accumulation. It will help you train you with high batch size without keeping too many images in memory at a time.
https://github.com/CyberZHG/keras-gradient-accumulation
Finally, keras.preprocessing.image.ImageDataGenerator() in facts trains on the whole dataset, it just chooses a random sample at each step (you're doing the same thing with your so-called sub-datasets).
You can seed the ImageDataGenerator so it is reproducible and not entirely random.

Does fitting a model to a new data set overwrite previous training progress?

I'd like to implement some active learning algorithm (modAL) with Keras. But I'd like to know whether initiating multiple training instances (i.e., running .fit() more than once) will build on previous training, or if the weights are reset. In other words, is training additive or iterative?
In case training starts from scratch each time, is there a way to have the model build on previous training?
For a given iteration the inputs are provided to the network with the network weights set to a certain value. For each applied input back propagation is used to calculate a gradients. For a batch size of 100 ,100 inputs are applied and the resultant 100 gradients averaged to determine a new value for the network weights. These weights are then used to process the next batch of 100 inputs. So the process is iterative not additive. There are many possible explanations for why the network appears not to be learning.

Is there any rules of thumb for the relation of number of iterations and training size for lightgbm?

When I train a classification model using lightgbm, I usually use validation set and early stopping to determine the number of iterations.
Now I want to combine training and validation set to train a model (so I have more training examples), and use the model to predict the test data, should I change the number of iterations derived from the validation process?
Thanks!
As you said in your comment, this is not comparable to the Deep Learning number of epochs because deep learning is usually stochastic.
With LGBM, all parameters and features being equals, by adding 10% up to 15% more training points, we can expect the trees to look alike: as you have more information your split values will be better, but it is unlikely to drastically change your model (this is less true if you use parameters such as bagging_fraction or if the added points are from a different distribution).
I saw people multiplying the number of iterations by 1.1 (can't find my sources sorry). Intuitively this makes sense to add some trees as you potentially add information. Experimentally this value worked well but the optimal value will be dependent of your model and data.
In a similar problem in deep learning with Keras: I do it by using an early stopper and cross validation with train and validation data, and let the model optimize itself using validation data during trainings.
After each training, I test the model with test data and examine the mean accuracies. In the mean time after each training I save the stopped_epoch from EarlyStopper. If CV scores are satisfying, I take the mean of stopped epochs and do a full training (including all data I have) with the number of mean stopped epochs, and save the model.
I'm not aware of a well-established rule of thumb to do such estimate. As Florian has pointed out, sometimes people rescale the number of iterations obtained from early stopping by a factor. If i remember correctly, typically the factor assumes a linear dependence of the data size and the optimal number of trees. I.e. in the 10-fold cv this would be a rescaling 1.1 factor. But there is no solid justification for this. As Florian also pointed out, the dependence around the optimum is typically reasonably flat, so +- a bit of trees will not have a dramatic effect.
Two suggestions:
do k-fold validation instead of a single train-validation split. This will allow to evaluate how stable the estimate of the optimal number of trees is. If this fluctuates a lot between folds- do not rely on such estimate :)
fix the size of the validation sample and re-train your model with early stopping using gradually increasing training set. This will allow to evaluae the dependence of the number of trees on the sample size and approximate it to the full sample size.

What's the reason for the weights of my NN model don't change a lot?

I am training a neural network model, and my model fits the training data well. The training loss decreases stably. Everything works fine. However, when I output the weights of my model, I found that it didn't change too much since random initialization (I didn't use any pretrained weights. All weights are initialized by default in PyTorch). All dimension of the weights only changed about 1%, while the accuracy on training data climbed from 50% to 90%.
What could account for this phenomenon? Is the dimension of weights too high and I need to reduce the size of my model? Or is there any other possible explanations?
I understand this is a quite broad question, but I think it's impractical for me to show my model and analyze it mathematically here. So I just want to know what could be the general / common cause for this problem.
There are almost always many local optimal points in a problem so one thing you can't say specially in high dimensional feature spaces is which optimal point your model parameters will fit into. one important point here is that for every set of weights that you are computing for your model to find a optimal point, because of real value weights, there are infinite set of weights for that optimal point, the proportion of weights to each other is the only thing that matters, because you are trying to minimize the cost, not finding a unique set of weights with loss of 0 for every sample. every time you train you may get different result based on initial weights. when weights change very closely with almost same ratio to each others this means your features are highly correlated(i.e. redundant) and since you are getting very high accuracy just with a little bit of change in weights, only thing i can think of is that your data set classes are far away from each other. try to remove features one at a time, train and see results if accuracy was good continue to remove another one till you hopefully reach to a 3 or 2 dimensional space which you can plot your data and visualize it to see how data points are distributed and make some sense out of this.
EDIT: Better approach is to use PCA for dimensionality reduction instead of removing one by one

Normalization of input data to Qnetwork

I am well known with that a “normal” neural network should use normalized input data so one variable does not have a bigger influence on the weights in the NN than others.
But what if you have a Qnetwork where your training data and test data can differ a lot and can change over time in a continous problem?
My idea was to just run a normal run without normalization of input data and then see the variance and mean from the input datas of the run and then use the variance and mean to normalize my input data of my next run.
But what is the standard to do in this case?
Best regards Søren Koch
Normalizing the input can lead to faster convergence. It is highly recommended to normalize the inputs.
And as the network will progress through different layers due to use of non-linearities the data flowing between the different layers will not be normalized anymore and therefore, for faster convergence we often use batch normalization layers. Unit Gaussian data always helps in faster convergence and therefore make sure to keep it in unit Gaussian form as much as possible.

Categories

Resources