Normalization of input data to Qnetwork - python

I am well known with that a “normal” neural network should use normalized input data so one variable does not have a bigger influence on the weights in the NN than others.
But what if you have a Qnetwork where your training data and test data can differ a lot and can change over time in a continous problem?
My idea was to just run a normal run without normalization of input data and then see the variance and mean from the input datas of the run and then use the variance and mean to normalize my input data of my next run.
But what is the standard to do in this case?
Best regards Søren Koch

Normalizing the input can lead to faster convergence. It is highly recommended to normalize the inputs.
And as the network will progress through different layers due to use of non-linearities the data flowing between the different layers will not be normalized anymore and therefore, for faster convergence we often use batch normalization layers. Unit Gaussian data always helps in faster convergence and therefore make sure to keep it in unit Gaussian form as much as possible.


How to deal with batch normalization for multiple datasets?

I am working on a task of generating synthetic data to help the training of my model. This means that the training is performed on synthetic + real data, and tested on real data.
I was told that batch normalization layers might be trying to find weights that are good for all while training, which is a problem since the distribution of my synthetic data is not exactly equal to the distribution of the real data. So, the idea would be to have different ‘copies’ of the weights of batch normalization layers. So that the neural network estimates different weights for synthetic and real data, and uses just the weights of real data for evaluation.
Could someone suggest me good ways to actually implement that in pytorch? My idea was the following, after each epoch of training in a dataset I would go through all batchnorm layers and save their weights. Then at the beginning of the next epoch I would iterate again loading the right weights. Is this a good approach? Still, I am not sure how I should deal with the batch-norm weights at test time since batch-norm treats it differently.
It sounds like the problem you're worried about is that your neural network will learn weights that work well when the batch norm is computed for a batch of both real and synthetic data, and then later at test time it will compute a batch norm on just real data?
Rather than trying to track multiple batch norms, you probably just want to set track_running_stats to True for your batch norm layer, and then put it into eval mode when testing. This will cause it to compute a running mean and variance over multiple batches while training, and then it will use that mean and variance later at test time, rather than looking at the batch stats for the test batches.
(This is often what you want anyway, because depending on your use case, you might be sending very small batches to the deployed model, and so you want to use a pre-computed mean and variance rather than relying on stats for those small batches.)
If you really want to be computing fresh means and variances at test time, what I would do is instead of passing a single batch with both real and synthetic data into your network, I'd pass in one batch of real data, then one batch of synthetic data, and average the two losses together before backprop. (Note that if you do this you should not rely on the running mean and variance later -- you'll have to either set track_running_stats to False, or reset it when you're done and run through a few dummy batches with only real data to compute reasonable values. This is because the running mean and variance stats are only useful if they're expected to be roughly the same for every batch, and you're instead polarizing the values by feeding in different types of data in different batches.)

Is there any rules of thumb for the relation of number of iterations and training size for lightgbm?

When I train a classification model using lightgbm, I usually use validation set and early stopping to determine the number of iterations.
Now I want to combine training and validation set to train a model (so I have more training examples), and use the model to predict the test data, should I change the number of iterations derived from the validation process?
As you said in your comment, this is not comparable to the Deep Learning number of epochs because deep learning is usually stochastic.
With LGBM, all parameters and features being equals, by adding 10% up to 15% more training points, we can expect the trees to look alike: as you have more information your split values will be better, but it is unlikely to drastically change your model (this is less true if you use parameters such as bagging_fraction or if the added points are from a different distribution).
I saw people multiplying the number of iterations by 1.1 (can't find my sources sorry). Intuitively this makes sense to add some trees as you potentially add information. Experimentally this value worked well but the optimal value will be dependent of your model and data.
In a similar problem in deep learning with Keras: I do it by using an early stopper and cross validation with train and validation data, and let the model optimize itself using validation data during trainings.
After each training, I test the model with test data and examine the mean accuracies. In the mean time after each training I save the stopped_epoch from EarlyStopper. If CV scores are satisfying, I take the mean of stopped epochs and do a full training (including all data I have) with the number of mean stopped epochs, and save the model.
I'm not aware of a well-established rule of thumb to do such estimate. As Florian has pointed out, sometimes people rescale the number of iterations obtained from early stopping by a factor. If i remember correctly, typically the factor assumes a linear dependence of the data size and the optimal number of trees. I.e. in the 10-fold cv this would be a rescaling 1.1 factor. But there is no solid justification for this. As Florian also pointed out, the dependence around the optimum is typically reasonably flat, so +- a bit of trees will not have a dramatic effect.
Two suggestions:
do k-fold validation instead of a single train-validation split. This will allow to evaluate how stable the estimate of the optimal number of trees is. If this fluctuates a lot between folds- do not rely on such estimate :)
fix the size of the validation sample and re-train your model with early stopping using gradually increasing training set. This will allow to evaluae the dependence of the number of trees on the sample size and approximate it to the full sample size.

How do I better process my data and set parameters for my Neural Network?

When I run my NN the only way to get any training to occur is if I divide X by 1000. The network also needs to be trained under 70000 times with a 0.03 training rate and if those values are larger the NN gets worse. I think this is a due to bad processing of data and maybe the lack of having biases, but I don't really know.
Code on Google Colab
In short: all of the problems you mentioned and more.
Scaling is essential, typically to 0 mean and a variance of 1. Otherwise, you will quickly saturate the hidden units, their gradients will be near zero and (almost) no learning will be possible.
Bias is mandatory for such ANN. It's like an offset for fitting linear function. If you drop it, getting good fit will be very difficult.
You seem to be checking accuracy on your training data.
You have very few training samples.
Sigmoid is proven to be poor choice. Use ReLU and check e.g. here for explanation.
Also, I'd recommend spending some time on learning Python before going into this. For starter, avoid using global, it can get you unforeseen behaviour if you're not careful.

What's the reason for the weights of my NN model don't change a lot?

I am training a neural network model, and my model fits the training data well. The training loss decreases stably. Everything works fine. However, when I output the weights of my model, I found that it didn't change too much since random initialization (I didn't use any pretrained weights. All weights are initialized by default in PyTorch). All dimension of the weights only changed about 1%, while the accuracy on training data climbed from 50% to 90%.
What could account for this phenomenon? Is the dimension of weights too high and I need to reduce the size of my model? Or is there any other possible explanations?
I understand this is a quite broad question, but I think it's impractical for me to show my model and analyze it mathematically here. So I just want to know what could be the general / common cause for this problem.
There are almost always many local optimal points in a problem so one thing you can't say specially in high dimensional feature spaces is which optimal point your model parameters will fit into. one important point here is that for every set of weights that you are computing for your model to find a optimal point, because of real value weights, there are infinite set of weights for that optimal point, the proportion of weights to each other is the only thing that matters, because you are trying to minimize the cost, not finding a unique set of weights with loss of 0 for every sample. every time you train you may get different result based on initial weights. when weights change very closely with almost same ratio to each others this means your features are highly correlated(i.e. redundant) and since you are getting very high accuracy just with a little bit of change in weights, only thing i can think of is that your data set classes are far away from each other. try to remove features one at a time, train and see results if accuracy was good continue to remove another one till you hopefully reach to a 3 or 2 dimensional space which you can plot your data and visualize it to see how data points are distributed and make some sense out of this.
EDIT: Better approach is to use PCA for dimensionality reduction instead of removing one by one

Neural Network - Input Normalization

It is a common practice to normalize input values (to a neural network) to speed up the learning process, especially if features have very large scales.
In its theory, normalization is easy to understand. But I wonder how this is done if the training data set is very large, say for 1 million training examples..? If # features per training example is large as well (say, 100 features per training example), 2 problems pop up all of a sudden:
- It will take some time to normalize all training samples
- Normalized training examples need to be saved somewhere, so that we need to double the necessary disk space (especially if we do not want to overwrite the original data).
How is input normalization solved in practice, especially if the data set is very large?
One option maybe is to normalize inputs dynamically in the memory per mini batch while training.. But normalization results will then be changing from one mini batch to another. Would it be tolerable then?
There is maybe someone in this platform having hands on experience on this question. I would really appreciate if you could share your experiences.
Thank you in advance.
A large number of features makes it easier to parallelize the normalization of the dataset. This is not really an issue. Normalization on large datasets would be easily GPU accelerated, and it would be quite fast. Even for large datasets like you are describing. One of my frameworks that I have written can normalize the entire MNIST dataset in under 10 seconds on a 4-core 4-thread CPU. A GPU could easily do it in under 2 seconds. Computation is not the problem. While for smaller datasets, you can hold the entire normalized dataset in memory, for larger datasets, like you mentioned, you will need to swap out to disk if you normalize the entire dataset. However, if you are doing reasonably large batch sizes, about 128 or higher, your minimums and maximums will not fluctuate that much, depending upon the dataset. This allows you to normalize the mini-batch right before you train the network on it, but again this depends upon the network. I would recommend experimenting based on your datasets, and choosing the best method.

