Speeding up TensorFlow Cifar10 Example for Experimentation - python

The TensorFlow tutorial for using CNN for the cifar10 data set has the following advice:
EXERCISE: When experimenting, it is sometimes annoying that the first training step can take so long. Try decreasing the number of images that initially fill up the queue. Search for NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN in cifar10.py.
In order to play around with it, I tried decreasing this number by a lot but it doesn't seem to change the training time. Is there anything I can do? I tried even changing it to something as low as 5 and the training session still continued very slowly.
Any help would be appreciated!

Note that this exercise only speeds up the first step time by skipping the prefetching of a larger from of the data. This exercise does not speed up the overall training
That said, the tutorial text needs to be updated. It should read
Search for min_fraction_of_examples_in_queue in cifar10_input.py.
If you lower this number, the first step should be much quicker because the model will not attempt to prefetch the input.

Related

Data shuffling changes gpu memory use drastically

I'm having a problem with a pytorch-ignite classification model. The code is quite long, so I'd like to first ask if anyone can explain this behavior in theory.
I am doing many classifications in a row. In each iteration, I select a subset of my data randomly and perform classification. My results were quite poor (accuracy ~ 0.6). I realized that in each iteration my training dataset is not balanced. I have a lot more class 0 data than class 1; so in a random selection, there tends to be more data from class 0.
So, I modified the selection procedure: I randomly select a N data points from class 1, then select N data points from class 0, then concatenate these two together (so the label order is like [1111111100000000] ). Finally, I shuffle this list to mix the labels before feeding it to the network.
The problem is, with this new data selection, my gpu runs out of memory within seconds. This was odd since with the first data selectin policy the code ran for tens of hours.
I retraced my steps: Turns out, if I do not shuffle my data in the end, meaning, if I keep the [1111111100000000] order, all is well. If I do shuffle the data, I need to reduce my batch_size by a factor of 5 or more so the code doesn't crash due to running out of gpu memory.
Any idea what is happening here? Is this to be expected?
I found the solution to my problem. But I don't really understand the details of why it works:
When trying to choose a batch_size at first, I chose 90. 64 was slow, I was worried 128 was going to be too large, and a quick googling let me to believe keeping to powers of 2 shouldn't matter much.
Turns out, it does matter! At least, when your classification training data is balanced. As soon as I changed my batch_size to a power of 2, there was no memory overflow. In fact, I ran the whole thing on a batch_size of 128 and there was no problem :)

Why is Normalization causing my network to have exploding gradients in training?

I've built a network (In Pytorch) that performs well for image restoration purposes. I'm using an autoencoder with a Resnet50 encoder backbone, however, I am only using a batch size of 1. I'm experimenting with some frequency domain stuff that only allows me to process one image at a time.
I have found that my network performs reasonably well, however, it only behaves well if I remove all batch normalization from the network. Now of course batch norm is useless for a batch size of 1 so I switched over to group norm, designed for this purpose. However, even with group norm, my gradient explodes. The training can go very well for 20 - 100 epochs and then game over. Sometimes it recovers and explodes again.
I should also say that in training, every new image fed in is given a wildly different amount of noise to train for random noise amounts. This has been done before but perhaps coupled with a batch size of 1 it could be problematic.
I'm scratching my head at this one and I'm wondering if anyone has suggestions. I've dialed in my learning rate and clipped the max gradients but this isn't really solving the actual issue. I can post some code but I'm not sure where to start and hoping someone could give me a theory. Any ideas? Thanks!
To answer my own question, my network was unstable in training because a batch size of 1 makes the data too different from batch to batch. Or as the papers like to put it, too high an internal covariate shift.
Not only were my images drawn from a very large varied dataset, but they were also rotated and flipped randomly. As well as this, random Gaussain of noise between 0 and 30 was chosen for each image, so one image may have little to no noise while the next may be barely distinguisable in some cases. Or as the papers like to put it, too high an internal covariate shift.
In the above question I mentioned group norm - my network is complex and some of the code is adapted from other work. There were still batch norm functions hidden in my code that I missed. I removed them. I'm still not sure why BN made things worse.
Following this I reimplemented group norm with groups of size=32 and things are training much more nicely now.
In short removing the extra BN and adding Group norm helped.

generator in GAN model generate images of one digit only ( with MNIST Dataset )?

i am learning to code GAN model
the code link here
after 100 epochs when i run generator , it generate one digit only the digit is
: "7"
why it generate one digit only , although i change the random numbers every time before i give it to generator and the result is 7 every time
i use this code to test generator
u=np.random.randn(0,100)
ph=tf.reshape(generator(u)[0],(28,28)).numpy()
plt.imshow(ph,cmap='gray')
it gives me every time images of the digit 7 :
am i make a mistake in code ?? or what ??
the generator and decremenator models hereto download if you want to try them :)
Several things here:
The discriminator is nowhere near powerful enough to keep up with the generator. You need atleast 4-5 conv layers.
You haven't used any activations in the generator. Try adding ReLU() after every BarchNormalization()
To improve performance, try adding one or two conv layers after every transpose conv layer in the generator.
#Susmit Argawal may give you some solution, you can try it. I will raise my idea to answer why it always generates seven.
You didn't do anything wrong, GANs is the fight between Generator(G) and Discriminator(D). Image it as the philosophical way, when you can fool somebody in some way, you will try to do the same thing next time. That problem appears in GANs too, when the G can "minimize" the loss, it will learn less from data, produce samples with less variation (nearly similar).
Some new GANs model tries to reduce this in multiple ways, for example, "minibatch standard deviation" in ProGans paper. There are several tips for training the GANs model, follow it will help you reduce a lot of time.
Training GANs and tuning its parameters are painful and need a little luck, so just try it as much as possible.
Good luck!

Optimizing RAM usage when training a learning model

I have been working on creating and training a Deep Learning model for the first time. I did not have any knowledge about the subject prior to the project and therefor my knowledge is limited even now.
I used to run the model on my own laptop but after implementing a well working OHE and SMOTE I simply couldnt run it on my own device anymore due to MemoryError (8GB of RAM). Therefor I am currently running the model on a 30GB RAM RDP which allows me to do so much more, I thought.
My code seems to have some horribly inefficiencies of which I wonder if they can be solved. One example is that by using pandas.concat my model's RAM usages skyrockets from 3GB to 11GB which seems very extreme, afterwards I drop a few columns making the RAm spike to 19GB but actually returning back to 11GB after the computation is completed (unlike the concat). I also forced myself to stop using the SMOTE for now just because the RAM usage would just go up way too much.
At the end of the code, where the training happens the model breaths its final breath while trying to fit the model. What can I do to optimize this?
I have thought about splitting the code into multiple parts (for exmaple preprocessing and training) but to do so I would need to store massive datasets in a pickle which can only reach 4GB (correct me if I'm wrong). I have also given thought about using pre-trained models but I truely did not understand how this process goes to work and how to use one in Python.
P.S.: I would also like my SMOTE back if possible
Thank you all in advance!
Let's analyze the steps:
Step 1: OHE
For your OHE, the only dependence there is between data points is that it needs to be clear what categories are there overall. So the OHE can be broken into two steps, both of which do not require that all data points are in RAM.
Step 1.1: determine categories
Stream read your data points, collecting all the categories. It is not necessary to save the data points you read.
Step 1.2: transform data
After step 1.1, each data point can be independently converted. So stream read, convert, stream write. You only need one or very few data points in memory at all times.
Step 1.3: feature selection
It may be worthwile to look at feature selection to reduce the memory footprint and improve performance. This answer argues it should happen before SMOTE.
Feature selection methods based on entropy depend on all data. While you can probably also throw something together which streams, one approach that worked well for me in the past is removing features that only one or two data points have, since these features definitely have low entropy and probably don't help the classifier much. This can be done again like Step 1.1 and Step 1.2
Step 2: SMOTE
I don't know SMOTE enough to give an answer, but maybe the problem has already solved itself if you do feature selection. In any case, save the resulting data to disk so you do not need to recompute for every training.
Step 3: training
See if the training can be done in batches or streaming (online, basically), or simply with less sampled data.
With regards to saving to disk: Use a format that can be easily streamed, like csv or some other splittable format. Don't use pickle for that.
Slightly orthogonal to your actual question, if your high RAM usage is caused by having entire dataset in memory for the training, you could eliminate such memory footprint by reading and storing only one batch at a time: read a batch, train on this batch, read next batch and so on.

Running Tensorflow Predictions code twice does *not* result same outcome

I am new to tensorflow, so please pardon my ignorance.
I have a tensorflow demo model "from an online tutorial" that should predict stockmarket prices for S&P. When I run the code I get inconsistent results everytime I run it. Training data does not change, I suppressed block shuffling , ...
But, When I run the prediction 2 times in the same run I get consistent results "i.e. use Only one training , run prediction twice".
My questions are:
Why am I getting inconsistent results?
If you are going to release such code to production , would you
just take the last time you ran this model training results? if not, then what would you do?
Does it make sense to force the model to produce consistent predictions? how would
you do that?
Here is my code location github repo
In training a neural network there is more randomness involved than just the batch shuffling. The initial weights of the layers are also randomly initialized.
Typically you would use the best model you have trained so far. To determine which model is the best you usually use some test dataset you did not use during training.
It is probably not a good sign if your performance fluctuates for different training runs. This means your result depends a lot on the random initialization. But I personally don't know about any general techniques to make learning more stable. But there probably are some.

Categories

Resources