base_model.summary() Crashes my notebook and VS Code - ResNet101 - python

I am trying to print model summary in tensorflow and I think the model is large and it's crashing my notebook. The model is ResNet101.
The whole computer comes to a halt, memory usage goes up to 99% and VS Code crashes. I have 16 GB of ram, so I didn't think printing something large would actually eat all my ram. Also, because the kernel crashes, all the variables are lost like history = model.fit() which I need to fine-tune the model afterwards. Moreover, I need to print base_model summary in order to choose from which layer to fine-tune from.
Is there a way to print the summary in another way and save the entire notebook with the variables, so I can continue working? I have checkpoints for model weights, but I need to keep track of past epochs through history to resume training afterwards.

Related

How can i stop model training and resume it?

I am working on object detection with autonomous datasets . I want to train my model with 10000 train images,2000 test,2000 validation images.I will use object detection tensorflow lite model maker.
Project link : tensorflow.org/lite/tutorials/model_maker_object_detection
After setting batch size 32, the training takes 50 epochs and continues for 2 days(Step 3).I can’t keep my computer on for two days.I am running the project in jupyter notebook
How can i stop model training and again resume it ? (e.g. stop the 10th epoch and continue one day later)
I sure it depend on your code you working on. You can do that with tensorflow check
How to Pause / Resume Training in Tensorflow
a sleep mode is a better option.
it will make your pc rest for some time and your work will be resumed after you log in again

spyder kernel dies during training

I am trying to train an fairly complex GCN network on my 10GB GPU. It runs smoothly until epoch 87 but then the spyder kernel restarts. Is it because of memory issue, if so how can I handle it.
As you mentioned, If the model is too large it good to store logs after every epoch.
## after every epoch
path = os.path.join(SAVE_DIR, 'model.pth')
torch.save(TheModelClass.cpu().state_dict(), path) # saving model
MODEL.cuda() # moving model to GPU for further training
## if the kernel terminates, load the model paramters
device = torch.device("cuda")
model = TheModelClass()
model.load_state_dict(torch.load(PATH))
model.train()
model.to(device)
so, anything happens at the process you can start from the last completed epoch.
from your information, it's hard to tell what is the causes to terminate the kernel exactly. RAM overloading is less likely because of GPU acceleration and using pytorch framework. but it could be.
However, the above solution will help you anywhere.

PyTorch: Is there a way to store model in CPU ram, but run all operations on the GPU for large models?

From what I see, most people seem to be initializing an entire model, and sending the whole thing to the GPU. But I have a neural net model that is too big to fit entirely on my GPU. Is it possible to keep the model saved in ram, but run all the operations on the GPU?
I do not believe this is possible. However, one easy work around would be to split you model into sections that will fit into gpu memory along with your batch input.
Send the first part(s) of the model to gpu and calculate outputs
Release the former part of the model from gpu memory, and send the next section of the model to the gpu.
Input the output from 1 into the next section of the model and save outputs.
Repeat 1 through 3 until you reach your models final output.

how to save and reload the tensorflow best model python

I have created the following jupyter notebook. I have used 200 epoches for the training purpose. But there was a power failure and the epoch it reached was 110. Now I need to restart it from zero. All my hardwork was wasted.
Have a look at the Jupyter Notebook created by me.
What should I do to save the train model and find out the best out of it? I want to restore the model saved even if there is any problem associated to power.
How to save and restore the model successfully? Currently I am going with 200 epoches but I soon think to move for 1000.
If possible then suggest me how I can move to Keras. I have heard it is simple to understand compare to Tensorflow.

Running Tensorflow Predictions code twice does *not* result same outcome

I am new to tensorflow, so please pardon my ignorance.
I have a tensorflow demo model "from an online tutorial" that should predict stockmarket prices for S&P. When I run the code I get inconsistent results everytime I run it. Training data does not change, I suppressed block shuffling , ...
But, When I run the prediction 2 times in the same run I get consistent results "i.e. use Only one training , run prediction twice".
My questions are:
Why am I getting inconsistent results?
If you are going to release such code to production , would you
just take the last time you ran this model training results? if not, then what would you do?
Does it make sense to force the model to produce consistent predictions? how would
you do that?
Here is my code location github repo
In training a neural network there is more randomness involved than just the batch shuffling. The initial weights of the layers are also randomly initialized.
Typically you would use the best model you have trained so far. To determine which model is the best you usually use some test dataset you did not use during training.
It is probably not a good sign if your performance fluctuates for different training runs. This means your result depends a lot on the random initialization. But I personally don't know about any general techniques to make learning more stable. But there probably are some.

Categories

Resources