hdf5 keras model - very large. Re-training gives OOM error - python

I trained a network and saved the model as a hdf5 file. I am using the Keras library for training. But the file is very large ~ 3.6GB. I am unable to continue the training process, because the program throws an "Out of memory" error. I reduced the batch size from 128 to 8, and the same error is thrown. I would like to retrain the existing model when I encounter new data/Want to train for a few more epochs.
Using 12GB GPU RAM, 65GB CPU RAM.
2GB training images, 1GB training labels.
Any way to counter this?

Related

XGBoost model: train on GPU, run on CPU without GPU RAM allocation

How can I train an XGBoost model on a GPU but run predictions on CPU without allocating any GPU RAM?
My situation: I create an XGBoot model (tree_method='gpu_hist') in Python with predictor='cpu_predictor', then I train it on GPU, then I save (pickle) it to disk, then I read the model from disk, then I use it for predictions.
My problem: once the model starts doing predictions, even though I run it on CPU, it still allocates some small amount of GPU RAM (around ~289MB). This is a problem for the following reasons:
I run multiple copies of the model to parallelize predictions and if I run too many, the prediction processes crash.
I can not use GPU for training other models, if I run predictions on the same machine at the same time.
So, how can one tell XGBoost to not allocate any GPU RAM and use CPU and regular RAM only for predictions?
Thank you very much for your help!

OOM error when running custom Tensorflow training loop for MNIST dataset

I wrote a custom Tensorflow training loop to train a MNIST classifier.
I experienced an error:
OOM when allocating tensor for MNIST
The snap shot of error
Here is my code: https://github.com/soon22/learningTensorflowCustomTrainingLoop/blob/master/mnist_custom_training_loop.ipynb
Using tensorflow.keras 's model.compile and model.fit ,the training was successful with more than 90% accuracy. It didn't have this problem.
What did I do wrong.
You're trying to make the model do a prediction for all 60000 data points in the MNIST dataset at the same time, and compute gradients for the resulting loss. That is way too much for your graphics card to handle.
Try training on batches of say a hundred data points. The reason model.fit doesn't give an OOM is that that model.fit defaults to batches of size 32 if you don't specify another value for batch_size (see https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit).

In Keras, how can I load images to GPU in groups larger than batch size?

I have the following code segment:
model.fit(x=train_x, y=train_y, batch_size=32, epochs=10, verbose=2, validation_data=(val_x, val_y), initial_epoch=0)
print(model.evaluate(test_x, test_y))
My GPU will still work with a batch size of 1024. However, this will severely penalize the frequency with which the model updates. Is it possible to load the images in groups of 1024 to the GPU but adjust the weights for the model every 32 images?
My intention is to improve performance by reducing the number of times the GPU has to fetch data from main memory since there is high latency involved with this operation. My question is similar to this one: How can you load all batch data into GPU memory in Keras (Theano backend)?
However, I am not necessarily trying to load all my data to the GPU at once, as the dataset is too large.
Thank you!

Train on own data set. Mask_RCNN Resource exhausted: OOM when allocating

I'm trying to train my own dataset with Mask_RCNN but i get the following framework errors:
I have followed the github tutorial on Mask R-CNN for object detection.
Can i do something to decrease the memory needed to train the data set? Or how can i solve this problem?
Need more information about your pc.I think maybe because the batch size or the image size is too large, it needs more video memory or memory.When I try to train on my own dataset using mask-rcnn-resnet101 in tensorflow-models/object_detection, GTX1080TI 11GB works fine.(batch_size=1,image_size=1000*600).
In https://stackoverflow.com/a/64145818/11262633, I have described 3 changes you can make to reduce memory utilization in TensorFlow v2.

Tensorflow GPU error: Resource Exhausted in middle of training a model

I'm trying to train a model (implementation of a research paper) on K80 GPU with 12GB memory available for training. The dataset is about 23 GB and after data extraction, it shrinks to 12GB for the training script.
At about 4640th step (max_steps being 500,000), I receive the following error saying Resource Exhausted and the script stops soon after that. -
The memory usage at the beginning of the script is:
I went through a lot of similar questions and found that reducing the batch-size might help but I have reduced the batch-size to 50 and the error persists. Is there any other solution except switching to a more powerful GPU?
This does not look like a GPU Out Of Memory (OOM) error but more like you ran out of space on your local drive to save the checkpoint of your model.
Are you sure that you have enough space on your disk or that the folder you save to doesn't have a quotta?

Categories

Resources