Resource Exhausted in Tensorflow with any architecture

Resource Exhausted in Tensorflow with any architecture - python

I tried to train a image classifier using tensorflow. I used data api to load the dataset and i used dataset caching to speed up training process. while trying to training the model i struck with a error called Resource Exhausted. I tried to change the batch size even after trying different batch size like 32,64,128 i could not over come this problem
I have tried to remove some layers but i could not fix this error.

Check your batch_size. Decrease it. It seems it is overwhelming.

Related

How one can quickly verify that a CNN actually learns?

I tried to build a CNN from scratch based on LeNet architecture from this article
I implemented backdrop and now trying to train it on the MNIST dataset using SGD with 16 batch size. I want to find a quick way to verify that the learning goes well and there are no bugs. For this, I visualize loss for every 100th batch but it takes too long on my laptop and I don't see an overall dynamic (the loss fluctuates downwards, but occasionally jumps up back so I am not sure). Could anyone suggest a proven way to find that the CNN works well without waiting many hours of training?

The MNIST consist of 60k datasets of 28 * 28 pixel.Training a CNN with batch size 16 will have 4000 forward pass per epochs.
Now taking into consideration that your are using LeNet which not a very deep model.
I would suggest you to do followings:
Check your PC specifications such as RAM,Processor,GPU etc.
Try your to train your model on cloud service such Google Colab, Kaggle and others
Try a batch size of 128 or 64
Try to normalize your image data set before training
Training speed also depends on machine learning framework you are using such as Tensorflow, Pytorch etc.
I hope this will help.

Adanet running out of memory

I tried training an AutoEnsembleEstimator with two DNNEstimators (with hidden units of 1000,500, 100) on a dataset with around 1850 features (after feature engineering), and I kept running out of memory (even on larger 400G+ high-mem gcp vms).
I'm using the above for binary classification. Initially I had trained various models and combined them by training a traditional ensemble classifier over the trained models. I was hoping that Adanet would simplify the generated model graph that would make the inference easier, rather than having separate graphs/pickles for various scalers/scikit models/keras models.

Three hypotheses:
You might have too many DNNs in your ensemble, which can happen if max_iteration_steps is too small and max_iterations is not set (both of those are constructor arguments to AutoEnsembleEstimator). If you want to train each DNN for N steps, and you want an ensemble with 2 DNNs, you should set max_iteration_steps=N, set max_iterations=2, and train the AutoEnsembleEstimator for 2N steps.
You might have been on adanet-0.6.0-dev, which had a memory leak. To fix this, try updating to the latest release and seeing if this problem still arises.
Your batch size might have been too large. Try lowering your batch size.

Train on own data set. Mask_RCNN Resource exhausted: OOM when allocating

I'm trying to train my own dataset with Mask_RCNN but i get the following framework errors:
I have followed the github tutorial on Mask R-CNN for object detection.
Can i do something to decrease the memory needed to train the data set? Or how can i solve this problem?

Need more information about your pc.I think maybe because the batch size or the image size is too large, it needs more video memory or memory.When I try to train on my own dataset using mask-rcnn-resnet101 in tensorflow-models/object_detection, GTX1080TI 11GB works fine.(batch_size=1,image_size=1000*600).

In https://stackoverflow.com/a/64145818/11262633, I have described 3 changes you can make to reduce memory utilization in TensorFlow v2.

How to reduce the number of training steps in Tensorflow's Object Detection API?

I am following Dat Trans example to train my own Object Detector with TensorFlow’s Object Detector API.
I successfully started to train the custom objects. I am using CPU to train the model but it takes around 3 hour to complete 100 training steps. I suppose i have to change some parameter in .config.
I tried to convert .ckpt to .pb, I referred this post, but i was still not able to convert
1) How to reduce the number of training steps?
2) Is there a way to convert .ckpt to .pb.

I don't think you can reduce the number of training step, but you can stop at any checkpoint(ckpt) and then convert it to .pb file
From TensorFlow Model git repository you can use , export_inference_graph.py
and following code
python tensorflow_models/object_detection/export_inference_graph.py \
--input_type image_tensor \
--pipeline_config_path architecture_used_while_training.config \
--trained path_to_saved_ckpt/model.ckpt-NUMBER \
--output_directory model/
where NUMBER refers to your latest saved checkpoint file number, however you can use older checkpoint file if you find it better in tensorboard

1) I'm afraid there is no effective way to just "reduce" training steps. Using bigger batch sizes may lead to "faster" training (as in, reaching high accuracy in a lower number of steps), but each step will take longer to compute, since you're running on your CPU.
Playing around with input image resolution might give you a speedup, to the price of lower accuracy.
You should really consider moving to a machine with a GPU.
2) .pb files (and their corresponding text version .pbtxt) by default contain only the definition of your graph. If you freeze your graph, you take a checkpoint, get all the variables defined in the graph, convert them to constants and assign them the values stored in the checkpoint. You typically do this to ship your trained model to whoever will use it, but this is useless in the training stage.

I would highly recommend finding a way to speed up your per-training-step running time rather than reducing the number of training steps. The best way is to get your hands on a GPU. If you can't do this, you can look into reducing image resolution or using a lighter network.
For converting to a frozen inference graph (the .pb file), please see the documentation here:
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/exporting_models.md

Ya there is one parameter in the .config file where you can reduce the number of step as much you want. num_steps: is in .config file which is actually number of epochs in training.
But please keep in mind that it is not recommended to reduce it much.Because if you reduce it much your loss function will not be reduce much which will give you bad output.
So keep seeing loss function, once it come under 1 , then you can start testing your model seprately and your training will be happening.

1. Yup there is a way to change the number of training steps:
try this,
python model_main_tf2.py --pipeline_config_path="config_path_here" --num_train_steps=5000 --model_dir="model_dir_here" --alsologtostderr
here I set the number of training steps to 5000
2. Yup there is a way to convert checkpoints into .pb:
try this,
python exporter_main_v2.py --trained_checkpoint_dir="checkpoint_dir_here" --pipeline_config_path="config_path_here" --output_directory "output_dir_here"
this will create a directory where the checkpoints and .pb file will be saved.

Train a network on python wrraper caffe?

I would like to train a caffe network with the python interface.
The main reason behind this is I use multi dimentional input of a few Tbs of data and I dont want to convert all this to LMDB and train it.
I have found a this one answer on stack overflow.
But his loads this complete data at once and has initialized weights.
I would like to load data to a numpy and then pass it to the caffe.
And save the weights of the caffemodel to a .caffemodel file once every 1000 iterations.
the print_network() get_accuracy() & load_data() are very useful. And gives me a good inside.

Beside using PythonLayer , one thing you can do is use MemoryData layer and feed in each batch of data at a time by using solver.net.set_input_arrays(your_data) after however many iteration is needed to go through one batch of data.
Remember, you can always restore the training state by using .solverstate file from your snapshots.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Resource Exhausted in Tensorflow with any architecture - python

Check your batch_size. Decrease it. It seems it is overwhelming.

Related

How one can quickly verify that a CNN actually learns?

Adanet running out of memory

Train on own data set. Mask_RCNN Resource exhausted: OOM when allocating

How to reduce the number of training steps in Tensorflow's Object Detection API?

Train a network on python wrraper caffe?

Categories

Resources