Tensorflow Object Detection API Out of Memory - python

I am new to tensorflow and trying to train my own object detection model.
Hardware: have tried multiple things, but biggest was a 32gb cpu. Also tried a 8gb cpu & 2gb gpu.
All running windows, python 3.6.2 & tensorflow 1.3
I've been following the guide here https://pythonprogramming.net/introduction-use-tensorflow-object-detection-api-tutorial/
Error is: (sorry, VM was shut down and that's all I have)
error message OOM
Batch Size: 3
Im starting to think my issue is the initial image size (4mb, 4288x3216), so I've been trying other resources (paperspace) and am still running out of memory on an 8gb gpu.
Ive got 571 images to train the model, and and 250 or so more for testing, all of the same size.
Any suggestions on passing the model a smaller amount of data so that I can keep it running?
As one of the comments suggested, I can resize but how will that affect my XML file?

Related

What machine specs do you need in training of DETR?(End-to-End Object Detection with Transformers)

I am researching the machine specs needed for DETR training.
However, I only have a geforce 1660 super and I got an "out of memory" error. So, please let me know how much machine specs you have to use to complete the DETR training.
Please help me with my research.
DETR(https://github.com/facebookresearch/detr)
Your are getting out of memory error because your GPU memory isn't sufficient to hold up the batch-size you input. Try running the code with minimum batch-size possible, see how much memory it consumes, increase the batch-size slightly, again check increase in memory consumption. This was you will be able to estimate how much GPU memory you require to run it with the actual batch-size.
I had the same issue, I switched to a machine with larger GPU memory (around 24 GB) and then everything worked fine!

Is there a limit about image size when we train custom object with already trained models?

I already trained ssd_mobilenet_v2_coco with my custom data set on tensorflow. Also I trained YOLO with my data set too. I solved all problems and they work.
I encounter a problem with both models. When my data set includes images with more size than 400kb, the trained models do not work. Some times "allocation of memory" problem occurs. I solved them with changing parameters(batch size etc.). But I still don't know whether there is a limit of image size when we are preparing data set?
Why more than 400kb images are problem for my system? My question is not about pixel size,it's about image file size.
Thanks...
System Info
Nvidia RTX 2060 6G
AMD Ryzen 7
16GB DDR4 2600Mhz Ram
Cuda:10.0
CudNN:7.4.2(I also try diffrent versions and same results occur)
Tensorflow:2
Providing the solution here (Answer Section), even though it is present in the comment section (Thanks dasmehdix for the update) for the benefit of the community.
No, there is no a limit of image size that we use to train our model.

Tensorflow-GPU Object Detection API gets stuck after first saved checkpoint

I'm trying to train a SSD mobilenet v2 using Tensorflow Object Detection API, with Tensorflow GPU. The training goes well and fast until the first checkpoint save (after some hundreds of steps), where it gets stuck after restoring the last checkpoint. The GPU usage goes down and never comes up. Sometimes Python itself crashes.
I'm running Tensorflow GPU on Windows 7, with an NVIDIA Quadro M4000, with CUDA 8.0 (the only version I managed to work with). The model is an SSD Mobilenet v2 pretrained with COCO, using a very low batch size of 4.
The config file is the same as it comes out from the Tensorflow Model ZOO, of course changing paths, batch size, number of classes and number of steps and adding shuffle: true on the training part.
I'm adding the terminal infos that come out. This is where it gets stuck.
Did someone experience the same kind of problem or has any idea why?
Thanks in advance
I faced the same problem as you stated. I waited a long time and found something interesting. I got some evaluation results. The training process continued after that. It seems that the evaluation process takes too much time. As it gives no output at the beginning, it just like get stuck. Maybe changing the parameter 'sample_1_of_n_eval_examples' will help. I'm trying...

How can I train dlib shape predictor using a very large training set

I'm trying to use the python dlib.train_shape_predictor function to train using a very large set of images (~50,000).
I've created an xml file containing the necessary data, but it seems like train_shape_predictor loads all the referenced images into RAM before it starts training. This leads to the process getting terminated because it uses over 100gb of RAM. Even trimming down the data set uses over 20gb (machine only has 16gb physical memory).
Is there some way to get train_shape_predictor to load images on demand, instead of all at once?
I'm using python 3.7.2 and dlib 19.16.0 installed via pip on macOS.
I posted this as an issue on the dlib github and got this response from the author:
It's not reasonable to change the code to cycle back and forth between disk and ram like that. It will make training very slow. You should instead buy more RAM, or use smaller images.
As designed, large training sets need tons of RAM.

Tensorflow GPU error: Resource Exhausted in middle of training a model

I'm trying to train a model (implementation of a research paper) on K80 GPU with 12GB memory available for training. The dataset is about 23 GB and after data extraction, it shrinks to 12GB for the training script.
At about 4640th step (max_steps being 500,000), I receive the following error saying Resource Exhausted and the script stops soon after that. -
The memory usage at the beginning of the script is:
I went through a lot of similar questions and found that reducing the batch-size might help but I have reduced the batch-size to 50 and the error persists. Is there any other solution except switching to a more powerful GPU?
This does not look like a GPU Out Of Memory (OOM) error but more like you ran out of space on your local drive to save the checkpoint of your model.
Are you sure that you have enough space on your disk or that the folder you save to doesn't have a quotta?

Categories

Resources