I'm trying to train a SSD mobilenet v2 using Tensorflow Object Detection API, with Tensorflow GPU. The training goes well and fast until the first checkpoint save (after some hundreds of steps), where it gets stuck after restoring the last checkpoint. The GPU usage goes down and never comes up. Sometimes Python itself crashes.
I'm running Tensorflow GPU on Windows 7, with an NVIDIA Quadro M4000, with CUDA 8.0 (the only version I managed to work with). The model is an SSD Mobilenet v2 pretrained with COCO, using a very low batch size of 4.
The config file is the same as it comes out from the Tensorflow Model ZOO, of course changing paths, batch size, number of classes and number of steps and adding shuffle: true on the training part.
I'm adding the terminal infos that come out. This is where it gets stuck.
Did someone experience the same kind of problem or has any idea why?
Thanks in advance
I faced the same problem as you stated. I waited a long time and found something interesting. I got some evaluation results. The training process continued after that. It seems that the evaluation process takes too much time. As it gives no output at the beginning, it just like get stuck. Maybe changing the parameter 'sample_1_of_n_eval_examples' will help. I'm trying...
Related
I don't have a GPU on my machine, since most of the performance recommondations on tensorflow mention only GPU, can someone confirm that e.g.
tf.data.prefetch
tf.distribute.mirroredstrategy
tf.distribute.multiworkerstrategy
Will only work with multi GPU ?
I tried it on my PC and most of the functions realy slow down the process instead of increasing it. Therefore multi CPU is no benefit here?
In case you haven't solved your problem yet, you can use Google Colab (https://colab.research.google.com) to get a GPU - there you can change a runtime to GPU or TPU.
I did not understand exactly what you are asking but let me give you a 10,000ft explanation of those. It might help you to understand what/when you should use it.
tf.data.prefetch : let's suppose that your have 2 steps while training your model. a) read data, b) process the data. While you are processing the data, you could be reading more data to make sure it is available once 'training' is done with the current batch of data. Just think about a producer/consumer model. You don't want your consumer idle while you are producing more data.
tf.distribute.mirroredstrategy : this one helps if you have a single machine with more than one GPU. It allows to train a model in "parallel" on the same machine.
tf.distribute.multiworkerstrategy : let's suppose now that you have a cluster with 5 machines. You could train your model in a distributed fashion using all of them.
This is just a simple explanation of those 3 items you mentioned here.
I'm trying to train a model (implementation of a research paper) on K80 GPU with 12GB memory available for training. The dataset is about 23 GB and after data extraction, it shrinks to 12GB for the training script.
At about 4640th step (max_steps being 500,000), I receive the following error saying Resource Exhausted and the script stops soon after that. -
The memory usage at the beginning of the script is:
I went through a lot of similar questions and found that reducing the batch-size might help but I have reduced the batch-size to 50 and the error persists. Is there any other solution except switching to a more powerful GPU?
This does not look like a GPU Out Of Memory (OOM) error but more like you ran out of space on your local drive to save the checkpoint of your model.
Are you sure that you have enough space on your disk or that the folder you save to doesn't have a quotta?
I randomly encounter the same error whenever I run XGBoost model (both the normal run and grid search). The error message says this:
H2OConnectionError: Local server has died unexpectedly. RIP.
I don't know what happens, I tried to change versions but didn't work. I'm currently using the version 3.18.0.5. Does anyone have any idea what is happening? Thanks in advance
The only time I've seen this happen is when H2O runs out of memory. Please check that you have enough memory -- an H2O cluster should have at least 4x the amount of RAM as the dataset you're trying to train a model on (data size on disk).
I am a beginner in machine learning. Recently, I had successfully running a machine learning application using Tensorflow object detection API.
My dataset is 200 images of object with 300*300 resolution. However, the training had been running for two days and yet to be completed.
I wonder how long would it take to complete a training?? At the moment it is running at global step 9000, how many global step needed to complete the training?
P.S: the training used only CPUs
It depends on your desired accuracy and data set of course but I generally stop training when the loss value gets around 4 or less. What is your current loss value after 9000 steps?
To me this sounds like your training is not converging.
See the discussion in the comments of this question.
Basically, it is recommended that you run eval.py in parallel and check how it performs there as well.
I'm running the Tensorflow tf.estimator Quickstart example on a Raspberry Pi 3. It works well, however the prediction is very slow.
The instruction predictions = list(classifier.predict(input_fn=predict_input_fn)) takes seconds to return.
It's a problem for me because I want to predict a simple tuple immediately after I receive it from my sensors, I cannot batch the predictions.
My model_dir is a tmpfs folder (something in RAM) so I don't think it's related to IO latency. Perhaps the network is built every time, I don't know.
I'm probably doing something wrong, do you know how to run TensorFlow predictions on the same classifier in a short time ?