How can i stop model training and resume it? - python

I am working on object detection with autonomous datasets . I want to train my model with 10000 train images,2000 test,2000 validation images.I will use object detection tensorflow lite model maker.
Project link : tensorflow.org/lite/tutorials/model_maker_object_detection
After setting batch size 32, the training takes 50 epochs and continues for 2 days(Step 3).I can’t keep my computer on for two days.I am running the project in jupyter notebook
How can i stop model training and again resume it ? (e.g. stop the 10th epoch and continue one day later)

I sure it depend on your code you working on. You can do that with tensorflow check
How to Pause / Resume Training in Tensorflow

a sleep mode is a better option.
it will make your pc rest for some time and your work will be resumed after you log in again

Related

base_model.summary() Crashes my notebook and VS Code - ResNet101

I am trying to print model summary in tensorflow and I think the model is large and it's crashing my notebook. The model is ResNet101.
The whole computer comes to a halt, memory usage goes up to 99% and VS Code crashes. I have 16 GB of ram, so I didn't think printing something large would actually eat all my ram. Also, because the kernel crashes, all the variables are lost like history = model.fit() which I need to fine-tune the model afterwards. Moreover, I need to print base_model summary in order to choose from which layer to fine-tune from.
Is there a way to print the summary in another way and save the entire notebook with the variables, so I can continue working? I have checkpoints for model weights, but I need to keep track of past epochs through history to resume training afterwards.

spyder kernel dies during training

I am trying to train an fairly complex GCN network on my 10GB GPU. It runs smoothly until epoch 87 but then the spyder kernel restarts. Is it because of memory issue, if so how can I handle it.
As you mentioned, If the model is too large it good to store logs after every epoch.
## after every epoch
path = os.path.join(SAVE_DIR, 'model.pth')
torch.save(TheModelClass.cpu().state_dict(), path) # saving model
MODEL.cuda() # moving model to GPU for further training
## if the kernel terminates, load the model paramters
device = torch.device("cuda")
model = TheModelClass()
model.load_state_dict(torch.load(PATH))
model.train()
model.to(device)
so, anything happens at the process you can start from the last completed epoch.
from your information, it's hard to tell what is the causes to terminate the kernel exactly. RAM overloading is less likely because of GPU acceleration and using pytorch framework. but it could be.
However, the above solution will help you anywhere.

Run evaluation after part of a training epoch

I load two datasets with the dataset api, one for training and one for evaluation. I switch between them with sess.run(train_init_op) before running the evaluation or training.
Currently I run the evaluation after finishing one epoch, i.e. after the training dataset was run through completely.
If I want to evaluate my network before the training dataset was finished, I would have to switch earlier, and by doing this tensorflow would forget where it has been in the training dataset. Is there any way to remember the state of the training dataset iterator? And switch back to this position after the evaluation has finished?
I think it is not only about remembering the position in the training set, but also accumulated gradients, params of an optimizer (if you use something like Adam) etc. Switching context between training and validation can be tricky.
For instance, in Google object detection API there is a separate validation process that monitors for fresh checkpoints and run validation on them. Meanwhile training is running further. Thus, by setting the checkpoint interval you can achieve any validation frequency you want.

How long does tensorflow object detection API train.py complete training using CPU only?

I am a beginner in machine learning. Recently, I had successfully running a machine learning application using Tensorflow object detection API.
My dataset is 200 images of object with 300*300 resolution. However, the training had been running for two days and yet to be completed.
I wonder how long would it take to complete a training?? At the moment it is running at global step 9000, how many global step needed to complete the training?
P.S: the training used only CPUs
It depends on your desired accuracy and data set of course but I generally stop training when the loss value gets around 4 or less. What is your current loss value after 9000 steps?
To me this sounds like your training is not converging.
See the discussion in the comments of this question.
Basically, it is recommended that you run eval.py in parallel and check how it performs there as well.

My Tensorflow DNN Classifier is very slow to start predicticting results

I'm running the Tensorflow tf.estimator Quickstart example on a Raspberry Pi 3. It works well, however the prediction is very slow.
The instruction predictions = list(classifier.predict(input_fn=predict_input_fn)) takes seconds to return.
It's a problem for me because I want to predict a simple tuple immediately after I receive it from my sensors, I cannot batch the predictions.
My model_dir is a tmpfs folder (something in RAM) so I don't think it's related to IO latency. Perhaps the network is built every time, I don't know.
I'm probably doing something wrong, do you know how to run TensorFlow predictions on the same classifier in a short time ?

Categories

Resources