I am trying to train a TensorFlow object detection model on a custom dataset on google colab and I have a saved model trained for 5000 steps, is it possible to use saved model to resume training? I am planning to train for another 20000 steps. I am using google colab for training and the training will take around 36 hours, so I'm planning to use checkpoint. How to store best model checkpoints and use them when session runs out?
For resuming training using weights from a saved checkpoint, in your pipeline.config file, change the line containing fine_tune_checkpoint from <path_to_ckpt>/model.ckpt to <path_to_ckpt>/model.ckpt-XXXX where XXXX is your checkpoint number.
As far as saving only best weights is concerned, you can refer to this post and/or this GitHub link
Related
I am Trying to deploy a tensorflow keras model using amazon sagemaker. Process finishes successfully, yet i get different prediction results when predicted directly using keras and when calling sagemaker endpoint to make predictions.
I used these steps in deploying the model to sagemaker.
Check the following example.
data = np.random.randn(1, 150, 150, 3)
# predict using amazon sagemaker
sagemaker_predict = uncompiled_predictor.predict(data)
print(sagemaker_predict)
#predict same using keras
val = model.predict(data)
print(val)
>>{'predictions': [[0.491645753]]}
[[0.]]
Is this something supposed to happen? For my knowledge it should be the same. For some reason data gets corrupted or sagemaker weights get reinitialized. Any ideas?
Not suppose to happen.
See what you get if you deloy the model directly to TensorFlow serving (which is what the SageMaker inference container wraps).
To experiment faster you can work with SageMaker inference container in local mode, so you can start/stop an endopint in seconds.
Finally found a solution. It seems to be a problem with .h5 (HDF5) weights file, for some reason sagemaker seems not to extract weight from .h5. Therefore changed the weights file to TensorFlow SavedModel format
As for tensorflow keras save and serialize documentation
There are two formats you can use to save an entire model to disk: the TensorFlow SavedModel format, and the older Keras H5 format. The recommended format is SavedModel. It is the default when you use model.save().
You can switch to the H5 format by:
Passing save_format='h5' to save().
Passing a filename that ends in .h5 or .keras to save()
So instead of saving weights as
model.save("my_model.h5")
save as
model.save("my_model")
And load the same weights as
keras.models.load_model("my_model")
This will save your file in TensorFlow SavedModel format which you can follow in the above documentation to load and deploy to sagemaker.
I am using TensorFlow 2.x object detection API. I have trained a deep learning model from the model zoo on my dataset. I am using Google Colab. After training now I want to evaluate my model. I am using coco detection metrics. I used the following script to evaluate my model,
!python3 model_main_tf2.py \
--model_dir = path/to/model directory \
--pipeline_config_path = path/to/pipeline config file \
--checkpoint_dir = path/to/checkpoint directory
After running the above code I get the mean average precision (mAP) and average recall (AR) for the latest checkpoint on my test set. But for academic purposes, I want to get these metrics on all the checkpoints to get a graph of how my model has improved over time. Is there a possible way to that? or is it possible to train and evaluate at the same time in TensorFlow 2 object detection API? I am a beginner in this field so kindly help me out with this issue. Thank you.
I am facing the same problem. So I had an idea. We can run the model_main_tf2.py you mentioned to eval the model but changing the current checkpoint (first line) to evaluate in the checkpoint file
model_checkpoint_path: "ckpt-1"
then
model_checkpoint_path: "ckpt-2"
then
model_checkpoint_path: "ckpt-3"
.
.
.
For each checkpoint you will get a .tfevent so then you open TensorBoard pointing to the directory that contains all the .tfevent and you can see how the model improves over time.
I just saved the last 3 checkpoints in my computer so I can't see the progress from the beginning (my fault) but if you have all the checkpoints try to do what I suggest.
See my graph evaluating the last 3 checkpoints.
You should have an eval directory including an events.out.tfevents file under your model directory. You can run !tensorboard --logdir=path/to/eval/directory to access the graphs.
You can run training with the same snipped you have except without the checkpoint_dirand can open another terminal to run evaluation like you're currently doing.
I am a novice in using tensorflow and I built a CNN which is trained and tested both with 80-85% accuracy. I tried to save my trained model using model.save('example.h5'), and download the file using files.download('example.h5')
Afterwards I tried to load it to my flask back-end using model = tf.keras.models.load_model('example.h5').
When I tried using it with random images, it's like the model has never been trained before. Any solutions? thank you
A common reason to this might be that you are normalizing your data on training but not before predicting. Make sure you normalize the input data before you run predictions, if you have trained on normalized data.
Which method is best, whether saving model checkpoints or saving entire model to disk for each epochs. Why nobody saves the entire model?
A keras model has two things, an architecture and weights. If you save the whole model in each checkpoint, you’re saving the architecture every time. For this reason the best on training is to save only weight and use the wireframe in memory.
On tensorflow.keras documentation have more about other methods.
Checkpoints are used to save your model if in case your system crashes or code interrupted while training so when you start training your model again after crashes you don't have to start from scratch.Checkpoints capture the exact value of all parameters (tf.Variable objects) used by a model. Checkpoints do not contain any description of the computation defined by the model.
The SavedModel format on the other hand includes a serialized description of the computation defined by the model in addition to the parameter values (checkpoint). Models in this format are independent of the source code that created the model.
you can see the above info in the official doc of tensorflow. #R Nanthak
I trained a neural network with google colab.
I saved the neural network using joblib.dump()
I then loaded the model on my PC using joblib.load()
I made a prediction on the exact same sample, using the same model, on both colab and my PC. On colab, it has an output of [[0.51]]. On my pc, it has an output of [[nan]].
The model summary reports that the architecture of the model is the same.
I checked the weights of the model I loaded on my PC, and the model on colab, and the weights are the exact same.
Any ideas as to what I can do? Thank you.
Quick update: even if I change all of my inputs to zero, the prediction is still nan.
As far as I know keras has its own function to save the model such as model.save('file.h5'), and the joblib library is used to save sklearn models.