I have been trying to train an object detection model for past 2 months and have finally succeeded by following this tutorial.
Here is my colab which contains all my work.
The problem is, the training loss is shown, and it is decreasing on average, but the validation loss is not.
In the pipeline.config file, I did input the evaluation TFRecord file (which I assumed to be the validation data input) , like this:
eval_config {
metrics_set: "coco_detection_metrics"
use_moving_averages: false
}
eval_input_reader {
label_map_path: "annotations/label_map.pbtxt"
shuffle: false
num_epochs: 1
tf_record_input_reader {
input_path: "annotations/test.record"
}
}
and I read through model_main_tf2.py, which does not seem to evaluation while training, but only evaluates when the checkpoint_dir is mentioned.
Hence, I have only been able to monitor the loss on the training set and not the loss on the validation set.
As a result, I have no clue about over or under fitting.
Have any of you managed to use model_main_tf2.py successfully to view validation loss?
Also, it would be nice to see the mAP score with training.
I know keras training allows all these things to be seen on tensorboard, but OD API seems to be much harder.
Thank you for your time, if you are still confused about something please let me know.
You have to open another terminal and run this command
python model_main_tf2.py \
--model_dir=models/my_ssd_resnet50_v1_fpn \
--pipeline_config_path=models/my_ssd_resnet50_v1_fpn/pipeline.config \
--checkpoint_dir=models/my_ssd_resnet50_v1_fpn
This API tutorial is unclear on that topic. I had the exact same issue.
It turns out that the evaluation process is not included in the training loop, you must launch it in parallel.
It will wait and say waiting for new checkpoint, which means that when you will launch a training with:
python model_main_tf2.py --model_dir=models/my_ssd_resnet50_v1_fpn --pipeline_config_path=models/my_ssd_resnet50_v1_fpn/pipeline.config # note that the checkpoint_dir argument is not there
It's going to run the evaluation once every eval_interval_secs in your eval_config.
According to the documentation, the eval metrics will the be stored next to your checkpoints inside a eval_0 directory, which you will then be able to plot in tensorboard.
I do agree that this was a bit hard to understand as it is not very clear in the documentation, and is not very convenient as well since I had to allocate another GPU to do the evaluation to avoid the CUDA out of memory issue.
Have a nice day
Related
I want to do Cross Validation or Hold-Out Validation with the API that I have mentioned in the title.I have already read all the similar questions but I have not found the solution.
I want to simulate something like validation_split of Keras, so before testing, I want to save the best model with best performance on validation set, but using Tensorflow Object Detection API 2 and Tensorflow Model Zoo.
I have already tried the guide at this link, but this give me only performance and I need to find the best model.
At the moment, I set the fine-tuned model config like:
train_input_reader {
label_map_path: "/annotations/label_map.pbtxt"
tf_record_input_reader {
input_path: "/annotations/train.record"
}
}
eval_config {
metrics_set: "coco_detection_metrics"
use_moving_averages: false
batch_size: 1
}
eval_input_reader {
label_map_path: "/annotations/label_map.pbtxt"
shuffle: false
num_epochs: 1
tf_record_input_reader {
input_path: "/annotations/validation.record"
}
and I run in parallel:
the validation script which waits for new checkpoints:
!python model_main_tf2.py
--pipeline_config_path= path to your config file
--model_dir= path to a directory with your model
--checkpoint_dir= path to a directory with checkpoints
--sample_1_of_n_eval_examples=1
and the training script which generates checkpoints:
!python model_main_tf2.py
--model_dir= path to a directory with your model
--pipeline_config_path= path to your config file
--num_train_steps=13000
I'm working on Google Colab.
Running this code give me the checkpoints for the trained model and eval events like ./events.out.tfevents.1635340070.da52917cf102.2947.0.v2 which i can plot with tensorboard but I can't do anything else.
I think this is a kind of single split validation Keras-like but I can't find an automatic way to save best model and I can't figure it out how to extend this with multiple folds.
Any ideas?
I am using TensorFlow 2.x object detection API. I have trained a deep learning model from the model zoo on my dataset. I am using Google Colab. After training now I want to evaluate my model. I am using coco detection metrics. I used the following script to evaluate my model,
!python3 model_main_tf2.py \
--model_dir = path/to/model directory \
--pipeline_config_path = path/to/pipeline config file \
--checkpoint_dir = path/to/checkpoint directory
After running the above code I get the mean average precision (mAP) and average recall (AR) for the latest checkpoint on my test set. But for academic purposes, I want to get these metrics on all the checkpoints to get a graph of how my model has improved over time. Is there a possible way to that? or is it possible to train and evaluate at the same time in TensorFlow 2 object detection API? I am a beginner in this field so kindly help me out with this issue. Thank you.
I am facing the same problem. So I had an idea. We can run the model_main_tf2.py you mentioned to eval the model but changing the current checkpoint (first line) to evaluate in the checkpoint file
model_checkpoint_path: "ckpt-1"
then
model_checkpoint_path: "ckpt-2"
then
model_checkpoint_path: "ckpt-3"
.
.
.
For each checkpoint you will get a .tfevent so then you open TensorBoard pointing to the directory that contains all the .tfevent and you can see how the model improves over time.
I just saved the last 3 checkpoints in my computer so I can't see the progress from the beginning (my fault) but if you have all the checkpoints try to do what I suggest.
See my graph evaluating the last 3 checkpoints.
You should have an eval directory including an events.out.tfevents file under your model directory. You can run !tensorboard --logdir=path/to/eval/directory to access the graphs.
You can run training with the same snipped you have except without the checkpoint_dirand can open another terminal to run evaluation like you're currently doing.
I have trained an image classification model using Keras. The model after training has 95% accuracy on training data and using model.evaluate on an untouched validation data, I get ~92.8% accuracy.
But when I use model.predict function instead to get the prediction probabilities and get the predicted class with maximum probability, I get ~80% accuracy.
The complete code is available as a colab notebook on the following link - https://colab.research.google.com/drive/1RQ2KnT2sVsdCAWfpsDj_kcMZiqiwJrpc?usp=sharing
You should be able to run everything and see the difference in accuracy. The problem lies in the code blocks as shown below
To make both the accuracies from predict_generator and evaluate_generator same, you have to set the following 3 things in your functions as parameters:
shuffle = False
pickle_safe = True
workers = 1
Your program might be running on different threads and these settings make it run on the main thread.
The solution I could find so far after having posted the issue here and keras official github (without any answer for weeks) is that instead of using Keras, I used tf.keras. Most of the implementation stayed the same. And the "Shuffle" option is definitely messing up the accuracy. The lower accuracy with "Shuffle = False" is a bug in the keras implementation probably. The tf.keras implementation gives the same result in the "evaluate_generator" function. And the predict and evaluate function outputs with respect to accuracy match. I hope if other people encounter this error, they don't waste as much time as I did on the issue.
I would like to evaluate a custom-trained Tensorflow object detection model on a new test set using Google Cloud.
I obtained the inital checkpoints from:
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md
I know that the Tensorflow object-detection API allows me to run training and evaluation simultaneously by using:
https://github.com/tensorflow/models/blob/master/research/object_detection/model_main.py
To start such a job, i submit following ml-engine job:
gcloud ml-engine jobs submit training [JOBNAME]
--runtime-version 1.9
--job-dir=gs://path_to_bucket/model-dir
--packages dist/object_detection-
0.1.tar.gz,slim/dist/slim-0.1.tar.gz,pycocotools-2.0.tar.gz
--module-name object_detection.model_main
--region us-central1
--config object_detection/samples/cloud/cloud.yml
--
--model_dir=gs://path_to_bucket/model_dir
--pipeline_config_path=gs://path_to_bucket/data/model.config
However, after I have successfully transfer-trained a model I would like to use calculate performance metrics, such as COCO mAP(http://cocodataset.org/#detection-eval) or PASCAL mAP (http://host.robots.ox.ac.uk/pascal/VOC/pubs/everingham10.pdf) on a new test data set which has not been previously used (neither during training nor during evaluation).
I have seen, that there is possible flag in model_main.py:
flags.DEFINE_string(
'checkpoint_dir', None, 'Path to directory holding a checkpoint. If '
'`checkpoint_dir` is provided, this binary operates in eval-only
mode, '
'writing resulting metrics to `model_dir`.')
But I don't know whether this really implicates that model_main.py can be run in exclusive evaluation mode? If yes, how should I submit the ML-Engine job?
Alternatively, are there any functions in the Tensorflow API which allows me to evaluate an existing output dictionary (containing bounding boxes, class labels, scores) based on COCO and/or Pascal mAP? If there is, I could easily read in a Tensorflow record file locally, run inference and then evaluate the output dictionary.
I know how to obtain these metrics for the evaluation data set, which is evaluated during training in model_main.py. However, from my understanding I should still report model performance on a new test data set, since I compare multiple models and implement some hyper-parameter optimization and thus I should not report on evaluation data set, am I right? On a more general note: I can really not comprehend why one would switch from separate training and evaluation (as it is in the legacy code) to a combined training and evaluation script?
Edit:
I found two related posts. However I do not think that the answers provided are complete:
how to check both training/eval performances in tensorflow object_detection
How to evaluate a pretrained model in Tensorflow object detection api
The latter has been written while TF's object detection API still had separate evaluation and training scripts. This is not the case anymore.
Thank you very much for any help.
If you specify the checkpoint_dir and set run_once to be true, then it should run evaluation exactly once on the eval dataset. I believe that metrics will be written to the model_dir and should also appear in your console logs. I usually just run this on my local machine (since it's just doing one pass over the dataset) and is not a distributed job. Unfortunately I haven't tried running this particular codepath on CMLE.
Regarding why we have a combined script... from the perspective of the Object Detection API, we were trying to write things in the tf.Estimator paradigm --- but you are right that personally I found it a bit easier when the two functionalities lived in separate binaries. If you want, you can always wrap up this functionality in another binary :)
I am using a tensorflow estimator object to train a model from the official tensorflow layers documentation (https://www.tensorflow.org/tutorials/layers). I can see that the training loss is displayed on the console during training. Is there a way to store these training loss values?
Thanks!
The displaying is done via logging.info. tf.estimator creates a LoggingTensorHook for the training loss to do this, see here.
I suppose you could reroute the logging output to some file, but this would still not give you the raw values.
Two ways I could think of:
Write your own hook to store the values; this would probably look extremely similar to LoggingTensorHook, you would just need to write the numbers to a file instead of printing them.
By default tf.estimator also creates summary data in Tensorboard for the training loss; you could open the "Scalar" tab in Tensorboard where you should see the loss curve. Tick "Show data download links" in the top left corner. This will give you an option to download each graph's data in either CSV or JSON format. By default, both the logging and summary hooks are set up such that they both log values every 100 steps. So the graph should have the same information you saw in the console. If you're unfamiliar with Tensorboard, there are tutorials on the Tensorflow website as well; the basic usage should be quite simple!
You can use TensorBoard event file in model_dir after training your estimator by estimator.train()
model = tf.estimator.Estimator(..., model_dir= 'tmp')
# model data will be save in tmp directory after training
image
The event file has the name events.out.tfevents.15121254...., this file save the log of training process (there is an other event file in eval folder that save evaluate log). You can get training loss by:
for e in tf.train.summary_iterator(path_to_events_file):
for v in e.summary.value:
if v.tag == 'loss':
print(v.simple_value)
In addition, you can save other values during training by add tf.summary inside your model_fn:
tf.summary.scalar('accuracy', accuracy)
reference: https://www.tensorflow.org/api_docs/python/tf/train/summary_iterator