Queries regarding checkpoints of Object Detection API

Queries regarding checkpoints of Object Detection API - python

I have a few queries regarding the Tensorflow Object Detection API.
While training, only the previous 5 check-points are stored. I want to store more than that, say the previous 10 check points. How can this be done? (I think it should be one of the parameters of train.proto in object_detection/protos.)
By default, the check points are stored every 10 minutes (600 seconds). To change this frequency, I believe it is one of these two parameters that have to be changed, please confirm which one it is:
from learning.py in
/home/user/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim
save_summaries_secs=600 or
save_interval_secs=600
While training my model (ssd_mobilenet_v2_coco_2018_03_29), I also run the evaluation simultaneously. The latest checkpoint represented in the eval graph always lags the latest one saved in object_detection/training folder. For example, in the case below, the latest checkpoint shown on graph is 29.437k, while the model is already trained till the checkpoint 32.891k (and saved in the training folder). What is the reason for this lag (20 minutes lag) Why isn't one step (10 minutes) enough to perform evaluation on the trained model?

This is for anyone who wants to configure the updated object detection API that supports TensorFlow 2
To save the previous 10 checkpoints, open model_lib.py and pass keyword argument max_to_keep=10 to every tf.train.Saver function
To change the frequency from 600 seconds to 3600 seconds (1 hour),
open model_main.py and find the line that contains tf.estimator.RunConfig in the main function.Pass the keyword argument save_checkpoints_secs=3600 to the tf.estimator.RunConfig class.
Here is the code snippet after configuring checkpoint save frequency in model_main.py:
def main(unused_argv):
flags.mark_flag_as_required('model_dir')
flags.mark_flag_as_required('pipeline_config_path')
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, save_checkpoints_secs=3600)
please note that there is a parameter keep_checkpoint_max in the
tf.estimator.RunConfig class but setting it didn't affect the number of saved checkpoints for me.

This post here should work i believe to change keep_checkpoint_every_n_hours
max_to_keep
How to store best models checkpoints, not only newest 5, in Tensorflow Object Detection API?
You can also refer official doc
https://www.tensorflow.org/api_docs/python/tf/train/Saver

Related

Tensorflow training, how to prevent training node deletion

I am using Tensorflow with python for object detection.
I want to start training and leave it for a while and keep all training nodes (model-cpk). Standard Tensorflow training seems to delete nodes and only keep the last few nodes. How do I prevent that?
Please excuse me if this is the wrong place to ask such questions. I would be oblidged if been told a proper place. Thank you.

You can use the keep_checkpoint_max flag to tf.estimator.RunConfig in model_main.py.
You can set it to a very large number to practically save all checkpoints.
You should be warned though that depending on the model size and saving frequency, it might fill up your disk (and therefore crash during training).
You can change saving frequency by the flags save_checkpoints_steps or save_checkpoints_secs of RunConfig. The default is to use save_checkpoints_secs, with a default value of 600 (10 minutes).

You can save modelcheckpoints as .hdf5 files are load them again when wanting to predict on test data.
Hope that helps.

How to read back the "random-seed" from a saved model of Dynet

I have a model already trained by dynet library. But i forget the --dynet-seed parameter when training this model.
Does anyone know how to read back this parameter from the saved model?
Thank you in advance for any feedback.

You can't read back the seed parameter. Dynet model does not save the seed parameter. The obvious reason is, it is not required at testing time. Seed is only used to set fixed initial weights, random shuffling etc. for different experimental runs. At testing time no parameter initialisation or shuffling is required. So, no need to save seed parameter.
To the best of my knowledge, none of the other libraries like tensorflow, pytorch etc. save the seed parameter as well.

How to reduce the number of training steps in Tensorflow's Object Detection API?

I am following Dat Trans example to train my own Object Detector with TensorFlow’s Object Detector API.
I successfully started to train the custom objects. I am using CPU to train the model but it takes around 3 hour to complete 100 training steps. I suppose i have to change some parameter in .config.
I tried to convert .ckpt to .pb, I referred this post, but i was still not able to convert
1) How to reduce the number of training steps?
2) Is there a way to convert .ckpt to .pb.

I don't think you can reduce the number of training step, but you can stop at any checkpoint(ckpt) and then convert it to .pb file
From TensorFlow Model git repository you can use , export_inference_graph.py
and following code
python tensorflow_models/object_detection/export_inference_graph.py \
--input_type image_tensor \
--pipeline_config_path architecture_used_while_training.config \
--trained path_to_saved_ckpt/model.ckpt-NUMBER \
--output_directory model/
where NUMBER refers to your latest saved checkpoint file number, however you can use older checkpoint file if you find it better in tensorboard

1) I'm afraid there is no effective way to just "reduce" training steps. Using bigger batch sizes may lead to "faster" training (as in, reaching high accuracy in a lower number of steps), but each step will take longer to compute, since you're running on your CPU.
Playing around with input image resolution might give you a speedup, to the price of lower accuracy.
You should really consider moving to a machine with a GPU.
2) .pb files (and their corresponding text version .pbtxt) by default contain only the definition of your graph. If you freeze your graph, you take a checkpoint, get all the variables defined in the graph, convert them to constants and assign them the values stored in the checkpoint. You typically do this to ship your trained model to whoever will use it, but this is useless in the training stage.

I would highly recommend finding a way to speed up your per-training-step running time rather than reducing the number of training steps. The best way is to get your hands on a GPU. If you can't do this, you can look into reducing image resolution or using a lighter network.
For converting to a frozen inference graph (the .pb file), please see the documentation here:
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/exporting_models.md

Ya there is one parameter in the .config file where you can reduce the number of step as much you want. num_steps: is in .config file which is actually number of epochs in training.
But please keep in mind that it is not recommended to reduce it much.Because if you reduce it much your loss function will not be reduce much which will give you bad output.
So keep seeing loss function, once it come under 1 , then you can start testing your model seprately and your training will be happening.

1. Yup there is a way to change the number of training steps:
try this,
python model_main_tf2.py --pipeline_config_path="config_path_here" --num_train_steps=5000 --model_dir="model_dir_here" --alsologtostderr
here I set the number of training steps to 5000
2. Yup there is a way to convert checkpoints into .pb:
try this,
python exporter_main_v2.py --trained_checkpoint_dir="checkpoint_dir_here" --pipeline_config_path="config_path_here" --output_directory "output_dir_here"
this will create a directory where the checkpoints and .pb file will be saved.

Tensorflow Object Detection API multi-class error

I am creating a 11 class object detector using the faster-RCNN model set up to the maximum size of 300x400 in the image-resizer tag. This is due to CUDA OOM error popping up if I go any higher as the GPU is a 1050 Ti, 4Gb ver, so I have approximately 3800-3900 Mb of model run-time training memory.
I have followed erishima's steps and mutated them with the Pet's scripts and Dati Tran's to generate the TFRecord files.
The steps were as follows:
Create the labels for the categories using labelImg.
Use the name field in labelImg to annotate the class of the image file.
Create a CSV file and extract the filename, class, xmin, ymin, xmax, ymax from the XML file. (Custom Script)
Create a train and test/eval CSV from the main CSV file.
Generate the TFRecord files to be inputted into the config file. Train and Test.(Dati Tran's script modified to suit needs)
Modify faster_rcnn_config without touching the hyper-parameters.
Created a label_map.pbtxt file which corresponded to the names of the classes. Started from 1 as stated in many other answers related to this topic.
Started training the model via the stated method.
The dataset for the classes is custom and the images/class varies from 2500 to 300. The dataset has no definition of orientation of the object and the difficulty of detection in the image even though every possible angle of the object is present in those images.
The problem which arises when I have trained to a loss value of .002 after 217k steps was that a single class was enveloping the objects of all other classes whether I ran the detector on a video or images. I have not tried to run the eval.py script as that takes too long on this setup and those I can't really see the mAP for the classes but I would assume that it should be redundant information as the problem should be in the dataset set preparation method or in the dataset itself.
When retrained from anew for 60k steps, the problem persisted but with another class enveloping all the other.
The warnings shown were:
The Sparse Index Tensor going to take alot of memory. Can I change the code so that this does not pop-up and possibly save me some precious memory.
Wanted [x,?,?,y], got [x,y,z,a,b] instead. This one stops the training. Got this 2 times in the training upto 217k steps. Have no idea where this one originates; probably, the dataset.
If someone can show me even a hint to the proper fix to this, I would highly appreciate it.

I believe you have class imbalance. Had similar problem in the past
Do an analysis of your dataset - make sure # of images per class are in similar order of magnitude.

Tensorflow 1.0 training model uses exponentially more space

I am using tensorflow 1.0 to train a DNNRegressor. Most of the training is already handled automatically by the new tensorflow 1.0 features. The model information is saved automatically in a folder. I call the train(filepath, isAuthentic) function repeatedly, with different training files, using a for loop.
The problem is that the events.out.tfevents files keep getting larger and larger, taking up space. I have gotten around this by deleting these files as they are generated, but the CPU still wastes incrementally more time trying to generate these files. These don't affect the results of training or predicting. Is there a way to stop these events.out.tfevents files from being generated?
I've noticed that when I run the python program for a long period, the events.out.tfevents file sizes start small and then get large, but if I run the training for several periods of shorter intervals, the file sizes stay small.
picture of model folder, contents ordered by size
When I let the training run long enough, the events.out.tfevents reaches over 200 MB, wasting much time and space. I have already tried changing the checkpoint and summary parameters in a RunConfig object passed to the DNNRegressor.
def getRegressor():
feature_cols = [tf.contrib.layers.real_valued_column(k) for k in networkSetup.FEATURES]
# Build 2 layer fully connected DNN with 8, 8 units respectively.
regressor = tf.contrib.learn.DNNRegressor(feature_columns=feature_cols,
hidden_units=[8, 8],
model_dir=networkSetup.MODEL_DIR,
activation_fn=tf.nn.sigmoid,
optimizer=tf.train.GradientDescentOptimizer(
learning_rate=0.001
)
)
return regressor
def train(filepath, isAuthentic):
regressor = getRegressor()
# training on training set
regressor.fit(input_fn=lambda: input_fn(filepath, isAuthentic), steps=1)

.tfevents files contain events written by fit method. DNNRegressor saves at least histograms and fraction of zeros for each hidden layer. You can use Tensorboard to view them.
Tensorflow doesn't overwrite event files instead it appends to them so bigger file size doesn't mean more CPU cycles.
You can pass config parameter to DNNRegressor constructor (RunConfig instance) and specify how often you want summary to be saved using its save_summary_steps property. Default is to save summary every 100 steps.

To prevent tensorflow creating the events.out file, you just have to comment that part of code that writes this file each time a new user trains the model.
In all the models, there are writers written in the main class to create these summaries/logs for further analysis of data, although it is not useful in many cases.
Sample code lines from Tensorflow Inception's "retrain.py":
train_writer = tf.summary.FileWriter(FLAGS.summaries_dir + '/train',sess.graph)
validation_writer = tf.summary.FileWriter(FLAGS.summaries_dir + '/validation')
Just comment out the part of code creating the events.out file and you are up.

That's because of newly generated Graph files for tensorboard.
tfFileWriter = tf.summary.FileWriter(os.getcwd())
tfFileWriter.add_graph(sess.graph)
tfFileWriter.close()
Comment out these lines if you find them in your code and it would go away.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Queries regarding checkpoints of Object Detection API - python

This post here should work i believe to change keep_checkpoint_every_n_hours max_to_keep How to store best models checkpoints, not only newest 5, in Tensorflow Object Detection API? You can also refer official doc https://www.tensorflow.org/api_docs/python/tf/train/Saver

Related

Tensorflow training, how to prevent training node deletion

How to read back the "random-seed" from a saved model of Dynet

How to reduce the number of training steps in Tensorflow's Object Detection API?

Tensorflow Object Detection API multi-class error

Tensorflow 1.0 training model uses exponentially more space

Categories

Resources