I am creating a 11 class object detector using the faster-RCNN model set up to the maximum size of 300x400 in the image-resizer tag. This is due to CUDA OOM error popping up if I go any higher as the GPU is a 1050 Ti, 4Gb ver, so I have approximately 3800-3900 Mb of model run-time training memory.
I have followed erishima's steps and mutated them with the Pet's scripts and Dati Tran's to generate the TFRecord files.
The steps were as follows:
Create the labels for the categories using labelImg.
Use the name field in labelImg to annotate the class of the image file.
Create a CSV file and extract the filename, class, xmin, ymin, xmax, ymax from the XML file. (Custom Script)
Create a train and test/eval CSV from the main CSV file.
Generate the TFRecord files to be inputted into the config file. Train and Test.(Dati Tran's script modified to suit needs)
Modify faster_rcnn_config without touching the hyper-parameters.
Created a label_map.pbtxt file which corresponded to the names of the classes. Started from 1 as stated in many other answers related to this topic.
Started training the model via the stated method.
The dataset for the classes is custom and the images/class varies from 2500 to 300. The dataset has no definition of orientation of the object and the difficulty of detection in the image even though every possible angle of the object is present in those images.
The problem which arises when I have trained to a loss value of .002 after 217k steps was that a single class was enveloping the objects of all other classes whether I ran the detector on a video or images. I have not tried to run the eval.py script as that takes too long on this setup and those I can't really see the mAP for the classes but I would assume that it should be redundant information as the problem should be in the dataset set preparation method or in the dataset itself.
When retrained from anew for 60k steps, the problem persisted but with another class enveloping all the other.
The warnings shown were:
The Sparse Index Tensor going to take alot of memory. Can I change the code so that this does not pop-up and possibly save me some precious memory.
Wanted [x,?,?,y], got [x,y,z,a,b] instead. This one stops the training. Got this 2 times in the training upto 217k steps. Have no idea where this one originates; probably, the dataset.
If someone can show me even a hint to the proper fix to this, I would highly appreciate it.
I believe you have class imbalance. Had similar problem in the past
Do an analysis of your dataset - make sure # of images per class are in similar order of magnitude.
Related
I am trying to segment 4 lesions with semantic segmentation. I follow this
this great post
My training folder has only 2 subfolders with patches: masks and images. Inside the folder with masks, ALL the classes are mixed. The other folder has the corresponding images. So, when I train the model ,it appears: ONE CLASS FOUND, just following the abovementioned post. The results are disappointing and I am wondering if I have to split the classes in the folders, and thus the model recognizes 4 classes instead of the one.
What your really need to be attentive at is the way in which the masks are created.
It is possible that by default the ImageDataGenerator in Keras to output the number of folders, regardless of how you manually build and adapt the ImageDataGenerator for image segmentation instead of image classification.
My suggestion is to follow the entire post along and change nothing in the first instance. If you pay attention the final results obtained are quite good; this means that the dataset preparation process (mask creation) is correct.
I want to use tensorflow for detecting cars in an embedded system, so I tried ssd_mobilenet_v2 and it actually did pretty well for me, except for some specific car types which are not very common and I think that is why the model does not recognize them. I have a dataset of these cases and I want to improve the model by fine-tuning it. I should also note that I need a .tflite file because I'm using tflite_runtime in python.
I followed these instructions https://github.com/EdjeElectronics/TensorFlow-Object-Detection-API-Tutorial-Train-Multiple-Objects-Windows-10 and I could train the model and reached a reasonable loss value. I then used export_tflite_ssd_graph.py in the object detection API to build inference_graph from the trained model. Afterwards I used toco tool to build a .tflite file out of it.
But here is the problem, after I've done all that; not only the model did not improve, but now it does not detect any cars. I got confused and do not know what is the problem, I searched a lot and did not find any tutorial about doing what I need to do. They just added a new object to a model and then exported it, which I tried and I was successful doing that. I also tried to build a .tflite file without training the model and directly from the Tensorflow detection model zoo and it worked fine. So I think the problem has something to do with the training process. Maybe I am missing something there.
Another thing that I did not find in documents is that whether is it possible to "add" a class to the current classes of an object detection model. For example, let's assume the mobilenet ssd v2 detects 90 different object classes, I would like to add another class so that the model detects 91 different classes instead of 90 classes. As far as I understand and tested after doing transfer learning using object detection API, I could only detect the objects that I had in my dataset and the old classes will be gone. So how do I do what I explained?
I found out that there is no way to 'add' a class to the previously trained classes but with providing a little amount of data of that class you can have your model detect it. The reason is that the last layer of the model changes when transfer learning is applied. In my case I labeled around 3k frames containing about 12k objects because my frames would be complicated. But for simpler tasks as I saw in tutorials 200-300 annotated images would be enough.
And for the part that the model did not detect anything it has something to do with the convert command that I used. I should have used tflite_convert instead of toco. I explained more here.
I have a few queries regarding the Tensorflow Object Detection API.
While training, only the previous 5 check-points are stored. I want to store more than that, say the previous 10 check points. How can this be done? (I think it should be one of the parameters of train.proto in object_detection/protos.)
By default, the check points are stored every 10 minutes (600 seconds). To change this frequency, I believe it is one of these two parameters that have to be changed, please confirm which one it is:
from learning.py in
/home/user/tensorflow-gpu/lib/python3.5/site-packages/tensorflow/contrib/slim/python/slim
save_summaries_secs=600 or
save_interval_secs=600
While training my model (ssd_mobilenet_v2_coco_2018_03_29), I also run the evaluation simultaneously. The latest checkpoint represented in the eval graph always lags the latest one saved in object_detection/training folder. For example, in the case below, the latest checkpoint shown on graph is 29.437k, while the model is already trained till the checkpoint 32.891k (and saved in the training folder). What is the reason for this lag (20 minutes lag) Why isn't one step (10 minutes) enough to perform evaluation on the trained model?
This is for anyone who wants to configure the updated object detection API that supports TensorFlow 2
To save the previous 10 checkpoints, open model_lib.py and pass keyword argument max_to_keep=10 to every tf.train.Saver function
To change the frequency from 600 seconds to 3600 seconds (1 hour),
open model_main.py and find the line that contains tf.estimator.RunConfig in the main function.Pass the keyword argument save_checkpoints_secs=3600 to the tf.estimator.RunConfig class.
Here is the code snippet after configuring checkpoint save frequency in model_main.py:
def main(unused_argv):
flags.mark_flag_as_required('model_dir')
flags.mark_flag_as_required('pipeline_config_path')
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, save_checkpoints_secs=3600)
please note that there is a parameter keep_checkpoint_max in the
tf.estimator.RunConfig class but setting it didn't affect the number of saved checkpoints for me.
This post here should work i believe to change keep_checkpoint_every_n_hours
max_to_keep
How to store best models checkpoints, not only newest 5, in Tensorflow Object Detection API?
You can also refer official doc
https://www.tensorflow.org/api_docs/python/tf/train/Saver
I have just started working with Tensorflow, with Caffe it was super practical reading in the data in an efficient manner but with Tensorflow I see that I have to write data loading process myself, creating TFRecords, the batching, the multiple threats, handling those threads etc. So I started with an example, inception v3, as they handle the part to read in the data. I am new to Tensorflow and relatively new to Python, so I feel like I don't understand what is going on with this part exactly (I mean yes it extends the size of the labels to label_index * no of files -but- why? Is it creating one hot encoding for labels? Do we have to? Why doesn't it just extend as much for the length or files as each file have a label? Thx.
labels.extend([label_index] * len(filenames))
texts.extend([text] * len(filenames))
filenames.extend(filenames)
The whole code is here: https://github.com/tensorflow/models/tree/master/research/inception
The part mentioned is under data/build_image_data.py and builds image dataset from an existing dataset as images stored under folders (where foldername is the label): https://github.com/tensorflow/models/blob/master/research/inception/inception/data/build_image_data.py
Putting together what we discussed in the comments:
You have to one-hot encode because the network architecture requires you to, not because it's Tensorflow's demand. The network is a N-class classifier, so the final layer will have one neuron per class and you'll train the network to activate the neuron matching the class the sample belongs to. One-hot encoding the label is the first step in doing this.
About the human-readable labels, the code you're referring to is located in the _find_image_files function, which in turn is used by _process_dataset to transform the dataset from a set of folders to a set TfRecord files, which are a convenient input format type for Tensorflow.
The human-readable label string is included as a feature in the Examples inside the tfrecord files as an 'extra' (probably to simplify visualization of intermediate results during training), it is not strictly necessary for the dataset and will not be used in any way in the actual optimization of the network's parameters.
I am using tensorflow 1.0 to train a DNNRegressor. Most of the training is already handled automatically by the new tensorflow 1.0 features. The model information is saved automatically in a folder. I call the train(filepath, isAuthentic) function repeatedly, with different training files, using a for loop.
The problem is that the events.out.tfevents files keep getting larger and larger, taking up space. I have gotten around this by deleting these files as they are generated, but the CPU still wastes incrementally more time trying to generate these files. These don't affect the results of training or predicting. Is there a way to stop these events.out.tfevents files from being generated?
I've noticed that when I run the python program for a long period, the events.out.tfevents file sizes start small and then get large, but if I run the training for several periods of shorter intervals, the file sizes stay small.
picture of model folder, contents ordered by size
When I let the training run long enough, the events.out.tfevents reaches over 200 MB, wasting much time and space. I have already tried changing the checkpoint and summary parameters in a RunConfig object passed to the DNNRegressor.
def getRegressor():
feature_cols = [tf.contrib.layers.real_valued_column(k) for k in networkSetup.FEATURES]
# Build 2 layer fully connected DNN with 8, 8 units respectively.
regressor = tf.contrib.learn.DNNRegressor(feature_columns=feature_cols,
hidden_units=[8, 8],
model_dir=networkSetup.MODEL_DIR,
activation_fn=tf.nn.sigmoid,
optimizer=tf.train.GradientDescentOptimizer(
learning_rate=0.001
)
)
return regressor
def train(filepath, isAuthentic):
regressor = getRegressor()
# training on training set
regressor.fit(input_fn=lambda: input_fn(filepath, isAuthentic), steps=1)
.tfevents files contain events written by fit method. DNNRegressor saves at least histograms and fraction of zeros for each hidden layer. You can use Tensorboard to view them.
Tensorflow doesn't overwrite event files instead it appends to them so bigger file size doesn't mean more CPU cycles.
You can pass config parameter to DNNRegressor constructor (RunConfig instance) and specify how often you want summary to be saved using its save_summary_steps property. Default is to save summary every 100 steps.
To prevent tensorflow creating the events.out file, you just have to comment that part of code that writes this file each time a new user trains the model.
In all the models, there are writers written in the main class to create these summaries/logs for further analysis of data, although it is not useful in many cases.
Sample code lines from Tensorflow Inception's "retrain.py":
train_writer = tf.summary.FileWriter(FLAGS.summaries_dir + '/train',sess.graph)
validation_writer = tf.summary.FileWriter(FLAGS.summaries_dir + '/validation')
Just comment out the part of code creating the events.out file and you are up.
That's because of newly generated Graph files for tensorboard.
tfFileWriter = tf.summary.FileWriter(os.getcwd())
tfFileWriter.add_graph(sess.graph)
tfFileWriter.close()
Comment out these lines if you find them in your code and it would go away.