Tensorflow training, how to prevent training node deletion

Tensorflow training, how to prevent training node deletion - python

I am using Tensorflow with python for object detection.
I want to start training and leave it for a while and keep all training nodes (model-cpk). Standard Tensorflow training seems to delete nodes and only keep the last few nodes. How do I prevent that?
Please excuse me if this is the wrong place to ask such questions. I would be oblidged if been told a proper place. Thank you.

You can use the keep_checkpoint_max flag to tf.estimator.RunConfig in model_main.py.
You can set it to a very large number to practically save all checkpoints.
You should be warned though that depending on the model size and saving frequency, it might fill up your disk (and therefore crash during training).
You can change saving frequency by the flags save_checkpoints_steps or save_checkpoints_secs of RunConfig. The default is to use save_checkpoints_secs, with a default value of 600 (10 minutes).

You can save modelcheckpoints as .hdf5 files are load them again when wanting to predict on test data.
Hope that helps.

Related

Prediction and forecasting with Python tensorflow

I have created a prediction model and used RNN in it offered by the tensorflow library in Python. Here is the complete code I have created and tried:
Jupyter Notbook of the Code
But I have doubts.
1) Whether RNN is correct for what I am trying to predict?
2) Is there a better algorithm I can try?
3) Can anyone suggest me how I can give multiple inputs and get the necessary output using tensorflow model? Can anyone guide me please.
I hope I am clear on my points. Please do tell me if anything else required.

Having doubts is normal, but you should try to measure them before asking for advice. If you don't have a clear thing you want to improve it's unlikely you will get something better.
1) Whether RNN is correct for what I am trying to predict?
Yes. RNN is used appropriately here. If you don't care much about having arbitrary length input sequences, you can also try to force them to a fixed size and then apply convolutions on top (see convolutional NeuralNetworks), or even try with a more simple DNN.
The more important question to ask yourself is if you have the right inputs and if you have sufficient training data to learn what you hope to learn.
2) Is there a better algorithm I can try?
Probably no. As I said RNN seems appropriate for this problem. Do try some hyper parameter tuning to make sure you don't accidentally just pick a sub-optimal configuration.
3) Can anyone suggest me how I can give multiple inputs and get the necessary output using tensorflow model? Can anyone guide me please.
The common way to handle variable length inputs is to set a max length and pad the shorter examples until they reach that length. The max length can be a variable you pick or you can dynamically set it to the largest length in the batch. This is needed only because the internal operations are done in batches. You can pick which results you want. Picking the last one is reasonable (the model will just have to learn to propagate the state for the padding values). Another reasonable thing to do is to pick the first one you get after feeding the last meaningful value into the RNN.
Looking at your code, there's one thing I would improve:
Instead of computing a loss on the last value only, I would compute it over all values in the series. This gives your model more training data with very little performance degradation.

Tensorflow estimator: predict without loading from checkpoint everytime

I am using estimator in Tensorflow(1.8) and python3.6 to build up Neural network for my Reinforcement learning project. And I notice everytime you use estimator.predict(), the tensorflow will load up the checkpoint under the model_dir. But It's extremely inefficient if you have to use this function multiple times for the same checkpoint, e.g. in Reinforcement learning, I may need to predict next action based on current state and next state will be realized only after you choose a specific action. So it's commonplace to call this function thousands of times.
So my question is, how to call this function without loading checkpoint(same checkpoint) everytime.
Thank you.

Well, I think I've just found a good answer to my own question. A good solution to this problem is to construct a tf.dataset by generator. Link is here.
Generator will keep your estimator.predict open and in this way you won't need to keep loading the checkpoint. Only thing you need to do is to change the yielded object in this fastpredict object(self.next_feature in this case) if necessary.
However, I need to mention if your ultimate goal is to make the whole thing a service or something like that. You may need something like tf.serving. So I suggest you go in that way directly. I waste a lot time in the process. So I hope this answer helps you save yours.

Optimizing RAM usage when training a learning model

I have been working on creating and training a Deep Learning model for the first time. I did not have any knowledge about the subject prior to the project and therefor my knowledge is limited even now.
I used to run the model on my own laptop but after implementing a well working OHE and SMOTE I simply couldnt run it on my own device anymore due to MemoryError (8GB of RAM). Therefor I am currently running the model on a 30GB RAM RDP which allows me to do so much more, I thought.
My code seems to have some horribly inefficiencies of which I wonder if they can be solved. One example is that by using pandas.concat my model's RAM usages skyrockets from 3GB to 11GB which seems very extreme, afterwards I drop a few columns making the RAm spike to 19GB but actually returning back to 11GB after the computation is completed (unlike the concat). I also forced myself to stop using the SMOTE for now just because the RAM usage would just go up way too much.
At the end of the code, where the training happens the model breaths its final breath while trying to fit the model. What can I do to optimize this?
I have thought about splitting the code into multiple parts (for exmaple preprocessing and training) but to do so I would need to store massive datasets in a pickle which can only reach 4GB (correct me if I'm wrong). I have also given thought about using pre-trained models but I truely did not understand how this process goes to work and how to use one in Python.
P.S.: I would also like my SMOTE back if possible
Thank you all in advance!

Let's analyze the steps:
Step 1: OHE
For your OHE, the only dependence there is between data points is that it needs to be clear what categories are there overall. So the OHE can be broken into two steps, both of which do not require that all data points are in RAM.
Step 1.1: determine categories
Stream read your data points, collecting all the categories. It is not necessary to save the data points you read.
Step 1.2: transform data
After step 1.1, each data point can be independently converted. So stream read, convert, stream write. You only need one or very few data points in memory at all times.
Step 1.3: feature selection
It may be worthwile to look at feature selection to reduce the memory footprint and improve performance. This answer argues it should happen before SMOTE.
Feature selection methods based on entropy depend on all data. While you can probably also throw something together which streams, one approach that worked well for me in the past is removing features that only one or two data points have, since these features definitely have low entropy and probably don't help the classifier much. This can be done again like Step 1.1 and Step 1.2
Step 2: SMOTE
I don't know SMOTE enough to give an answer, but maybe the problem has already solved itself if you do feature selection. In any case, save the resulting data to disk so you do not need to recompute for every training.
Step 3: training
See if the training can be done in batches or streaming (online, basically), or simply with less sampled data.
With regards to saving to disk: Use a format that can be easily streamed, like csv or some other splittable format. Don't use pickle for that.

Slightly orthogonal to your actual question, if your high RAM usage is caused by having entire dataset in memory for the training, you could eliminate such memory footprint by reading and storing only one batch at a time: read a batch, train on this batch, read next batch and so on.

How to read back the "random-seed" from a saved model of Dynet

I have a model already trained by dynet library. But i forget the --dynet-seed parameter when training this model.
Does anyone know how to read back this parameter from the saved model?
Thank you in advance for any feedback.

You can't read back the seed parameter. Dynet model does not save the seed parameter. The obvious reason is, it is not required at testing time. Seed is only used to set fixed initial weights, random shuffling etc. for different experimental runs. At testing time no parameter initialisation or shuffling is required. So, no need to save seed parameter.
To the best of my knowledge, none of the other libraries like tensorflow, pytorch etc. save the seed parameter as well.

How to reduce the number of training steps in Tensorflow's Object Detection API?

I am following Dat Trans example to train my own Object Detector with TensorFlow’s Object Detector API.
I successfully started to train the custom objects. I am using CPU to train the model but it takes around 3 hour to complete 100 training steps. I suppose i have to change some parameter in .config.
I tried to convert .ckpt to .pb, I referred this post, but i was still not able to convert
1) How to reduce the number of training steps?
2) Is there a way to convert .ckpt to .pb.

I don't think you can reduce the number of training step, but you can stop at any checkpoint(ckpt) and then convert it to .pb file
From TensorFlow Model git repository you can use , export_inference_graph.py
and following code
python tensorflow_models/object_detection/export_inference_graph.py \
--input_type image_tensor \
--pipeline_config_path architecture_used_while_training.config \
--trained path_to_saved_ckpt/model.ckpt-NUMBER \
--output_directory model/
where NUMBER refers to your latest saved checkpoint file number, however you can use older checkpoint file if you find it better in tensorboard

1) I'm afraid there is no effective way to just "reduce" training steps. Using bigger batch sizes may lead to "faster" training (as in, reaching high accuracy in a lower number of steps), but each step will take longer to compute, since you're running on your CPU.
Playing around with input image resolution might give you a speedup, to the price of lower accuracy.
You should really consider moving to a machine with a GPU.
2) .pb files (and their corresponding text version .pbtxt) by default contain only the definition of your graph. If you freeze your graph, you take a checkpoint, get all the variables defined in the graph, convert them to constants and assign them the values stored in the checkpoint. You typically do this to ship your trained model to whoever will use it, but this is useless in the training stage.

I would highly recommend finding a way to speed up your per-training-step running time rather than reducing the number of training steps. The best way is to get your hands on a GPU. If you can't do this, you can look into reducing image resolution or using a lighter network.
For converting to a frozen inference graph (the .pb file), please see the documentation here:
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/exporting_models.md

Ya there is one parameter in the .config file where you can reduce the number of step as much you want. num_steps: is in .config file which is actually number of epochs in training.
But please keep in mind that it is not recommended to reduce it much.Because if you reduce it much your loss function will not be reduce much which will give you bad output.
So keep seeing loss function, once it come under 1 , then you can start testing your model seprately and your training will be happening.

1. Yup there is a way to change the number of training steps:
try this,
python model_main_tf2.py --pipeline_config_path="config_path_here" --num_train_steps=5000 --model_dir="model_dir_here" --alsologtostderr
here I set the number of training steps to 5000
2. Yup there is a way to convert checkpoints into .pb:
try this,
python exporter_main_v2.py --trained_checkpoint_dir="checkpoint_dir_here" --pipeline_config_path="config_path_here" --output_directory "output_dir_here"
this will create a directory where the checkpoints and .pb file will be saved.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.