tensorflow change checkpoint location

tensorflow change checkpoint location - python

i was wondering if theres a way to indicate to your training models(in tensorflow) or to tensorflow configuration in general where to store checkpoint files, i was tranining a neural network and got the errors:
InternalError: Error writing (tmp) checkpoint file: /tmp/tmpn2cWXm/model.ckpt-500-00000-of-00001.tempstate12392765014661958578: Resource exhauste
and
ERROR:tensorflow:Got exception during tf.learn final checkpoint .
And also im getting operating system alerts(debian linux) about low disk space so i asume that the problem is that my disk got full with checkpoint files but i have serveral partitions in my disk with enough space and would like to move checkpoint files there.
Thank you!

You can specify your save path as the second argument of tf.train.Saver.save(sess, 'your/save/path', ...). Similarly, you can restore your previously saved variables passing your restore path as the second argument of tf.train.Saver.restore(sess, 'your/restore/path').
Please, see TensorFlow Saver documentation and these saving and restoring examples for further details.

This is a pretty old query that I stumbled upon today while searching for some other issue. Anyway, I thought I will put down my thoughts in case it helps someone in the future.
You can specify the model directory while constructing your regressor. This can be any location on your filesystem where you have write permission and enough space (the events file needs quite a bit). For ex:
dnn_regressor = tf.estimator.DNNRegressor(
...,
model_dir="/tmp/mydata"
)

Related

Is there a way to save the weights and load them on another file

So, is simple as the title of the question, is there a way to save the weights after training like this
model.save_weights("path")
and then load them on another project only with
model = load_weights("path")
model.predict(x)
is it possible ?

yes. it is possible if you call the right path
for instance, you have this path
- project1/cool.py
- project2/another_cool.py
you train with cool.py and the model is saved inside project1's folder. then you want to load the model in another_cool.py
just call load_model function with path ../project1/weigh.h5

If you only want to save/load the weights, you can use
model.save_weights("path/to/my_model_weights.hdf5")
and then to reload (potentially in another python project / in another interpreter, you just have to update the path accordingly)
other_model.load_weights("path/to/my_model_weights.hdf5")
However both models should have the same architecture (instances of the same class), and Python/tensorflow/keras versions should be the same. See the doc for more info.
You can save both weights and architecture through model.save("path/to/my_model.hdf5")for saving on disk and keras.models.load_model("path/to/my_model.hdf5")for loading from disk (once again, the documentation should provide details).
Once loaded in memory, you can retrain your model, or use predict on it, predictions should be identical between projects

Tensorflow training, how to prevent training node deletion

I am using Tensorflow with python for object detection.
I want to start training and leave it for a while and keep all training nodes (model-cpk). Standard Tensorflow training seems to delete nodes and only keep the last few nodes. How do I prevent that?
Please excuse me if this is the wrong place to ask such questions. I would be oblidged if been told a proper place. Thank you.

You can use the keep_checkpoint_max flag to tf.estimator.RunConfig in model_main.py.
You can set it to a very large number to practically save all checkpoints.
You should be warned though that depending on the model size and saving frequency, it might fill up your disk (and therefore crash during training).
You can change saving frequency by the flags save_checkpoints_steps or save_checkpoints_secs of RunConfig. The default is to use save_checkpoints_secs, with a default value of 600 (10 minutes).

You can save modelcheckpoints as .hdf5 files are load them again when wanting to predict on test data.
Hope that helps.

Is there a default output-directory for TensorFlow training batchjobs?

Lately, I'm trying to implement a training-service for TensorFlow.
I wasn't able to find any information regarding a output-directory for training-jobs in the documentation.
Is there any default-dir that TensorFlow uses (I'm really not familiar with TensorFlow at all)?
I was also thinking about the possibility that the output-dir can (maybe) be coded into a training-script or even be specified in the application-call via CLI.
Can anyone please help?

If you are using an Estimator API, the default output_directory is specified through model_dir argument. If you don't specify one, a temporary dir is created for you. See docstring:
model_dir: Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into a estimator to continue training a previously saved model. If PathLike object, the path will be resolved. If None, the model_dir in config will be used if set. If both are set, they must be same. If both are None, a temporary directory will be used.

How to Port a .ckpt to a .pb for use in Tensorflow for Mobile Poets

I am trying to convert a pretrained InceptionV3 model (.ckpt) from the Open Images Dataset to a .pb file for use in the Tensorflow for Mobile Poets example. I have searched the site as well as the GitHub Repository and have not found any conclusive answers.
(OpenImages Inception Model: https://github.com/openimages/dataset)
Thank you for your responses.

Below I've included some draft documentation I'm working on that might be helpful. One other thing to look out for is that if you're using Slim, you'll need to run export_inference_graph.py to get a .pb GraphDef file initially.
In most situations, training a model with TensorFlow will give you a folder containing a GraphDef file (usually ending with the .pb or .pbtxt extension) and a set of checkpoint files. What you need for mobile or embedded deployment is a single GraphDef file that’s been ‘frozen’, or had its variables converted into inline constants so everything’s in one file.
To handle the conversion, you’ll need the freeze_graph.py script, that’s held in tensorflow/pythons/tools/freeze_graph.py. You’ll run it like this:
bazel build tensorflow/tools:freeze_graph
bazel-bin/tensorflow/tools/freeze_graph \
--input_graph=/tmp/model/my_graph.pb \ --input_checkpoint=/tmp/model/model.ckpt-1000 \ --output_graph=/tmp/frozen_graph.pb \
--input_node_names=input_node \
--output_node_names=output_node \
The input_graph argument should point to the GraphDef file that holds your model architecture. It’s possible that your GraphDef has been stored in a text format on disk, in which case it’s likely to end in ‘.pbtxt’ instead of ‘.pb’, and you should add an extra --input_binary=false flag to the command.
The input_checkpoint should be the most recent saved checkpoint. As mentioned in the checkpoint section, you need to give the common prefix to the set of checkpoints here, rather than a full filename.
output_graph defines where the resulting frozen GraphDef will be saved. Because it’s likely to contain a lot of weight values that take up a large amount of space in text format, it’s always saved as a binary protobuf.
output_node_names is a list of the names of the nodes that you want to extract the results of your graph from. This is needed because the freezing process needs to understand which parts of the graph are actually needed, and which are artifacts of the training process, like summarization ops. Only ops that contribute to calculating the given output nodes will be kept. If you know how your graph is going to be used, these should just be the names of the nodes you pass into Session::Run() as your fetch targets. If you don’t have this information handy, you can get some suggestions on likely outputs by running the summarize_graph tool.
Because the output format for TensorFlow has changed over time, there are a variety of other less commonly used flags available too, like input_saver, but hopefully you shouldn’t need these on graphs trained with modern versions of the framework.

How to turn off events.out.tfevents file in tf.contrib.learn Estimator

When using estimator.Estimator in tensorflow.contrib.learn, after training and prediction there are these files in the modeldir:
checkpoint
events.out.tfevents.1487956647
events.out.tfevents.1487957016
graph.pbtxt
model.ckpt-101.data-00000-of-00001
model.ckpt-101.index
model.ckpt-101.meta
When the graph is complicated or the number of variables is big, the graph.pbtxt file and the events files can be very big. Is here a way to not write these files? Since model reloading only needs the checkpoint files removing them won't affect evaluation and prediction down the road.

On Google's Machine Learning Crash Course they use the following approach:
# Create classifier:
classifier = tf.estimator.DNNRegressor(
feature_columns=feature_columns,
optimizer=optimizer
)
# Train it:
classifier.train(
input_fn=training_input_fn,
steps=steps
)
# Remove event files to save disk space.
_ = map(os.remove, glob.glob(os.path.join(classifier.model_dir, 'events.out.tfevents*')))

If you don't want the events.out.tfevents files to be written. Find in your code somethings like these and delete them.
tfFileWriter = tf.summary.FileWriter(os.getcwd())
tfFileWriter.add_graph(sess.graph)
tfFileWriter.close()

I had the same issue and was not able to find any resolution for this while the events file kept on growing in size. My understanding is that this file stores the events generated by tensorflow. I went ahead and deleted this manually.
Interestingly, it never got created again while the other files are getting updated when I run a train sequence.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.