Tensorflow 1.0 training model uses exponentially more space

Tensorflow 1.0 training model uses exponentially more space - python

I am using tensorflow 1.0 to train a DNNRegressor. Most of the training is already handled automatically by the new tensorflow 1.0 features. The model information is saved automatically in a folder. I call the train(filepath, isAuthentic) function repeatedly, with different training files, using a for loop.
The problem is that the events.out.tfevents files keep getting larger and larger, taking up space. I have gotten around this by deleting these files as they are generated, but the CPU still wastes incrementally more time trying to generate these files. These don't affect the results of training or predicting. Is there a way to stop these events.out.tfevents files from being generated?
I've noticed that when I run the python program for a long period, the events.out.tfevents file sizes start small and then get large, but if I run the training for several periods of shorter intervals, the file sizes stay small.
picture of model folder, contents ordered by size
When I let the training run long enough, the events.out.tfevents reaches over 200 MB, wasting much time and space. I have already tried changing the checkpoint and summary parameters in a RunConfig object passed to the DNNRegressor.
def getRegressor():
feature_cols = [tf.contrib.layers.real_valued_column(k) for k in networkSetup.FEATURES]
# Build 2 layer fully connected DNN with 8, 8 units respectively.
regressor = tf.contrib.learn.DNNRegressor(feature_columns=feature_cols,
hidden_units=[8, 8],
model_dir=networkSetup.MODEL_DIR,
activation_fn=tf.nn.sigmoid,
optimizer=tf.train.GradientDescentOptimizer(
learning_rate=0.001
)
)
return regressor
def train(filepath, isAuthentic):
regressor = getRegressor()
# training on training set
regressor.fit(input_fn=lambda: input_fn(filepath, isAuthentic), steps=1)

.tfevents files contain events written by fit method. DNNRegressor saves at least histograms and fraction of zeros for each hidden layer. You can use Tensorboard to view them.
Tensorflow doesn't overwrite event files instead it appends to them so bigger file size doesn't mean more CPU cycles.
You can pass config parameter to DNNRegressor constructor (RunConfig instance) and specify how often you want summary to be saved using its save_summary_steps property. Default is to save summary every 100 steps.

To prevent tensorflow creating the events.out file, you just have to comment that part of code that writes this file each time a new user trains the model.
In all the models, there are writers written in the main class to create these summaries/logs for further analysis of data, although it is not useful in many cases.
Sample code lines from Tensorflow Inception's "retrain.py":
train_writer = tf.summary.FileWriter(FLAGS.summaries_dir + '/train',sess.graph)
validation_writer = tf.summary.FileWriter(FLAGS.summaries_dir + '/validation')
Just comment out the part of code creating the events.out file and you are up.

That's because of newly generated Graph files for tensorboard.
tfFileWriter = tf.summary.FileWriter(os.getcwd())
tfFileWriter.add_graph(sess.graph)
tfFileWriter.close()
Comment out these lines if you find them in your code and it would go away.

Related

using ray + light gbm + limited memory

So, I would like to train a lightGBM on a remote, large ray cluster and a large dataset. Before that, I would like to write the code such that I can run the training also in a memory-constrained setting, e.g. my local laptop, where the dataset does not fit in-mem. That will require some way of lazy loading the data.
The way I imagine it, I should be possible with ray to load batches of random samples of the large dataset from disk (multiple .pq files) and feed them to the lightgbm training function. The memory should thereby act as a fast buffer, which contains random, loaded batches that are fed to the training function and then removed from memory. Multiple workers take care of training + IO ops for loading new samples from disk into memory. The maximum amount of memory can be constrained to not exceed my local resources, such that my pc doesn't crash. Is this possible?
I did not understand yet whether the LGBM needs the full dataset at once, or can be fed batches iteratively, such as with neural networks, for instance. So far, I have tried using the lightgbm_ray lib for this:
from lightgbm_ray import RayDMatrix, RayParams, train, RayFileType
# some stuff before
...
# make dataset
data_train = RayDMatrix(
data=filenames,
label=TARGET,
feature_names=features,
filetype=RayFileType.PARQUET,
num_actors=2,
lazy=True,
)
# feed to training function
evals_result = {}
bst = train(
params_model,
data_train,
evals_result=evals_result,
valid_sets=[data_train],
valid_names=["train"],
verbose_eval=False,
ray_params=RayParams(num_actors=2, cpus_per_actor=2)
)
I thought the lazy=True keyword might take care of it, however, when executing this, I see the memory being maxed out and then my app crashes.
Thanks for any advice!

LightGBM requires loading the entire dataset for training, so in this case, you can test on your laptop with a subset of the data (i.e. only pass a subset of the parquet filenames in).
The lazy=True flag delays the data loading to be split across the actors, rather than loading into memory first, then splitting+sending to actors. However, this would still load the entire dataset into memory, since all actors are on the same (local) node.
Additionally, when you do move to running on the remote cluster, these tips might be helpful to optimize memory usage: https://docs.ray.io/en/latest/train/gbdt.html?highlight=xgboost%20memro#how-to-optimize-xgboost-memory-usage.

Unable to load and use multiple keras models

I'm trying to load three different models in the same process. Only the first one works as expected, the rest of them return like random results.
Basically the order is as follows:
define and compile first model
load trained weights before
rename layers
the same process for the second model
the same process for the third model
So, something like:
model1 = Model(inputs=Input(shape=input_size_im) , outputs=layers_firstmodel)
model1.compile(optimizer='sgd', loss='mse')
model1.load_weights(weights_first, by_name=True)
# rename layers but didn't work
model2 = Model(inputs=Input(shape=input_size_im) , outputs=layers_secondmodel)
model2.compile(optimizer='sgd', loss='mse')
model2.load_weights(weights_second, by_name=True)
# rename layers but didn't work
model3 = Model(inputs=Input(shape=input_size_im) , outputs=layers_thirdmodel)
model3.compile(optimizer='sgd', loss='mse')
model3.load_weights(weights_third, by_name=True)
# rename layers but didn't work
for im in list_images:
results_firstmodel = model1.predict(im)
results_secondmodel = model2.predict(im)
results_thirdmodel = model2.predict(im)
I'd like to perform some inference over a bunch of images. To do that the idea consists in looping over the images and perform inference with these three algorithms, and return the results.
I have tried to rename all layers to make them unique with no success. Also I created a different graph for each network, and with a different session do the inference. This works but it's very inefficient (in addition I have to set their weights every time because of sess.run(tf.global_variables_initializer()) removes them). Each time it's created a session tensorflow prints "creating tensorflow device (/device:GPU:0)".
I am running Tensorflow 1.4.0-rc0, Keras 2.1.1 and Ubuntu 16.04 kernel 4.14.

The OP is correct here. There is a serious bug when you try to load multiple weight files in the same script. The above answer doesn't solve this. If you actually interrogate the weights when loading weights for multiple models in the same script you will notice that the weights are different than when you just load weights for one model on its own. This is where the randomness is the OP observes coming from.
EDIT: To solve this problem you have to encapsulate the model.load_weight command within a function and the randomness that you are experiencing should go away. The problem is that something weird screws up when you have multiple load_weight commands in the same script like you have above. If you load those model weights with a function you issues should go away.

From the Keras docs we have this explanation for the user of load_weights:
loads the weights of the model from a HDF5 file (created by save_weights). By default, the architecture is expected to be unchanged. To load weights into a different architecture (with some layers in common), use by_name=True to load only those layers with the same name.
Therefore, if your architecture is unchanged you should drop the by_name=True or make it False (its default value). This could be causing the inconsistencies that you are facing, as your weights are not being loaded probably due to having different names on your layers.
Another important thing to consider is the nature of your HDF5 file, and the way you created it. If it indeed contains only the weights (created with save_weights as the docs point out) then there should be no problem in proceeding as explained before.
Now, if that HDF5 contains weights and architecture in the same file, then you should be loading it with keras.models.load_model instead (further reading if you like here). If this is the case then this would also explain those inconsistencies.
As a side suggestion, I prefer to save my models using Callbacks, like the ModelCheckpoint or the EarlyStopping if you want to automatically determine when to stop training. This not only gives you greater flexibility when training and saving your models (as you can stop them on the optimal training epoch or when you desire), but also makes loading those models easily, as you can simply use the load_model method to load both architecture and weights to your desired variable.
Finally, here is one useful SO post where saving (and loading) Keras models is explained.

Tensorflow Object Detection API multi-class error

I am creating a 11 class object detector using the faster-RCNN model set up to the maximum size of 300x400 in the image-resizer tag. This is due to CUDA OOM error popping up if I go any higher as the GPU is a 1050 Ti, 4Gb ver, so I have approximately 3800-3900 Mb of model run-time training memory.
I have followed erishima's steps and mutated them with the Pet's scripts and Dati Tran's to generate the TFRecord files.
The steps were as follows:
Create the labels for the categories using labelImg.
Use the name field in labelImg to annotate the class of the image file.
Create a CSV file and extract the filename, class, xmin, ymin, xmax, ymax from the XML file. (Custom Script)
Create a train and test/eval CSV from the main CSV file.
Generate the TFRecord files to be inputted into the config file. Train and Test.(Dati Tran's script modified to suit needs)
Modify faster_rcnn_config without touching the hyper-parameters.
Created a label_map.pbtxt file which corresponded to the names of the classes. Started from 1 as stated in many other answers related to this topic.
Started training the model via the stated method.
The dataset for the classes is custom and the images/class varies from 2500 to 300. The dataset has no definition of orientation of the object and the difficulty of detection in the image even though every possible angle of the object is present in those images.
The problem which arises when I have trained to a loss value of .002 after 217k steps was that a single class was enveloping the objects of all other classes whether I ran the detector on a video or images. I have not tried to run the eval.py script as that takes too long on this setup and those I can't really see the mAP for the classes but I would assume that it should be redundant information as the problem should be in the dataset set preparation method or in the dataset itself.
When retrained from anew for 60k steps, the problem persisted but with another class enveloping all the other.
The warnings shown were:
The Sparse Index Tensor going to take alot of memory. Can I change the code so that this does not pop-up and possibly save me some precious memory.
Wanted [x,?,?,y], got [x,y,z,a,b] instead. This one stops the training. Got this 2 times in the training upto 217k steps. Have no idea where this one originates; probably, the dataset.
If someone can show me even a hint to the proper fix to this, I would highly appreciate it.

I believe you have class imbalance. Had similar problem in the past
Do an analysis of your dataset - make sure # of images per class are in similar order of magnitude.

How to turn off events.out.tfevents file in tf.contrib.learn Estimator

When using estimator.Estimator in tensorflow.contrib.learn, after training and prediction there are these files in the modeldir:
checkpoint
events.out.tfevents.1487956647
events.out.tfevents.1487957016
graph.pbtxt
model.ckpt-101.data-00000-of-00001
model.ckpt-101.index
model.ckpt-101.meta
When the graph is complicated or the number of variables is big, the graph.pbtxt file and the events files can be very big. Is here a way to not write these files? Since model reloading only needs the checkpoint files removing them won't affect evaluation and prediction down the road.

On Google's Machine Learning Crash Course they use the following approach:
# Create classifier:
classifier = tf.estimator.DNNRegressor(
feature_columns=feature_columns,
optimizer=optimizer
)
# Train it:
classifier.train(
input_fn=training_input_fn,
steps=steps
)
# Remove event files to save disk space.
_ = map(os.remove, glob.glob(os.path.join(classifier.model_dir, 'events.out.tfevents*')))

If you don't want the events.out.tfevents files to be written. Find in your code somethings like these and delete them.
tfFileWriter = tf.summary.FileWriter(os.getcwd())
tfFileWriter.add_graph(sess.graph)
tfFileWriter.close()

I had the same issue and was not able to find any resolution for this while the events file kept on growing in size. My understanding is that this file stores the events generated by tensorflow. I went ahead and deleted this manually.
Interestingly, it never got created again while the other files are getting updated when I run a train sequence.

Train a network on python wrraper caffe?

I would like to train a caffe network with the python interface.
The main reason behind this is I use multi dimentional input of a few Tbs of data and I dont want to convert all this to LMDB and train it.
I have found a this one answer on stack overflow.
But his loads this complete data at once and has initialized weights.
I would like to load data to a numpy and then pass it to the caffe.
And save the weights of the caffemodel to a .caffemodel file once every 1000 iterations.
the print_network() get_accuracy() & load_data() are very useful. And gives me a good inside.

Beside using PythonLayer , one thing you can do is use MemoryData layer and feed in each batch of data at a time by using solver.net.set_input_arrays(your_data) after however many iteration is needed to go through one batch of data.
Remember, you can always restore the training state by using .solverstate file from your snapshots.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.