Use multiple GPUs for inception_v3 model in TF slim

Use multiple GPUs for inception_v3 model in TF slim - python

I am trying to train a slim model using 3 GPUs.
I specifically telling TF to use the second GPU to allocate the model:
with tf.device('device:GPU:1'):
logits, end_points = inception_v3(inputs)
However, I'm getting an OOM error on that GPU everytime I run my code. I've tried to reduce the batch_size so the model fits in memory, but the net is ruinned.
I own 3 GPUS so, is there a way to tell TF to use my third GPU when second is full? I've tried not telling TF to use any GPU and allowing soft placemente, but it is not working either.

This statement with tf.device('device:GPU:1') tells tensorflow specifically to use GPU-1, so it won't attempt to use any other device you have.
When the model is too big, the recommended way is to use model parallelism via manually splitting your graph into different GPUs. The complication in your case is that the model definition is in the library, so you can't insert tf.device statements for different layers unless you patch tensorflow.
But there is a workaround
You can define and place variables before invoking inception_v3 builder. This way inception_v3 will reuse these variables and not change its placement. Example:
with tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE):
with tf.device('device:GPU:1'):
tf.get_variable("InceptionV3/Logits/Conv2d_1c_1x1/biases", shape=[1000])
tf.get_variable("InceptionV3/Logits/Conv2d_1c_1x1/weights", shape=[1, 1, 2048, 1000])
with tf.device('device:GPU:0'):
logits, end_points = inception_v3(inputs)
Upon running, you'll see that all variables except Conv2d_1c_1x1 are placed onto GPU-0, while Conv2d_1c_1x1 layer is on GPU-1.
The drawback is that you need to know the shape of each variable you want to replace. But it is doable and at least can get your model running.

Related

Train complicated nn models with tf.eager (better with TF2 symbolic support)

Is there (more or less) simple way to write a complicated NN model so it will be trainable in the eager mode? Are there examples of a such code?
For example, I want to use the InceptionResnetV2. I have the code created with tf.contrib.slim. According to this link, https://github.com/tensorflow/tensorflow/issues/16182 , slim is deprecated and I need to use Keras. And I really can't use a slim code for training with eager because I can't take the list of variables and apply gradients (ok, I can try to wrap model into the GradientTape but not sure what to do with regularization loss).
Ok, let's try Keras.
In [30]: tf.__version__
Out[30]: '1.13.1'
In [31]: tf.enable_eager_execution()
In [32]: from keras.applications.inception_resnet_v2 import InceptionResNetV2
In [33]: model = InceptionResNetV2(weights=None)
...
/usr/local/lib/python3.6/dist-packages/keras_applications/inception_resnet_v2.py in InceptionResNetV2(include_top, weights, input_tensor, input_shape, pooling, classes, **kwargs)
246
247 if input_tensor is None:
--> 248 img_input = layers.Input(shape=input_shape)
249 else:
250 if not backend.is_keras_tensor(input_tensor):
...
RuntimeError: tf.placeholder() is not compatible with eager execution.
Doesn't work by default.
In this tutorial they say that I need to make my own class of a model and maintain variables by myself https://www.tensorflow.org/tutorials/eager/custom_training#define_the_model. I'm not sure that I want to do it for Inception. Too many variables to create and maintain. It's like returning back to old versions of TF, in days when even slim didn't exist.
In this tutorial networks are created using Keras https://www.tensorflow.org/tutorials/eager/custom_training_walkthrough#create_a_model_using_keras but I doubt that I can easily maintain complicated structure in a such way, by only defining model without using it with Input. For example, in this article, if I understand correctly, author initialize keras Input and propagate it through the model (which causes RuntimeError when used with Eager, as you seen earlier). I can make my own model by subclassing the model class as here: https://www.tensorflow.org/api_docs/python/tf/keras/Model . Oops, in this way I need to maintain layers, not variables. It seems as almost the same problem to me.
There is an interesting mention of AutoGrad here https://www.tensorflow.org/beta/guide/autograph#keras_and_autograph . They only overwrite __call__, so it seems like I don't need to maintain variables in this case, but I didn't test it yet.
So, is there any simple solution?
Wrap slim model in GradientTape? How can I then apply reg loss to the weights?
Track every variable by myself? Sounds a little bit painful.
Use Keras? How to use it with eager when I have branches and complicated structure in the model?

Your first approach is probably the most common. This error:
RuntimeError: tf.placeholder() is not compatible with eager execution.
is because one cannot use a tf.placeholder in eager mode. There is no concept of such a thing when executing eagerly.
You could use the tf.data API to build a dataset for your training data and feed that to the model. Something like this with the datasets replaced with your real data:
import tensorflow as tf
tf.enable_eager_execution()
model = tf.keras.applications.inception_resnet_v2.InceptionResNetV2(weights=None)
model.compile(tf.keras.optimizers.Adam(), loss=tf.keras.losses.categorical_crossentropy)
### Replace with tf.data.Datasets for your actual training data!
train_x = tf.data.Dataset.from_tensor_slices(tf.random.normal((10,299,299,3)))
train_y = tf.data.Dataset.from_tensor_slices(tf.random.uniform((10,), maxval=10, dtype=tf.int32))
training_data = tf.data.Dataset.zip((train_x, train_y)).batch(BATCH_SIZE)
model.fit(training_data)
This approach works as is in TensorFlow 2.0 too as mentioned in your title.

Keras tf backend predict speed slow for batch size of 1

I am combining a Monte-Carlo Tree Search with a convolutional neural network as the rollout policy. I've identified the Keras model.predict function as being very slow. After experimentation, I found that surprisingly model parameter size and prediction sample size don't affect the speed significantly. For reference:
0.00135549 s for 3 samples with batch_size = 3
0.00303991 s for 3 samples with batch_size = 1
0.00115528 s for 1 sample with batch_size = 1
0.00136132 s for 10 samples with batch_size = 10
as you can see I can predict 10 samples at about the same speed as 1 sample. The change is also very minimal though noticeable if I decrease parameter size by 100X but I'd rather not change parameter size by that much anyway. In addition, the predict function is very slow the first time run through (~0.2s) though I don't think that's the problem here since the same model is predicting multiple times.
I wonder if there is some workaround because clearly the 10 samples can be evaluated very quickly, all I want to be able to do is predict the samples at different times and not all at once since I need to update the Tree Search before making a new prediction. Perhaps should I work with tensorflow instead?

The batch size controls parallelism when predicting, so it is expected that increasing the batch size will have better performance, as you can use more cores and use GPU more efficiently.
You cannot really workaround, there is nothing really to work around, using a batch size of one is the worst case for performance. Maybe you should look into a smaller network that is faster to predict, or predict on the CPU if your experiments are done in a GPU, to minimize overhead due to transfer.
Don't forget that model.predict does a full forward pass of the network, so its speed completely depends on the network architecture.

One way that gave me a speed up was switching from model.predict(x) to,
model.predict_on_batch(x)
making sure your x shape has 1 as the first dimension.

I don't think working with pure Tensorflow would change the performance much. Keras is a high-level API for low-level Tensorflow primitives. You could use a smaller model instead, like MobileNetV3 or EfficientNet, but this would require retraining.
If you need to remain with the existing model, you could try OpenVINO. OpenVINO is optimized for Intel hardware, but it should work with any CPU. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime.
It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets are below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow2]
Save your model as SavedModel
OpenVINO is not able to convert the HDF5 model, so you have to save it as SavedModel first.
import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. You care about latency, so I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"})
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.

Keras seems slower than tensorflow while feeding training data

I am currently converting a project from tensorflow to keras.
Everything seems fine and I am very impressed by how easy it is to build models with keras. However, training is much slower with Keras, where my GPU is significantly less utilized.
I am using a Tensorflow generator Dataset to load my training data. Luckily keras seems to accept that with no problems.
The problem
However while using tensorflow to train on the dataset I archieve an average GPU utilization of ~70%. When I am training the same network with the same dataset generator using Keras I only archieve ~35% GPU utilization
The problem seems to be that I have a very simple network, hence I need to feed data to the GPU as fast as possible since much time is spent here, compared to actually doing backpropagation.
Using tensorflow the key here seemed to be to not use feed-dicts but instead use the tensor from my dataset as input to the graph. Basically, this can be reduced to
x, y = iterator.get_next() # Get the dataset tensors
loss = tf.reduce_sum(tf.square(y - model_out)) # Use the y tensor directly for loss
# Use x as the input layer in my model <- Implememntation omitted
I would like to achieve the same thing with keras, hence I did something like this, where i set x as the input and y as the target tensor. (Can I somehow get rid of having to put y in a list for the target tensor?)
x, y = iterator.get_next() # Get the dataset tensors
model_input = keras.Input(tensor=x)
# Build model with model_input as input layer and something as output layer. <- Implememntation omitted
model = tf.keras.Model(inputs=model_input, outputs=something) # Insert the dataset tensor directly as input
model.compile(loss='mean_squared_error',
optimizer=#something,
metrics=['accuracy'],
target_tensors=[y]) # Input the dataset y tensor directly for use in the loss calculation
Basically that should set x as the input tensor and y as the tensor used directly for loss, just like in the tensorflow version. I can now cal model.fit without providing x and y arguments explicitly since they are used directly in the graph
model.fit(validation_data=validation_iterator,
steps_per_epoch=5000,
validation_steps=1)
To me it seems like I am doing the same thing now with keras and tensorflow, however, keras is way slower with about half the GPU utilization of the pure tensorflow implementation
Am I doing something wrong here, or should i just accept this slowdown if I want to use keras?

I experienced the same issue on TensorFlow 1.13 and solved it by upgrading to TensorFlow 1.14 / 2.0.0.
For a sanity check, I wrapped the TensorFlow graph (as is) as a Keras model and trained the model using model.fit(). When using TensorFlow 1.13, I got a slowdown in throughput of 50% relative to the throughput of training the pure TensorFlow implementation. In both cases, I used the same tf.data.dataset input pipeline.
Using TensorFlow version 1.14 solved the issue (now I get the ~same throughput for both cases mentioned above). Later I migrated to TensorFlow 2.0.0 (alpha) and also got the same throughput for both cases.

Keras with Tensorflow backend - Run predict on CPU but fit on GPU

I am using keras-rl to train my network with the D-DQN algorithm. I am running my training on the GPU with the model.fit_generator() function to allow data to be sent to the GPU while it is doing backprops. I suspect the generation of data to be too slow compared to the speed of processing data by the GPU.
In the generation of data, as instructed in the D-DQN algorithm, I must first predict Q-values with my models and then use these values for the backpropagation. And if the GPU is used to run these predictions, it means that they are breaking the flow of my data (I want backprops to run as often as possible).
Is there a way I can specify on which device to run specific operations? In a way that I could run the predictions on the CPU and the backprops on the GPU.

Maybe you can save the model at the end of the training. Then start another python file and write os.environ["CUDA_VISIBLE_DEVICES"]="-1"before you import any keras or tensorflow stuff. Now you should be able to load the model and make predictions with your CPU.

It's hard to properly answer your question without seeing your code.
The code below shows how you can list the available devices and force tensorflow to use a specific device.
def get_available_devices():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos]
get_available_devices()
with tf.device('/gpu:0'):
//Do GPU stuff here
with tf.device('/cpu:0'):
//Do CPU stuff here

Unable to load and use multiple keras models

I'm trying to load three different models in the same process. Only the first one works as expected, the rest of them return like random results.
Basically the order is as follows:
define and compile first model
load trained weights before
rename layers
the same process for the second model
the same process for the third model
So, something like:
model1 = Model(inputs=Input(shape=input_size_im) , outputs=layers_firstmodel)
model1.compile(optimizer='sgd', loss='mse')
model1.load_weights(weights_first, by_name=True)
# rename layers but didn't work
model2 = Model(inputs=Input(shape=input_size_im) , outputs=layers_secondmodel)
model2.compile(optimizer='sgd', loss='mse')
model2.load_weights(weights_second, by_name=True)
# rename layers but didn't work
model3 = Model(inputs=Input(shape=input_size_im) , outputs=layers_thirdmodel)
model3.compile(optimizer='sgd', loss='mse')
model3.load_weights(weights_third, by_name=True)
# rename layers but didn't work
for im in list_images:
results_firstmodel = model1.predict(im)
results_secondmodel = model2.predict(im)
results_thirdmodel = model2.predict(im)
I'd like to perform some inference over a bunch of images. To do that the idea consists in looping over the images and perform inference with these three algorithms, and return the results.
I have tried to rename all layers to make them unique with no success. Also I created a different graph for each network, and with a different session do the inference. This works but it's very inefficient (in addition I have to set their weights every time because of sess.run(tf.global_variables_initializer()) removes them). Each time it's created a session tensorflow prints "creating tensorflow device (/device:GPU:0)".
I am running Tensorflow 1.4.0-rc0, Keras 2.1.1 and Ubuntu 16.04 kernel 4.14.

The OP is correct here. There is a serious bug when you try to load multiple weight files in the same script. The above answer doesn't solve this. If you actually interrogate the weights when loading weights for multiple models in the same script you will notice that the weights are different than when you just load weights for one model on its own. This is where the randomness is the OP observes coming from.
EDIT: To solve this problem you have to encapsulate the model.load_weight command within a function and the randomness that you are experiencing should go away. The problem is that something weird screws up when you have multiple load_weight commands in the same script like you have above. If you load those model weights with a function you issues should go away.

From the Keras docs we have this explanation for the user of load_weights:
loads the weights of the model from a HDF5 file (created by save_weights). By default, the architecture is expected to be unchanged. To load weights into a different architecture (with some layers in common), use by_name=True to load only those layers with the same name.
Therefore, if your architecture is unchanged you should drop the by_name=True or make it False (its default value). This could be causing the inconsistencies that you are facing, as your weights are not being loaded probably due to having different names on your layers.
Another important thing to consider is the nature of your HDF5 file, and the way you created it. If it indeed contains only the weights (created with save_weights as the docs point out) then there should be no problem in proceeding as explained before.
Now, if that HDF5 contains weights and architecture in the same file, then you should be loading it with keras.models.load_model instead (further reading if you like here). If this is the case then this would also explain those inconsistencies.
As a side suggestion, I prefer to save my models using Callbacks, like the ModelCheckpoint or the EarlyStopping if you want to automatically determine when to stop training. This not only gives you greater flexibility when training and saving your models (as you can stop them on the optimal training epoch or when you desire), but also makes loading those models easily, as you can simply use the load_model method to load both architecture and weights to your desired variable.
Finally, here is one useful SO post where saving (and loading) Keras models is explained.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.