Best practice for allocating GPU and CPU resources in TensorFlow

Best practice for allocating GPU and CPU resources in TensorFlow - python

I'm wondering what is the correct way to set devices for creating/training a model in order to optimize resource usage for speedy training in TensorFlow with the Keras API? I have 1 CPU and 2 GPUs at my disposal. I was initially using a tf.device context to create my model and train on GPUs only, but then I saw in the TensorFlow documentation for tf.keras.utils.multi_gpu_model, they suggest explicitly instantiating the model on the CPU:
# Instantiate the base model (or "template" model).
# We recommend doing this with under a CPU device scope,
# so that the model's weights are hosted on CPU memory.
# Otherwise they may end up hosted on a GPU, which would
# complicate weight sharing.
with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)
# Replicates the model on 8 GPUs.
# This assumes that your machine has 8 available GPUs.
parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
I did this, and now when I train I see my CPU usage go way up with all 8 cores at about 70% usage each, and my GPU memory is maxed out. Would things go faster if the model were created on one of the GPUs? Even if I have just 1 GPU, is it still better to create model on CPU and use tf.device context to train the model on the GPU?

Many TensorFlow operations are accelerated using the GPU for computation. Without any annotations, TensorFlow automatically decides whether to use the GPU or CPU for an operation—copying the tensor between CPU and GPU memory, if necessary. Tensors produced by an operation are typically backed by the memory of the device on which the operation executed.
Tensorflow will only allocate memory and place operations on visible physical devices, as otherwise no LogicalDevice will be created on them. By default all discovered devices are marked as visible.
Also GPU utilization depends on the batch_size. The utilization may change with varying batch_size.
You can also compare your current results(time taken and utilization) with model using the Example 3 from multi_gpu_model.
Also if you go into the link, it states -
Warning: THIS FUNCTION IS DEPRECATED. It will be removed after 2020-04-01. Instructions for updating: Use tf.distribute.MirroredStrategy instead.
There should be performance improvement and GPU Utilization using tf.distribute.MirroredStrategy. This strategy is typically used for training on one machine with multiple GPUs. The tf.distribute.Strategy API provides an abstraction for distributing your training across multiple processing units. The goal is to allow users to enable distributed training using existing models and training code, with minimal changes.
For example, a variable created under a MirroredStrategy is a MirroredVariable. If no devices are specified in the constructor argument of the strategy then it will use all the available GPUs. If no GPUs are found, it will use the available CPUs. Note that TensorFlow treats all CPUs on a machine as a single device, and uses threads internally for parallelism.
Would recommend to go through Custom training with tf.distribute.Strategy tutorial that demonstrates on how to use tf.distribute.Strategy with custom training loops. They will train a simple CNN model on the fashion MNIST dataset.
Hope this answers your question. Happy Learning.

Related

How to train a TF model that is larger than GPU memory?

I want to train a large object detection model using TF2, preferrably the EfficientDet D7 network. With my Tesla P100 card that has 16 GB of memory I am running into an "out of memory" exception, i.e. not enough memory on the graphics card can be allocated.
So I am wondering what my options are in this case. Is it correct that if I would have multiple GPUs, then the TF model would be split so that it fills memory of both cards? So in my case, with a second Tesla card again with 16 GB I would have 32 GB in total during training? If that is the case would that also be true for a cloud provider, where I could utilize multiple GPUs?
Moreover, if I am wrong and it would not work to split a model for multiple GPUs during training, what other approach would work in order to train a large network that does not fit into my GPU memory?
PS: I know that I could reduce the batch_size to 1, but unfortunately that does still not solve my issue for the really large models ...

You can use multiple GPU's in GCP (Google Cloud Platform) atleast, not too sure about other cloud providers. And yes, once you do that, you can train with a larger batch size (exact number would depend on the GPU, it's memory and how may you GPU's you have running in your VM)
You can check this link for the list of all GPU's available in GCP
If you're using the object detection API, you can check this post regarding training using multiple GPU's.
Alternatively, if you want to go with a single GPU, one clever trick would be to use the concept of gradient accumulation where you could virtually increase your batch size without using too much extra GPU memory, which is discussed in this post

One gpu uses more memory than others during training

I use multigpu to train a model with pytorch. One gpu uses more memory than others, causing "out-of-memory". Why would one gpu use more memory? Is it possible to make the usage more balanced? Is there other ways to reduce memory usage? (Deleting variables that will not be used anymore...?) The batch size is already 1. Thanks.

DataParallel splits the batch and sends each split to a different GPU, each GPU has a copy of the model, then the forward pass is computed independently and then the outputs of each GPU are collected back to one GPU instead of computing loss independently in each GPU.
If you want to mitigate this issue you can include the loss computation in the DataParallel module.
If doing this is still an issue, then you might want model parallelism instead of data parallelism: move different parts of your model to different GPUs using .cuda(gpu_id). This is useful when the weights of your model are pretty large.

TensorFlow v1.10+ load SavedModel with different device placement or manually set dynamic device placement?

So in TensorFlow's guide for using GPUs there is a part about using multiple GPUs in a "multi-tower fashion":
...
for d in ['/device:GPU:2', '/device:GPU:3']:
with tf.device(d): # <---- manual device placement
...
Seeing this, one might be tempted to leverage this style for multiple GPU training in a custom Estimator to indicate to the model that it can be distributed across multiple GPUs efficiently.
To my knowledge, if manual device placement is absent TensorFlow does not have some form of optimal device mapping (expect perhaps if you have the GPU version installed and a GPU is available, using it over the CPU). So what other choice do you have?
Anyway, you carry on with training your estimator and export it to a SavedModel via estimator.export_savedmodel(...) and wish to use this SavedModel later... perhaps on a different machine, one which may not have as many GPUs as the device on which the model was trained (or maybe no GPUs)
so when you run
from tensorflow.contrib import predictor
predict_fn = predictor.from_saved_model(model_dir)
you get
Cannot assign a device for operation <OP-NAME>. Operation was
explicitly assigned to <DEVICE-NAME> but available devices are
[<AVAILABLE-DEVICE-0>,...]
An older S.O. Post suggests that changing device placement was not possible... but hopefully over time things have changed.
Thus my question is:
when loading a SavedModel can I change the device placement to be appropriate for the device it is loaded on. E.g. if I train a model with 6 GPUs and a friend wants to run it at home with their e-GPU, can they set '/device:GPU:1' through '/device:GPU:5' to '/device:GPU:0'?
if 1 is not possible, is there a (painless) way for me, in the custom Estimator's model_fn, to specify how to generically distribute a graph?
e.g.
with tf.device('available-gpu-3')
where available-gpu-3 is the third available GPU if there are three or more GPUs, otherwise the second or first available GPU, and if no GPU it is CPU
This matters because if there is a shared machine with is training two models, say one model on '/device:GPU:0' then the other model is trained explicitly on GPUs 1 and 2... so on another 2 GPU machine, GPU 2 will not be available....

I am doing some research on this topic recently and to my knowledge, your question 1 can work only if you clear all devices when you export the model in the original tensorflow code, with flag clear_devices=True.
In my own code, it looks like
builder = tf.saved_model.builder.SavedModelBuilder('osvos_saved')
builder.add_meta_graph_and_variables(sess, ['serve'], clear_devices=True)
builder.save()
If you only have a exported model, seems not possible. You can refer to this issue.
I'm currently trying to find a way to fix this, as stated in my stackoverflow question. Hope the workaround can help you.

Keras not using full CPU cores for training

I am training a LSTM model on a very huge dataset on my machine using Keras on Tensorflow backend. My machine have 16 cores. While training the model I noticed that the load in all the cores are below 40%.
I have gone through different sources looking for a solution and have tried providing the cores to use in the backend as
config = tf.ConfigProto(device_count={"CPU": 16})
backend.tensorflow_backend.set_session(tf.Session(config=config))
Even after that the load is still the same.
Is this because the model is very small.? It is taking around 5 minutes for an epoch. If it uses full cores the speed can be improved.
How to tell Keras or Tensorflow to use the full available cores i.e 16 cores to train the model.??
I have went through these stackoverflow questions and tried the solutions mentioned there. It didn't help.
Limit number of cores used in Keras

How are you training the model exactly? You might want to look into using model.fit_generator() but with a Keras Sequence object instead of a custom generator. This allows to safely use multiprocessing and will result in all cores being used.
You can checkout the Keras docs for an example.

Keras with Tensorflow backend - Run predict on CPU but fit on GPU

I am using keras-rl to train my network with the D-DQN algorithm. I am running my training on the GPU with the model.fit_generator() function to allow data to be sent to the GPU while it is doing backprops. I suspect the generation of data to be too slow compared to the speed of processing data by the GPU.
In the generation of data, as instructed in the D-DQN algorithm, I must first predict Q-values with my models and then use these values for the backpropagation. And if the GPU is used to run these predictions, it means that they are breaking the flow of my data (I want backprops to run as often as possible).
Is there a way I can specify on which device to run specific operations? In a way that I could run the predictions on the CPU and the backprops on the GPU.

Maybe you can save the model at the end of the training. Then start another python file and write os.environ["CUDA_VISIBLE_DEVICES"]="-1"before you import any keras or tensorflow stuff. Now you should be able to load the model and make predictions with your CPU.

It's hard to properly answer your question without seeing your code.
The code below shows how you can list the available devices and force tensorflow to use a specific device.
def get_available_devices():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos]
get_available_devices()
with tf.device('/gpu:0'):
//Do GPU stuff here
with tf.device('/cpu:0'):
//Do CPU stuff here

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.