One gpu uses more memory than others during training

One gpu uses more memory than others during training - python

I use multigpu to train a model with pytorch. One gpu uses more memory than others, causing "out-of-memory". Why would one gpu use more memory? Is it possible to make the usage more balanced? Is there other ways to reduce memory usage? (Deleting variables that will not be used anymore...?) The batch size is already 1. Thanks.

DataParallel splits the batch and sends each split to a different GPU, each GPU has a copy of the model, then the forward pass is computed independently and then the outputs of each GPU are collected back to one GPU instead of computing loss independently in each GPU.
If you want to mitigate this issue you can include the loss computation in the DataParallel module.
If doing this is still an issue, then you might want model parallelism instead of data parallelism: move different parts of your model to different GPUs using .cuda(gpu_id). This is useful when the weights of your model are pretty large.

Related

How to train a TF model that is larger than GPU memory?

I want to train a large object detection model using TF2, preferrably the EfficientDet D7 network. With my Tesla P100 card that has 16 GB of memory I am running into an "out of memory" exception, i.e. not enough memory on the graphics card can be allocated.
So I am wondering what my options are in this case. Is it correct that if I would have multiple GPUs, then the TF model would be split so that it fills memory of both cards? So in my case, with a second Tesla card again with 16 GB I would have 32 GB in total during training? If that is the case would that also be true for a cloud provider, where I could utilize multiple GPUs?
Moreover, if I am wrong and it would not work to split a model for multiple GPUs during training, what other approach would work in order to train a large network that does not fit into my GPU memory?
PS: I know that I could reduce the batch_size to 1, but unfortunately that does still not solve my issue for the really large models ...

You can use multiple GPU's in GCP (Google Cloud Platform) atleast, not too sure about other cloud providers. And yes, once you do that, you can train with a larger batch size (exact number would depend on the GPU, it's memory and how may you GPU's you have running in your VM)
You can check this link for the list of all GPU's available in GCP
If you're using the object detection API, you can check this post regarding training using multiple GPU's.
Alternatively, if you want to go with a single GPU, one clever trick would be to use the concept of gradient accumulation where you could virtually increase your batch size without using too much extra GPU memory, which is discussed in this post

On single gpu, can TensorFlow train a model which larger than GPU memory?

If I have a single GPU with 8GB RAM and I have a TensorFlow model (excluding training/validation data) that is 10GB, can TensorFlow train the model?
If yes, how does TensorFlow do this?
Notes:
I'm not looking for distributed GPU training. I want to know about single GPU case.
I'm not concerned about the training/validation data sizes.

No you can not train a model larger than your GPU's memory. (there may be some ways with dropout that I am not aware of but in general it is not advised). Further you would need more memory than even all the parameters you are keeping because your GPU needs to retain the parameters along with the derivatives for each step to do back-prop.
Not to mention the smaller batch size this would require as there is less space left for the dataset.

Tensorflow not fully utilizing GPU in GPT-2 program

I am running the GPT-2 code of the large model(774M). It is used for the generation of text samples through interactive_conditional_samples.py , link: here
So I've given an input file containing prompts which are automatically selected to generate output. This output is also automatically copied into a file. In short, I'm not training it, I'm using the model to generate text.
Also, I'm using a single GPU.
The problem I'm facing in this is, The code is not utilizing the GPU fully.
By using nvidia-smi command, I was able to see the below image
https://imgur.com/CqANNdB

It depends on your application. It is not unusual to have low GPU utilization when the batch_size is small. Try increasing the batch_size for more GPU utilization.
In your case, you have set batch_size=1 in your program. Increase the batch_size to a larger number and verify the GPU utilization.
Let me explain using MNIST size networks. They are tiny and it's hard to achieve high GPU (or CPU) efficiency for them. You will get higher computational efficiency with larger batch size, meaning you can process more examples per second, but you will also get lower statistical efficiency, meaning you need to process more examples total to get to target accuracy. So it's a trade-off. For tiny character models, the statistical efficiency drops off very quickly after a batch_size=100, so it's probably not worth trying to grow the batch size for training. For inference, you should use the largest batch size you can.
Hope this answers your question. Happy Learning.

Best practice for allocating GPU and CPU resources in TensorFlow

I'm wondering what is the correct way to set devices for creating/training a model in order to optimize resource usage for speedy training in TensorFlow with the Keras API? I have 1 CPU and 2 GPUs at my disposal. I was initially using a tf.device context to create my model and train on GPUs only, but then I saw in the TensorFlow documentation for tf.keras.utils.multi_gpu_model, they suggest explicitly instantiating the model on the CPU:
# Instantiate the base model (or "template" model).
# We recommend doing this with under a CPU device scope,
# so that the model's weights are hosted on CPU memory.
# Otherwise they may end up hosted on a GPU, which would
# complicate weight sharing.
with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)
# Replicates the model on 8 GPUs.
# This assumes that your machine has 8 available GPUs.
parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
I did this, and now when I train I see my CPU usage go way up with all 8 cores at about 70% usage each, and my GPU memory is maxed out. Would things go faster if the model were created on one of the GPUs? Even if I have just 1 GPU, is it still better to create model on CPU and use tf.device context to train the model on the GPU?

Many TensorFlow operations are accelerated using the GPU for computation. Without any annotations, TensorFlow automatically decides whether to use the GPU or CPU for an operation—copying the tensor between CPU and GPU memory, if necessary. Tensors produced by an operation are typically backed by the memory of the device on which the operation executed.
Tensorflow will only allocate memory and place operations on visible physical devices, as otherwise no LogicalDevice will be created on them. By default all discovered devices are marked as visible.
Also GPU utilization depends on the batch_size. The utilization may change with varying batch_size.
You can also compare your current results(time taken and utilization) with model using the Example 3 from multi_gpu_model.
Also if you go into the link, it states -
Warning: THIS FUNCTION IS DEPRECATED. It will be removed after 2020-04-01. Instructions for updating: Use tf.distribute.MirroredStrategy instead.
There should be performance improvement and GPU Utilization using tf.distribute.MirroredStrategy. This strategy is typically used for training on one machine with multiple GPUs. The tf.distribute.Strategy API provides an abstraction for distributing your training across multiple processing units. The goal is to allow users to enable distributed training using existing models and training code, with minimal changes.
For example, a variable created under a MirroredStrategy is a MirroredVariable. If no devices are specified in the constructor argument of the strategy then it will use all the available GPUs. If no GPUs are found, it will use the available CPUs. Note that TensorFlow treats all CPUs on a machine as a single device, and uses threads internally for parallelism.
Would recommend to go through Custom training with tf.distribute.Strategy tutorial that demonstrates on how to use tf.distribute.Strategy with custom training loops. They will train a simple CNN model on the fashion MNIST dataset.
Hope this answers your question. Happy Learning.

PyTorch: Is there a way to store model in CPU ram, but run all operations on the GPU for large models?

From what I see, most people seem to be initializing an entire model, and sending the whole thing to the GPU. But I have a neural net model that is too big to fit entirely on my GPU. Is it possible to keep the model saved in ram, but run all the operations on the GPU?

I do not believe this is possible. However, one easy work around would be to split you model into sections that will fit into gpu memory along with your batch input.
Send the first part(s) of the model to gpu and calculate outputs
Release the former part of the model from gpu memory, and send the next section of the model to the gpu.
Input the output from 1 into the next section of the model and save outputs.
Repeat 1 through 3 until you reach your models final output.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.