Why, using Huggingface Trainer, single GPU training is faster than 2 GPUs?

Why, using Huggingface Trainer, single GPU training is faster than 2 GPUs? - python

I have a VM with 2 V100s and I am training gpt2-like models (same architecture, fewer layers) using the really nice Trainer API from Huggingface. I am using the pytorch back-end.
I am observing that when I train the exact same model (6 layers, ~82M parameters) with exactly the same data and TrainingArguments, training on a single GPU training is significantly faster than on 2GPUs: ~5hrs vs ~6.5hrs.
How would one debug this kind of issue to uderstand what's causing the slowdown?
Extra notes:
the 2 gpus are both being used (watching nvidia-smi output)
I am using fp16 precision
My TrainingArguments values are:
{
"optim": "adamw_torch",
"evaluation_strategy": "epoch",
"save_strategy": "epoch",
"fp16": true,
"gradient_checkpointing": true,
"per_device_train_batch_size": 16,
"per_device_eval_batch_size": 16,
"dataloader_num_workers": 4,
"dataloader_pin_memory": true,
"gradient_accumulation_steps": 1,
"num_train_epochs": 5
}
The output of nvidia-smi topo -m is:
$ nvidia-smi topo -m
GPU0 GPU1 CPU Affinity NUMA Affinity
GPU0 X SYS 0-11 N/A
GPU1 SYS X 0-11 N/A
I understand that without NVLink inter-gpu communication is not as fast as it could be, but can that be the only cause of a slowdown like the one I'm observing? And if so, is there anything I can do or will I always have slower training times on 2GPUs (thus making multi-gpu training essentially useless)?

Keeping this here for reference. The cause was "gradient_checkpointing": true,. The slowdown induced by gradient checkpointing appears to be larger on 2 GPUs than on a single GPU. I don't really know the cause of this issue, if anyone knows I would really appreaciate someone telling me.

Related

how to make YOLOv7 use only part of the GPU?

I am trying to train the YOLOv7 network, but I should limit the GPU usage to 8Gb.
What I understood doing experiments also on Google Colab is that before starting the training, the network tries to occupy most of the available memory regardless of how much it is or how much it is needed for training. In fact, on Google Colab it requires 11 GB out of 14 to start the training, while on my GPU it requires 45 out of 50, although then the actual training phase requires only 6GB.
I tried to minimize the parameters (batch size, workers) but nothing changes as, as mentioned, the problem is the pre-training allocation which is fixed.
I tried using the function of pytorch
torch.cuda.set_per_process_memory_fraction(0.16, CUDA_VISIBLE_DEVICES)
but this function does not cause the network to use only 8GB but causes, if exceeded 8GB, an error.
on YOLOX there is the "-o" parameter which, if omitted, avoids the allocation of pre-training memory and therefore uses only the memory it needs during training but I have not found the equivalent of this parameter on YOLOv7.
Is it possible to make YOLOv7 see only 8GB available and therefore allocate a smaller amount of GB?
Or is it possible that pre-training allocation is avoided like in YOLOX?

How to set different GPUs in different tasks in the same script?

I am running a deep learning script but I am not an expert. My task is to run multiple GPUs for data training. However, I have trouble specifying GPUs. Here are the steps of my confusion.
I set multiple GPUs by
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["TF_MIN_GPU_MULTIPROCESSOR_COUNT"]="2"
os.environ["CUDA_VISIBLE_DEVICES"]= "5,6"
print("# of GPU available:", len(tf.config.list_physical_devices('GPU')))
# of GPU available: 2
when I start the model creation, I receive this error, which I did not receive when using only ONE gpu.
tf.random.set_seed(42)
model_unet = binary_unet(256,256,6)
ResourceExhaustedError: OOM when allocating tensor with shape[3,3,64,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:TruncatedNormal]
I thought I could set one GPU first (for step 2) by making the cuda_visible_devices to be the ONE gpu wanted, and specify multiple GPUs (after step 2) by making the cuda_visible_devices to be multiple GPUs. But then, tensorflow couldn't recognize multiple GPUs that, for example:
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["TF_MIN_GPU_MULTIPROCESSOR_COUNT"]="2"
os.environ["CUDA_VISIBLE_DEVICES"]= "5,6"
print("# of GPU available:", len(tf.config.list_physical_devices('GPU')))
# of GPU available: 1
Note the number of GPUs available becomes 1. This will stick around unless I restart the kernel and clear output. Actually, I need to restart the kernel and clear output between step 1 and 2 as well, to make sure only 1 GPU is available and step 2 doesn't fail. But I can't just restart and clear everything because I am going to use previous outputs to run epochs.
I believe some potential solutions are: 1) make step 2 (creating a unet model) run with multiple GPUs; 2) somehow clear the logs in tensorflow without having to restart the kernel that I can create model with 1 GPU but train data/run epoch with multiple GPUs. But I have no idea how to do this. Could someone help?

Best practice for allocating GPU and CPU resources in TensorFlow

I'm wondering what is the correct way to set devices for creating/training a model in order to optimize resource usage for speedy training in TensorFlow with the Keras API? I have 1 CPU and 2 GPUs at my disposal. I was initially using a tf.device context to create my model and train on GPUs only, but then I saw in the TensorFlow documentation for tf.keras.utils.multi_gpu_model, they suggest explicitly instantiating the model on the CPU:
# Instantiate the base model (or "template" model).
# We recommend doing this with under a CPU device scope,
# so that the model's weights are hosted on CPU memory.
# Otherwise they may end up hosted on a GPU, which would
# complicate weight sharing.
with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)
# Replicates the model on 8 GPUs.
# This assumes that your machine has 8 available GPUs.
parallel_model = multi_gpu_model(model, gpus=8)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
I did this, and now when I train I see my CPU usage go way up with all 8 cores at about 70% usage each, and my GPU memory is maxed out. Would things go faster if the model were created on one of the GPUs? Even if I have just 1 GPU, is it still better to create model on CPU and use tf.device context to train the model on the GPU?

Many TensorFlow operations are accelerated using the GPU for computation. Without any annotations, TensorFlow automatically decides whether to use the GPU or CPU for an operation—copying the tensor between CPU and GPU memory, if necessary. Tensors produced by an operation are typically backed by the memory of the device on which the operation executed.
Tensorflow will only allocate memory and place operations on visible physical devices, as otherwise no LogicalDevice will be created on them. By default all discovered devices are marked as visible.
Also GPU utilization depends on the batch_size. The utilization may change with varying batch_size.
You can also compare your current results(time taken and utilization) with model using the Example 3 from multi_gpu_model.
Also if you go into the link, it states -
Warning: THIS FUNCTION IS DEPRECATED. It will be removed after 2020-04-01. Instructions for updating: Use tf.distribute.MirroredStrategy instead.
There should be performance improvement and GPU Utilization using tf.distribute.MirroredStrategy. This strategy is typically used for training on one machine with multiple GPUs. The tf.distribute.Strategy API provides an abstraction for distributing your training across multiple processing units. The goal is to allow users to enable distributed training using existing models and training code, with minimal changes.
For example, a variable created under a MirroredStrategy is a MirroredVariable. If no devices are specified in the constructor argument of the strategy then it will use all the available GPUs. If no GPUs are found, it will use the available CPUs. Note that TensorFlow treats all CPUs on a machine as a single device, and uses threads internally for parallelism.
Would recommend to go through Custom training with tf.distribute.Strategy tutorial that demonstrates on how to use tf.distribute.Strategy with custom training loops. They will train a simple CNN model on the fashion MNIST dataset.
Hope this answers your question. Happy Learning.

Tensorflow / keras multi_gpu_model is not splitted to more than one gpu

I'm encountered the problem, that I can not successfully split my training batches to more than one GPU. If multi_gpu_model from tensorflow.keras.utils is used, tensorflow allocates the full memory on all available (for example 2) gpus, but only the first one (gpu[0]) is utilized to 100% if nvidia-smi is watched.
I'm using tensorflow 1.12 right now.
Test on single device
model = getSimpleCNN(... some parameters)
model .compile()
model .fit()
As expected, data is loaded by cpu and the model runs on gpu[0] with 97% - 100% gpu utilization:
Create a multi_gpu model
As described in the tensorflow api for multi_gpu_model here, the device scope for model definition is not changed.
from tensorflow.keras.utils import multi_gpu_model
model = getSimpleCNN(... some parameters)
parallel_model = multi_gpu_model(model, gpus=2, cpu_merge=False) # weights merge on GPU (recommended for NV-link)
parallel_model.compile()
parallel_model.fit()
As seen in the timeline, cpu now not only loads the data, but is doing some other calculations. Notice: the second gpu is nearly doing nothing:
The question
The effect even worsens as soon as four gpus are used. Utilization of the first one goes up to 100% but for the rest there are only short peeks.
Is there any solution to fix this? How to properly train on multiple gpus?
Is there any difference between tensorflow.keras.utils and keras.utils which causes the unexpected behavior?

I just ran into the same issue.
In my case, the problem came from the use of a build_model(... parameters) function that returned the model.
Be careful with your getSimpleCNN() function, as I don't know what is in it my best advice is to build the model sequentially in your code without using this function.

Caffe's GPU Utilization Is Not Full Enough When Doing Forward Inference, Any Idea?

I coded both Python and C++ version of Caffe forward classification scripts to test Caffe's inference performance. The model is trained already. And the results are quite similar, GPU utils is not full enough.
My settings:
1. Card: Titan XP, 12GB
2. Model: InceptionV3
3. Img size: 3*299*299
When batch_size set to 40, GRAM usage can reach 10GB, but the GPU utils can just reach 77%~79%, both for Python and C++. So the performance is about 258 frames/s.
In my scripts, I loaded the image, preprocess it, load it into the input layer, and then repeat the net_.forward() operation. According to my understanding, this won't cause any Mem copy ops, so ideally should maximally pull up the GPU utils. But I can only reach no more than 80%.
In the C++ Classification Tutorial, I found below phrase:
Use multiple classification threads to ensure the GPU is always fully utilized and not waiting for an I/O blocked CPU thread.
So I tried to use the multi-thread compiled OpenBLAS, and under CPU backend, actually more CPU is involved to do the forwarding, but no use for the GPU backend. Under the GPU backend, the CPU utils will be fixed to about 100%.
Then I even tried to reduce the batch_size to 20, and start two classification processes in two terminals. The result is, GRAM usage increases to 11GB, but the GPU utils decrease to 64%~66%. Finally, the performance decreases to around 200 frames/s.
Has anyone encountered this problem? I'm really confused.
Any opinion is welcome.
Thanks,

As I had observed, the GPU util is decreased with,
1) low PCI express mode resnet-152(x16)-90% > resnet-152(x8)-80% > resnet-152(x4)-70%
2) large model - VGG-100%(x16) ; ResNet-50(x16)-95~100% ; ResNet-152(x16) - 90%
In addition, if I turn off cuDNN, the GPU Util is always 100%.
So I think there is some problem related with cuDNN, but I don't know more about the problem.
NVCaffe is somewhat better, and MXNet can utilize GPU 100% (resnet-152; x4).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.