I'm using NVIDIA Tensor Cores on Volta architecture (V100 GPU). I want to measure impact of Tensor Cores on my code, (a Convolutional Neural Network in Tensorflow/Python for testing purpose).
How can I measure Tensor Cores speedup ? Is it possible to disable Tensor Cores and run the same code with/without them ?
What I've tried:
setting TF_DISABLE_CUDNN_TENSOR_OP_MATH to 1 (from this). But I still see that Tensor Cores are used. More precisely, I see in nvprof log: volta_s884cudnn_fp16 lines (disappear with this option) and volta_s884gemm_fp16 (which are still here). Side question: what do these lines mean ?
compare with same code on Pascal architecture (P100) which has no Tensor Cores, where I see a 30% speedup, but I can't tell which part of this 30% is caused by GPU improvement and which part is Tensor Cores performance.
training same network in tf.float16 and tf.float32, but same result, I see improvements but can't tell what is caused by model size reduction.
Thanks in advance for any help/advice on this.
I chose a hack to estimate the performance gain of Tensor Cores:
I ran the code in float32 on both Pascal and Volta architecture (to estimate the performance gain of the architecture).
I ran the code in float16 on both too, and assuming the performance gain of the architecture would be the same with float32 and float16, I can estimate that the other part of the performance gain (in float16) is imputable to Tensor Cores.
Related
I am trying to train the YOLOv7 network, but I should limit the GPU usage to 8Gb.
What I understood doing experiments also on Google Colab is that before starting the training, the network tries to occupy most of the available memory regardless of how much it is or how much it is needed for training. In fact, on Google Colab it requires 11 GB out of 14 to start the training, while on my GPU it requires 45 out of 50, although then the actual training phase requires only 6GB.
I tried to minimize the parameters (batch size, workers) but nothing changes as, as mentioned, the problem is the pre-training allocation which is fixed.
I tried using the function of pytorch
torch.cuda.set_per_process_memory_fraction(0.16, CUDA_VISIBLE_DEVICES)
but this function does not cause the network to use only 8GB but causes, if exceeded 8GB, an error.
on YOLOX there is the "-o" parameter which, if omitted, avoids the allocation of pre-training memory and therefore uses only the memory it needs during training but I have not found the equivalent of this parameter on YOLOv7.
Is it possible to make YOLOv7 see only 8GB available and therefore allocate a smaller amount of GB?
Or is it possible that pre-training allocation is avoided like in YOLOX?
If I have a single GPU with 8GB RAM and I have a TensorFlow model (excluding training/validation data) that is 10GB, can TensorFlow train the model?
If yes, how does TensorFlow do this?
Notes:
I'm not looking for distributed GPU training. I want to know about single GPU case.
I'm not concerned about the training/validation data sizes.
No you can not train a model larger than your GPU's memory. (there may be some ways with dropout that I am not aware of but in general it is not advised). Further you would need more memory than even all the parameters you are keeping because your GPU needs to retain the parameters along with the derivatives for each step to do back-prop.
Not to mention the smaller batch size this would require as there is less space left for the dataset.
I use multigpu to train a model with pytorch. One gpu uses more memory than others, causing "out-of-memory". Why would one gpu use more memory? Is it possible to make the usage more balanced? Is there other ways to reduce memory usage? (Deleting variables that will not be used anymore...?) The batch size is already 1. Thanks.
DataParallel splits the batch and sends each split to a different GPU, each GPU has a copy of the model, then the forward pass is computed independently and then the outputs of each GPU are collected back to one GPU instead of computing loss independently in each GPU.
If you want to mitigate this issue you can include the loss computation in the DataParallel module.
If doing this is still an issue, then you might want model parallelism instead of data parallelism: move different parts of your model to different GPUs using .cuda(gpu_id). This is useful when the weights of your model are pretty large.
I coded both Python and C++ version of Caffe forward classification scripts to test Caffe's inference performance. The model is trained already. And the results are quite similar, GPU utils is not full enough.
My settings:
1. Card: Titan XP, 12GB
2. Model: InceptionV3
3. Img size: 3*299*299
When batch_size set to 40, GRAM usage can reach 10GB, but the GPU utils can just reach 77%~79%, both for Python and C++. So the performance is about 258 frames/s.
In my scripts, I loaded the image, preprocess it, load it into the input layer, and then repeat the net_.forward() operation. According to my understanding, this won't cause any Mem copy ops, so ideally should maximally pull up the GPU utils. But I can only reach no more than 80%.
In the C++ Classification Tutorial, I found below phrase:
Use multiple classification threads to ensure the GPU is always fully utilized and not waiting for an I/O blocked CPU thread.
So I tried to use the multi-thread compiled OpenBLAS, and under CPU backend, actually more CPU is involved to do the forwarding, but no use for the GPU backend. Under the GPU backend, the CPU utils will be fixed to about 100%.
Then I even tried to reduce the batch_size to 20, and start two classification processes in two terminals. The result is, GRAM usage increases to 11GB, but the GPU utils decrease to 64%~66%. Finally, the performance decreases to around 200 frames/s.
Has anyone encountered this problem? I'm really confused.
Any opinion is welcome.
Thanks,
As I had observed, the GPU util is decreased with,
1) low PCI express mode resnet-152(x16)-90% > resnet-152(x8)-80% > resnet-152(x4)-70%
2) large model - VGG-100%(x16) ; ResNet-50(x16)-95~100% ; ResNet-152(x16) - 90%
In addition, if I turn off cuDNN, the GPU Util is always 100%.
So I think there is some problem related with cuDNN, but I don't know more about the problem.
NVCaffe is somewhat better, and MXNet can utilize GPU 100% (resnet-152; x4).
I need to train a very large number of Neural Nets using Tensorflow with Python. My neural nets (MLP) are ranging from very small ones (~ 2 Hidden Layers with ~30 Neurons each) to large ones (3-4 Layers with >500 neurons each).
I am able to run all of them sequencially on my GPU, which is fine. But my CPU is almost idling. Additionally I found out, that my CPU is quicker than the GPU for my very small nets (I assume because of the GPU-Overhead etc...). Thats why I want to use both my CPU and my GPU in parallel to train my nets. The CPU should process the smaller networks to the larger ones, and my GPU should process from the larger to the smaller ones, until they meet somewhere in the middle... I thought, this is a good idea :-)
So I just simply start my consumers twice in different processes. The one with device = CPU, the other one with device = GPU. Both are starting and consuming the first 2 nets as expected. But then, the GPU-consumer throws an Exception, that his tensor is accessed/violated by another process on the CPU(!), which I find weird, because it is supposed to run on the GPU...
Can anybody help me, to fully segregate my to processes?
Do any of your networks share operators?
E.g. they use variables with the same name in the same variable_scope which is set to variable_scope(reuse=True)
Then multiple nets will try to reuse the same underlying Tensor structures.
Also check it tf.ConfigProto.allow_soft_placement is set to True or False in your tf.Session. If True you can't be guaranteed that the device placement will be actually executed in the way you intended in your code.