PyTorch list slicing on GPU slower than on CPU

PyTorch list slicing on GPU slower than on CPU - python

I would like to optimize ML code (SSD in PyTorch) on NVIDIA Jetson Xavier NX (development kit). One of the bottlenecks seems to be list slicing on PyTorch (1.6.0) tensors on GPU device.
The same problem occured on NVIDIA GeForce GTX 1050 Ti (GP107), CPU was ~2 times faster.
Let me create the variables first
import torch
from time import time
cuda0 = torch.device('cuda:0')
probs = torch.ones([3000], dtype=torch.float64, device=cuda0)
mask = torch.ones([3000], dtype=torch.bool, device=cuda0)
probs_cpu = probs.cpu()
mask_cpu = mask.cpu()
Then run the logic (Approximately same results occurred every run)
before = time()
probs[mask]
print(f'GPU {time() - before:.5f}') # output: GPU 0.00263
before = time()
probs_cpu[mask_cpu]
print(f'CPU {time() - before:.5f}') # output: CPU 0.00066
Why is the list slicing ~4 times slower on GPU compared to CPU using PyTorch library vesrion 1.6.0 on NVIDIA Jetson Xavier NX Developer kit according to the code above? How to speed it up?
Code details: see line 51 in predictor.py which is part of SSD Implementation in PyTorch
Run it on CPU?: Whole algorithm will not be faster if I run it on the CPU since the downloading from GPU takes too long (~0.00805 s).

Related

my GPU Memory Usage become almost full whenever I run the tensorflow code [duplicate]

I work in an environment in which computational resources are shared, i.e., we have a few server machines equipped with a few Nvidia Titan X GPUs each.
For small to moderate size models, the 12 GB of the Titan X is usually enough for 2–3 people to run training concurrently on the same GPU. If the models are small enough that a single model does not take full advantage of all the computational units of the GPU, this can actually result in a speedup compared with running one training process after the other. Even in cases where the concurrent access to the GPU does slow down the individual training time, it is still nice to have the flexibility of having multiple users simultaneously train on the GPU.
The problem with TensorFlow is that, by default, it allocates the full amount of available GPU memory when it is launched. Even for a small two-layer neural network, I see that all 12 GB of the GPU memory is used up.
Is there a way to make TensorFlow only allocate, say, 4 GB of GPU memory, if one knows that this is enough for a given model?

You can set the fraction of GPU memory to be allocated when you construct a tf.Session by passing a tf.GPUOptions as part of the optional config argument:
# Assume that you have 12GB of GPU memory and want to allocate ~4GB:
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
The per_process_gpu_memory_fraction acts as a hard upper bound on the amount of GPU memory that will be used by the process on each GPU on the same machine. Currently, this fraction is applied uniformly to all of the GPUs on the same machine; there is no way to set this on a per-GPU basis.

config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
https://github.com/tensorflow/tensorflow/issues/1578

For TensorFlow 2.0 and 2.1 (docs):
import tensorflow as tf
tf.config.gpu.set_per_process_memory_growth(True)
For TensorFlow 2.2+ (docs):
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
The docs also list some more methods:
Set environment variable TF_FORCE_GPU_ALLOW_GROWTH to true.
Use tf.config.experimental.set_virtual_device_configuration to set a hard limit on a Virtual GPU device.

Here is an excerpt from the Book Deep Learning with TensorFlow
In some cases it is desirable for the process to only allocate a subset of the available memory, or to only grow the memory usage as it is needed by the process. TensorFlow provides two configuration options on the session to control this. The first is the allow_growth option, which attempts to allocate only as much GPU memory based on runtime allocations, it starts out allocating very little memory, and as sessions get run and more GPU memory is needed, we extend the GPU memory region needed by the TensorFlow process.
1) Allow growth: (more flexible)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)
The second method is per_process_gpu_memory_fraction option, which determines the fraction of the overall amount of memory that each visible GPU should be allocated. Note: No release of memory needed, it can even worsen memory fragmentation when done.
2) Allocate fixed memory:
To only allocate 40% of the total memory of each GPU by:
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.4
session = tf.Session(config=config, ...)
Note:
That's only useful though if you truly want to bind the amount of GPU memory available on the TensorFlow process.

For Tensorflow version 2.0 and 2.1 use the following snippet:
import tensorflow as tf
gpu_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpu_devices[0], True)
For prior versions , following snippet used to work for me:
import tensorflow as tf
tf_config=tf.ConfigProto()
tf_config.gpu_options.allow_growth=True
sess = tf.Session(config=tf_config)

All the answers above assume execution with a sess.run() call, which is becoming the exception rather than the rule in recent versions of TensorFlow.
When using the tf.Estimator framework (TensorFlow 1.4 and above) the way to pass the fraction along to the implicitly created MonitoredTrainingSession is,
opts = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
conf = tf.ConfigProto(gpu_options=opts)
trainingConfig = tf.estimator.RunConfig(session_config=conf, ...)
tf.estimator.Estimator(model_fn=...,
config=trainingConfig)
Similarly in Eager mode (TensorFlow 1.5 and above),
opts = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
conf = tf.ConfigProto(gpu_options=opts)
tfe.enable_eager_execution(config=conf)
Edit: 11-04-2018
As an example, if you are to use tf.contrib.gan.train, then you can use something similar to bellow:
tf.contrib.gan.gan_train(........, config=conf)

You can use
TF_FORCE_GPU_ALLOW_GROWTH=true
in your environment variables.
In tensorflow code:
bool GPUBFCAllocator::GetAllowGrowthValue(const GPUOptions& gpu_options) {
const char* force_allow_growth_string =
std::getenv("TF_FORCE_GPU_ALLOW_GROWTH");
if (force_allow_growth_string == nullptr) {
return gpu_options.allow_growth();
}

Tensorflow 2.0 Beta and (probably) beyond
The API changed again. It can be now found in:
tf.config.experimental.set_memory_growth(
device,
enable
)
Aliases:
tf.compat.v1.config.experimental.set_memory_growth
tf.compat.v2.config.experimental.set_memory_growth
References:
https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/config/experimental/set_memory_growth
https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth
See also:
Tensorflow - Use a GPU: https://www.tensorflow.org/guide/gpu
for Tensorflow 2.0 Alpha see: this answer

All the answers above refer to either setting the memory to a certain extent in TensorFlow 1.X versions or to allow memory growth in TensorFlow 2.X.
The method tf.config.experimental.set_memory_growth indeed works for allowing dynamic growth during the allocation/preprocessing. Nevertheless one may like to allocate from the start a specific-upper limit GPU memory.
The logic behind allocating a specific GPU memory would also be to prevent OOM memory during training sessions. For example, if one trains while opening video-memory consuming Chrome tabs/any other video consumption process, the tf.config.experimental.set_memory_growth(gpu, True) could result in OOM errors thrown, hence the necessity of allocating from the start more memory in certain cases.
The recommended and correct way in which to allot memory per GPU in TensorFlow 2.X is done in the following manner:
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 1GB of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)]

Shameless plug: If you install the GPU supported Tensorflow, the session will first allocate all GPUs whether you set it to use only CPU or GPU. I may add my tip that even you set the graph to use CPU only you should set the same configuration(as answered above:) ) to prevent the unwanted GPU occupation.
And in an interactive interface like IPython and Jupyter, you should also set that configure, otherwise, it will allocate all memory and leave almost none for others. This is sometimes hard to notice.

If you're using Tensorflow 2 try the following:
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.Session(config=config)

For Tensorflow 2.0 this this solution worked for me. (TF-GPU 2.0, Windows 10, GeForce RTX 2070)
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
tf.config.experimental.set_memory_growth(physical_devices[0], True)

# allocate 60% of GPU memory
from keras.backend.tensorflow_backend import set_session
import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.6
set_session(tf.Session(config=config))

this code has worked for me:
import tensorflow as tf
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.InteractiveSession(config=config)

Well I am new to tensorflow, I have Geforce 740m or something GPU with 2GB ram, I was running mnist handwritten kind of example for a native language with training data containing of 38700 images and 4300 testing images and was trying to get precision , recall , F1 using following code as sklearn was not giving me precise reults. once i added this to my existing code i started getting GPU errors.
TP = tf.count_nonzero(predicted * actual)
TN = tf.count_nonzero((predicted - 1) * (actual - 1))
FP = tf.count_nonzero(predicted * (actual - 1))
FN = tf.count_nonzero((predicted - 1) * actual)
prec = TP / (TP + FP)
recall = TP / (TP + FN)
f1 = 2 * prec * recall / (prec + recall)
plus my model was heavy i guess, i was getting memory error after 147, 148 epochs, and then I thought why not create functions for the tasks so I dont know if it works this way in tensrorflow, but I thought if a local variable is used and when out of scope it may release memory and i defined the above elements for training and testing in modules, I was able to achieve 10000 epochs without any issues, I hope this will help..

i tried to train unet on voc data set but because of huge image size, memory finishes. i tried all the above tips, even tried with batch size==1, yet to no improvement. sometimes TensorFlow version also causes the memory issues. try by using
pip install tensorflow-gpu==1.8.0

Using GPU to run python script in anaconda prompt

I am trying to run the following python script from my anaconda prompt:
python object_tracker.py --video test.mp4 --model yolov4 --dont_show
This comes directly from the AI Guys yolov4-deepsort repository (https://github.com/theAIGuysCode/yolov4-deepsort). It is code for an object tracker so its very computation heavy and running it on longer videos take hours. A computer with and Nvidia Graphics card has become available to me and I want to use GPU to speed up the process but I'm not sure how.

Python OpenCV uses NumPy for computation and NumPy runs on CPU. You can convert NumPy arrays to Pytorch tensors and can run your code on GPU. A simple idea is
N = 8000
np.random.seed(42)
nA = np.random.rand(N,N).astype(np.float32)
nB = np.random.rand(N,N).astype(np.float32)
nC = nA.dot(nB) # numpy dot product runs on CPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'PyTorch set to {device}') # cuda when available
tA = torch.from_numpy(nA).float().to(device)
tB = torch.from_numpy(nB).to(device)
tC = torch.mm(tA,tB) # torch matrix multiplication aka dot product runs on GPU
PS:
Just came across this
https://medium.com/swlh/understanding-torchvision-functionalities-for-pytorch-391273299dc9
seems you can do it in a better way with torchvision.

OnnxRuntime vs OnnxRuntime+OpenVinoEP inference time difference

I'm trying to accelerate my model's performance by converting it to OnnxRuntime. However, I'm getting weird results, when trying to measure inference time.
While running only 1 iteration OnnxRuntime's CPUExecutionProvider greatly outperforms OpenVINOExecutionProvider:
CPUExecutionProvider - 0.72 seconds
OpenVINOExecutionProvider - 4.47 seconds
But if I run let's say 5 iterations the result is different:
CPUExecutionProvider - 3.83 seconds
OpenVINOExecutionProvider - 14.13 seconds
And if I run 100 iterations, the result is drastically different:
CPUExecutionProvider - 74.19 seconds
OpenVINOExecutionProvider - 46.96seconds
It seems to me, that the inference time of OpenVinoEP is not linear, but I don't understand why.
So my questions are:
Why does OpenVINOExecutionProvider behave this way?
What ExecutionProvider should I use?
The code is very basic:
import onnxruntime as rt
import numpy as np
import time
from tqdm import tqdm
limit = 5
# MODEL
device = 'CPU_FP32'
model_file_path = 'road.onnx'
image = np.random.rand(1, 3, 512, 512).astype(np.float32)
# OnnxRuntime
sess = rt.InferenceSession(model_file_path, providers=['CPUExecutionProvider'], provider_options=[{'device_type' : device}])
input_name = sess.get_inputs()[0].name
start = time.time()
for i in tqdm(range(limit)):
out = sess.run(None, {input_name: image})
end = time.time()
inference_time = end - start
print(inference_time)
# OnnxRuntime + OpenVinoEP
sess = rt.InferenceSession(model_file_path, providers=['OpenVINOExecutionProvider'], provider_options=[{'device_type' : device}])
input_name = sess.get_inputs()[0].name
start = time.time()
for i in tqdm(range(limit)):
out = sess.run(None, {input_name: image})
end = time.time()
inference_time = end - start
print(inference_time)

The use of ONNX Runtime with OpenVINO Execution Provider enables the inferencing of ONNX models using ONNX Runtime API while the OpenVINO toolkit runs in the backend.
This accelerates ONNX model's performance on the same hardware compared to generic acceleration on Intel® CPU, GPU, VPU and FPGA.
Generally, CPU Execution Provider works best with small iteration since its intention is to keep the binary size small. Meanwhile, the OpenVINO Execution Provider is intended for Deep Learning inference on Intel CPUs, Intel integrated GPUs, and Intel® MovidiusTM Vision Processing Units (VPUs).
This is why the OpenVINO Execution Provider outperforms the CPU Execution Provider during larger iterations.
You should choose Execution Provider that would suffice your own requirements. If you going to execute complex DL with large iteration, then go for OpenVINO Execution Provider. For a simpler use case, where you need the binary size to be smaller with smaller iterations, you can choose the CPU Execution Provider instead.
For more information, you may refer to this ONNX Runtime Performance Tuning

Regarding non-linear time, it might be the case that there is some preparation that happens when you first run the model with OpenVINO - perhaps the model is first compiled to OpenVINO when you first call sess.run. I observed a similar effect when using TFLite. For these scenarios, it makes sense to discard the time of the first iteration when benchmarking. There also tends to be quite a bit of variance so running >10 or ideally >100 iterations is a good idea.

tensorflow - same program on different computers allocate different GPU memory

ubuntu 16.04, python 2.7.12, tensorflow 1.10.1 (gpu version), cuda 9.0, cudnn 7.2
I have built and trained a CNN model, and now I am using a while loop to repeatedly let my model make predictions.
In order to limit the memory usage, I am using the following code to create my classifier:
import tensorflow as tf
session_config = tf.ConfigProto(log_device_placement=False)
session_config.gpu_options.allow_growth = True
run_config = tf.estimator.RunConfig().replace(session_config=session_config)
classifier = tf.estimator.Estimator(
model_fn = my_model_fn,
model_dir = my_trained_model_dir,
config = run_config,
params={}
)
And I call classifier.predict(my_input_fn) in a while loop to repeatedly make predictions.
Issue:
I am running my codes on two computers, both with the same software environment as I listed above.
However, the two computers have different GPUs:
Computer A: 1050 2G
Computer B: 1070 8G
My code works well on both computer.
However, when I use nvidia-smi to check the GPU memory allocation, I found that my code will allocate 1.4G GPU memory on Computer A, while it becomes 3.6G on Computer B.
So, Why would this happen?
I think session_config.gpu_options.allow_growth = True tells the program to allocate as much as it needs. Computer A has proved that 1.4G is enough, then why would the same code allocate 3.6G on Computer B?

It may be that 1.4gb is actually not enough, and some of the required memory is swapped into main memory. Video cards drivers do that.

Pytorch speed comparison - GPU slower than CPU

I was trying to find out if GPU tensor operations are actually faster than CPU ones. So, I wrote this particular code below to implement a simple 2D addition of CPU tensors and GPU cuda tensors successively to see the speed difference:
import torch
import time
###CPU
start_time = time.time()
a = torch.ones(4,4)
for _ in range(1000000):
a += a
elapsed_time = time.time() - start_time
print('CPU time = ',elapsed_time)
###GPU
start_time = time.time()
b = torch.ones(4,4).cuda()
for _ in range(1000000):
b += b
elapsed_time = time.time() - start_time
print('GPU time = ',elapsed_time)
To my surprise, the CPU time was 0.93 sec and the GPU time was as high as 63 seconds. Am I doing the cuda tensor operation properly or is the concept of cuda tensors works faster only in very highly complex operations, like in neural networks?
Note: My GPU is NVIDIA 940MX and torch.cuda.is_available() call returns True.

GPU acceleration works by heavy parallelization of computation. On a GPU you have a huge amount of cores, each of them is not very powerful, but the huge amount of cores here matters.
Frameworks like PyTorch do their to make it possible to compute as much as possible in parallel. In general matrix operations are very well suited for parallelization, but still it isn't always possible to parallelize computation!
In your example you have a loop:
b = torch.ones(4,4).cuda()
for _ in range(1000000):
b += b
You have 1000000 operations, but due to the structure of the code it impossible to parallelize much of these computations. If you think about it, to compute the next b you need to know the value of the previous (or current) b.
So you have 1000000 operations, but each of these has to be computed one after another. Possible parallelization is limited to the size of your tensor. This size though is not very large in your example:
torch.ones(4,4)
So you only can parallelize 16 operations (additions) per iteration.
As the CPU has few, but much more powerful cores, it is just much faster for the given example!
But things change if you change the size of the tensor, then PyTorch is able to parallelize much more of the overall computation. I changed the iterations to 1000 (because I did not want to wait so long :), but you can put in any value you like, the relation between CPU and GPU should stay the same.
Here are the results for different tensor sizes:
#torch.ones(4,4) - the size you used
CPU time = 0.00926661491394043
GPU time = 0.0431208610534668
#torch.ones(40,40) - CPU gets slower, but still faster than GPU
CPU time = 0.014729976654052734
GPU time = 0.04474186897277832
#torch.ones(400,400) - CPU now much slower than GPU
CPU time = 0.9702610969543457
GPU time = 0.04415607452392578
#torch.ones(4000,4000) - GPU much faster then CPU
CPU time = 38.088677167892456
GPU time = 0.044649362564086914
So as you see, where it is possible to parallelize stuff (here the addition of the tensor elements), GPU becomes very powerful. GPU time is not changing at all for the given calculations, the GPU can handle much more! (as long as it doesn't run out of memory :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

PyTorch list slicing on GPU slower than on CPU - python

Related

my GPU Memory Usage become almost full whenever I run the tensorflow code [duplicate]

Using GPU to run python script in anaconda prompt

OnnxRuntime vs OnnxRuntime+OpenVinoEP inference time difference

tensorflow - same program on different computers allocate different GPU memory

Pytorch speed comparison - GPU slower than CPU

Categories

Resources