I was trying to find out if GPU tensor operations are actually faster than CPU ones. So, I wrote this particular code below to implement a simple 2D addition of CPU tensors and GPU cuda tensors successively to see the speed difference:
import torch
import time
###CPU
start_time = time.time()
a = torch.ones(4,4)
for _ in range(1000000):
a += a
elapsed_time = time.time() - start_time
print('CPU time = ',elapsed_time)
###GPU
start_time = time.time()
b = torch.ones(4,4).cuda()
for _ in range(1000000):
b += b
elapsed_time = time.time() - start_time
print('GPU time = ',elapsed_time)
To my surprise, the CPU time was 0.93 sec and the GPU time was as high as 63 seconds. Am I doing the cuda tensor operation properly or is the concept of cuda tensors works faster only in very highly complex operations, like in neural networks?
Note: My GPU is NVIDIA 940MX and torch.cuda.is_available() call returns True.
GPU acceleration works by heavy parallelization of computation. On a GPU you have a huge amount of cores, each of them is not very powerful, but the huge amount of cores here matters.
Frameworks like PyTorch do their to make it possible to compute as much as possible in parallel. In general matrix operations are very well suited for parallelization, but still it isn't always possible to parallelize computation!
In your example you have a loop:
b = torch.ones(4,4).cuda()
for _ in range(1000000):
b += b
You have 1000000 operations, but due to the structure of the code it impossible to parallelize much of these computations. If you think about it, to compute the next b you need to know the value of the previous (or current) b.
So you have 1000000 operations, but each of these has to be computed one after another. Possible parallelization is limited to the size of your tensor. This size though is not very large in your example:
torch.ones(4,4)
So you only can parallelize 16 operations (additions) per iteration.
As the CPU has few, but much more powerful cores, it is just much faster for the given example!
But things change if you change the size of the tensor, then PyTorch is able to parallelize much more of the overall computation. I changed the iterations to 1000 (because I did not want to wait so long :), but you can put in any value you like, the relation between CPU and GPU should stay the same.
Here are the results for different tensor sizes:
#torch.ones(4,4) - the size you used
CPU time = 0.00926661491394043
GPU time = 0.0431208610534668
#torch.ones(40,40) - CPU gets slower, but still faster than GPU
CPU time = 0.014729976654052734
GPU time = 0.04474186897277832
#torch.ones(400,400) - CPU now much slower than GPU
CPU time = 0.9702610969543457
GPU time = 0.04415607452392578
#torch.ones(4000,4000) - GPU much faster then CPU
CPU time = 38.088677167892456
GPU time = 0.044649362564086914
So as you see, where it is possible to parallelize stuff (here the addition of the tensor elements), GPU becomes very powerful. GPU time is not changing at all for the given calculations, the GPU can handle much more! (as long as it doesn't run out of memory :)
Related
I'm trying to accelerate my model's performance by converting it to OnnxRuntime. However, I'm getting weird results, when trying to measure inference time.
While running only 1 iteration OnnxRuntime's CPUExecutionProvider greatly outperforms OpenVINOExecutionProvider:
CPUExecutionProvider - 0.72 seconds
OpenVINOExecutionProvider - 4.47 seconds
But if I run let's say 5 iterations the result is different:
CPUExecutionProvider - 3.83 seconds
OpenVINOExecutionProvider - 14.13 seconds
And if I run 100 iterations, the result is drastically different:
CPUExecutionProvider - 74.19 seconds
OpenVINOExecutionProvider - 46.96seconds
It seems to me, that the inference time of OpenVinoEP is not linear, but I don't understand why.
So my questions are:
Why does OpenVINOExecutionProvider behave this way?
What ExecutionProvider should I use?
The code is very basic:
import onnxruntime as rt
import numpy as np
import time
from tqdm import tqdm
limit = 5
# MODEL
device = 'CPU_FP32'
model_file_path = 'road.onnx'
image = np.random.rand(1, 3, 512, 512).astype(np.float32)
# OnnxRuntime
sess = rt.InferenceSession(model_file_path, providers=['CPUExecutionProvider'], provider_options=[{'device_type' : device}])
input_name = sess.get_inputs()[0].name
start = time.time()
for i in tqdm(range(limit)):
out = sess.run(None, {input_name: image})
end = time.time()
inference_time = end - start
print(inference_time)
# OnnxRuntime + OpenVinoEP
sess = rt.InferenceSession(model_file_path, providers=['OpenVINOExecutionProvider'], provider_options=[{'device_type' : device}])
input_name = sess.get_inputs()[0].name
start = time.time()
for i in tqdm(range(limit)):
out = sess.run(None, {input_name: image})
end = time.time()
inference_time = end - start
print(inference_time)
The use of ONNX Runtime with OpenVINO Execution Provider enables the inferencing of ONNX models using ONNX Runtime API while the OpenVINO toolkit runs in the backend.
This accelerates ONNX model's performance on the same hardware compared to generic acceleration on Intel® CPU, GPU, VPU and FPGA.
Generally, CPU Execution Provider works best with small iteration since its intention is to keep the binary size small. Meanwhile, the OpenVINO Execution Provider is intended for Deep Learning inference on Intel CPUs, Intel integrated GPUs, and Intel® MovidiusTM Vision Processing Units (VPUs).
This is why the OpenVINO Execution Provider outperforms the CPU Execution Provider during larger iterations.
You should choose Execution Provider that would suffice your own requirements. If you going to execute complex DL with large iteration, then go for OpenVINO Execution Provider. For a simpler use case, where you need the binary size to be smaller with smaller iterations, you can choose the CPU Execution Provider instead.
For more information, you may refer to this ONNX Runtime Performance Tuning
Regarding non-linear time, it might be the case that there is some preparation that happens when you first run the model with OpenVINO - perhaps the model is first compiled to OpenVINO when you first call sess.run. I observed a similar effect when using TFLite. For these scenarios, it makes sense to discard the time of the first iteration when benchmarking. There also tends to be quite a bit of variance so running >10 or ideally >100 iterations is a good idea.
Yesterday I tried GPU for computation and I was very disappointed. My idea was that GPU (with thousands of cores) will be at least 1k faster than CPU.
I wanted first to use GPU for the computation of Cosine similarity, and let me tell you the results.
First Off, GPU is not capable to compute the pairwise similarity (Matrix to Matrix) it says out of memory regardless of the fact that the memory of GPU was 24 GB RAM (it was model - 2x RTX 3090, 88.8 tflops 24 GB ram card, 64 GB ram system)
So I had to use vector to matrix computation, where you then need do the iteration eg if your original matrix has 400k rows so it means you have to do 400k time the iteration vector to matrix computation of Cosine similarity.
I used btw Tensor flow for this like that:
import tensorflow as tf
import numpy as np
Xn = np.random.uniform(0,10, (400000,100)).astype('float32')
X = tf.constant(Xn)
count = 1000 * 100
for i in range(count):
tf.keras.losses.cosine_similarity(
X[0],
X,
axis=-1
)
As you can see the Matrix is shape of 400k, 100 means 400k rows 100 columns.
Then I made the iterations above it is 100k. These 100k took 176 seconds, which is yes like 20x faster than CPU on my notebook, but to be honest same results you can achieve with python Numba library which translates python code to machine native code. (but then 400k iterations takes 30 mins though - with Numba, don't know how long with GPU)
Result is, I am dissapointed, I expected to be GPU more powerful in terms of memory, and mainly at least 1000 x faster than CPU.. I expected the times to be in manner of mili or micro seconds..
Why they still say this card has power of 80 teraflops etc when then you compute it same time as on CPU with native instructions. And mainly - CPU has only few cores, GPU has thousand of cores..
Can I make it somehow faster?
As I am learning to write Neural Networks with Python, I have just written the following linear association network that takes in K input vectors x_1, ..., x_K of respective length L and K output vectors of respective length N and finds optimal weights using gradient descent.
As the calculation times explodes really quickly when adjusting K, L and N, I was searching on how to speed this up. I discovered cupy, but cupy is much, much slower than numpy in this case. Why would this be? When changing the code to the cupy variation, I do nothing but substituting every np to cp as I imported cupy as cp.
I have also used f = njit()(ManyAssociations.fit), but then I had to return W in fit instead of writing ManyAssociations.weights = W. Is there any way to use njit inside of the class or apart from that is there a better way to use numba/cuda? It turns out to be much quicker after "warming up" with a first function call, but it still reaches its limit at with vectors of the mentioned shapes around K = L = N = 9.
What are some other good ways to speed up code like the below one? Could I be writing more efficiently? Could I be using the GPU better?
import numpy as np
class ManyAssociations:
def fit(x_train, y_train, learning_rate, tol):
L_L = x_train.shape[1]
L_N = y_train.shape[1]
W = np.zeros((L_N, L_L))
for n in range(L_N):
learning = True
w = np.random.rand(L_L)
while learning:
delta = (x_train # w - y_train[:,n])
grad_E = delta # x_train
w = w - learning_rate * grad_E
if (grad_E # grad_E) < tol:
W[n] = w
learning = False
ManyAssociations.weights = W
def predict(x_pred, W):
preds = []
for k in range(x_pred.shape[0]):
preds.append(W # x_pred[k])
return np.array(preds)
I discovered cupy, but cupy is much, much slower than numpy in this case. Why would this be?
Computations on GPU are split into basic computationally-intensive building-blocks called kernels. The kernels are submitted to the GPU by the CPU. Each kernel call take some time: the CPU has to communicate with the GPU and often use the relatively slow PCI interconnect (both should be synchronized), allocations should be made on the GPU so that resulting data can be written, etc. The CuPy package transform each basic Numpy instruction to a GPU kernel naively. Since you loop executes a lot of small kernels, the overhead is huge. Thus, if you want you code to be faster on GPUs using CuPy, you need either to work on huge chunk data or to write directly your own kernel (this is hard since GPU are quite complex).
Is there any way to use njit inside of the class or apart from that is there a better way to use numba/cuda?
You can use #jitclass. You can find more information in the documentation.
Moreover, you can take advantage of parallelism to speed you code up. To do that, you can for exemple replace range by prange and add the property parallel=True to Numba's njit. You can find more information here.
What are some other good ways to speed up code like the below one? Could I be writing more efficiently? Could I be using the GPU better?
Neural networks are generally very computationally intensive. Numba should be quite good to get reasonably high performance. But if you want a fast code, then you will either need to use higher-level library or to get your hands dirty by rewriting what the libraries do yourself (likely with a much lower-level code).
The standard way to work with neural networks is to use dedicated libraries like TensorFlow, PyTorch, Keras, etc. AFAIK, the former is flexible and highly optimized although it is a bit low-level than the other.
We ran inference over 1280*720 RGB images using Faster RCNN from TensorFlow model zoo, trained over the COCO dataset and got some results
Test case 1:
Created a tf session using tf.Session(graph=tf.Graph())
Batch Size = 4, GPU use = 100% (as default)
Time Taken for inference of 4 images = 0.32 secs
Test case 2:
Then we restricted the GPU usage for each TF session to a fraction using gpu_options
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.40)
tf.Session(graph=tf.Graph(), config=tf.ConfigProto(gpu_options=gpu_options))
For a batch size of 2, the time taken for inference was 0.16secs, which is understandable because this is linear.
Test case 3:
Ran 2 instances of inferences as two different python processes. Here, both the batch sizes were 2.
The inference slowed down considerably, and the final value was 0.32secs, which is same as the case of one process running inference over the batch size of 4.
Batch Size = 2, GPU use = 40%, Processes = 2
Time Taken = 0.32 secs
Hence, in Case 1 and Case 3, time taken is the same.
Q.1 Is there any way to reduce the time taken?
Q.2 Where is the bottleneck? In 1st case also, it doesn’t seem that the entire GPU memory is being utilised. Therefore, we had thought that if we were to divide the inference into two, more efficient utilisation of resources could be done. Where are we going wrong in our understanding?
I am running a simple deep learning model on Google's colab, but it's running slower than my MacBook Air with no GPU.
I read this question and found out it's a problem because of dataset importing over the internet, but I am unable to figure out how to speed up this process.
My model can be found here. Any idea of how I can make the epoch faster?
My local machine takes 0.5-0.6 seconds per epoch and google-colabs takes 3-4 seconds
Is GPU always faster than CPU? No, why? because the speed optimization by a GPU depends on a few factors,
How much part of your code runs/executes in parallel, i.e how much part of your code creates threads that run parallel, this is automatically taken care by Keras and should not be a problem in your scenario.
Time Spent sending the data between CPU and GPU, this is where many times people falter, it is assumed that GPU will always outperform CPU, but if data being passed is too small, the time it takes to perform the computation (No of computation steps required) are lesser than breaking the data/processes into thread, executing them in GPU and then recombining them back again on the CPU.
The second scenario looks probable in your case since you have used a batch_size of 5.
classifier=KerasClassifier(build_fn=build_classifier,epochs=100,batch_size=5), If your dataset is big enough, Increasing the batch_size will increase the performance of GPU over CPU.
Other than that you have used a fairly simple model and as #igrinis pointed out that data is loaded only once from drive to memory so the problem in all theory should not be loading time because the data is on drive.