Yesterday I tried GPU for computation and I was very disappointed. My idea was that GPU (with thousands of cores) will be at least 1k faster than CPU.
I wanted first to use GPU for the computation of Cosine similarity, and let me tell you the results.
First Off, GPU is not capable to compute the pairwise similarity (Matrix to Matrix) it says out of memory regardless of the fact that the memory of GPU was 24 GB RAM (it was model - 2x RTX 3090, 88.8 tflops 24 GB ram card, 64 GB ram system)
So I had to use vector to matrix computation, where you then need do the iteration eg if your original matrix has 400k rows so it means you have to do 400k time the iteration vector to matrix computation of Cosine similarity.
I used btw Tensor flow for this like that:
import tensorflow as tf
import numpy as np
Xn = np.random.uniform(0,10, (400000,100)).astype('float32')
X = tf.constant(Xn)
count = 1000 * 100
for i in range(count):
tf.keras.losses.cosine_similarity(
X[0],
X,
axis=-1
)
As you can see the Matrix is shape of 400k, 100 means 400k rows 100 columns.
Then I made the iterations above it is 100k. These 100k took 176 seconds, which is yes like 20x faster than CPU on my notebook, but to be honest same results you can achieve with python Numba library which translates python code to machine native code. (but then 400k iterations takes 30 mins though - with Numba, don't know how long with GPU)
Result is, I am dissapointed, I expected to be GPU more powerful in terms of memory, and mainly at least 1000 x faster than CPU.. I expected the times to be in manner of mili or micro seconds..
Why they still say this card has power of 80 teraflops etc when then you compute it same time as on CPU with native instructions. And mainly - CPU has only few cores, GPU has thousand of cores..
Can I make it somehow faster?
Related
I am trying to replace a single 2D convolution layer with a relatively large kernel, with several 2D-Conv layers having much smaller kernels. Theoretically, the replacement should work much faster (in respect of the number of operations) but actually it does not.
Suppose my kernel is of size OxCxHxW (output channels, input channels, height, width, respectfully), then in theory, the computational cost is OxCxHxW per input pixel. However, it seems that larger kernels are computed much faster.
For example:
a kernel of size 32x32x5x5 takes 46ms to run (baseline large kernel)
a smaller kernel of size 32x32x1x1 takes 8 ms (supposed to be x25 faster than #1, about 2ms)
a group-convolution with a kernel size of 32x1x5x5 takes about 9 ms, while the reduction in kernel size is much more significant (supposed to be x32 faster than #1, about 1.5ms).
I have also tried TensorFlow and got similar results.
I understand that larger kernels probably utilize the CPU (or GPU) better and that there is a function's call overhead, but the time reduction going from large to the small kernel is not as significant as I expected.
A colab notebook can be found here.
Can anyone suggest an idea to accelerate those small convolutions?
When running a PyTorch training program with num_workers=32 for DataLoader, htop shows 33 python process each with 32 GB of VIRT and 15 GB of RES.
Does this mean that the PyTorch training is using 33 processes X 15 GB = 495 GB of memory? htop shows only about 50 GB of RAM and 20 GB of swap is being used on the entire machine with 128 GB of RAM. So, how do we explain the discrepancy?
Is there a more accurate way of calculating the total amount of RAM being used by the main PyTorch program and all its child DataLoader worker processes?
Thank you
Does this mean that the PyTorch training is using 33 processes X 15 GB = 495 GB of memory?
Not necessary. You have a worker process (with several subprocesses - workers) and the CPU has several cores. One worker usually loads one batch. The next batch can already be loaded and ready to go by the time the main process is ready for another batch. This is the secret for the speeding up.
I guess, you should use far less num_workers.
It would be interesting to know your batch size too, which you can adapt for the training process as well.
Is there a more accurate way of calculating the total amount of RAM being used by the main PyTorch program and all its child DataLoader worker processes?
I was googling but could not find a concrete formula. I think that it is a rough estimation of how many cores has your CPU and Memory and Batch Size.
To choose the num_workers depends on what kind of computer you are using, what kind of dataset you are taking, and how much on-the-fly pre-processing your data requires.
HTH
There is a python function called tracemalloc which is used to trace memory blocks allocated to python. https://docs.python.org/3/library/tracemalloc.html
Tracebacks
Statics on memory per filename
Compute the diff between snapshots
import tracemalloc
tracemalloc.start()
do_someting_that_consumes_ram_and releases_some()
# show how much RAM the above code allocated and the peak usage
current, peak = tracemalloc.get_traced_memory()
print(f"{current:0.2f}, {peak:0.2f}")
tracemalloc.stop()
https://discuss.pytorch.org/t/measuring-peak-memory-usage-tracemalloc-for-pytorch/34067
I am running a simple deep learning model on Google's colab, but it's running slower than my MacBook Air with no GPU.
I read this question and found out it's a problem because of dataset importing over the internet, but I am unable to figure out how to speed up this process.
My model can be found here. Any idea of how I can make the epoch faster?
My local machine takes 0.5-0.6 seconds per epoch and google-colabs takes 3-4 seconds
Is GPU always faster than CPU? No, why? because the speed optimization by a GPU depends on a few factors,
How much part of your code runs/executes in parallel, i.e how much part of your code creates threads that run parallel, this is automatically taken care by Keras and should not be a problem in your scenario.
Time Spent sending the data between CPU and GPU, this is where many times people falter, it is assumed that GPU will always outperform CPU, but if data being passed is too small, the time it takes to perform the computation (No of computation steps required) are lesser than breaking the data/processes into thread, executing them in GPU and then recombining them back again on the CPU.
The second scenario looks probable in your case since you have used a batch_size of 5.
classifier=KerasClassifier(build_fn=build_classifier,epochs=100,batch_size=5), If your dataset is big enough, Increasing the batch_size will increase the performance of GPU over CPU.
Other than that you have used a fairly simple model and as #igrinis pointed out that data is loaded only once from drive to memory so the problem in all theory should not be loading time because the data is on drive.
I was trying to find out if GPU tensor operations are actually faster than CPU ones. So, I wrote this particular code below to implement a simple 2D addition of CPU tensors and GPU cuda tensors successively to see the speed difference:
import torch
import time
###CPU
start_time = time.time()
a = torch.ones(4,4)
for _ in range(1000000):
a += a
elapsed_time = time.time() - start_time
print('CPU time = ',elapsed_time)
###GPU
start_time = time.time()
b = torch.ones(4,4).cuda()
for _ in range(1000000):
b += b
elapsed_time = time.time() - start_time
print('GPU time = ',elapsed_time)
To my surprise, the CPU time was 0.93 sec and the GPU time was as high as 63 seconds. Am I doing the cuda tensor operation properly or is the concept of cuda tensors works faster only in very highly complex operations, like in neural networks?
Note: My GPU is NVIDIA 940MX and torch.cuda.is_available() call returns True.
GPU acceleration works by heavy parallelization of computation. On a GPU you have a huge amount of cores, each of them is not very powerful, but the huge amount of cores here matters.
Frameworks like PyTorch do their to make it possible to compute as much as possible in parallel. In general matrix operations are very well suited for parallelization, but still it isn't always possible to parallelize computation!
In your example you have a loop:
b = torch.ones(4,4).cuda()
for _ in range(1000000):
b += b
You have 1000000 operations, but due to the structure of the code it impossible to parallelize much of these computations. If you think about it, to compute the next b you need to know the value of the previous (or current) b.
So you have 1000000 operations, but each of these has to be computed one after another. Possible parallelization is limited to the size of your tensor. This size though is not very large in your example:
torch.ones(4,4)
So you only can parallelize 16 operations (additions) per iteration.
As the CPU has few, but much more powerful cores, it is just much faster for the given example!
But things change if you change the size of the tensor, then PyTorch is able to parallelize much more of the overall computation. I changed the iterations to 1000 (because I did not want to wait so long :), but you can put in any value you like, the relation between CPU and GPU should stay the same.
Here are the results for different tensor sizes:
#torch.ones(4,4) - the size you used
CPU time = 0.00926661491394043
GPU time = 0.0431208610534668
#torch.ones(40,40) - CPU gets slower, but still faster than GPU
CPU time = 0.014729976654052734
GPU time = 0.04474186897277832
#torch.ones(400,400) - CPU now much slower than GPU
CPU time = 0.9702610969543457
GPU time = 0.04415607452392578
#torch.ones(4000,4000) - GPU much faster then CPU
CPU time = 38.088677167892456
GPU time = 0.044649362564086914
So as you see, where it is possible to parallelize stuff (here the addition of the tensor elements), GPU becomes very powerful. GPU time is not changing at all for the given calculations, the GPU can handle much more! (as long as it doesn't run out of memory :)
I am trying to classify paragraphs based on their sentiments. I have training data of 600 thousand documents. When I convert them to Tf-Idf vector space with words as analyzer and ngram range as 1-2 there are almost 6 million features. So I have to do Singular value decomposition (SVD) to reduce features.
I have tried gensim and sklearn's SVD feature. Both work fine for feature reduction till 100 but as soon as I try for 200 features they throw memory error.
Also I have not used entire document (600 thousand) as training data, I have taken 50000 documents only. So essentially my training matrix is:
50000 * 6 million and want to reduce it to 50000 * (100 to 500)
Is there any other way I can implement it in python, or do I have to implement sparks mllib SVD(written for only java and scala) ? If Yes, how much faster will it be?
System specification: 32 Gb RAM with 4 core processors on ubuntu 14.04
I don't really see why using sparks mllib SVD would improve performance or avoid memory errors. You simply exceed the size of your RAM. You have some options to deal with that:
Reduce the dictionary size of your tf-idf (playing with max_df and min_df parameters of scikit-learn for example).
Use a hashing vectorizer instead of tf-idf.
Get more RAM (but at some point tf-idf + SVD is not scalable).
Also you should show your code sample, you might do something wrong in your python code.