Pytorch 2D convolution is somewhat slow

Pytorch 2D convolution is somewhat slow - python

I am trying to replace a single 2D convolution layer with a relatively large kernel, with several 2D-Conv layers having much smaller kernels. Theoretically, the replacement should work much faster (in respect of the number of operations) but actually it does not.
Suppose my kernel is of size OxCxHxW (output channels, input channels, height, width, respectfully), then in theory, the computational cost is OxCxHxW per input pixel. However, it seems that larger kernels are computed much faster.
For example:
a kernel of size 32x32x5x5 takes 46ms to run (baseline large kernel)
a smaller kernel of size 32x32x1x1 takes 8 ms (supposed to be x25 faster than #1, about 2ms)
a group-convolution with a kernel size of 32x1x5x5 takes about 9 ms, while the reduction in kernel size is much more significant (supposed to be x32 faster than #1, about 1.5ms).
I have also tried TensorFlow and got similar results.
I understand that larger kernels probably utilize the CPU (or GPU) better and that there is a function's call overhead, but the time reduction going from large to the small kernel is not as significant as I expected.
A colab notebook can be found here.
Can anyone suggest an idea to accelerate those small convolutions?

Related

Tensorflow, how much faster is GPU vs CPU? (cosine similarity computation)

Yesterday I tried GPU for computation and I was very disappointed. My idea was that GPU (with thousands of cores) will be at least 1k faster than CPU.
I wanted first to use GPU for the computation of Cosine similarity, and let me tell you the results.
First Off, GPU is not capable to compute the pairwise similarity (Matrix to Matrix) it says out of memory regardless of the fact that the memory of GPU was 24 GB RAM (it was model - 2x RTX 3090, 88.8 tflops 24 GB ram card, 64 GB ram system)
So I had to use vector to matrix computation, where you then need do the iteration eg if your original matrix has 400k rows so it means you have to do 400k time the iteration vector to matrix computation of Cosine similarity.
I used btw Tensor flow for this like that:
import tensorflow as tf
import numpy as np
Xn = np.random.uniform(0,10, (400000,100)).astype('float32')
X = tf.constant(Xn)
count = 1000 * 100
for i in range(count):
tf.keras.losses.cosine_similarity(
X[0],
X,
axis=-1
)
As you can see the Matrix is shape of 400k, 100 means 400k rows 100 columns.
Then I made the iterations above it is 100k. These 100k took 176 seconds, which is yes like 20x faster than CPU on my notebook, but to be honest same results you can achieve with python Numba library which translates python code to machine native code. (but then 400k iterations takes 30 mins though - with Numba, don't know how long with GPU)
Result is, I am dissapointed, I expected to be GPU more powerful in terms of memory, and mainly at least 1000 x faster than CPU.. I expected the times to be in manner of mili or micro seconds..
Why they still say this card has power of 80 teraflops etc when then you compute it same time as on CPU with native instructions. And mainly - CPU has only few cores, GPU has thousand of cores..
Can I make it somehow faster?

NN: Why does the manual implementation converges better than Keras/TF?

I've implemented a basic 1-hidden layer NN to identify digits from part of the MNIST dataset (This is adapted from the Coursera ML course). It is implemented using matrices and conjugated-gradient (fmincg) optimization.
In addition I've implemented both a Keras implementation and a vanilla TF implementation.
In the manual implementation, there's 50 iterations (or epochs, but since I'm using the whole batch, each epoch is of size 1).
Now, what I've noticed, is that the manual implementation takes a while to compute, but after 50 iterations gives me really good results.
Keras and TF on the other hand, takes less time to compute, but need about 500-1000 iterations if I'm using the whole batch to give me the same results. (Though using 50 epochs, with each processing mini-batches of size 32, seems to work good as well)
So I wonder, how come the manual impl. converges better? And why does it take longer to compute even though it's the same number of iterations (50)?

Measure NVIDIA Tensor Cores speedup

I'm using NVIDIA Tensor Cores on Volta architecture (V100 GPU). I want to measure impact of Tensor Cores on my code, (a Convolutional Neural Network in Tensorflow/Python for testing purpose).
How can I measure Tensor Cores speedup ? Is it possible to disable Tensor Cores and run the same code with/without them ?
What I've tried:
setting TF_DISABLE_CUDNN_TENSOR_OP_MATH to 1 (from this). But I still see that Tensor Cores are used. More precisely, I see in nvprof log: volta_s884cudnn_fp16 lines (disappear with this option) and volta_s884gemm_fp16 (which are still here). Side question: what do these lines mean ?
compare with same code on Pascal architecture (P100) which has no Tensor Cores, where I see a 30% speedup, but I can't tell which part of this 30% is caused by GPU improvement and which part is Tensor Cores performance.
training same network in tf.float16 and tf.float32, but same result, I see improvements but can't tell what is caused by model size reduction.
Thanks in advance for any help/advice on this.

I chose a hack to estimate the performance gain of Tensor Cores:
I ran the code in float32 on both Pascal and Volta architecture (to estimate the performance gain of the architecture).
I ran the code in float16 on both too, and assuming the performance gain of the architecture would be the same with float32 and float16, I can estimate that the other part of the performance gain (in float16) is imputable to Tensor Cores.

Neural Network - Input Normalization

It is a common practice to normalize input values (to a neural network) to speed up the learning process, especially if features have very large scales.
In its theory, normalization is easy to understand. But I wonder how this is done if the training data set is very large, say for 1 million training examples..? If # features per training example is large as well (say, 100 features per training example), 2 problems pop up all of a sudden:
- It will take some time to normalize all training samples
- Normalized training examples need to be saved somewhere, so that we need to double the necessary disk space (especially if we do not want to overwrite the original data).
How is input normalization solved in practice, especially if the data set is very large?
One option maybe is to normalize inputs dynamically in the memory per mini batch while training.. But normalization results will then be changing from one mini batch to another. Would it be tolerable then?
There is maybe someone in this platform having hands on experience on this question. I would really appreciate if you could share your experiences.
Thank you in advance.

A large number of features makes it easier to parallelize the normalization of the dataset. This is not really an issue. Normalization on large datasets would be easily GPU accelerated, and it would be quite fast. Even for large datasets like you are describing. One of my frameworks that I have written can normalize the entire MNIST dataset in under 10 seconds on a 4-core 4-thread CPU. A GPU could easily do it in under 2 seconds. Computation is not the problem. While for smaller datasets, you can hold the entire normalized dataset in memory, for larger datasets, like you mentioned, you will need to swap out to disk if you normalize the entire dataset. However, if you are doing reasonably large batch sizes, about 128 or higher, your minimums and maximums will not fluctuate that much, depending upon the dataset. This allows you to normalize the mini-batch right before you train the network on it, but again this depends upon the network. I would recommend experimenting based on your datasets, and choosing the best method.

Calculate face_descriptor faster

In my face recognition project a face is represented as a 128-dimensional embedding(face_descriptor) as used in FaceNet.
I could generate embedding from image in 2 ways.
Using Tensorflow resnet model v1.
emb_array = sess.run(embedding_layer,
{images_placeholder: images_array, phase_train_placeholder: False})
An array of images can be passed and a list of embeddings is obtained.
This is a bit slow.Took 1.6s.(Though the time is almost constant for large number of images).
Note: GPU not available
Other method is using dlib
dlib.face_recognition_model_v1.compute_face_descriptor(image, shape)
This gives fast result. Almost 0.05 seconds.
But only one image can be passed at a time.Time increases with number of images.
Is there any way to pass array of images to calculate embeddings in dlib or any way to improve the speed in dlib?
Or is there any other faster method to generate 128 dimensional face embedding?
Update:
I concatenated multiple images into single image and passed to dlib
dlib.face_recognition_model_v1.compute_face_descriptor(big_image, shapes)
i.e converted multiple images with single face into single image with multiple faces.
Still time is proportional to number of images(i.e number of faces) concatenated. Almost same time for iterating on individual images.

One of the more important aspects to this question is that you have no GPU available. I'm putting this here so if anyone reads this answer will have a better understanding of the context.
There are two major parts to the time consumed for inference. First is the setup time. Tensorflow takes its sweet, sweet time to set itself up when you first run the network, therefore your measurement of 1.6 seconds is probably 99.9999% setup time and 0.0001% processing your image. Then it does the actual inference calculation, which is probably tiny for one image compared to the setup. A better measurement would be running 1,000 images through it and then 2,000 images and calculate the difference, divided by 1,000 to get how much time each image takes to infer.
From the look of it, Dlib doesn't spend much time with setting up on the first run, but it would still be a better benchmark to do the same as outlined above.
I suspect Tensorflow and Dlib should be fairly similar in terms of execution speed on a CPU because both use optimized linear algebra libraries (BLAS, LAPACK) and there is only so much optimization one can do for matrix multiplication.
There is another thing you might want to give a try though. Most networks use 32 bit floating point calculations for training and inference, but research shows that in most cases, switching over to 8 bit integers for inference doesn't degrade accuracy too much but speeds up inference by a lot.
It is generally better to train a network with later quantization in mind at training, which is not the case here because you use a pre-trained model, but you can still benefit from quantization a lot probably. You can quantize your model with basically running a command that's included in Tensorflow (with the surprising name quantize_graph) but there is a little bit more to it. There is a nice quantization tutorial to follow, but keep in mind that the script is now in tensorflow/tools/quantization and not in contrib any more, as written in the tutorial.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.