How to achieve a faster convolve2d using GPU

How to achieve a faster convolve2d using GPU - python

I was recently learning PyCuda and planning to replace some code of a camera system to speed up image processing. That part was originally using cv2.filter2D. My intention is to accelerate the processing with GPU.
Time for signal.convolve2d: 1.6639747619628906
Time for cusignal.convolve2d: 0.6955723762512207
Time for cv2.filter2D: 0.18787837028503418
However, it seems that cv2.filter2D is still the fastest among three. If the input is a long list of images, could a custom PyCuda kernel outweigh the cv2.filter2D?
import time
import cv2
from cusignal.test.utils import array_equal
import cusignal
import cupy as cp
import numpy as np
from scipy import signal
from scipy import misc
ascent = misc.ascent()
ascent = np.array(ascent, dtype='int16')
ascentList = [ascent]*100
filterSize = 3
scharr = np.ones((filterSize, filterSize), dtype="float") * (1.0 / (filterSize*filterSize))
startTime = time.time()
for asc in ascentList:
grad = signal.convolve2d(asc, scharr, boundary='symm', mode='same')
endTime = time.time()
print("Time for signal.convolve2d: "+str(endTime - startTime))
startTime = time.time()
for asc in ascentList:
gpu_convolve2d = cp.asnumpy(cusignal.convolve2d(cp.asarray(asc), scharr, boundary='symm', mode='same'))
endTime = time.time()
print("Time for cusignal.convolve2d: "+str(endTime - startTime))
print("If signal equal to cusignal: "+ str(array_equal(grad, gpu_convolve2d)))
startTime = time.time()
for asc in ascentList:
opencvOutput = cv2.filter2D(asc, -1, scharr)
endTime = time.time()
print("Time for cv2.filter2D: "+str(endTime - startTime))
print("If cv2 equal to cusignal: "+ str(array_equal(opencvOutput, gpu_convolve2d)))

In your timing analysis of the GPU, you are timing the time to copy asc to the GPU, execute convolve2d, and transfer the answer back. Transfers to and from the GPU are very slow in the scheme of things. If you want a true comparison of the compute just profile convolve2d.
Currently the cuSignal.convolve2d is written in Numba. We are in the process of porting this to use CuPy Raw Kernels, and there will be an improvement. I don't have an ETA on convolve2d.
It looks there might be a OpenCV CUDA version https://github.com/opencv/opencv_contrib/blob/master/modules/cudafilters/src/cuda/filter2d.cu
Have you tried scipy.ndimage.filters.convolve - http://blog.rtwilson.com/convolution-in-python-which-function-to-use/
Also, checkout CuPy's convolve - https://github.com/cupy/cupy/blob/master/cupyx/scipy/ndimage/filters.py
Now to your original question. When trying to determine if the GPU will be faster than the CPU, you need to ensure there is enough work to keep the GPU busy. It is known that in some cases, where the data size is small, the CPU will perform faster.

Related

Speeding up a large number of small matrix multiplications

I have a large number of 2x2 matrices that I need multiplied together.
I can initiate the general shape of the problem as
import numpy as np
import time
A_dim = 6*6
B_dim = 2**8
C_dim = B_dim
A = np.random.rand(A_dim,A_dim,2,2)
B = np.random.rand(B_dim,2,2)
C = np.random.rand(C_dim,2,2)
tic = time.perf_counter()
X = A[None,None,:,:,:,:] # B[:,None,None,None,:,:] # A[None,None,:,:,:,:] # C[None,:,None,None,:,:]
toc = time.perf_counter()
print(f"matrix multiplication took {toc - tic:0.4f} seconds")
where I get
matrix multiplication took 14.4403 seconds
This seems to be a vectorized implementation, but is there anything I can do to speed this up? My native Numpy library runs this on only one core, so possibly if I figured out how to get OpenBLAS to work, this would be faster? The problem is that each matrix operation takes a very small amount of time to complete. Is there a better way to construct such a multidimensional array?

Numba CUDA speedup seems to low

Newbie starting with Numba/cuda here.
I wrote this little test script to compare between #jit and #cuda.jit. speeds, just to get a feel for it. It calculates 10M steps of a logistic equation for 256 separate instances.
The cuda part takes approximately 1.2s to finish.
The cpu 'jitted' part finishes in close to 5s (just one thread used on the cpu).
So there is a speedup of about x4, from going to the GPU (a dedicated GTX1080TI not doing anything else). I expected the cuda part, doing all 256 instances in parallel, to be much faster. What am I doing wrong?
Here is the working example:
#!/usr/bin/python3
#logistic equation on gpu/cpu comparison
import os,sys
#Set environment variables (needed for numba 0.42 to find lvvm)
os.environ['NUMBAPRO_NVVM'] = '/usr/lib/x86_64-linux-gnu/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/lib/nvidia-cuda-toolkit/libdevice/'
from time import time
from scipy import *
from numba import cuda, jit
from numba import int64,int32, float64
#cuda.jit
def logistic_cuda(array_in,array_out):
pos = cuda.grid(1)
x = array_in[pos]
for k in range(10*1000*1000):
x = 3.9 * x * (1 - x)
array_out[pos] = x
#jit
def logistic_cpu(array_in,array_out):
for pos,x in enumerate(array_in):
for k in range(10*1000*1000):
x = 3.9 * x * (1 - x)
array_out[pos] = x
if __name__ == '__main__':
N=256
in_ary = random.uniform(low=0.2,high=0.9,size=N).astype('float32')
out_ary = zeros(N,dtype='float32')
t0 = time()
#explicit copying. not really needed
d_in_ary = cuda.to_device(in_ary)
d_out_ary = cuda.to_device(out_ary)
t1 = time()
logistic_cuda[1,N](d_in_ary,d_out_ary)
cuda.synchronize()
t2 = time()
out_ary = d_out_ary.copy_to_host()
t3 = time()
print(out_ary)
print('Total time cuda: %g seconds.'%(t3-t0))
out_ary2 = zeros(N)
t4 = time()
logistic_cpu(in_ary,out_ary2)
t5 = time()
print('Total time cpu: %g seconds.'%(t5-t4))
print('\nDifference:')
print(out_ary2-out_ary)
#Total time cuda: 1.19364 seconds.
#Total time cpu: 5.01788 seconds.
Thanks!

The problem likely comes from the very small amount of data and the loop dependency. Indeed, modern Nvidia GPUs can execute thousands of CUDA threads simultaneously (packed in warps of 32 threads) thanks to the large amount of CUDA cores. In your case, each thread performs a computation on one cell of array_out using a sequential loop. However, there are only 256 cells. Thus, at most 256 threads (8 warps) can run simultaneously - only a tiny faction of the number of simultaneous threads that your GPU should be able to manage. As a result, if you want a better speed-up, you need to provide more parallelism to the GPU (for example by increasing the data size or by computing multiple regression at the same time).

Random number generation in Tensorflow is very slow (tensorflow-gpu)

I have been starting to learn Tensorflow (on GPU) and have been playing with some of the basic functions, including random number generation with Tensorflow Probability. I have found that the random number generation to be surprisingly slow. Here is the code I used to test:
import time
import tensorflow as tf
import numpy as np
import tensorflow_probability as tfp
tfd = tfp.distributions
size = (100,100)
counts = np.random.randint(1000,size = size)
probs = 0.5*np.ones(size)
t_counts = tf.cast(tf.convert_to_tensor(counts),tf.float64)
t_probs = tf.convert_to_tensor(probs)
N = 20
samples = np.zeros((size[0],size[1],N))
tic = time.time()
for i in range(N):
samples[:,:,i] = np.random.binomial(counts,probs)
print('Numpy took',time.time()-tic,'seconds')
t_samples = []
b = tfd.Binomial(t_counts,t_probs)
tic = time.time()
for i in range(N):
t_samples.append(b.sample())
print('Tensorflow took',time.time()-tic,'seconds')
I expected a marginal speedup over numpy at these array sizes but instead found Tensorflow to be dramatically slower. For the given size the output I get is:
Numpy took 0.04088950157165527 seconds
Tensorflow took 404.8193895816803 seconds
which seems dramatic. I tried scaling up the size of the size of the array to see how it scaled, and for size = (200,200) I get:
Numpy took 0.16755127906799316 seconds
Tensorflow took 1649.4787373542786 seconds
So tensorflow (ostensibly using the GPU) seems to be scaling in exactly the same way as numpy using the CPU. Is this expected behavior? I haven't been able to find any benchmarks for Tensorflow random number generation but it's not something I have specialized expertise in. Is there something that I am doing wrong when setting this up?

How to schedule multiple 1d FFTs using Scikit-cuda FFT?

I'm looking to parallelize multiple 1d FFTs using CUDA. I'm working on a GTX 1050Ti with CUDA 6.1.
For instance in the code I attached, I have a 3d input array 'data', and I want to do 1d FFTs over the second dimension of this array. The purpose is, of course, to speed up the execution time by an order of magnitude.
I'm able to use Python's scikit-cuda's cufft package to run a batch of 1 1d FFT and the results match with NumPy's FFT. The problem comes when I go to a real batch size. There, I'm not able to match the NumPy's FFT output (which is the correct one) with cufft's output (which I believe isn't correct). In the code attached, parameter 'singleFFT' controls whether we schedule a batch of 1 or many. Help in correcting the output FFT and also speeding up execution further (if possible) will be greatly appreciated.
import numpy as np
from time import process_time
from skcuda import cufft as cf
import pycuda.autoinit
from pycuda import gpuarray
# params
nSamp = 512
nTx = 16
nRx = 16
nChirp = 256
NX = nChirp
# Uncomment the following line to generate same data always
# np.random.seed(seed=1)
data = (np.random.randn(nSamp,nChirp,nTx,nRx) + 1j*np.random.randn(nSamp,nChirp,nTx,nRx)).astype(np.complex64)
data = data.reshape(nSamp,-1,nTx*nRx)
dataShp0 = np.int32(data.shape[0])
dataShp2 = np.int32(data.shape[2])
idx1 = 0
idx2 = 0
idx3 = 0
singleFFT = 0
if (1 == singleFFT):
data_t = data[0,:,0]
fftAxis = 0
BATCH = np.int32(1)
else:
data_t = data
fftAxis = 1
BATCH = np.int32(nSamp*nTx*nRx)
# calculate and time NumPy FFT
t1 = process_time()
dataFft = np.fft.fft(data_t, axis=fftAxis)
t2 = process_time()
print('\nCPU NumPy time is: ',t2-t1)
data_o_gpu = gpuarray.empty((BATCH*NX),dtype=np.complex64)
# calculate and time GPU FFT
data_t = data_t.reshape((BATCH*NX))
t1 = process_time()
# transfer input data to Device
data_t_gpu = gpuarray.to_gpu(data_t)
# Make FFT plan
plan = cf.cufftPlan1d(NX, cf.CUFFT_C2C, BATCH)
# Execute FFT plan
res = cf.cufftExecC2C(plan, int(data_t_gpu.gpudata), int(data_o_gpu.gpudata), cf.CUFFT_FORWARD)
dataFft_gpu = data_o_gpu.get()
t2 = process_time()
if (0 == singleFFT):
dataFft_gpu = dataFft_gpu.reshape((nSamp,-1,nTx*nRx))
print('\nGPU time is: ',t2-t1)
print(np.allclose(dataFft,dataFft_gpu,atol=1e-6))
The last line in the code matches the result of NumPy's FFT with cuFFT. It could be seen with singleFFT=1, the result is True, while for singleFFT=0 (i.e. batch of many 1d FFTs), the result is False.

Post my attempts, I would want to conclude that:
Using cufft library from skcuda is a bit tricky and to get to the correct FFT output might take a long time, in development. I also noticed that there wasn't an order of magnitude difference in execution time between NumPy's FFT and cufft's FFT (from skcuda)
Using CuPy and arranging your data in a format so that the FFT dimension is laid out in contiguous memory gives an order of magnitude improvement in the FFT compute time. For my case, the order was a little better than 10!
Using CuPy for FFTs is a great option if one wants to stick to Py-based development only. Also the to and fro from C to Python when writing C GPU kernels is an added overhead which is very conveniently resolved with CuPy. Though CuPy itself calls laying out the plan and calling the FFT exec engine internally.

Getting different results everytime in pycuda FFT

I am writing code which can compare between numpy.fft.fft2 and pycuda but the results are not matching. Additionally pycuda results are ambiguous every time.
data file : https://nofile.io/f/bjGRQGRVSCG/gauss.npy
from pyfft.cuda import Plan
import numpy as np
from pycuda.tools import make_default_context
import pycuda.gpuarray as gpuarray
import pycuda.driver as cuda
import time
import matplotlib.pyplot as plt
cuda.init()
context = make_default_context()
data = np.load('gauss.npy')
data_complex = data.astype(np.complex64)
start_time = time.time()
plan = Plan((32,32))
gpu_data = gpuarray.to_gpu(data_complex)
plan.execute(gpu_data)
result = gpu_data.get()
print("--- %s seconds (FFT calculation pycuda)---" % (time.time() - start_time))
start_time_3 = time.time()
result_np = np.fft.fft2(data)
#print(result_np)
print("--- %s seconds (FFT calculation numpy.fft.fft)---" % (time.time() - start_time))
context.pop()
#plt.plot(result)
#plt.plot(result_np)
I'm starting to wonder whether we can even perform 2D FFT with pycuda?

pyfft.cuda is almost assuredly using cufft, which does not compute FFTs in the same way as numpy's fft (IIRC, even scipy.fft and np.fft can produce different results). You should read the documentation for each library in order to understand the differences. You can definitely perform 2D FFTs with pycuda

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to achieve a faster convolve2d using GPU - python

Related

Speeding up a large number of small matrix multiplications

Numba CUDA speedup seems to low

Random number generation in Tensorflow is very slow (tensorflow-gpu)

How to schedule multiple 1d FFTs using Scikit-cuda FFT?

Getting different results everytime in pycuda FFT

Categories

Resources