Random number generation in Tensorflow is very slow (tensorflow-gpu)

Random number generation in Tensorflow is very slow (tensorflow-gpu) - python

I have been starting to learn Tensorflow (on GPU) and have been playing with some of the basic functions, including random number generation with Tensorflow Probability. I have found that the random number generation to be surprisingly slow. Here is the code I used to test:
import time
import tensorflow as tf
import numpy as np
import tensorflow_probability as tfp
tfd = tfp.distributions
size = (100,100)
counts = np.random.randint(1000,size = size)
probs = 0.5*np.ones(size)
t_counts = tf.cast(tf.convert_to_tensor(counts),tf.float64)
t_probs = tf.convert_to_tensor(probs)
N = 20
samples = np.zeros((size[0],size[1],N))
tic = time.time()
for i in range(N):
samples[:,:,i] = np.random.binomial(counts,probs)
print('Numpy took',time.time()-tic,'seconds')
t_samples = []
b = tfd.Binomial(t_counts,t_probs)
tic = time.time()
for i in range(N):
t_samples.append(b.sample())
print('Tensorflow took',time.time()-tic,'seconds')
I expected a marginal speedup over numpy at these array sizes but instead found Tensorflow to be dramatically slower. For the given size the output I get is:
Numpy took 0.04088950157165527 seconds
Tensorflow took 404.8193895816803 seconds
which seems dramatic. I tried scaling up the size of the size of the array to see how it scaled, and for size = (200,200) I get:
Numpy took 0.16755127906799316 seconds
Tensorflow took 1649.4787373542786 seconds
So tensorflow (ostensibly using the GPU) seems to be scaling in exactly the same way as numpy using the CPU. Is this expected behavior? I haven't been able to find any benchmarks for Tensorflow random number generation but it's not something I have specialized expertise in. Is there something that I am doing wrong when setting this up?

Related

Speeding up a large number of small matrix multiplications

I have a large number of 2x2 matrices that I need multiplied together.
I can initiate the general shape of the problem as
import numpy as np
import time
A_dim = 6*6
B_dim = 2**8
C_dim = B_dim
A = np.random.rand(A_dim,A_dim,2,2)
B = np.random.rand(B_dim,2,2)
C = np.random.rand(C_dim,2,2)
tic = time.perf_counter()
X = A[None,None,:,:,:,:] # B[:,None,None,None,:,:] # A[None,None,:,:,:,:] # C[None,:,None,None,:,:]
toc = time.perf_counter()
print(f"matrix multiplication took {toc - tic:0.4f} seconds")
where I get
matrix multiplication took 14.4403 seconds
This seems to be a vectorized implementation, but is there anything I can do to speed this up? My native Numpy library runs this on only one core, so possibly if I figured out how to get OpenBLAS to work, this would be faster? The problem is that each matrix operation takes a very small amount of time to complete. Is there a better way to construct such a multidimensional array?

How to achieve a faster convolve2d using GPU

I was recently learning PyCuda and planning to replace some code of a camera system to speed up image processing. That part was originally using cv2.filter2D. My intention is to accelerate the processing with GPU.
Time for signal.convolve2d: 1.6639747619628906
Time for cusignal.convolve2d: 0.6955723762512207
Time for cv2.filter2D: 0.18787837028503418
However, it seems that cv2.filter2D is still the fastest among three. If the input is a long list of images, could a custom PyCuda kernel outweigh the cv2.filter2D?
import time
import cv2
from cusignal.test.utils import array_equal
import cusignal
import cupy as cp
import numpy as np
from scipy import signal
from scipy import misc
ascent = misc.ascent()
ascent = np.array(ascent, dtype='int16')
ascentList = [ascent]*100
filterSize = 3
scharr = np.ones((filterSize, filterSize), dtype="float") * (1.0 / (filterSize*filterSize))
startTime = time.time()
for asc in ascentList:
grad = signal.convolve2d(asc, scharr, boundary='symm', mode='same')
endTime = time.time()
print("Time for signal.convolve2d: "+str(endTime - startTime))
startTime = time.time()
for asc in ascentList:
gpu_convolve2d = cp.asnumpy(cusignal.convolve2d(cp.asarray(asc), scharr, boundary='symm', mode='same'))
endTime = time.time()
print("Time for cusignal.convolve2d: "+str(endTime - startTime))
print("If signal equal to cusignal: "+ str(array_equal(grad, gpu_convolve2d)))
startTime = time.time()
for asc in ascentList:
opencvOutput = cv2.filter2D(asc, -1, scharr)
endTime = time.time()
print("Time for cv2.filter2D: "+str(endTime - startTime))
print("If cv2 equal to cusignal: "+ str(array_equal(opencvOutput, gpu_convolve2d)))

In your timing analysis of the GPU, you are timing the time to copy asc to the GPU, execute convolve2d, and transfer the answer back. Transfers to and from the GPU are very slow in the scheme of things. If you want a true comparison of the compute just profile convolve2d.
Currently the cuSignal.convolve2d is written in Numba. We are in the process of porting this to use CuPy Raw Kernels, and there will be an improvement. I don't have an ETA on convolve2d.
It looks there might be a OpenCV CUDA version https://github.com/opencv/opencv_contrib/blob/master/modules/cudafilters/src/cuda/filter2d.cu
Have you tried scipy.ndimage.filters.convolve - http://blog.rtwilson.com/convolution-in-python-which-function-to-use/
Also, checkout CuPy's convolve - https://github.com/cupy/cupy/blob/master/cupyx/scipy/ndimage/filters.py
Now to your original question. When trying to determine if the GPU will be faster than the CPU, you need to ensure there is enough work to keep the GPU busy. It is known that in some cases, where the data size is small, the CPU will perform faster.

How to schedule multiple 1d FFTs using Scikit-cuda FFT?

I'm looking to parallelize multiple 1d FFTs using CUDA. I'm working on a GTX 1050Ti with CUDA 6.1.
For instance in the code I attached, I have a 3d input array 'data', and I want to do 1d FFTs over the second dimension of this array. The purpose is, of course, to speed up the execution time by an order of magnitude.
I'm able to use Python's scikit-cuda's cufft package to run a batch of 1 1d FFT and the results match with NumPy's FFT. The problem comes when I go to a real batch size. There, I'm not able to match the NumPy's FFT output (which is the correct one) with cufft's output (which I believe isn't correct). In the code attached, parameter 'singleFFT' controls whether we schedule a batch of 1 or many. Help in correcting the output FFT and also speeding up execution further (if possible) will be greatly appreciated.
import numpy as np
from time import process_time
from skcuda import cufft as cf
import pycuda.autoinit
from pycuda import gpuarray
# params
nSamp = 512
nTx = 16
nRx = 16
nChirp = 256
NX = nChirp
# Uncomment the following line to generate same data always
# np.random.seed(seed=1)
data = (np.random.randn(nSamp,nChirp,nTx,nRx) + 1j*np.random.randn(nSamp,nChirp,nTx,nRx)).astype(np.complex64)
data = data.reshape(nSamp,-1,nTx*nRx)
dataShp0 = np.int32(data.shape[0])
dataShp2 = np.int32(data.shape[2])
idx1 = 0
idx2 = 0
idx3 = 0
singleFFT = 0
if (1 == singleFFT):
data_t = data[0,:,0]
fftAxis = 0
BATCH = np.int32(1)
else:
data_t = data
fftAxis = 1
BATCH = np.int32(nSamp*nTx*nRx)
# calculate and time NumPy FFT
t1 = process_time()
dataFft = np.fft.fft(data_t, axis=fftAxis)
t2 = process_time()
print('\nCPU NumPy time is: ',t2-t1)
data_o_gpu = gpuarray.empty((BATCH*NX),dtype=np.complex64)
# calculate and time GPU FFT
data_t = data_t.reshape((BATCH*NX))
t1 = process_time()
# transfer input data to Device
data_t_gpu = gpuarray.to_gpu(data_t)
# Make FFT plan
plan = cf.cufftPlan1d(NX, cf.CUFFT_C2C, BATCH)
# Execute FFT plan
res = cf.cufftExecC2C(plan, int(data_t_gpu.gpudata), int(data_o_gpu.gpudata), cf.CUFFT_FORWARD)
dataFft_gpu = data_o_gpu.get()
t2 = process_time()
if (0 == singleFFT):
dataFft_gpu = dataFft_gpu.reshape((nSamp,-1,nTx*nRx))
print('\nGPU time is: ',t2-t1)
print(np.allclose(dataFft,dataFft_gpu,atol=1e-6))
The last line in the code matches the result of NumPy's FFT with cuFFT. It could be seen with singleFFT=1, the result is True, while for singleFFT=0 (i.e. batch of many 1d FFTs), the result is False.

Post my attempts, I would want to conclude that:
Using cufft library from skcuda is a bit tricky and to get to the correct FFT output might take a long time, in development. I also noticed that there wasn't an order of magnitude difference in execution time between NumPy's FFT and cufft's FFT (from skcuda)
Using CuPy and arranging your data in a format so that the FFT dimension is laid out in contiguous memory gives an order of magnitude improvement in the FFT compute time. For my case, the order was a little better than 10!
Using CuPy for FFTs is a great option if one wants to stick to Py-based development only. Also the to and fro from C to Python when writing C GPU kernels is an added overhead which is very conveniently resolved with CuPy. Though CuPy itself calls laying out the plan and calling the FFT exec engine internally.

Result of 3D FFT using pyculib is wrong

I use pyculib to perform 3D FFT on a matrix in Anaconda 3.5. I just followed the example code posted in the website. But I found something interesting and don't understand why.
Performing a 3D FFT on matrix with pyculib is correct only when using numpy.arange to create the matrix.
Here is the code:
from pyculib.fft.binding import Plan, CUFFT_C2C
import numpy as np
from numba import cuda
data = np.random.rand(26, 256, 256).astype(np.complex64)
orig = data.copy()
d_data = cuda.to_device(data)
fftplan = Plan.three(CUFFT_C2C, *data.shape)
fftplan.forward(d_data, d_data)
fftplan.inverse(d_data, d_data)
d_data.copy_to_host(data)
result = data / n
np.allclose(orig, result.real)
Finally, it turns out to be False. And the difference between orig and result is not a small number, not negligible.
I tried some other data sets (not random numbers), and get the some wrong results.
Also, I test without inverse FFT:
from pyculib.fft.binding import Plan, CUFFT_C2C
import numpy as np
from numba import cuda
from scipy.fftpack import fftn,ifftn
data = np.random.rand(26,256,256).astype(np.complex64)
orig = data.copy()
orig_fft = fftn(orig)
d_data = cuda.to_device(data)
fftplan = Plan.three(CUFFT_C2C, *data.shape)
fftplan.forward(d_data, d_data)
d_data.copy_to_host(data)
np.allclose(orig_fft, data)
The result is also wrong.
The test code on website, they use numpy.arange to create the matrix. And I tried:
n = 26*256*256
data = np.arange(n, dtype=np.complex64).reshape(26,256,256)
And the FFT result of this matrix is right.
Could anyone help to point out why?

I don't use CUDA, but I think your problem is numerical in nature. The difference lies in the two data sets you are using. random.rand has dynamic range of 0-1, and arange 0-26*256*256. The FFT attempts to resolve spatial frequency components on the order of range of values / number of points. For arange, this becomes unity, and the FFT is numerically accurate. For rand, this is 1/26*256*256 ~ 5.8e-7.
Just running FFT/IFFT on your numpy arrays without using CUDA shows similar differences.

How to successsfully run an ML algorithm with a medium sized data set on a mediocre laptop?

I have a Lenovo IdeaPad laptop with 8 GB RAM and Intel Core I5 processor. I have 60k data points each 100 dimentional. I want to do KNN and for it I am running LMNN algorithm to find a Mahalanobis Metric.
Problem is after 2 hours of running a blank screen appears on my ubuntu. I am not getting what is the problem! Is my memory getting full or something else?
So is there some way to optimize this my code?
My dataset: data
My LMNN implementation:
import numpy as np
import sys
from modshogun import LMNN, RealFeatures, MulticlassLabels
from sklearn.datasets import load_svmlight_file
def main():
# Get training file name from the command line
traindatafile = sys.argv[1]
# The training file is in libSVM format
tr_data = load_svmlight_file(traindatafile);
Xtr = tr_data[0].toarray(); # Converts sparse matrices to dense
Ytr = tr_data[1]; # The trainig labels
# Cast data to Shogun format to work with LMNN
features = RealFeatures(Xtr.T)
labels = MulticlassLabels(Ytr.astype(np.float64))
# Number of target neighbours per example - tune this using validation
k = 18
# Initialize the LMNN package
lmnn = LMNN(features, labels, k)
init_transform = np.eye(Xtr.shape[1])
# Choose an appropriate timeout
lmnn.set_maxiter(200000)
lmnn.train(init_transform)
# Let LMNN do its magic and return a linear transformation
# corresponding to the Mahalanobis metric it has learnt
L = lmnn.get_linear_transform()
M = np.matrix(np.dot(L.T, L))
# Save the model for use in testing phase
# Warning: do not change this file name
np.save("model.npy", M)
if __name__ == '__main__':
main()

Exact k-NN has scalability problems.
Scikit-learn has documentation page (scaling strategies) on what to do in such situation (many algorithms have partial_fit method, butunfortunately kNN doesn't have it).
If you'd accept to trade some accuracy for speed you can run something like approximate nearest neighbors.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Random number generation in Tensorflow is very slow (tensorflow-gpu) - python

Related

Speeding up a large number of small matrix multiplications

How to achieve a faster convolve2d using GPU

How to schedule multiple 1d FFTs using Scikit-cuda FFT?

Result of 3D FFT using pyculib is wrong

How to successsfully run an ML algorithm with a medium sized data set on a mediocre laptop?

Categories

Resources