Result of 3D FFT using pyculib is wrong - python

I use pyculib to perform 3D FFT on a matrix in Anaconda 3.5. I just followed the example code posted in the website. But I found something interesting and don't understand why.
Performing a 3D FFT on matrix with pyculib is correct only when using numpy.arange to create the matrix.
Here is the code:
from pyculib.fft.binding import Plan, CUFFT_C2C
import numpy as np
from numba import cuda
data = np.random.rand(26, 256, 256).astype(np.complex64)
orig = data.copy()
d_data = cuda.to_device(data)
fftplan = Plan.three(CUFFT_C2C, *data.shape)
fftplan.forward(d_data, d_data)
fftplan.inverse(d_data, d_data)
d_data.copy_to_host(data)
result = data / n
np.allclose(orig, result.real)
Finally, it turns out to be False. And the difference between orig and result is not a small number, not negligible.
I tried some other data sets (not random numbers), and get the some wrong results.
Also, I test without inverse FFT:
from pyculib.fft.binding import Plan, CUFFT_C2C
import numpy as np
from numba import cuda
from scipy.fftpack import fftn,ifftn
data = np.random.rand(26,256,256).astype(np.complex64)
orig = data.copy()
orig_fft = fftn(orig)
d_data = cuda.to_device(data)
fftplan = Plan.three(CUFFT_C2C, *data.shape)
fftplan.forward(d_data, d_data)
d_data.copy_to_host(data)
np.allclose(orig_fft, data)
The result is also wrong.
The test code on website, they use numpy.arange to create the matrix. And I tried:
n = 26*256*256
data = np.arange(n, dtype=np.complex64).reshape(26,256,256)
And the FFT result of this matrix is right.
Could anyone help to point out why?

I don't use CUDA, but I think your problem is numerical in nature. The difference lies in the two data sets you are using. random.rand has dynamic range of 0-1, and arange 0-26*256*256. The FFT attempts to resolve spatial frequency components on the order of range of values / number of points. For arange, this becomes unity, and the FFT is numerically accurate. For rand, this is 1/26*256*256 ~ 5.8e-7.
Just running FFT/IFFT on your numpy arrays without using CUDA shows similar differences.

Related

How to schedule multiple 1d FFTs using Scikit-cuda FFT?

I'm looking to parallelize multiple 1d FFTs using CUDA. I'm working on a GTX 1050Ti with CUDA 6.1.
For instance in the code I attached, I have a 3d input array 'data', and I want to do 1d FFTs over the second dimension of this array. The purpose is, of course, to speed up the execution time by an order of magnitude.
I'm able to use Python's scikit-cuda's cufft package to run a batch of 1 1d FFT and the results match with NumPy's FFT. The problem comes when I go to a real batch size. There, I'm not able to match the NumPy's FFT output (which is the correct one) with cufft's output (which I believe isn't correct). In the code attached, parameter 'singleFFT' controls whether we schedule a batch of 1 or many. Help in correcting the output FFT and also speeding up execution further (if possible) will be greatly appreciated.
import numpy as np
from time import process_time
from skcuda import cufft as cf
import pycuda.autoinit
from pycuda import gpuarray
# params
nSamp = 512
nTx = 16
nRx = 16
nChirp = 256
NX = nChirp
# Uncomment the following line to generate same data always
# np.random.seed(seed=1)
data = (np.random.randn(nSamp,nChirp,nTx,nRx) + 1j*np.random.randn(nSamp,nChirp,nTx,nRx)).astype(np.complex64)
data = data.reshape(nSamp,-1,nTx*nRx)
dataShp0 = np.int32(data.shape[0])
dataShp2 = np.int32(data.shape[2])
idx1 = 0
idx2 = 0
idx3 = 0
singleFFT = 0
if (1 == singleFFT):
data_t = data[0,:,0]
fftAxis = 0
BATCH = np.int32(1)
else:
data_t = data
fftAxis = 1
BATCH = np.int32(nSamp*nTx*nRx)
# calculate and time NumPy FFT
t1 = process_time()
dataFft = np.fft.fft(data_t, axis=fftAxis)
t2 = process_time()
print('\nCPU NumPy time is: ',t2-t1)
data_o_gpu = gpuarray.empty((BATCH*NX),dtype=np.complex64)
# calculate and time GPU FFT
data_t = data_t.reshape((BATCH*NX))
t1 = process_time()
# transfer input data to Device
data_t_gpu = gpuarray.to_gpu(data_t)
# Make FFT plan
plan = cf.cufftPlan1d(NX, cf.CUFFT_C2C, BATCH)
# Execute FFT plan
res = cf.cufftExecC2C(plan, int(data_t_gpu.gpudata), int(data_o_gpu.gpudata), cf.CUFFT_FORWARD)
dataFft_gpu = data_o_gpu.get()
t2 = process_time()
if (0 == singleFFT):
dataFft_gpu = dataFft_gpu.reshape((nSamp,-1,nTx*nRx))
print('\nGPU time is: ',t2-t1)
print(np.allclose(dataFft,dataFft_gpu,atol=1e-6))
The last line in the code matches the result of NumPy's FFT with cuFFT. It could be seen with singleFFT=1, the result is True, while for singleFFT=0 (i.e. batch of many 1d FFTs), the result is False.
Post my attempts, I would want to conclude that:
Using cufft library from skcuda is a bit tricky and to get to the correct FFT output might take a long time, in development. I also noticed that there wasn't an order of magnitude difference in execution time between NumPy's FFT and cufft's FFT (from skcuda)
Using CuPy and arranging your data in a format so that the FFT dimension is laid out in contiguous memory gives an order of magnitude improvement in the FFT compute time. For my case, the order was a little better than 10!
Using CuPy for FFTs is a great option if one wants to stick to Py-based development only. Also the to and fro from C to Python when writing C GPU kernels is an added overhead which is very conveniently resolved with CuPy. Though CuPy itself calls laying out the plan and calling the FFT exec engine internally.

How to get the second smallest eigenvalue of laplacian matrix of a complex network with python?

I'm trying to compute the second smallest eigenvalue of the laplacian matrix of a complex network(with 10000 nodes) with python using the shift-invert mode, here is the code:
import numpy as np
import networkx as nx
from scipy import sparse
G = nx.watts_strogatz_graph(10000,4,0.1)
degree_dict = nx.degree(G)
degree_list = []
for i in degree_dict:
degree_list.append(degree_dict[i])
lap_matrix = sparse.diags(degree_list, 0)-nx.adjacency_matrix(G)
eigval, eigvec = sparse.linalg.eigsh(lap_matrix, 2, sigma=0, which='LM')
second_eigval = eigval[1]
when running above code, i got:
RuntimeError: Factor is exactly singular
Does the error mean the laplacian matrix is singular?
Any ideas as to how I should proceed?
Is there any other way to compute this second smallest eigenvalue(with Matlab or any other programming language)?
Your code works for me (SciPy 1.0.0) almost exactly as written, except I simplified the formation of degree_list (which threw a KeyError in your version)
import numpy as np
import networkx as nx
from scipy import sparse
G = nx.watts_strogatz_graph(10000,4,0.1)
degree_dict = nx.degree(G)
degree_list = [x[1] for x in degree_dict]
lap_matrix = sparse.diags(degree_list, 0)-nx.adjacency_matrix(G)
eigval, eigvec = sparse.linalg.eigsh(lap_matrix, 2, sigma=0, which='LM')
Now eigval is [1.48814294e-16, 4.88863211e-02]; the smallest eigenvalue is zero within machine precision but the second smallest is not.

Re-compose a Tensor after tensor factorization

I am trying to decompose a 3D matrix using python library scikit-tensor. I managed to decompose my Tensor (with dimensions 100x50x5) into three matrices. My question is how can I compose the initial matrix again using the decomposed matrix produced with Tensor factorization? I want to check if the decomposition has any meaning. My code is the following:
import logging
from scipy.io.matlab import loadmat
from sktensor import dtensor, cp_als
import numpy as np
//Set logging to DEBUG to see CP-ALS information
logging.basicConfig(level=logging.DEBUG)
T = np.ones((400, 50))
T = dtensor(T)
P, fit, itr, exectimes = cp_als(T, 10, init='random')
// how can I re-compose the Matrix T? TA = np.dot(P.U[0], P.U[1].T)
I am using the canonical decomposition as provided from the scikit-tensor library function cp_als. Also what is the expected dimensionality of the decomposed matrices?
The CP product of, for example, 4 matrices
can be expressed using Einstein notation as
or in numpy as
numpy.einsum('az,bz,cz,dz -> abcd', A, B, C, D)
so in your case you would use
numpy.einsum('az,bz->ab', P.U[0], P.U[1])
or, in your 3-matrix case
numpy.einsum('az,bz,cz->abc', P.U[0], P.U[1], P.U[2])
sktensor.ktensor.ktensor also have a method totensor() that does exactly this:
np.allclose(np.einsum('az,bz->ab', P.U[0], P.U[1]), P.totensor())
>>> True
See an explanation of CP here. You may also use tensorlearn package to rebuild the tensor.

The equivalence of Matlab sprand() in Python?

I am trying to translate a Matlab code snippet into a Python one. However, I am not very sure how to correctly implement the sprand() function.
This is how the Matlab code use sprand():
% n_z is an integer, n_dw is a matrix
n_p_z_dw = cell(n_z, 1); % n(d,w) * p(z|d,w)
for z = 1:n_z
n_p_z_dw{z} = sprand(n_dw);
And this is how I implement the above logic in Python:
n_p_z_dw = [None]*n_z # n(d,w) * p(z|d,w)
density = np.count_nonzero(n_dw)/float(n_dw.size)
for i in range(0, n_z):
n_p_z_dw[i] = scipy.sparse.rand(n_d, n_w, density=density)
It seems to work, but I am not very sure about this. Any comment or suggestion?
The following should be a relatively fast way, I think, for a sparse array A:
import scipy.sparse as sparse
import numpy as np
sparse.coo_matrix((np.random.rand(A.nnz),A.nonzero()),shape=A.shape)
This will construct a COO format sparse matrix: it uses A.nonzero() as the coordinates, and A.nnz (the number of nonzero entries in A) to find the number of random numbers to generate.
I wonder, though, whether this might be a useful addition to the scipy.sparse.rand function.

Computing cross-correlation function?

In R, I am using ccf or acf to compute the pair-wise cross-correlation function so that I can find out which shift gives me the maximum value. From the looks of it, R gives me a normalized sequence of values. Is there something similar in Python's scipy or am I supposed to do it using the fft module? Currently, I am doing it as follows:
xcorr = lambda x,y : irfft(rfft(x)*rfft(y[::-1]))
x = numpy.array([0,0,1,1])
y = numpy.array([1,1,0,0])
print xcorr(x,y)
To cross-correlate 1d arrays use numpy.correlate.
For 2d arrays, use scipy.signal.correlate2d.
There is also scipy.stsci.convolve.correlate2d.
There is also matplotlib.pyplot.xcorr which is based on numpy.correlate.
See this post on the SciPy mailing list for some links to different implementations.
Edit: #user333700 added a link to the SciPy ticket for this issue in a comment.
If you are looking for a rapid, normalized cross correlation in either one or two dimensions
I would recommend the openCV library (see http://opencv.willowgarage.com/wiki/ http://opencv.org/). The cross-correlation code maintained by this group is the fastest you will find, and it will be normalized (results between -1 and 1).
While this is a C++ library the code is maintained with CMake and has python bindings so that access to the cross correlation functions is convenient. OpenCV also plays nicely with numpy. If I wanted to compute a 2-D cross-correlation starting from numpy arrays I could do it as follows.
import numpy
import cv
#Create a random template and place it in a larger image
templateNp = numpy.random.random( (100,100) )
image = numpy.random.random( (400,400) )
image[:100, :100] = templateNp
#create a numpy array for storing result
resultNp = numpy.zeros( (301, 301) )
#convert from numpy format to openCV format
templateCv = cv.fromarray(numpy.float32(template))
imageCv = cv.fromarray(numpy.float32(image))
resultCv = cv.fromarray(numpy.float32(resultNp))
#perform cross correlation
cv.MatchTemplate(templateCv, imageCv, resultCv, cv.CV_TM_CCORR_NORMED)
#convert result back to numpy array
resultNp = np.asarray(resultCv)
For just a 1-D cross-correlation create a 2-D array with shape equal to (N, 1 ). Though there is some extra code involved to convert to an openCV format the speed-up over scipy is quite impressive.
I just finished writing my own optimised implementation of normalized cross-correlation for N-dimensional arrays. You can get it from here.
It will calculate cross-correlation either directly, using scipy.ndimage.correlate, or in the frequency domain, using scipy.fftpack.fftn/ifftn depending on whichever will be quickest.
For 1D array, numpy.correlate is faster than scipy.signal.correlate, under different sizes, I see a consistent 5x peformance gain using numpy.correlate. When two arrays are of similar size (the bright line connecting the diagonal), the performance difference is even more outstanding (50x +).
# a simple benchmark
res = []
for x in range(1, 1000):
list_x = []
for y in range(1, 1000):
# generate different sizes of series to compare
l1 = np.random.choice(range(1, 100), size=x)
l2 = np.random.choice(range(1, 100), size=y)
time_start = datetime.now()
np.correlate(a=l1, v=l2)
t_np = datetime.now() - time_start
time_start = datetime.now()
scipy.signal.correlate(in1=l1, in2=l2)
t_scipy = datetime.now() - time_start
list_x.append(t_scipy / t_np)
res.append(list_x)
plt.imshow(np.matrix(res))
As default, scipy.signal.correlate calculates a few extra numbers by padding and that might explained the performance difference.
>> l1 = [1,2,3,2,1,2,3]
>> l2 = [1,2,3]
>> print(numpy.correlate(a=l1, v=l2))
>> print(scipy.signal.correlate(in1=l1, in2=l2))
[14 14 10 10 14]
[ 3 8 14 14 10 10 14 8 3] # the first 3 is [0,0,1]dot[1,2,3]

Categories

Resources