I'm attempting to identify elements in the euclidean distance matrix that fall under a certain threshold. I then take the positional arguments for this search and use them to compare elements in a second array (for sake of demonstration this array is the first eigenvector of PCA, but the sort is the most relevant part for my question). The application needs to be applicable for an unknown number of observations, but should run effectively on several million.
import numpy as np
from scipy.spatial.distance import cdist
threshold = 10
data = np.random.uniform((1, 2, 3), 5000)
searchValues = np.where(cdist(data, data) < threshold)
My problem is two fold.
Firstly the euclidean distance matrix quickly becomes too large for simply applying scipy.spatial.distance.cdist(). To solve this issue I apply the cdist function in batches over the dataset and implement the search iteratively.
cdist(data, data)
Traceback (most recent call last):
File "C:\Users\tl928yx\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\", line 2862, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-10-fb93ae543712>", line 1, in <module>
cdist(data, data)
File "C:\Users\tl928yx\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\spatial\", line 2142, in cdist
dm = np.zeros((mA, mB), dtype=np.double)
The second problem is a runtime issue that results from constructing distance matrix iteratively. When I institute my iterative approach the runtime increases exponentially. This isn't unexpected due to the nature of the iterative approach.
import numpy as np
import dask.array as da
from scipy.spatial.distance import cdist
import itertools
import timeit
threshold = 10
data = np.random.uniform(1, 100, (200000,40)) #Build random data
data = da.asarray(data)
it = round(data.shape[0]/10000)
dataArrays = [data[i*10000:(i+1)*10000] for i in range(0, it)]
comparisons = itertools.combinations(dataArrays, 2)
start = timeit.default_timer()
searchvalues = []
for comparison in comparisons:
searchvalues.append(np.where(cdist(comparison[0], comparison[1]) < threshold))
time = timeit.default_timer() - start
Neither of these issues are unexpected due to the nature of the problem. To try and offset both problems I've tried using dask to implement both a large data framework in python, and insert parallelization in the batch process. However, this hasn't resulted in a significant improvement in the time calculation, and I have a pretty strict memory limitation with this iterative method in dask (requiring taking in batches of 1000 obs at a time.
from dask.diagnostics import ProgressBar
import dask.delayed
import dask.bag
def eucDist(comparison):
return da.asarray(cdist(comparison[0], comparison[1]))
def findValues(euclideanMatrix):
return np.where(euclideanMatrix < threshold)
start = timeit.default_timer()
searchvalues = []
test = []
for comparison in comparisons:
comp = dask.delayed(eucDist)(comparison)
look = []
with ProgressBar():
for element in test:
I'm hoping that I can parallelize the comparisons to increase my speed, but I'm not sure how to implement that in python. Any help with that, or any recommendations for how I can improve the initial comparison code would be appreciated.

You can calculate the Euclidean distance in Dask by using dask_distance.euclidean(x,y).

I believe that the dask-image package has some dask-enabled distance algorithms.


Can Dask automatically create a tree to parallelize a computation and reduce the copies between workers?

I've structured this in two sections, BACKGROUND and QUESTION. The Question is all the way at the bottom.
Suppose I want to (using Dask distributed) do an embarrassingly parallel computation like summing 16 gigantic dataframes. I know that this is going to be blazing fast using CUDA but let's please stay with Dask for this example.
A basic way to accomplish this (using delayed) is:
from functools import reduce
import math
from dask import delayed, compute, visualize
import dask.distributed as dd
import numpy as np
def gen_matrix():
return np.random.rand(1000, 1000)
def calc_sum(matrices):
return reduce(lambda a, b: a + b, matrices)
if __name__ == '__main__':
num_matrices = 16
# Plop them into a big list
matrices = [gen_matrix() for _ in range(num_matrices)]
# Here's the Big Sum
matrices = calc_sum(matrices)
# Go!
with dd.Client('localhost:8786') as client:
f = client.submit(compute, matrices)
result = client.gather(f)
And here's the dask graph:
This certainly will work, BUT as the size of the matrices (see gen_matrix above) gets too large, the Dask distributed workers start to have three problems:
They time out sending data to the main worker performing the sum
The main worker runs out of memory gathering all of the matrices
The overall sum is not running in parallel (only matrix ganeration is)
Note that none of these issues are Dask's fault, it's working as advertised. I've just set up the computation poorly.
One solution is to break this into a tree computation, which is shown here, along with the dask visualization of that graph:
from functools import reduce
import math
from dask import delayed, compute, visualize
import dask.distributed as dd
import numpy as np
def gen_matrix():
return np.random.rand(1000, 1000)
def calc_sum(a, b):
return a + b
if __name__ == '__main__':
num_matrices = 16
# Plop them into a big list
matrices = [gen_matrix() for _ in range(num_matrices)]
# This tells us the depth of the calculation portion
# of the tree we are constructing in the next step
depth = int(math.log(num_matrices, 2))
# This is the code I don't want to have to manually write
for _ in range(depth):
matrices = [
calc_sum(matrices[i], matrices[i+1])
for i in range(0, len(matrices), 2)
# Go!
with dd.Client('localhost:8786') as client:
f = client.submit(compute, matrices)
result = client.gather(f)
And the graph:
I would like to be able to get this tree generation done by either a library or perhaps Dask itself. How can I accomplish this?
And for those who are wondering, why not just use the code above? Because there are edge cases that I don't want to have to code for, and also because it's just more code to write :)
I have also seen this: Parallelize tree creation with dask
Is there something in functools or itertools that knows how to do this (and can be used with dask.delayed)?
Dask bag has a reduction/aggregation method that will generate tree-like DAG: fold.
The workflow would be to 'bag' the delayed objects and then fold them.

How to schedule multiple 1d FFTs using Scikit-cuda FFT?

I'm looking to parallelize multiple 1d FFTs using CUDA. I'm working on a GTX 1050Ti with CUDA 6.1.
For instance in the code I attached, I have a 3d input array 'data', and I want to do 1d FFTs over the second dimension of this array. The purpose is, of course, to speed up the execution time by an order of magnitude.
I'm able to use Python's scikit-cuda's cufft package to run a batch of 1 1d FFT and the results match with NumPy's FFT. The problem comes when I go to a real batch size. There, I'm not able to match the NumPy's FFT output (which is the correct one) with cufft's output (which I believe isn't correct). In the code attached, parameter 'singleFFT' controls whether we schedule a batch of 1 or many. Help in correcting the output FFT and also speeding up execution further (if possible) will be greatly appreciated.
import numpy as np
from time import process_time
from skcuda import cufft as cf
import pycuda.autoinit
from pycuda import gpuarray
# params
nSamp = 512
nTx = 16
nRx = 16
nChirp = 256
NX = nChirp
# Uncomment the following line to generate same data always
# np.random.seed(seed=1)
data = (np.random.randn(nSamp,nChirp,nTx,nRx) + 1j*np.random.randn(nSamp,nChirp,nTx,nRx)).astype(np.complex64)
data = data.reshape(nSamp,-1,nTx*nRx)
dataShp0 = np.int32(data.shape[0])
dataShp2 = np.int32(data.shape[2])
idx1 = 0
idx2 = 0
idx3 = 0
singleFFT = 0
if (1 == singleFFT):
data_t = data[0,:,0]
fftAxis = 0
BATCH = np.int32(1)
data_t = data
fftAxis = 1
BATCH = np.int32(nSamp*nTx*nRx)
# calculate and time NumPy FFT
t1 = process_time()
dataFft = np.fft.fft(data_t, axis=fftAxis)
t2 = process_time()
print('\nCPU NumPy time is: ',t2-t1)
data_o_gpu = gpuarray.empty((BATCH*NX),dtype=np.complex64)
# calculate and time GPU FFT
data_t = data_t.reshape((BATCH*NX))
t1 = process_time()
# transfer input data to Device
data_t_gpu = gpuarray.to_gpu(data_t)
# Make FFT plan
plan = cf.cufftPlan1d(NX, cf.CUFFT_C2C, BATCH)
# Execute FFT plan
res = cf.cufftExecC2C(plan, int(data_t_gpu.gpudata), int(data_o_gpu.gpudata), cf.CUFFT_FORWARD)
dataFft_gpu = data_o_gpu.get()
t2 = process_time()
if (0 == singleFFT):
dataFft_gpu = dataFft_gpu.reshape((nSamp,-1,nTx*nRx))
print('\nGPU time is: ',t2-t1)
The last line in the code matches the result of NumPy's FFT with cuFFT. It could be seen with singleFFT=1, the result is True, while for singleFFT=0 (i.e. batch of many 1d FFTs), the result is False.
Post my attempts, I would want to conclude that:
Using cufft library from skcuda is a bit tricky and to get to the correct FFT output might take a long time, in development. I also noticed that there wasn't an order of magnitude difference in execution time between NumPy's FFT and cufft's FFT (from skcuda)
Using CuPy and arranging your data in a format so that the FFT dimension is laid out in contiguous memory gives an order of magnitude improvement in the FFT compute time. For my case, the order was a little better than 10!
Using CuPy for FFTs is a great option if one wants to stick to Py-based development only. Also the to and fro from C to Python when writing C GPU kernels is an added overhead which is very conveniently resolved with CuPy. Though CuPy itself calls laying out the plan and calling the FFT exec engine internally.

Need to find eigenvectors in pyspark for a non-symmetric square matrix through eigen value decomposition similar to scipy.linalg.eig

I am a beginner so please correct me if I go wrong somewhere.
I have a square matrix of size 1 million x 1 million.
I want to find the eigenvectors for it in pyspark. I know computeSVD gives me eigenvectors but those are through SVD and the result is a Dense Matrix which is a local data structure. I want the results which scipy.linalg.eig would give.
I saw there is a function EigenValueDecomposition using ARPACK in java and scala api for spark. Will it give same eigenvectors as eig in scipy? If yes, is there any way I can use it in pyspark? Or is there any alternate solution for the same. Can I use ARPACK directly in my code somehow or will I have to code Arnoldi iteration(for example) on my own?
Thanks for your help.
I have developed a python code to get a scipy sparse matrix and create a RowMatrix as the input to the computeSVD.
This is the part you need to convert the csr_matrix to list of SparseVectors. I use the parallel version since the sequential version is much slower and it is easy to make it parallel.
from import SparseVector
from pyspark.mllib.linalg.distributed import RowMatrix
from multiprocessing.dummy import Pool as ThreadPool
from functools import reduce
from pyspark.sql import DataFrame
num_row, num_col = fullMatrix.shape
lst_total = [None] * num_row
selected_indices = [i for i in range(num_row)]
def addMllibSparseVector(idx):
curr = fullMatrix.getrow(idx)
arr_ind = np.argsort(curr.indices)
lst_total[idx] = (idx, SparseVector(num_col\
, curr.indices[arr_ind],[arr_ind]),)
pool = ThreadPool(), selected_indices)
Then I create the dataframes using below code.
import math
lst_dfs = []
batch_size = 5000
num_range = math.ceil(num_row / batch_size)
lst_dfs = [None] * num_range
selected_dataframes = [i for i in range(num_range)]
def makeDataframes(idx):
start = idx * batch_size
end = min(start + batch_size, num_row)
lst_dfs[idx] = sqlContext.createDataFrame(lst_total[start:end]\
, ["id", "features"])
pool = ThreadPool(), selected_dataframes)
Then I reduce them to 1 dataframe and create the RowMatrix.
raw_df = reduce(DataFrame.unionAll,*lst_dfs)
raw_rdd ='features')
mat = RowMatrix(raw_rdd)
svd = mat.computeSVD(100, computeU=True)
I simplified the code and haven't tested it completely. Please feel free to comment if something has problem.

Speeding up Evaluation of Sympy Symbolic Expressions

A Python program I am currently working on (Gaussian process classification) is bottlenecking on evaluation of Sympy symbolic matrices, and I can't figure out what I can, if anything, do to speed it up. Other parts of the program I've already ensured are typed properly (in terms of numpy arrays) so calculations between them are properly vectorised, etc.
I looked into Sympy's codegen functions a bit (autowrap, binary_function) in particular, but because my within my ImmutableMatrix object itself are partial derivatives over elements of a symbolic matrix, there is a long list of 'unhashable' things which prevent me from using the codegen functionality.
Another possibility I looked into was using Theano - but after some initial benchmarks, I found that while it build the initial partial derivative symbolic matrices much quicker, it seemed to be a few orders of magnitude slower at evaluation, the opposite of what I was seeking.
Below is a working, extracted snippet of the code I am currently working on.
import theano
import sympy
from sympy.utilities.autowrap import autowrap
from sympy.utilities.autowrap import binary_function
import numpy as np
import math
from datetime import datetime
# 'Vectorized' cdist that can handle symbols/arbitrary types - preliminary benchmarking put it at ~15 times faster than python list comprehension, but still notably slower (forgot at the moment) than cdist, of course
def sqeucl_dist(x, xs):
m = np.sum(np.power(
np.repeat(x[:,None,:], len(xs), axis=1) -
np.resize(xs, (len(x), xs.shape[0], xs.shape[1])),
2), axis=2)
return m
def build_symbolic_derivatives(X):
# Pre-calculate derivatives of inverted matrix to substitute values in the Squared Exponential NLL gradient
f_err_sym, n_err_sym = sympy.symbols("f_err, n_err")
# (1,n) shape 'matrix' (vector) of length scales for each dimension
l_scale_sym = sympy.MatrixSymbol('l', 1, X.shape[1])
# K matrix
print("Building sympy matrix...")
eucl_dist_m = sqeucl_dist(X/l_scale_sym, X/l_scale_sym)
m = sympy.Matrix(f_err_sym**2 * math.e**(-0.5 * eucl_dist_m)
+ n_err_sym**2 * np.identity(len(X)))
# Element-wise derivative of K matrix over each of the hyperparameters
print("Getting partial derivatives over all hyperparameters...")
pd_t1 =
dK_df = m.diff(f_err_sym)
dK_dls = [m.diff(l_scale_sym) for l_scale_sym in l_scale_sym]
dK_dn = m.diff(n_err_sym)
print("Took: {}".format( - pd_t1))
# Lambdify each of the dK/dts to speed up substitutions per optimization iteration
print("Lambdifying ")
l_t1 =
dK_dthetas = [dK_df] + dK_dls + [dK_dn]
dK_dthetas = sympy.lambdify((f_err_sym, l_scale_sym, n_err_sym), dK_dthetas, 'numpy')
print("Took: {}".format( - l_t1))
return dK_dthetas
# Evaluates each dK_dtheta pre-calculated symbolic lambda with current iteration's hyperparameters
def eval_dK_dthetas(dK_dthetas_raw, f_err, l_scales, n_err):
l_scales = sympy.Matrix(l_scales.reshape(1, len(l_scales)))
return np.array(dK_dthetas_raw(f_err, l_scales, n_err), dtype=np.float64)
dimensions = 3
X = np.random.rand(50, dimensions)
dK_dthetas_raw = build_symbolic_derivatives(X)
f_err = np.random.rand()
l_scales = np.random.rand(3)
n_err = np.random.rand()
t1 =
dK_dthetas = eval_dK_dthetas(dK_dthetas_raw, f_err, l_scales, n_err) # ~99.7%
print( - t1)
In this example, 5 50x50 symbolic matrices are evaluated, i.e. only 12,500 elements, taking 7 seconds. I've spent quite some time looking for resources on speeding operations like this up, and trying to translate it into Theano (at least until I found its evaluation slower in my case) and having no luck there either.
Any help greatly appreciated!

Speed up sampling of kernel estimate

Here's a MWE of a much larger code I'm using. Basically, it performs a Monte Carlo integration over a KDE (kernel density estimate) for all values located below a certain threshold (the integration method was suggested over at this question BTW: Integrate 2D kernel density estimate).
import numpy as np
from scipy import stats
import time
# Generate some random two-dimensional data:
def measure(n):
"Measurement model, return two coupled measurements."
m1 = np.random.normal(size=n)
m2 = np.random.normal(scale=0.5, size=n)
return m1+m2, m1-m2
# Get data.
m1, m2 = measure(20000)
# Define limits.
xmin = m1.min()
xmax = m1.max()
ymin = m2.min()
ymax = m2.max()
# Perform a kernel density estimate on the data.
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values)
# Define point below which to integrate the kernel.
x1, y1 = 0.5, 0.5
# Get kernel value for this point.
tik = time.time()
iso = kernel((x1,y1))
print 'iso: ', time.time()-tik
# Sample from KDE distribution (Monte Carlo process).
tik = time.time()
sample = kernel.resample(size=1000)
print 'resample: ', time.time()-tik
# Filter the sample leaving only values for which
# the kernel evaluates to less than what it does for
# the (x1, y1) point defined above.
tik = time.time()
insample = kernel(sample) < iso
print 'filter/sample: ', time.time()-tik
# Integrate for all values below iso.
tik = time.time()
integral = insample.sum() / float(insample.shape[0])
print 'integral: ', time.time()-tik
The output looks something like this:
iso: 0.00259208679199
resample: 0.000817060470581
filter/sample: 2.10829401016
integral: 4.2200088501e-05
which clearly means that the filter/sample call is consuming almost all of the time the code uses to run. I have to run this block of code iteratively several thousand times so it can get pretty time consuming.
Is there any way to speed up the filtering/sampling process?
Here's a slightly more realistic MWE of my actual code with Ophion's multi-threading solution written into it:
import numpy as np
from scipy import stats
from multiprocessing import Pool
def kde_integration(m_list):
m1, m2 = [], []
for item in m_list:
# Color data.
# Magnitude data.
# Define limits.
xmin, xmax = min(m1), max(m1)
ymin, ymax = min(m2), max(m2)
# Perform a kernel density estimate on the data:
x, y = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
values = np.vstack([m1, m2])
kernel = stats.gaussian_kde(values)
out_list = []
for point in m_list:
# Compute the point below which to integrate.
iso = kernel((point[0], point[1]))
# Sample KDE distribution
sample = kernel.resample(size=1000)
#Create definition.
def calc_kernel(samp):
return kernel(samp)
#Choose number of cores and split input array.
cores = 4
torun = np.array_split(sample, cores, axis=1)
pool = Pool(processes=cores)
results =, torun)
#Reintegrate and calculate results
insample_mp = np.concatenate(results) < iso
# Integrate for all values below iso.
integral = insample_mp.sum() / float(insample_mp.shape[0])
return out_list
# Generate some random two-dimensional data:
def measure(n):
"Measurement model, return two coupled measurements."
m1 = np.random.normal(size=n)
m2 = np.random.normal(scale=0.5, size=n)
return m1+m2, m1-m2
# Create list to pass.
m_list = []
for i in range(60):
m1, m2 = measure(5)
# Call KDE integration function.
print 'Integral result: ', kde_integration(m_list)
The solution presented by Ophion works great on the original code I presented, but fails with the following error in this version:
Integral result: Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/", line 551, in __bootstrap_inner
File "/usr/lib/python2.7/", line 504, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.7/multiprocessing/", line 319, in _handle_tasks
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
I tried moving the calc_kernel function around since one of the answers in this question Multiprocessing: How to use on a function defined in a class? states that "the function that you give to map() must be accessible through an import of your module"; but I still can't get this code to work.
Any help will be very much appreciated.
Add 2
Implementing Ophion's suggestion to remove the calc_kernel function and simply using:
results =, torun)
works to get rid of the PicklingError but now I see that if I create an initial m_list of anything more than around 62-63 items I get this error:
Traceback (most recent call last):
File "~/", line 67, in <module>
print 'Integral result: ', kde_integration(m_list)
File "~/", line 38, in kde_integration
pool = Pool(processes=cores)
File "/usr/lib/python2.7/multiprocessing/", line 232, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild)
File "/usr/lib/python2.7/multiprocessing/", line 161, in __init__
File "/usr/lib/python2.7/", line 494, in start
_start_new_thread(self.__bootstrap, ())
thread.error: can't start new thread
Since my actual list in my real implementation of this code can have up to 2000 items, this issue renders the code unusable. Line 38 is this one:
pool = Pool(processes=cores)
so apparently it has something to do with the number of cores I'm using?
This question "Can't start a new thread error" in Python suggests using:
to check the number of threads I have going when I get that error. I checked and it always crashes when it reaches 374 threads. How can I code around this issue?
Here's the new question dealing with this last issue: Thread error: can't start new thread
Probably the easiest way to speed this up is to parallelize kernel(sample):
Taking this code fragment:
tik = time.time()
insample = kernel(sample) < iso
print 'filter/sample: ', time.time()-tik
#filter/sample: 1.94065904617
Change this to use multiprocessing:
from multiprocessing import Pool
tik = time.time()
#Create definition.
def calc_kernel(samp):
return kernel(samp)
#Choose number of cores and split input array.
cores = 4
torun = np.array_split(sample, cores, axis=1)
pool = Pool(processes=cores)
results =, torun)
#Reintegrate and calculate results
insample_mp = np.concatenate(results) < iso
print 'multiprocessing filter/sample: ', time.time()-tik
#multiprocessing filter/sample: 0.496874094009
Double check they are returning the same answer:
print np.all(insample==insample_mp)
A 3.9x improvement on 4 cores. Not sure what you are running this on, but after about 6 processors your input array size is not large enough to get considerably gains. For example using 20 processors its only about 5.8x faster.
The claim in the comments section of this article (link below) is
"SciPy’s gaussian_kde doesn’t use FFT, while there is a statsmodels implementation that does"
…which is a possible cause of the observed poor performance. It goes on to report orders of magnitude improvement using FFT. See #jseabold's reply.
Disclaimer: I have no experience with statsmodels or scipy.

