Im looking from smart ways to optimise this looped euclidean distance calculation. This calculation is looking for the mean distance from all other vectors.
As my vector arrays are really big to just do: eucl_dist = euclidean_distances(eigen_vs_cleaned)
Im running a loop row by row.
Typical eigen_vs_cleaned shape is at least (300000,1000) at the moment and I have to go up way more. (like 2000000,10000)
Any smarter way to do this?
eucl_dist_meaned = np.zeros(eigen_vs_cleaned.shape[0],dtype=float)
from sklearn.metrics.pairwise import euclidean_distances
for z in range(eigen_vs_cleaned.shape[0]):
if z%10000==0:
print(z)
eucl_dist_temp = euclidean_distances(eigen_vs_cleaned[z].reshape(1, -1), eigen_vs_cleaned)
eucl_dist_meaned[z] = eucl_dist_temp.mean(axis=1)
Im no python/numpy guru but this is the first step I took optimising this. it runs way better on my MacPro at least.
from joblib import Parallel, delayed
import multiprocessing
import os
import tempfile
import shutil
from sklearn.metrics.pairwise import euclidean_distances
# Creat a temporary directory and define the array pat
path = tempfile.mkdtemp()
out_path = os.path.join(path,'out.mmap')
out = np.memmap(out_path, dtype=float, shape=eigen_vs_cleaned.shape[0], mode='w+')
eucl_dist_meaned = np.zeros(eigen_vs_cleaned.shape[0],dtype=float)
num_cores = multiprocessing.cpu_count()
def runparallel(row, out):
if row%10000==0:
print(row)
eucl_dist_temp = euclidean_distances(eigen_vs_cleaned[row].reshape(1, -1), eigen_vs_cleaned)
out[row] = eucl_dist_temp.mean(axis=1)
##
nothing = Parallel(n_jobs=num_cores)(delayed(runparallel)(r, out) for r in range(eigen_vs_cleaned.shape[0]))
Then I save the output:
eucl_dist_meaned = np.array(out,copy=True,dtype=float)
Related
I have a rather simple parallelization question that I can't seem to work out. I had parallelized a simple matrix assignment using joblib in Python which worked nicely on my workstation, but now I need to run the code on a HPC and the as-is code is not playing nicely with MPI. A skeleton of the code is below (I have stripped out a lot of not-relevant computation). Basically I have a large matrix that I want to fill in and at each point the value is a sum over many energies and eigenvalues, so this is the 'slow step' of the calculation. When I run this on my workstation I just parallelize that fill in using Parallel and delayed from joblib but of course when I run this on the cluster using mpirun --bind-to none -n 16 python KZ_spectral_function.py | tee spectral.out for example the code runs basically in serial (although with some odd behavior).
So, what I think I need to do is to convert that joblib line over to mpi4py, include an if rank == 0: statement encompassing everything in the main function, and just modify the the contents of gen_spec_func() and divvy up the calls to spec_func() to the different cores. This is the part where I am stuck, as all the examples I have read that were simple enough for me to understand use some variation of COMMS.scatter() and then append the results to a list, as far as I can tell in a random order, and I don't know quite enough to adapt them to something where I want the results to go in a specific place in the matrix. Any help or advice would be greatly appreciated, as neither parallelization nor python are strengths of mine ...
Code Snippet (simplified):
import numpy as np
import numpy.linalg as lina
import time
from functools import partial
from joblib import Parallel, delayed
# Helper Functions
def get_eigenvals(k_cart,cellMap,Hwannier,G):
## [....] some linear algebra, not important
return Ek
def gen_spec_func(eigenvals,Nkpts,Energies,Sigma):
## This is really the only part that I care to parallelize
## This is the joblib version
num_cores=16
tempfunc = partial(spec_func,Energies,Sigma,eigenvals)
spectral = np.reshape(Parallel(n_jobs=num_cores)(delayed(tempfunc)(i,j) for j in range(0,Nkpts) for i in range(0,len(Energies))),(Nkpts,len(Energies))).T
return np.matrix(spectral)
def spec_func(Energies,Sigma,eigenvals,i,j) :
return sum([(1.0/( (Energies[i]-val)**2 + (Sigma)**2 )) for val in eigenvals[j,:]])
#--- Start of main script
Tstart = time.time()
# [...] Declare Constants & Parameters
# [...] Read data from disk
# [...] Some calculations on that data we want done in serial
Energies = [Emin + (Emax-Emin)*i/(Nenergies-1) for i in range(0,Nenergies)]
kzs = [kzi+(kzf-kzi)*l/(nkzs-1) for l in range(0,nkzs)]
DomainAvg = np.matrix([[0.0 for j in range(0,Nkpts)] for i in range(0,Nenergies)])
for kz in kzs:
## An outer loop over Kz
print("Starting loop for kz = ",kz)
# Generate the base k-grid (symmetric) in 1/A for convenience
# [...] Generate the appropriate kpoint grid
for angle in range(0,Nangles):
## Inner loop over rotation angles
#--- For each angle generate the kpoint grid for that domain
# [...] Calculate some eigenvalues, small matrices not a big deal, serial fine
#--- Now we Generate the spectral function for that grid
### Ok, this is the slow part that we want to parallelize
DomainAvg += gen_spec_func(eigenvals,Nkpts,Energies,Sigma)
if(angle%20 == 0):
Tend = time.time()
print("Completed iteration ", angle, "of ", Nangles, " at T = ", Tend-Tstart)
# Output the results (one file for each Kz)
DomainAvg = DomainAvg/Nangles
outfile = "Spectral"+str(kz)+".txt"
np.savetxt(outfile, DomainAvg)
# And we are done
Tend = time.time()
print("Total execution time was :", Tend-Tstart)
EDIT: A very very hack solution I came up with was to encode the matrix indices in the matrix itself as floats, then use scatter() and gather() to distribute the matrix, replace the value with the calculation output, and reassemble the matrix. This is of course not a good idea since it requires int<->float conversion but it was the only way I could come up with that didn't require rebuilding the entire matrix from the gathered data index by index (instead just using hstack() and reshape() to put it together). I feel like there must be some tool I am missing that assists in distributed calculation for arrays/matrices where the index matters so I would still be interested if someone has a tip/pointer in this regard.
Minimum Working Example:
import numpy.linalg as lina
import time
import math
from functools import partial
from mpi4py import MPI
#-- Standard Comms
COMM = MPI.COMM_WORLD
size = COMM.Get_size()
rank = COMM.Get_rank()
Nkpts = 3
Energies = [1.00031415926*i for i in range(0,11)]
#--- Now we Generate the spectral function for that grid
# This will be done in parallel using scatter/gather in MPI
if rank == 0:
# List that we will scatter to the different nodes
# Encode the matrix index from whcn each element came as a float
datalist = [float(j+i*Nkpts) for j in range(0,Nkpts) for i in range(0,len(Energies))]
data = np.array_split(datalist, COMM.Get_size())
else:
data = None
# Distribute to the different nodes
data = COMM.scatter(data, root=0)
print("I am processor ",rank," and my data is",data)
for index in range(0,len(data)) :
# Decode the indicies
j = data[index]%Nkpts
i = math.floor(data[index]/Nkpts)
data[index] = 100.100*j+Energies[i]
COMM.Barrier()
dataMPI = COMM.gather(data,root=0)
if(rank==0) :
spectral = np.reshape(np.hstack(dataMPI),(Nkpts,len(Energies))).T
spectral_func = np.matrix(spectral)
print(spectral_func)```
I've structured this in two sections, BACKGROUND and QUESTION. The Question is all the way at the bottom.
BACKGROUND:
Suppose I want to (using Dask distributed) do an embarrassingly parallel computation like summing 16 gigantic dataframes. I know that this is going to be blazing fast using CUDA but let's please stay with Dask for this example.
A basic way to accomplish this (using delayed) is:
from functools import reduce
import math
from dask import delayed, compute, visualize
import dask.distributed as dd
import numpy as np
#delayed
def gen_matrix():
return np.random.rand(1000, 1000)
#delayed
def calc_sum(matrices):
return reduce(lambda a, b: a + b, matrices)
if __name__ == '__main__':
num_matrices = 16
# Plop them into a big list
matrices = [gen_matrix() for _ in range(num_matrices)]
# Here's the Big Sum
matrices = calc_sum(matrices)
# Go!
with dd.Client('localhost:8786') as client:
f = client.submit(compute, matrices)
result = client.gather(f)
And here's the dask graph:
This certainly will work, BUT as the size of the matrices (see gen_matrix above) gets too large, the Dask distributed workers start to have three problems:
They time out sending data to the main worker performing the sum
The main worker runs out of memory gathering all of the matrices
The overall sum is not running in parallel (only matrix ganeration is)
Note that none of these issues are Dask's fault, it's working as advertised. I've just set up the computation poorly.
One solution is to break this into a tree computation, which is shown here, along with the dask visualization of that graph:
from functools import reduce
import math
from dask import delayed, compute, visualize
import dask.distributed as dd
import numpy as np
#delayed
def gen_matrix():
return np.random.rand(1000, 1000)
#delayed
def calc_sum(a, b):
return a + b
if __name__ == '__main__':
num_matrices = 16
# Plop them into a big list
matrices = [gen_matrix() for _ in range(num_matrices)]
# This tells us the depth of the calculation portion
# of the tree we are constructing in the next step
depth = int(math.log(num_matrices, 2))
# This is the code I don't want to have to manually write
for _ in range(depth):
matrices = [
calc_sum(matrices[i], matrices[i+1])
for i in range(0, len(matrices), 2)
]
# Go!
with dd.Client('localhost:8786') as client:
f = client.submit(compute, matrices)
result = client.gather(f)
And the graph:
QUESTION:
I would like to be able to get this tree generation done by either a library or perhaps Dask itself. How can I accomplish this?
And for those who are wondering, why not just use the code above? Because there are edge cases that I don't want to have to code for, and also because it's just more code to write :)
I have also seen this: Parallelize tree creation with dask
Is there something in functools or itertools that knows how to do this (and can be used with dask.delayed)?
Dask bag has a reduction/aggregation method that will generate tree-like DAG: fold.
The workflow would be to 'bag' the delayed objects and then fold them.
I'm attempting to identify elements in the euclidean distance matrix that fall under a certain threshold. I then take the positional arguments for this search and use them to compare elements in a second array (for sake of demonstration this array is the first eigenvector of PCA, but the sort is the most relevant part for my question). The application needs to be applicable for an unknown number of observations, but should run effectively on several million.
#
import numpy as np
from scipy.spatial.distance import cdist
threshold = 10
data = np.random.uniform((1, 2, 3), 5000)
searchValues = np.where(cdist(data, data) < threshold)
#
My problem is two fold.
Firstly the euclidean distance matrix quickly becomes too large for simply applying scipy.spatial.distance.cdist(). To solve this issue I apply the cdist function in batches over the dataset and implement the search iteratively.
#
cdist(data, data)
Traceback (most recent call last):
File "C:\Users\tl928yx\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-10-fb93ae543712>", line 1, in <module>
cdist(data, data)
File "C:\Users\tl928yx\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\spatial\distance.py", line 2142, in cdist
dm = np.zeros((mA, mB), dtype=np.double)
MemoryError
#
The second problem is a runtime issue that results from constructing distance matrix iteratively. When I institute my iterative approach the runtime increases exponentially. This isn't unexpected due to the nature of the iterative approach.
#
import numpy as np
import dask.array as da
from scipy.spatial.distance import cdist
import itertools
import timeit
threshold = 10
data = np.random.uniform(1, 100, (200000,40)) #Build random data
data = da.asarray(data)
it = round(data.shape[0]/10000)
dataArrays = [data[i*10000:(i+1)*10000] for i in range(0, it)]
comparisons = itertools.combinations(dataArrays, 2)
start = timeit.default_timer()
searchvalues = []
for comparison in comparisons:
searchvalues.append(np.where(cdist(comparison[0], comparison[1]) < threshold))
time = timeit.default_timer() - start
print(time)
#
Neither of these issues are unexpected due to the nature of the problem. To try and offset both problems I've tried using dask to implement both a large data framework in python, and insert parallelization in the batch process. However, this hasn't resulted in a significant improvement in the time calculation, and I have a pretty strict memory limitation with this iterative method in dask (requiring taking in batches of 1000 obs at a time.
from dask.diagnostics import ProgressBar
import dask.delayed
import dask.bag
#dask.delayed
def eucDist(comparison):
return da.asarray(cdist(comparison[0], comparison[1]))
#dask.delayed
def findValues(euclideanMatrix):
return np.where(euclideanMatrix < threshold)
start = timeit.default_timer()
searchvalues = []
test = []
for comparison in comparisons:
comp = dask.delayed(eucDist)(comparison)
test.append(comp)
look = []
with ProgressBar():
for element in test:
look.append(dask.delayed(findValues)(element).compute())
I'm hoping that I can parallelize the comparisons to increase my speed, but I'm not sure how to implement that in python. Any help with that, or any recommendations for how I can improve the initial comparison code would be appreciated.
You can calculate the Euclidean distance in Dask by using dask_distance.euclidean(x,y).
I believe that the dask-image package has some dask-enabled distance algorithms.
https://github.com/dask/dask-image
I am a beginner so please correct me if I go wrong somewhere.
I have a square matrix of size 1 million x 1 million.
I want to find the eigenvectors for it in pyspark. I know computeSVD gives me eigenvectors but those are through SVD and the result is a Dense Matrix which is a local data structure. I want the results which scipy.linalg.eig would give.
I saw there is a function EigenValueDecomposition using ARPACK in java and scala api for spark. Will it give same eigenvectors as eig in scipy? If yes, is there any way I can use it in pyspark? Or is there any alternate solution for the same. Can I use ARPACK directly in my code somehow or will I have to code Arnoldi iteration(for example) on my own?
Thanks for your help.
I have developed a python code to get a scipy sparse matrix and create a RowMatrix as the input to the computeSVD.
This is the part you need to convert the csr_matrix to list of SparseVectors. I use the parallel version since the sequential version is much slower and it is easy to make it parallel.
from pyspark.ml.linalg import SparseVector
from pyspark.mllib.linalg.distributed import RowMatrix
from multiprocessing.dummy import Pool as ThreadPool
from functools import reduce
from pyspark.sql import DataFrame
num_row, num_col = fullMatrix.shape
lst_total = [None] * num_row
selected_indices = [i for i in range(num_row)]
def addMllibSparseVector(idx):
curr = fullMatrix.getrow(idx)
arr_ind = np.argsort(curr.indices)
lst_total[idx] = (idx, SparseVector(num_col\
, curr.indices[arr_ind], curr.data[arr_ind]),)
pool = ThreadPool()
pool.map(addMllibSparseVector, selected_indices)
pool.close()
pool.join()
Then I create the dataframes using below code.
import math
lst_dfs = []
batch_size = 5000
num_range = math.ceil(num_row / batch_size)
lst_dfs = [None] * num_range
selected_dataframes = [i for i in range(num_range)]
def makeDataframes(idx):
start = idx * batch_size
end = min(start + batch_size, num_row)
lst_dfs[idx] = sqlContext.createDataFrame(lst_total[start:end]\
, ["id", "features"])
pool = ThreadPool()
pool.map(makeDataframes, selected_dataframes)
pool.close()
pool.join()
Then I reduce them to 1 dataframe and create the RowMatrix.
raw_df = reduce(DataFrame.unionAll,*lst_dfs)
raw_rdd = raw_df.select('features').rdd.map(list)
raw_rdd.cache()
mat = RowMatrix(raw_rdd)
svd = mat.computeSVD(100, computeU=True)
I simplified the code and haven't tested it completely. Please feel free to comment if something has problem.
I'm trying to do knn search on big data with limited memory.
I'm using HDF5 and python.
I tried bruteforce linear search(using pytables) and kd-tree search (using sklearn)
It's suprising but kd-tree method takes more time(maybe kd-tree will work better if we increase batch size? but I don't know optimal size also it limited by memory)
Now I'm looking for how to speed up calculations, I think HDF5 file can be tuned for individual PC, also norm calculation can be speeded maybe using nymexpr or some python tricks.
import numpy as np
import time
import tables
import cProfile
from sklearn.neighbors import NearestNeighbors
rows = 10000
cols = 1000
batches = 100
k= 10
#USING HDF5
vec= np.random.rand(1,cols)
data = np.random.rand(rows,cols)
fileName = 'C:\carray1.h5'
shape = (rows*batches, cols) # predefined size
atom = tables.UInt8Atom() #?
filters = tables.Filters(complevel=5, complib='zlib') #?
#create
# h5f = tables.open_file(fileName, 'w')
# ca = h5f.create_carray(h5f.root, 'carray', atom, shape, filters=filters)
# for i in range(batches):
# ca[i*rows:(i+1)*rows]= data[:]+i # +i to modify data
# h5f.close()
#can be parallel?
def test_bruteforce_knn():
h5f = tables.open_file(fileName)
t0= time.time()
d = np.empty((rows*batches,))
for i in range(batches):
d[i*rows:(i+1)*rows] = ((h5f.root.carray[i*rows:(i+1)*rows]-vec)**2).sum(axis=1)
print (time.time()-t0)
ndx = d.argsort()
print ndx[:k]
h5f.close()
def test_tree_knn():
h5f = tables.open_file(fileName)
# it will not work
# t0= time.time()
# nbrs = NearestNeighbors(n_neighbors=k, algorithm='ball_tree').fit(h5f.root.carray)
# distances, indices = nbrs.kneighbors(vec)
# print (time.time()-t0)
#need to concatenate distances, indices somehow
t0= time.time()
d = np.empty((rows*batches,))
for i in range(batches):
nbrs = NearestNeighbors(n_neighbors=k, algorithm='ball_tree').fit(h5f.root.carray[i*rows:(i+1)*rows])
distances, indices = nbrs.kneighbors(vec) # put in dict?
#d[i*rows:(i+1)*rows] =
print (time.time()-t0)
#ndx = d.argsort()
#print ndx[:k]
h5f.close()
cProfile.run('test_bruteforce_knn()')
cProfile.run('test_tree_knn()')
If I understand correctly your data has 1000 dimensions? If this is the case then it's expected that kd-tree won't fare well as it suffers from the curse of dimensionality.
You might want to have a look at Approximate Nearest Neighbors search methods instead. For instance have a look at flann.