knn search using HDF5 - python

I'm trying to do knn search on big data with limited memory.
I'm using HDF5 and python.
I tried bruteforce linear search(using pytables) and kd-tree search (using sklearn)
It's suprising but kd-tree method takes more time(maybe kd-tree will work better if we increase batch size? but I don't know optimal size also it limited by memory)
Now I'm looking for how to speed up calculations, I think HDF5 file can be tuned for individual PC, also norm calculation can be speeded maybe using nymexpr or some python tricks.
import numpy as np
import time
import tables
import cProfile
from sklearn.neighbors import NearestNeighbors
rows = 10000
cols = 1000
batches = 100
k= 10
#USING HDF5
vec= np.random.rand(1,cols)
data = np.random.rand(rows,cols)
fileName = 'C:\carray1.h5'
shape = (rows*batches, cols) # predefined size
atom = tables.UInt8Atom() #?
filters = tables.Filters(complevel=5, complib='zlib') #?
#create
# h5f = tables.open_file(fileName, 'w')
# ca = h5f.create_carray(h5f.root, 'carray', atom, shape, filters=filters)
# for i in range(batches):
# ca[i*rows:(i+1)*rows]= data[:]+i # +i to modify data
# h5f.close()
#can be parallel?
def test_bruteforce_knn():
h5f = tables.open_file(fileName)
t0= time.time()
d = np.empty((rows*batches,))
for i in range(batches):
d[i*rows:(i+1)*rows] = ((h5f.root.carray[i*rows:(i+1)*rows]-vec)**2).sum(axis=1)
print (time.time()-t0)
ndx = d.argsort()
print ndx[:k]
h5f.close()
def test_tree_knn():
h5f = tables.open_file(fileName)
# it will not work
# t0= time.time()
# nbrs = NearestNeighbors(n_neighbors=k, algorithm='ball_tree').fit(h5f.root.carray)
# distances, indices = nbrs.kneighbors(vec)
# print (time.time()-t0)
#need to concatenate distances, indices somehow
t0= time.time()
d = np.empty((rows*batches,))
for i in range(batches):
nbrs = NearestNeighbors(n_neighbors=k, algorithm='ball_tree').fit(h5f.root.carray[i*rows:(i+1)*rows])
distances, indices = nbrs.kneighbors(vec) # put in dict?
#d[i*rows:(i+1)*rows] =
print (time.time()-t0)
#ndx = d.argsort()
#print ndx[:k]
h5f.close()
cProfile.run('test_bruteforce_knn()')
cProfile.run('test_tree_knn()')

If I understand correctly your data has 1000 dimensions? If this is the case then it's expected that kd-tree won't fare well as it suffers from the curse of dimensionality.
You might want to have a look at Approximate Nearest Neighbors search methods instead. For instance have a look at flann.

Related

Gatherv 2D numpy array mpi4py

I'm learning parallel computing through mpi4py. Since I deal with a large dataset, I need to preallocate the memory at the master process in order to not have memory issues. That's the reason why I use the Scatterv and Gatherv methods. The code proposed has only the scope to allocate the memory without doing any specific operation
import numpy as np
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
nprocs = comm.Get_size()
if rank == 0:
sendbuf = np.random.rand(4,3)
r, c = sendbuf.shape
ave, res = divmod(c, nprocs)
count = [ave + 1 if p < res else ave for p in range(nprocs)]
count = np.array(count)
print("count is ", count)
# displacement: the starting index of each sub-task
displ = [sum(count[:p]) for p in range(nprocs)]
displ = np.array(displ)
else:
sendbuf = None
# initialize count on worker processes
count = np.zeros(nprocs, dtype=np.int)
displ = None
# broadcast count
comm.Bcast(count, root=0)
# initialize recvbuf on all processes
recvbuf = np.zeros((4,count[rank]))
comm.Scatterv([sendbuf, count, displ, MPI.DOUBLE], recvbuf, root=0)
a, b = recvbuf.shape
sendbuf2 = np.random.rand(a,b)
recvbuf2 = np.zeros((4,sum(count)))
comm.Gatherv(sendbuf2, [recvbuf2, count, displ, MPI.DOUBLE], root=0)
In the master process, I firstly define a random 2D array (sendbuf) of dimensions (4,3). What I want to do is to scatter this matrix to the different process by dividing it in columns (so preserving the number of rows). Then I initialize the recvbuf variable in order to receive the chunks of information of sendbuf. Then I use the Scatterv method to pass the information. I noticed that only the data in the first row are passed correctly. This is not really important, since in the real application the recvbuf variable is used only to pre-allocate the memory. At this point I redefine the recvbuf variable, and then I try to send back the information to the master node, but the code gives error. I don't really understand what I'm doing wrong in the Gatherv part.
I tried to keep the example as simple as possible, so that the code doesn't not do anything specific. What I want to learn is how to correctly scatter and gather 2D numpy array.
What 'error' do you have exactly?
Here's a working version of gatherv with 2D arrays. Keep in mind that the sizes are scaled with the size of the 2D array, as NumPy has arrays row-major in memory
from mpi4py import MPI
import numpy as np
comm_world = MPI.COMM_WORLD
my_rank = comm_world.Get_rank()
num_proc = comm_world.Get_size()
# Parameters for this script
rowlength = 2
sizes = 2*np.ones((num_proc), dtype=np.int32)
sizes[-1] = 1
# Construct some data
data = [np.array((), dtype=np.double) for _ in range(num_proc)]
data[my_rank] = np.array(my_rank+np.random.rand(sizes[my_rank], rowlength), np.double)
# Compute sizes and offsets for Allgatherv
sizes_memory = rowlength*sizes
offsets = np.zeros(num_proc)
offsets[1:] = np.cumsum(sizes_memory)[:-1]
if my_rank == 0:
print(f"Total size {np.sum(sizes)}")
print(f"Sizes: {sizes}")
print(f"Sizes: {sizes_memory}")
print(f"Offsets: {offsets}")
# Prepare buffer for Allgatherv
data_out = np.empty((np.sum(sizes), rowlength), dtype=np.double)
comm_world.Allgatherv(
data[my_rank],
recvbuf=[data_out, sizes_memory.tolist(), offsets.tolist(), MPI.DOUBLE])
if (my_rank == 0):
print(f"Data_out has shape {data_out.shape}")
print(data_out[:, 0])
Linux OS, MPI: 'mpirun (Open MPI) 4.1.2', MPI4PY: 'mpi4py 3.1.4'

How to combine many numpy arrays efficiently?

I am having difficulty trying to load 18k of training data for training with tensorflow. The files are npy files named as such: 0.npy, 1.npy...18000.npy.
I was looking around the web and came up with a simple code to first read the files in the correct sequence and trying to concatenate the training data together but it takes forever..
import numpy as np
import glob
import re
import tensorflow as tf
print("TensorFlow version: {}".format(tf.__version__))
files = glob.glob('D:/project/train/*.npy')
files.sort(key=lambda var:[int(x) if x.isdigit() else x for x in
re.findall(r'[^0-9]|[0-9]+', var)])
# print(files)
final_dataset = []
i = 0
for file in files:
dataset = np.load(file, mmap_mode='r')
print(i)
#print("Size of dataset: {} ".format(dataset.shape))
if (i==0):
final_dataset = dataset
else:
final_dataset = np.concatenate((final_dataset, dataset), axis = 0)
i = i + 1
print("Size of final_dataset: {} ".format(final_dataset.shape))
np.save('combined_train.npy', final_dataset)
'Combining' arrays in any way involves (1), creating an array with the two arrays' total size; (2), copying their contents into the array. If you do this each time you load in an array, it repeats 18000 times - with time per iteration growing for each iteration (due to larger final_dataset).
A simple workaround is to append the arrays to a list - and then combine them all once at the end:
dataset = []
for file in files:
data = np.load(file, mmap_mode='r')
dataset.append(data)
final_dataset = np.concatenate(dataset, axis=0)
But beware: be sure final_dataset can actually fit in your RAM, else the program will crash. You can find out via ram_required = size_per_file * number_of_files. Relevant SO. (To speed things up even further, you can look into multiprocessing - but not simple to get working)

Need to find eigenvectors in pyspark for a non-symmetric square matrix through eigen value decomposition similar to scipy.linalg.eig

I am a beginner so please correct me if I go wrong somewhere.
I have a square matrix of size 1 million x 1 million.
I want to find the eigenvectors for it in pyspark. I know computeSVD gives me eigenvectors but those are through SVD and the result is a Dense Matrix which is a local data structure. I want the results which scipy.linalg.eig would give.
I saw there is a function EigenValueDecomposition using ARPACK in java and scala api for spark. Will it give same eigenvectors as eig in scipy? If yes, is there any way I can use it in pyspark? Or is there any alternate solution for the same. Can I use ARPACK directly in my code somehow or will I have to code Arnoldi iteration(for example) on my own?
Thanks for your help.
I have developed a python code to get a scipy sparse matrix and create a RowMatrix as the input to the computeSVD.
This is the part you need to convert the csr_matrix to list of SparseVectors. I use the parallel version since the sequential version is much slower and it is easy to make it parallel.
from pyspark.ml.linalg import SparseVector
from pyspark.mllib.linalg.distributed import RowMatrix
from multiprocessing.dummy import Pool as ThreadPool
from functools import reduce
from pyspark.sql import DataFrame
num_row, num_col = fullMatrix.shape
lst_total = [None] * num_row
selected_indices = [i for i in range(num_row)]
def addMllibSparseVector(idx):
curr = fullMatrix.getrow(idx)
arr_ind = np.argsort(curr.indices)
lst_total[idx] = (idx, SparseVector(num_col\
, curr.indices[arr_ind], curr.data[arr_ind]),)
pool = ThreadPool()
pool.map(addMllibSparseVector, selected_indices)
pool.close()
pool.join()
Then I create the dataframes using below code.
import math
lst_dfs = []
batch_size = 5000
num_range = math.ceil(num_row / batch_size)
lst_dfs = [None] * num_range
selected_dataframes = [i for i in range(num_range)]
def makeDataframes(idx):
start = idx * batch_size
end = min(start + batch_size, num_row)
lst_dfs[idx] = sqlContext.createDataFrame(lst_total[start:end]\
, ["id", "features"])
pool = ThreadPool()
pool.map(makeDataframes, selected_dataframes)
pool.close()
pool.join()
Then I reduce them to 1 dataframe and create the RowMatrix.
raw_df = reduce(DataFrame.unionAll,*lst_dfs)
raw_rdd = raw_df.select('features').rdd.map(list)
raw_rdd.cache()
mat = RowMatrix(raw_rdd)
svd = mat.computeSVD(100, computeU=True)
I simplified the code and haven't tested it completely. Please feel free to comment if something has problem.

looped sklearn euclidean distances optimisation

Im looking from smart ways to optimise this looped euclidean distance calculation. This calculation is looking for the mean distance from all other vectors.
As my vector arrays are really big to just do: eucl_dist = euclidean_distances(eigen_vs_cleaned)
Im running a loop row by row.
Typical eigen_vs_cleaned shape is at least (300000,1000) at the moment and I have to go up way more. (like 2000000,10000)
Any smarter way to do this?
eucl_dist_meaned = np.zeros(eigen_vs_cleaned.shape[0],dtype=float)
from sklearn.metrics.pairwise import euclidean_distances
for z in range(eigen_vs_cleaned.shape[0]):
if z%10000==0:
print(z)
eucl_dist_temp = euclidean_distances(eigen_vs_cleaned[z].reshape(1, -1), eigen_vs_cleaned)
eucl_dist_meaned[z] = eucl_dist_temp.mean(axis=1)
Im no python/numpy guru but this is the first step I took optimising this. it runs way better on my MacPro at least.
from joblib import Parallel, delayed
import multiprocessing
import os
import tempfile
import shutil
from sklearn.metrics.pairwise import euclidean_distances
# Creat a temporary directory and define the array pat
path = tempfile.mkdtemp()
out_path = os.path.join(path,'out.mmap')
out = np.memmap(out_path, dtype=float, shape=eigen_vs_cleaned.shape[0], mode='w+')
eucl_dist_meaned = np.zeros(eigen_vs_cleaned.shape[0],dtype=float)
num_cores = multiprocessing.cpu_count()
def runparallel(row, out):
if row%10000==0:
print(row)
eucl_dist_temp = euclidean_distances(eigen_vs_cleaned[row].reshape(1, -1), eigen_vs_cleaned)
out[row] = eucl_dist_temp.mean(axis=1)
##
nothing = Parallel(n_jobs=num_cores)(delayed(runparallel)(r, out) for r in range(eigen_vs_cleaned.shape[0]))
Then I save the output:
eucl_dist_meaned = np.array(out,copy=True,dtype=float)

Numba slower than pure Python in frequency counting

Given a data matrix with discrete entries represented as a 2D numpy array, I'm trying to compute the observed frequencies of some features (the columns) only looking at some instances (the rows of the matrix).
I can do that quite easily with numpy using bincount applied to each slice after having done some fancy slicing. Doing that in pure Python, using an external data structure as a count accumulator, is a double loop in C-style.
import numpy
import numba
try:
from time import perf_counter
except:
from time import time
perf_counter = time
def estimate_counts_numpy(data,
instance_ids,
feature_ids):
"""
WRITEME
"""
#
# slicing the data array (probably memory consuming)
curr_data_slice = data[instance_ids, :][:, feature_ids]
estimated_counts = []
for feature_slice in curr_data_slice.T:
counts = numpy.bincount(feature_slice)
#
# checking just for the all 0 case:
# this is not stable for not binary datasets TODO: fix it
if counts.shape[0] < 2:
counts = numpy.append(counts, [0], 0)
estimated_counts.append(counts)
return estimated_counts
#numba.jit(numba.types.int32[:, :](numba.types.int8[:, :],
numba.types.int32[:],
numba.types.int32[:],
numba.types.int32[:],
numba.types.int32[:, :]))
def estimate_counts_numba(data,
instance_ids,
feature_ids,
feature_vals,
estimated_counts):
"""
WRITEME
"""
#
# actual counting
for i, feature_id in enumerate(feature_ids):
for instance_id in instance_ids:
estimated_counts[i][data[instance_id, feature_id]] += 1
return estimated_counts
if __name__ == '__main__':
#
# creating a large synthetic matrix, testing for performance
rand_gen = numpy.random.RandomState(1337)
n_instances = 2000
n_features = 2000
large_matrix = rand_gen.binomial(1, 0.5, (n_instances, n_features))
#
# random indexes too
n_sample = 1000
rand_instance_ids = rand_gen.choice(n_instances, n_sample, replace=False)
rand_feature_ids = rand_gen.choice(n_features, n_sample, replace=False)
binary_feature_vals = [2 for i in range(n_features)]
#
# testing
numpy_start_t = perf_counter()
e_counts_numpy = estimate_counts_numpy(large_matrix,
rand_instance_ids,
rand_feature_ids)
numpy_end_t = perf_counter()
print('numpy done in {0} secs'.format(numpy_end_t - numpy_start_t))
binary_feature_vals = numpy.array(binary_feature_vals)
#
#
curr_feature_vals = binary_feature_vals[rand_feature_ids]
#
# creating a data structure to hold the slices
# (with numba I cannot use list comprehension?)
# e_counts_numba = [[0 for val in range(feature_val)]
# for feature_val in
# curr_feature_vals]
e_counts_numba = numpy.zeros((n_sample, 2), dtype='int32')
numba_start_t = perf_counter()
estimate_counts_numba(large_matrix,
rand_instance_ids,
rand_feature_ids,
binary_feature_vals,
e_counts_numba)
numba_end_t = perf_counter()
print('numba done in {0} secs'.format(numba_end_t - numba_start_t))
These are the times I get while running the above code:
numpy done in 0.2863295429997379 secs
numba done in 11.55551904299864 secs
My point here is that my implementation is even slower when I try to apply a jit with numba, so I highly suspect I am messing things up.
The reason your function is slow is because Numba has fallen back to object mode to compile the loop.
There are two problems:
Numba doesn't yet support chained indexing of multidimensional arrays, so you need to rewrite this:
estimated_counts[i][data[instance_id, feature_id]]
into this:
estimated_counts[i, data[instance_id, feature_id]]
Your explicit type signature is incorrect. All of your input arrays are actually int64, rather than int8/int32. Rather than fix your signature, you can rely on Numba's automatic JIT to detect the argument types and compile the right version. All you have to do is change the decorator to just #numba.jit. Just make sure you call the function once before you benchmark if you don't want to include compilation time.
With these changes, I benchmark Numba to be about 15% faster than NumPy for this function.

Categories

Resources