I have generate a large data frame by reading large number of files in a directory. I have managed to parallelize that section that read files in parse. I take that data and generate the data frame for the next step. Which is calculating similarity matrix.
Now and I am trying to calculate the cosine similarity between rows of the data frame. Since its a large data frame it takes long time (hours) to run. How can I parallelize this process.
Here is my current code of calculating cosine similarity which runs on the single thread:
df = df.fillna(0)
data = df.values
m, k = data.shape
mat = np.zeros((m, m))
"""
scipy cosine similarity is between 0-2 instead of -1 to 1
in that case 1 is 0 and 2 is -1
"""
for i in xrange(m):
for j in xrange(m):
if i != j:
mat[i][j] = 1 - cosine(data[i,:], data[j,:])
else:
mat[i][j] = 1. # 0 if we don't do 1-cosine()
First, I'm assuming your cosine is scipy.spatial.distance.cosine, whose key calculation is:
dist = 1.0 - np.dot(u, v) / (norm(u) * norm(v))
So it looks like I can replace your double loop with:
data1 = data/np.linalg.norm(data,axis=1)[:,None]
mat1 = np.einsum('ik,jk->ij', data1, data1)
That is, normalize data once at the start, rather than at each node. And then use einsum to calculate the whole set of dot products.
For a small test case (m,k=4,3), this is 25x faster than your double loop.
Cautions: I've only tested against your answer for one small data array.
scipy.spactial.distance.norm and cosine have some checks that I haven't implemented.
einsum, while fast for this sort of thing on modest size arrays, can get bogged down with larger ones, and will run into memory errors before your element by element dot. And the underlying dot library may be better tuned to handle multi-core machines.
But even if data is too large to handle with one call to einsum, you could break the calculation into blocks, e.g.
mat[n1:n2,m1:m2] = np.einsum('ik,jk->ij', data1[n1:n2,:], data1[m1:m2,:])
I'd like to point you in the direction of https://docs.python.org/2/library/multiprocessing.html
Take note of pool.map(function, iterable)
Then build the set of triangular position tuples, write the appropriate function and fire away.
Related
Let's say that I have a data stream where single data point is retrieved at a time:
import numpy as np
def next_data_point():
"""
Mock a data stream. Data points will always be a positive float
"""
return np.random.uniform(0, 1_000_000, dtype='float')
I need to be able to update a NumPy array and track the top-K smallest-values-so-far from this stream (or until the user decides when it is okay to stop the analysis via some check_stop_condition() function). Let's say we want to capture the top 1,000 smallest values from the stream, then a naive way to accomplish this might be:
k = 1000
topk = np.full(k, fille_value=np.inf, dtype='float')
while check_stop_condition():
topk[:] = np.sort(np.append(topk, next_data_point()))[:k]
This works fine but is quite inefficient and can be slow if repeated millions of times since we are:
creating a new array every time
sorting the concatenated array every time
So, I came up with a different approach to address these 2 inefficiencies:
k = 1000
topk = np.full(k, fille_value=np.inf)
while check_stop_condition():
data_point = next_data_point()
idx = np.searchsorted(topk, data_point)
if idx < k:
topk[idx : -1] = topk[idx + 1 :]
topk[idx] = data_point
Here, I leverage np.searchsorted() to replace np.sort and to quickly find the insertion point, idx, for the next data point. I believe that np.searchsorted uses some sort of binary search and assumes that the initial array is pre-sorted first. Then, we shift the data in topk to accommodate and insert the new data point if and only if idx < k.
I haven't seen this being done anywhere and so my question is if there is anything that can be done to make this even more efficient? Especially in the way that I shifting things around inside the if statement.
Sorting a huge array is very expensive so this is not surprising the second method is faster. However, the speed of the second method is probably bounded by the slow array copy. The complexity of the first method is O(k log(k) n) while the second method has a complexity of O(n (log(k) + k * p)), with n the number of points and p the probability of the branch to be taken.
To build a faster implementation, you can use a tree. More specifically a self-balancing binary search tree for example. Here is the algorithm:
topk = Tree()
maxi = np.inf
while check_stop_condition(): # O(n)
data_point = next_data_point()
if len(topk) <= 1000: # O(1)
topk.insert(data_point) # O(log k)
elif data_point < maxi: # Discard the value in O(1)
topk.insert(data_point) # O(log k)
topk.deleteMaxNode() # O(log k)
maxi = topk.findMaxValue() # O(log k)
The above algorithm run in O(n log k). One can show that this complexity is optimal (using only data_point comparisons).
In practice, binary heaps can be a bit faster (with the same complexity). Indeed, they have several advantage over self-balancing binary search trees in this case:
they can be implemented in a very compact way in memory (reducing cache misses and memory consumption)
insertion of the n=1000 first items can be done in O(n) time and very quickly
Note that discarded values are computed in constant time. This appends a lot on huge random datasets as most of the values get quickly bigger than maxi. On can even prove that random datasets can be computed in O(n) time (optimal).
Note that Python 3 provides a standard heap implementation called heapq which is probably a good starting point.
I am writing code to process a large point cloud. I have been able to successfully vectorize most of my code to make it efficient, however, I cannot think of a good way of achieving this:
I have an nx3 numpy array that represents points in 3d (x,y,z) with z >= 0. I want to create a list of length k, which is produced by bucketing the rows of the numpy array by z value. i.e. if a point has z value $z_i$, then this point should go into bucket $j =\lfloor z_i / d \rfloor$, where $d$ is the resolution of the buckets. $d$ is precomputed as $z_{max} / k$, where k is the fixed number of buckets. Right now, I have tried the following:
buckets = [set() for _ in range(num_buckets)]
d = max_z / num_slices
for i in range(len(points)):
p = points[i]
k = int(math.floor(p[2] / d))
buckets[min(k, num_buckets - 1)] |= {tuple(p)}
However, this is very inefficient since I am iterating over the whole dataset of points (which is very large). On the other hand, it emphasizes the independence on the number of buckets. I have closed form for the index of a point given its z value, so I should not need to iterate over the bucket list.
That being said, num_buckets is much smaller than the number of points, so I also wrote this, which is must faster:
for i in range(num_buckets):
ps = points[np.floor(in_fov[:,2] / d) == i]
buckets[i] = ps
This is certainly faster, but now the runtime depends on num_buckets, and the growth is fairly troubling since I am doing a numpy filtering on a large array for every bucket.
My question is if there is a better way to do this that allows me to allocate each row of this nx3 array to a list index that allows me to leverage the fact that I can compute the index for each point independently, and hopefully eliminates the need to iterate over num_buckets.
I have an array of x,y,z coordinates of several (~10^10) points (only 5 shown here)
a= [[ 34.45 14.13 2.17]
[ 32.38 24.43 23.12]
[ 33.19 3.28 39.02]
[ 36.34 27.17 31.61]
[ 37.81 29.17 29.94]]
I want to make a new array with only those points which are at least some distance d away from all other points in the list. I wrote a code using while loop,
import numpy as np
from scipy.spatial import distance
d=0.1 #or some distance
i=0
selected_points=[]
while i < len(a):
interdist=[]
j=i+1
while j<len(a):
interdist.append(distance.euclidean(a[i],a[j]))
j+=1
if all(dis >= d for dis in interdist):
np.array(selected_points.append(a[i]))
i+=1
This works, but it is taking really long to perform this calculation. I read somewhere that while loops are very slow.
I was wondering if anyone has any suggestions on how to speed up this calculation.
EDIT: While my objective of finding the particles which are at least some distance away from all the others stays the same, I just realized that there is a serious flaw in my code, let's say I have 3 particles, my code does the following, for the first iteration of i, it calculates the distances 1->2, 1->3, let's say 1->2 is less than the threshold distance d, so the code throws away particle 1. For the next iteration of i, it only does 2->3, and let's say it finds that it is greater than d, so it keeps particle 2, but this is wrong! since 2 should also be discarded with particle 1. The solution by #svohara is the correct one!
For big data sets and low-dimensional points (such as your 3-dimensional data), sometimes there is a big benefit to using a spatial indexing method. One popular choice for low-dimensional data is the k-d tree.
The strategy is to index the data set. Then query the index using the same data set, to return the 2-nearest neighbors for each point. The first nearest neighbor is always the point itself (with dist=0), so we really want to know how far away the next closest point is (2nd nearest neighbor). For those points where the 2-NN is > threshold, you have the result.
from scipy.spatial import cKDTree as KDTree
import numpy as np
#a is the big data as numpy array N rows by 3 cols
a = np.random.randn(10**8, 3).astype('float32')
# This will create the index, prepare to wait...
# NOTE: took 7 minutes on my mac laptop with 10^8 rand 3-d numbers
# there are some parameters that could be tweaked for faster indexing,
# and there are implementations (not in scipy) that can construct
# the kd-tree using parallel computing strategies (GPUs, e.g.)
k = KDTree(a)
#ask for the 2-nearest neighbors by querying the index with the
# same points
(dists, idxs) = k.query(a, 2)
# (dists, idxs) = k.query(a, 2, n_jobs=4) # to use more CPUs on query...
#Note: 9 minutes for query on my laptop, 2 minutes with n_jobs=6
# So less than 10 minutes total for 10^8 points.
# If the second NN is > thresh distance, then there is no other point
# in the data set closer.
thresh_d = 0.1 #some threshold, equiv to 'd' in O.P.'s code
d_slice = dists[:, 1] #distances to second NN for each point
res = np.flatnonzero( d_slice >= thresh_d )
Here's a vectorized approach using distance.pdist -
# Store number of pts (number of rows in a)
m = a.shape[0]
# Get the first of pairwise indices formed with the pairs of rows from a
# Simpler version, but a bit slow : idx1,_ = np.triu_indices(m,1)
shifts_arr = np.zeros(m*(m-1)/2,dtype=int)
shifts_arr[np.arange(m-1,1,-1).cumsum()] = 1
idx1 = shifts_arr.cumsum()
# Get the IDs of pairs of rows that are more than "d" apart and thus select
# the rest of the rows using a boolean mask created with np.in1d for the
# entire range of number of rows in a. Index into a to get the selected points.
selected_pts = a[~np.in1d(np.arange(m),idx1[distance.pdist(a) < d])]
For a huge dataset like 10e10, we might have to perform the operations in chunks based on the system memory available.
your algorithm is quadratic (10^20 operations), Here is a linear approach if distribution is nearly random.
Splits your space in boxes of size d/sqrt(3)^3. Put each points in its box.
Then for each box,
if there is just one point, you just have to calculate distance with points in a little neighborhood.
else there is nothing to do.
Drop the append, it must be really slow. You can have a static vector of distances and use [] to put the number in the right position.
Use min instead of all. You only need to check if the minimum distance is bigger than x.
Actually, you can break on your append in the moment that you find a distance smaller than your limit, and then you can drop out both points. In this way you even do not have to save any distance (unless you need them later).
Since d(a,b)=d(b,a) you can do the internal loop only for the following points, forget about the distances you already calculated. If you need them you can pick the faster from the array.
From your comment, I believe this would do, if you have no repeated points.
selected_points = []
for p1 in a:
save_point = True
for p2 in a:
if p1!=p2 and distance.euclidean(p1,p2)<d:
save_point = False
break
if save_point:
selected_points.append(p1)
return selected_points
In the end I check a,b and b,a because you should not modify a list while processing it, but you can be smarter using some aditional variables.
I have an array of doubles, roughly 200,000 rows by 100 columns, and I'm looking for a fast algorithm to find the rows that contain sequences most similar to a given pattern (the pattern can be anywhere from 10 to 100 elements). I'm using python, so the brute force method (code below: looping over each row and starting column index, and computing the Euclidean distance at each point) takes around three minutes.
The numpy.correlate function promises to solve this problem much faster (running over the same dataset in less than 20 seconds). However, it simply computes a sliding dot product of the pattern over the full row, meaning that to compare similarity I'd have to normalize the results first. Normalizing the cross-correlation requires computing the standard deviation of each slice of the data, which instantly negates the speed improvement of using numpy.correlate in the first place.
Is it possible to compute normalized cross-correlation quickly in python? Or will I have to resort to coding the brute force method in C?
def norm_corr(x,y,mode='valid'):
ya=np.array(y)
slices=[x[pos:pos+len(y)] for pos in range(len(x)-len(y)+1)]
return [np.linalg.norm(np.array(z)-ya) for z in slices]
similarities=[norm_corr(arr,pointarray) for arr in arraytable]
If your data is in a 2D Numpy array, you can take a 2D slice from it (200000 rows by len(pattern) columns) and compute the norm for all the rows at once. Then slide the window to the right in a for loop.
ROWS = 200000
COLS = 100
PATLEN = 20
#random data for example's sake
a = np.random.rand(ROWS,COLS)
pattern = np.random.rand(PATLEN)
tmp = np.empty([ROWS, COLS-PATLEN])
for i in xrange(COLS-PATLEN):
window = a[:,i:i+PATLEN]
tmp[:,i] = np.sum((window-pattern)**2, axis=1)
result = np.sqrt(tmp)
I'm just starting to play about with OpenCL, and I'm stuck with how to structure the program in a reasonably efficient manner (mainly avoiding lots of transferring of data to/from the GPU or wherever the work is being done)
What I'm trying to do is, given:
v = r*i + b*j + g*k
..I know v for various values of r, g and b, but i, j and k are unknown. I want to calculate reasonable values for i/j/k via brute force
In other words, I have a bunch of "raw" RGB pixel values, and I have a desaturated version of these colours. I do not know the weightings (i/j/k) used calculate the desaturated values.
My initial plan was to:
load the data into a CL buffer (so the input r/g/b values, and the output)
have a kernel which takes the three possible matrix values, and the various pixel-data buffers.
It then performs v = r*i + b*j + g*k, and subtracts the value of v to the known value, and stores this in a "score" buffer
Another kernel calculates the RMS error for that value (if the difference is zero for all input values, the values for i/j/k are "correct")
I have this working (written using Python and PyCL, the code is here), but I'm wondering how I can parallelise this chunk of work more (by try multiple i/j/k values at once)
I issue is, I have the 4 read-only buffers (3 for the input values, 1 for the expected values), but I need a separate "score" buffer for every combination of i/j/k
Another issue is the RMS calculation is the slowest part, since it's effectively single-threaded (total up all the values in "score" and sqrt() the total)
Basically, I'm wondering if there's a sensible way to structure such a program.
It seems like a task well-suited to OpenCL - hopefully the description of my goal wasn't too convoluted! As mentioned, my current code is here, and in case it is clearer, this is the Python version of what I'm trying to do:
import sys
import math
import random
def make_test_data(w = 128, h = 128):
in_r, in_g, in_b = [], [], []
print "Make raw data"
for x in range(w):
for y in range(h):
in_r.append(random.random())
in_g.append(random.random())
in_b.append(random.random())
# the unknown values
mtx = [random.random(), random.random(), random.random()]
print "Secret numbers were: %s" % mtx
out_r = [(r*mtx[0] + g*mtx[1] + b*mtx[2]) for (r, g, b) in zip(in_r, in_g, in_b)]
return {'in_r': in_r, 'in_g': in_g, 'in_b': in_b,
'expected_r': out_r}
def score_matrix(ir, ig, ib, expected_r, mtx):
ms = 0
for i in range(len(ir)):
val = ir[i] * mtx[0] + ig[i] * mtx[1] + ib[i] * mtx[2]
ms += abs(val - expected_r[i]) ** 2
rms = math.sqrt(ms / float(len(ir)))
return rms
# Make random test data
test_data = make_test_data(16, 16)
lowest_rms = sys.maxint
closest = []
divisions = 10
for possible_r in range(divisions):
for possible_g in range(divisions):
for possible_b in range(divisions):
pr, pg, pb = [x / float(divisions-1) for x in (possible_r, possible_g, possible_b)]
rms = score_matrix(
test_data['in_r'], test_data['in_g'], test_data['in_b'],
test_data['expected_r'],
mtx = [pr, pg, pb])
if rms < lowest_rms:
closest = [pr, pg, pb]
lowest_rms = rms
print closest
Are i,j,k sets independent? I assumed that yes. Few things hurts your performance:
running too many small kernels
using global memory for communication between score_matrix and rm_to_rms
You could rewrite both kernels into one with following changes:
make that one OpenCL work-group would work on different i,j,k - you can pre-generate this on CPU
in order to do 1 you need to process multiple elements of array with one thread you can do it like this:
int i = get_thread_id(0);
float my_sum = 0;
for (; i < array_size; i += get_local_size(0)){
float val = in_r[i] * mtx_r + in_g[i] * mtx_g + in_b[i] * mtx_b;
my_sum += pow(fabs(expect_r[i] - val), 2);
}
after this you write my_sum for each thread into local memory and sum it up with reduce (O(log(n)) algorithm).
save result into global memory
alternatively if you need to compute i,j,k sequentially you can look up barrier and memory fence functions in OpenCL specification so you can use these instead of running two kernels, just remember to sum up everything in first step, write into global synchronize all threads, and then sum up again
There are two potential issues:
Kernel launch overhead may be large if the work required to process each of your images is small. This is what you would address by combining the evaluation of multiple i,j,k values in a single kernel.
Serialization of the sum calculation for the RMSE. This is likely the larger issue, currently.
To address (2), notice that summation can be evaluated in parallel, but it is not as trivial as mapping a function separately over every pixel in your input. That is because summation requires communicating values between neighboring elements, rather than treating all elements independently. This pattern is commonly called a reduction.
PyOpenCL includes high-level support for common reductions. What you want here is a sum reduction: pyopencl.array.sum(array).
Looking further into how this is implemented in raw OpenCL, Apple's OpenCL docs include an example of parallel reduction for sum. The pieces most relevant to what you want to do are the kernel and the main and create_reduction_pass_counts functions of the host C program which runs the reduction.