I am writing code to process a large point cloud. I have been able to successfully vectorize most of my code to make it efficient, however, I cannot think of a good way of achieving this:
I have an nx3 numpy array that represents points in 3d (x,y,z) with z >= 0. I want to create a list of length k, which is produced by bucketing the rows of the numpy array by z value. i.e. if a point has z value $z_i$, then this point should go into bucket $j =\lfloor z_i / d \rfloor$, where $d$ is the resolution of the buckets. $d$ is precomputed as $z_{max} / k$, where k is the fixed number of buckets. Right now, I have tried the following:
buckets = [set() for _ in range(num_buckets)]
d = max_z / num_slices
for i in range(len(points)):
p = points[i]
k = int(math.floor(p[2] / d))
buckets[min(k, num_buckets - 1)] |= {tuple(p)}
However, this is very inefficient since I am iterating over the whole dataset of points (which is very large). On the other hand, it emphasizes the independence on the number of buckets. I have closed form for the index of a point given its z value, so I should not need to iterate over the bucket list.
That being said, num_buckets is much smaller than the number of points, so I also wrote this, which is must faster:
for i in range(num_buckets):
ps = points[np.floor(in_fov[:,2] / d) == i]
buckets[i] = ps
This is certainly faster, but now the runtime depends on num_buckets, and the growth is fairly troubling since I am doing a numpy filtering on a large array for every bucket.
My question is if there is a better way to do this that allows me to allocate each row of this nx3 array to a list index that allows me to leverage the fact that I can compute the index for each point independently, and hopefully eliminates the need to iterate over num_buckets.
Related
Let's say that I have a data stream where single data point is retrieved at a time:
import numpy as np
def next_data_point():
"""
Mock a data stream. Data points will always be a positive float
"""
return np.random.uniform(0, 1_000_000, dtype='float')
I need to be able to update a NumPy array and track the top-K smallest-values-so-far from this stream (or until the user decides when it is okay to stop the analysis via some check_stop_condition() function). Let's say we want to capture the top 1,000 smallest values from the stream, then a naive way to accomplish this might be:
k = 1000
topk = np.full(k, fille_value=np.inf, dtype='float')
while check_stop_condition():
topk[:] = np.sort(np.append(topk, next_data_point()))[:k]
This works fine but is quite inefficient and can be slow if repeated millions of times since we are:
creating a new array every time
sorting the concatenated array every time
So, I came up with a different approach to address these 2 inefficiencies:
k = 1000
topk = np.full(k, fille_value=np.inf)
while check_stop_condition():
data_point = next_data_point()
idx = np.searchsorted(topk, data_point)
if idx < k:
topk[idx : -1] = topk[idx + 1 :]
topk[idx] = data_point
Here, I leverage np.searchsorted() to replace np.sort and to quickly find the insertion point, idx, for the next data point. I believe that np.searchsorted uses some sort of binary search and assumes that the initial array is pre-sorted first. Then, we shift the data in topk to accommodate and insert the new data point if and only if idx < k.
I haven't seen this being done anywhere and so my question is if there is anything that can be done to make this even more efficient? Especially in the way that I shifting things around inside the if statement.
Sorting a huge array is very expensive so this is not surprising the second method is faster. However, the speed of the second method is probably bounded by the slow array copy. The complexity of the first method is O(k log(k) n) while the second method has a complexity of O(n (log(k) + k * p)), with n the number of points and p the probability of the branch to be taken.
To build a faster implementation, you can use a tree. More specifically a self-balancing binary search tree for example. Here is the algorithm:
topk = Tree()
maxi = np.inf
while check_stop_condition(): # O(n)
data_point = next_data_point()
if len(topk) <= 1000: # O(1)
topk.insert(data_point) # O(log k)
elif data_point < maxi: # Discard the value in O(1)
topk.insert(data_point) # O(log k)
topk.deleteMaxNode() # O(log k)
maxi = topk.findMaxValue() # O(log k)
The above algorithm run in O(n log k). One can show that this complexity is optimal (using only data_point comparisons).
In practice, binary heaps can be a bit faster (with the same complexity). Indeed, they have several advantage over self-balancing binary search trees in this case:
they can be implemented in a very compact way in memory (reducing cache misses and memory consumption)
insertion of the n=1000 first items can be done in O(n) time and very quickly
Note that discarded values are computed in constant time. This appends a lot on huge random datasets as most of the values get quickly bigger than maxi. On can even prove that random datasets can be computed in O(n) time (optimal).
Note that Python 3 provides a standard heap implementation called heapq which is probably a good starting point.
I have a 4 dimensional data set, say X. Which happens to be the iris dataset. I form a sub list of 10 data points from this set, called mu. For each of these 10 data points, I am to calculate the sum of the 10 smallest squared distances of points in mu to their closest neighbor. Closest neighbors here could include data points from the original data set.
How am I to achieve the same?
I think I could use something like this -
(np.array([min([np.linalg.norm(x-c)**2 for x in X]) for c in mu]))
But 'x' here wouldn't exclude the very point under consideration ('c'), would it?
If it is safe to assume that your points are unique (so that you will never have two points overlap exactly, you can filter out points that are equal from your list comprehension:
np.array([min([np.linalg.norm(x-c)**2 for x in X if not np.array_equal(x, c)]) for c in mu]
This, however, as a one-liner becomes a bit too long to read easily. I would therefore recommend a re-write in a PEP-8 compliant way:
res = np.empty(len(mu)) # allocate space for result
for i, c in enumerate(mu):
res[i] = min([np.linalg.norm(x-c)**2
for x in X if not np.array_equal(x, c)])
even though it is not quite as elegant as a one-liner.
I would like to perform the operation
If had a regular shape, then I could use np.einsum, I believe the syntax would be
np.einsum('ijp,ipk->ijk',X, alpha)
Unfortunately, my data X has a non regular structure on the 1st (if we zero index) axis.
To give a little more context, refers to the p^th feature of the j^th member of the i^th group. Because groups have different sizes, effectively, it is a list of lists of different lengths, of lists of the same length.
has a regular structure and thus can be saved as a standard numpy array (it comes in 1-dimensional and then I use alpha.reshape(a,b,c) where a,b,c are problem specific integers)
I would like to avoid storing X as a list of lists of lists or a list of np.arrays of different dimensions and writing something like
A = []
for i in range(num_groups):
temp = np.empty(group_sizes[i], dtype=float)
for j in range(group_sizes[i]):
temp[i] = np.einsum('p,pk->k',X[i][j], alpha[i,:,:])
A.append(temp)
Is this some nice numpy function/data structure for doing this or am I going to have to compromise with some only partially vectorised implementation?
I know this sounds obvious, but, if you can afford the memory, I'd start just by checking the performance you get simply by padding the data to have a uniform size, that is, simply adding zeros and perform the operation. Sometimes a simpler solution is faster than a more supposedly optimal one that has more Python/C roundtrips.
If that doesn't work, then your best bet, as Tom Wyllie suggested, is probably a bucketing strategy. Assuming X is your list of lists of lists and alpha is an array, you can start by collecting the sizes of the second index (maybe you already have this):
X_sizes = np.array([len(x_i) for x_i in X])
And sort them:
idx_sort = np.argsort(X_sizes)
X_sizes_sorted = X_sizes[idx_sort]
Then you choose a number of buckets, which is the number of divisions of your work. Let's say you pick BUCKETS = 4. You just need to divide the data so that more or less each piece is the same size:
sizes_cumsum = np.cumsum(X_sizes_sorted)
total = sizes_cumsum[-1]
bucket_idx = []
for i in range(BUCKETS):
low = np.round(i * total / float(BUCKETS))
high = np.round((i + 1) * total / float(BUCKETS))
m = sizes_cumsum >= low & sizes_cumsum < high
idx = np.where(m),
# Make relative to X, not idx_sort
idx = idx_sort[idx]
bucket_idx.append(idx)
And then you make the computation for each bucket:
bucket_results = []
for idx in bucket_idx:
# The last index in the bucket will be the biggest
bucket_size = X_sizes[idx[-1]]
# Fill bucket array
X_bucket = np.zeros((len(X), bucket_size, len(X[0][0])), dtype=X.dtype)
for i, X_i in enumerate(idx):
X_bucket[i, :X_sizes[X_i]] = X[X_i]
# Compute
res = np.einsum('ijp,ipk->ijk',X, alpha[:, :bucket_size, :])
bucket_results.append(res)
Filling the array X_bucket will probably be slow in this part. Again, if you can afford the memory, it would be more efficient to have X in a single padded array and then just slice X[idx, :bucket_size, :].
Finally, you can put back your results into a list:
result = [None] * len(X)
for res, idx in zip(bucket_results, bucket_idx):
for r, X_i in zip(res, idx):
result[X_i] = res[:X_sizes[X_i]]
Sorry I'm not giving a proper function, but I'm not sure how exactly is your input or expected output so I just put the pieces and you can use them as you see fit.
I have an array of x,y,z coordinates of several (~10^10) points (only 5 shown here)
a= [[ 34.45 14.13 2.17]
[ 32.38 24.43 23.12]
[ 33.19 3.28 39.02]
[ 36.34 27.17 31.61]
[ 37.81 29.17 29.94]]
I want to make a new array with only those points which are at least some distance d away from all other points in the list. I wrote a code using while loop,
import numpy as np
from scipy.spatial import distance
d=0.1 #or some distance
i=0
selected_points=[]
while i < len(a):
interdist=[]
j=i+1
while j<len(a):
interdist.append(distance.euclidean(a[i],a[j]))
j+=1
if all(dis >= d for dis in interdist):
np.array(selected_points.append(a[i]))
i+=1
This works, but it is taking really long to perform this calculation. I read somewhere that while loops are very slow.
I was wondering if anyone has any suggestions on how to speed up this calculation.
EDIT: While my objective of finding the particles which are at least some distance away from all the others stays the same, I just realized that there is a serious flaw in my code, let's say I have 3 particles, my code does the following, for the first iteration of i, it calculates the distances 1->2, 1->3, let's say 1->2 is less than the threshold distance d, so the code throws away particle 1. For the next iteration of i, it only does 2->3, and let's say it finds that it is greater than d, so it keeps particle 2, but this is wrong! since 2 should also be discarded with particle 1. The solution by #svohara is the correct one!
For big data sets and low-dimensional points (such as your 3-dimensional data), sometimes there is a big benefit to using a spatial indexing method. One popular choice for low-dimensional data is the k-d tree.
The strategy is to index the data set. Then query the index using the same data set, to return the 2-nearest neighbors for each point. The first nearest neighbor is always the point itself (with dist=0), so we really want to know how far away the next closest point is (2nd nearest neighbor). For those points where the 2-NN is > threshold, you have the result.
from scipy.spatial import cKDTree as KDTree
import numpy as np
#a is the big data as numpy array N rows by 3 cols
a = np.random.randn(10**8, 3).astype('float32')
# This will create the index, prepare to wait...
# NOTE: took 7 minutes on my mac laptop with 10^8 rand 3-d numbers
# there are some parameters that could be tweaked for faster indexing,
# and there are implementations (not in scipy) that can construct
# the kd-tree using parallel computing strategies (GPUs, e.g.)
k = KDTree(a)
#ask for the 2-nearest neighbors by querying the index with the
# same points
(dists, idxs) = k.query(a, 2)
# (dists, idxs) = k.query(a, 2, n_jobs=4) # to use more CPUs on query...
#Note: 9 minutes for query on my laptop, 2 minutes with n_jobs=6
# So less than 10 minutes total for 10^8 points.
# If the second NN is > thresh distance, then there is no other point
# in the data set closer.
thresh_d = 0.1 #some threshold, equiv to 'd' in O.P.'s code
d_slice = dists[:, 1] #distances to second NN for each point
res = np.flatnonzero( d_slice >= thresh_d )
Here's a vectorized approach using distance.pdist -
# Store number of pts (number of rows in a)
m = a.shape[0]
# Get the first of pairwise indices formed with the pairs of rows from a
# Simpler version, but a bit slow : idx1,_ = np.triu_indices(m,1)
shifts_arr = np.zeros(m*(m-1)/2,dtype=int)
shifts_arr[np.arange(m-1,1,-1).cumsum()] = 1
idx1 = shifts_arr.cumsum()
# Get the IDs of pairs of rows that are more than "d" apart and thus select
# the rest of the rows using a boolean mask created with np.in1d for the
# entire range of number of rows in a. Index into a to get the selected points.
selected_pts = a[~np.in1d(np.arange(m),idx1[distance.pdist(a) < d])]
For a huge dataset like 10e10, we might have to perform the operations in chunks based on the system memory available.
your algorithm is quadratic (10^20 operations), Here is a linear approach if distribution is nearly random.
Splits your space in boxes of size d/sqrt(3)^3. Put each points in its box.
Then for each box,
if there is just one point, you just have to calculate distance with points in a little neighborhood.
else there is nothing to do.
Drop the append, it must be really slow. You can have a static vector of distances and use [] to put the number in the right position.
Use min instead of all. You only need to check if the minimum distance is bigger than x.
Actually, you can break on your append in the moment that you find a distance smaller than your limit, and then you can drop out both points. In this way you even do not have to save any distance (unless you need them later).
Since d(a,b)=d(b,a) you can do the internal loop only for the following points, forget about the distances you already calculated. If you need them you can pick the faster from the array.
From your comment, I believe this would do, if you have no repeated points.
selected_points = []
for p1 in a:
save_point = True
for p2 in a:
if p1!=p2 and distance.euclidean(p1,p2)<d:
save_point = False
break
if save_point:
selected_points.append(p1)
return selected_points
In the end I check a,b and b,a because you should not modify a list while processing it, but you can be smarter using some aditional variables.
I have generate a large data frame by reading large number of files in a directory. I have managed to parallelize that section that read files in parse. I take that data and generate the data frame for the next step. Which is calculating similarity matrix.
Now and I am trying to calculate the cosine similarity between rows of the data frame. Since its a large data frame it takes long time (hours) to run. How can I parallelize this process.
Here is my current code of calculating cosine similarity which runs on the single thread:
df = df.fillna(0)
data = df.values
m, k = data.shape
mat = np.zeros((m, m))
"""
scipy cosine similarity is between 0-2 instead of -1 to 1
in that case 1 is 0 and 2 is -1
"""
for i in xrange(m):
for j in xrange(m):
if i != j:
mat[i][j] = 1 - cosine(data[i,:], data[j,:])
else:
mat[i][j] = 1. # 0 if we don't do 1-cosine()
First, I'm assuming your cosine is scipy.spatial.distance.cosine, whose key calculation is:
dist = 1.0 - np.dot(u, v) / (norm(u) * norm(v))
So it looks like I can replace your double loop with:
data1 = data/np.linalg.norm(data,axis=1)[:,None]
mat1 = np.einsum('ik,jk->ij', data1, data1)
That is, normalize data once at the start, rather than at each node. And then use einsum to calculate the whole set of dot products.
For a small test case (m,k=4,3), this is 25x faster than your double loop.
Cautions: I've only tested against your answer for one small data array.
scipy.spactial.distance.norm and cosine have some checks that I haven't implemented.
einsum, while fast for this sort of thing on modest size arrays, can get bogged down with larger ones, and will run into memory errors before your element by element dot. And the underlying dot library may be better tuned to handle multi-core machines.
But even if data is too large to handle with one call to einsum, you could break the calculation into blocks, e.g.
mat[n1:n2,m1:m2] = np.einsum('ik,jk->ij', data1[n1:n2,:], data1[m1:m2,:])
I'd like to point you in the direction of https://docs.python.org/2/library/multiprocessing.html
Take note of pool.map(function, iterable)
Then build the set of triangular position tuples, write the appropriate function and fire away.