How to collect similar vectors efficiently with Python?

How to collect similar vectors efficiently with Python? - python

I have a large set of 3-dimensional vectors, and each vector is associated with a weight. That is, the set is in the form
{[[0.707,0.5,0.5],0.3],[[0.6,0.8,0],0.2]....}
I want to collect vectors which are closed to each other together, and sum their weights up. Now my idea is that: if a vector A is close to another one B, then I will treat them as the same vector and add the weight of A to the weight of B. And the Python code is
def gathervecs(vecs):
gathers = []
for vec in vecs: #in each vec, there are two elements, the first one is the normalized vector, and the second is the norm**2.
index = 0
for i,avec in enumerate(gathers):
if sum(abs(vec[0] - avec[0])) < 10**(-10):
(gathers[i])[1] = avec[1]+vec[1]
index = 1
break
if index==0:
gathers.append(vec)
return gathers
But the time of this code is polynomial of the size of the original set. So, the question is how to design a more efficient algorithm?
PS: please generate the original set for testing the efficiency of the algorithm randomly.

Related

Faster way to find indices in an Array using points which get from two arrays combination in Python

I have two arrays which contains instances from DATA called A and B. These two arrays then refer to another array called Distance.
I need the fast way to:
find the points combination between A and B,
find the results of the distance from the combination in Distance
For example:
DATA = [0,1,...100]
A = [0,1,2]
B = [6,7,8]
Distance = [100x100] # contains the pairwise distance of all instances from DATA
# need a function to combine A and B
points_combination=[[0,6],[0,7],[0,8],[1,6],[1,7],[1,8],[2,6],[2,7],[2,8]]
# need a function to refer points_combination with Distance, so that I can get this results
distance_points=[0.346, 0.270, 0.314, 0.339, 0.241, 0.283, 0.304, 0.294, 0.254]
I already try to solve it myself, but when it deals with large data it's very slow
Here's the code I tried:
import numpy as np
def function(pair_distances, k, clusters):
list_distance = []
cluster_qty = k
for cluster_id in range(cluster_qty):
all_clusters = clusters[:] # List of all instances ID on their own cluster
in_cluster = all_clusters.pop(cluster_id) # List of instances ID inside the cluster
not_in_cluster = all_clusters # List of instances ID outside the cluster
# combine A and B array into a points to refer to Distance array
list_dist_id = np.array(np.meshgrid(in_cluster, np.concatenate(not_in_cluster))).T.reshape(-1, 2)
temp_dist = 9999999
for instance in range(len(list_dist_id)):
# basically refer the distance value from the pair_distances array
temp_dist = min(temp_dist, (pair_distances[list_dist_id[instance][0], list_dist_id[instance][1]]))
list_distance.append(temp_dist)
return list_distance
Notice that the nested loop is the source of the time consuming problem.
This is my first time asking in this forum, so please let me know if you need more information.

The first part(points_combination) is extensively covered in this post already:
Cartesian product of x and y array points into single array of 2D points
The second part (distance_points): seems that algorithm linking points_combination to distance_points is not provided. Would be helpful if you could provide small sample data sets indicating how to go from data sets to your distance_points ?

Is there a non brute force based solution to optimise the minimum sum of a 2D array only using 1 value from each row and column

I have a 2 arrays; one is an ordered array generated from a set of previous positions for connected points; the second is a new set of points specifying the new positions of the points. The task is to match up each old point with the best fitting new position. The differential between each set of points is stored in a new Array which is of size n*n. The objective is to find a way to map each previous point to a new point resulting in the smallest total sum. As such each old point is a row of the matrix and must match to a single column.
I have already looked into a exhaustive search. Although this works it has complexity O(n!) which is just not a valid solution.
The code below can be used to generate test data for the 2D array.
import numpy as np
def make_data():
org = np.random.randint(5000, size=(100, 2))
new = np.random.randint(5000, size=(100, 2))
arr = []
# ranges = []
for i,j in enumerate(org):
values = np.linalg.norm(new-j, axis=1)
arr.append(values)
# print(arr)
# print(ranges)
arr = np.array(arr)
return arr
Here are some small examples of the array and the expected output.
Ex. 1
1 3 5
0 2 3
5 2 6
The above output should return [0,2,1] to signify that row 0 maps to column 0, row 1 to column 2 and row 2 to column 1. As the optimal solution would b 1,3,2
In
The algorithm would be nice to be 100% accurate although something much quicker that is 85%+ would also be valid.

Google search terms: "weighted graph minimum matching". You can consider your array to be a weighted graph, and you're looking for a matching that minimizes edge length.
The assignment problem is a fundamental combinatorial optimization problem. It consists of finding, in a weighted bipartite graph, a matching in which the sum of weights of the edges is as large as possible. A common variant consists of finding a minimum-weight perfect matching.
https://en.wikipedia.org/wiki/Assignment_problem
The Hungarian method is a combinatorial optimization algorithm that solves the assignment problem in polynomial time and which anticipated later primal-dual methods.
https://en.wikipedia.org/wiki/Hungarian_algorithm
I'm not sure whether to post the whole algorithm here; it's several paragraphs and in wikipedia markup. On the other hand I'm not sure whether leaving it out makes this a "link-only answer". If people have strong feelings either way, they can mention them in the comments.

Reduce the size of a numpy array while preserving the information in it

I'm a newbie at python and I'm trying to do something like binning the data of a numpy array. I'm really struggling in doing so, tho!
My array is a simulation of a simple particle diffusion model, given their probabilities of walking forward or backward. It can have an arbitrary number of species of particles and the total number of particles and that information is coded in the key vector, which is a vector composed of numbers ranging from 0 to nSpecies. Each of these numbers appears according to a given proportion chosen by the user. The size of the vector is chosen by the user as well.
def walk(diff, key, progressProbability, recessProbability, nSpecies):
"""
Returns an array with the positions of the particles pondered by their
walk probabilities
"""
random = np.random.rand(len(key))
forward = key.astype(float)
backward = key.astype(float)
for i in range(nSpecies):
forward[key == i] = progressProbability[i]
backward[key == i] = recessProbability[i]
diff = np.add(diff, random < forward)
diff = np.subtract(diff, random > 1 - backward)
return diff
To add time into this simulation, I run this walk function presented above many times. Therefore, the values in diff after running this function many times are a representation of how far the particle has gone.
def probability_diffusion(time, progressProbability, recessProbability,
changeProbability, key, nSpecies, nBins):
populationSize = len(key)
diff = np.zeros(populationSize, dtype= int)
for t in range(time):
diff = walk(diff, key, progressProbability, recessProbability, nSpecies)
return diff
My goal is to turn this diff array in a array with size 381 without losing the information coded in it. I thought about doing so by binning and averaging the data in each bin.
I've tried using the scipy binned_statistic function but I can't really wrap my head around how it works.
Any thoughts? Thank you.

Fast way to preform a function on all pair combinations of elements in an array

Here is a simplified version of a function that I have:
def create_edge(a,b,network=G):
weight = calculate_weight(matrix[a],matrix[b])
network.addedge(array[a],array[b], weight = weight)
Basically it takes two matrix row-indices, calculates the weight between the two rows and then adds it as the weight for the edge between two nodes.
My goal is to preform this function on every pair combinations in an array. What I mean by this is that if I have an array as such:
array = np.array(['A','B','C','D'])
to preform these functions:
create_edge('A','B')
create_edge('A','C')
create_edge('A','D')
create_edge('B','C')
create_edge('B','D')
create_edge('C','D')
The catch is my array is large! It contains roughly 15000 elements. This means it is very slow. I'm wondering if there is a quick way to do this?
What I have tried so far:
To prevent a XYproblem. I probably should note that I don't necessarily need it to be pair combinations as B->A and A->B are the same, I just gathered it would be faster after doing this:
def create_network(network):
for i in range(len(array)):
for j in range(len(array)):
create_edge(i,j,network)
I also tried this:
comb = list(itertools.combinations(array,2))
def create_network(network):
for i in range(len(comb)):
create_edge(comb[i][0],comb[i][1], network)
Either case was too slow. I understand that's likely due to the size of my array but I'm sure there is a faster/more effective/better method to do this.

Speeding up distance between all possible pairs in an array

I have an array of x,y,z coordinates of several (~10^10) points (only 5 shown here)
a= [[ 34.45 14.13 2.17]
[ 32.38 24.43 23.12]
[ 33.19 3.28 39.02]
[ 36.34 27.17 31.61]
[ 37.81 29.17 29.94]]
I want to make a new array with only those points which are at least some distance d away from all other points in the list. I wrote a code using while loop,
import numpy as np
from scipy.spatial import distance
d=0.1 #or some distance
i=0
selected_points=[]
while i < len(a):
interdist=[]
j=i+1
while j<len(a):
interdist.append(distance.euclidean(a[i],a[j]))
j+=1
if all(dis >= d for dis in interdist):
np.array(selected_points.append(a[i]))
i+=1
This works, but it is taking really long to perform this calculation. I read somewhere that while loops are very slow.
I was wondering if anyone has any suggestions on how to speed up this calculation.
EDIT: While my objective of finding the particles which are at least some distance away from all the others stays the same, I just realized that there is a serious flaw in my code, let's say I have 3 particles, my code does the following, for the first iteration of i, it calculates the distances 1->2, 1->3, let's say 1->2 is less than the threshold distance d, so the code throws away particle 1. For the next iteration of i, it only does 2->3, and let's say it finds that it is greater than d, so it keeps particle 2, but this is wrong! since 2 should also be discarded with particle 1. The solution by #svohara is the correct one!

For big data sets and low-dimensional points (such as your 3-dimensional data), sometimes there is a big benefit to using a spatial indexing method. One popular choice for low-dimensional data is the k-d tree.
The strategy is to index the data set. Then query the index using the same data set, to return the 2-nearest neighbors for each point. The first nearest neighbor is always the point itself (with dist=0), so we really want to know how far away the next closest point is (2nd nearest neighbor). For those points where the 2-NN is > threshold, you have the result.
from scipy.spatial import cKDTree as KDTree
import numpy as np
#a is the big data as numpy array N rows by 3 cols
a = np.random.randn(10**8, 3).astype('float32')
# This will create the index, prepare to wait...
# NOTE: took 7 minutes on my mac laptop with 10^8 rand 3-d numbers
# there are some parameters that could be tweaked for faster indexing,
# and there are implementations (not in scipy) that can construct
# the kd-tree using parallel computing strategies (GPUs, e.g.)
k = KDTree(a)
#ask for the 2-nearest neighbors by querying the index with the
# same points
(dists, idxs) = k.query(a, 2)
# (dists, idxs) = k.query(a, 2, n_jobs=4) # to use more CPUs on query...
#Note: 9 minutes for query on my laptop, 2 minutes with n_jobs=6
# So less than 10 minutes total for 10^8 points.
# If the second NN is > thresh distance, then there is no other point
# in the data set closer.
thresh_d = 0.1 #some threshold, equiv to 'd' in O.P.'s code
d_slice = dists[:, 1] #distances to second NN for each point
res = np.flatnonzero( d_slice >= thresh_d )

Here's a vectorized approach using distance.pdist -
# Store number of pts (number of rows in a)
m = a.shape[0]
# Get the first of pairwise indices formed with the pairs of rows from a
# Simpler version, but a bit slow : idx1,_ = np.triu_indices(m,1)
shifts_arr = np.zeros(m*(m-1)/2,dtype=int)
shifts_arr[np.arange(m-1,1,-1).cumsum()] = 1
idx1 = shifts_arr.cumsum()
# Get the IDs of pairs of rows that are more than "d" apart and thus select
# the rest of the rows using a boolean mask created with np.in1d for the
# entire range of number of rows in a. Index into a to get the selected points.
selected_pts = a[~np.in1d(np.arange(m),idx1[distance.pdist(a) < d])]
For a huge dataset like 10e10, we might have to perform the operations in chunks based on the system memory available.

your algorithm is quadratic (10^20 operations), Here is a linear approach if distribution is nearly random.
Splits your space in boxes of size d/sqrt(3)^3. Put each points in its box.
Then for each box,
if there is just one point, you just have to calculate distance with points in a little neighborhood.
else there is nothing to do.

Drop the append, it must be really slow. You can have a static vector of distances and use [] to put the number in the right position.
Use min instead of all. You only need to check if the minimum distance is bigger than x.
Actually, you can break on your append in the moment that you find a distance smaller than your limit, and then you can drop out both points. In this way you even do not have to save any distance (unless you need them later).
Since d(a,b)=d(b,a) you can do the internal loop only for the following points, forget about the distances you already calculated. If you need them you can pick the faster from the array.
From your comment, I believe this would do, if you have no repeated points.
selected_points = []
for p1 in a:
save_point = True
for p2 in a:
if p1!=p2 and distance.euclidean(p1,p2)<d:
save_point = False
break
if save_point:
selected_points.append(p1)
return selected_points
In the end I check a,b and b,a because you should not modify a list while processing it, but you can be smarter using some aditional variables.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to collect similar vectors efficiently with Python? - python

Related

Faster way to find indices in an Array using points which get from two arrays combination in Python

Is there a non brute force based solution to optimise the minimum sum of a 2D array only using 1 value from each row and column

Reduce the size of a numpy array while preserving the information in it

Fast way to preform a function on all pair combinations of elements in an array

Speeding up distance between all possible pairs in an array

Categories

Resources