For every element in list A, I need to calculate the Levenshtein distance between it and every element in list B. It's 375 million calculations total, which will take too long (over 10 hours) with the nested for-loop that I currently have below:
for a in range(10000):
listA_element = listA[a]
# Calculate the levenshtein distance between the listA element and every listB element
for b in range(50000):
listB_element = listB[b]
score = abd.DiscountedLevenshtein().sim(listA_element, listB_element)
How can I do what the code above does, but in under 1-2 hours? I have looked into using NumPy but it seems that will not work with the levenshtein distance library, and I need the flexibility to do several different things in the loops other than calculations (creating lists, appending to lists, etc). I am having issues with the Cython route, so any alternatives are welcome.
I have a problem where I am trying to compute the nearest strings using the Edit/Levenshtein distance.
I have a list containing about 250,000 unique strings, and for each item in the list, I need to return the index of the string in the list that is closest.
My problem is that I can't just use something like pdist because it will generate a 250k^2/2 array and it'll lead to memory problems. But if I do a row by row operation like
def closest(s):
"""
Returns index of minimum Levenshtein distance
"""
distances = [levenshtein_distance(s, X[i]) for i in range(len(X))]
minimum_distance = min(i for i in distances if i > 0)
return distances.index(minimum_distance)
this will also be super inefficient as it isn't optimised like pdist and is the same as generating a dense matrix.
Would anyone have any suggestions? Many thanks!
I am learning how to traverse a 2D matrix spirally, and I came across this following algorithm:
def spiralOrder(self, matrix):
result = []
while matrix:
result.extend(matrix.pop(0))
matrix = zip(*matrix)[::-1]
return result
I am currently having a hard time figuring out the time complexity of this question with the zip function being in the while loop.
It would be greatly appreciated if anyone could help me figure out the time complexity with explanations.
Thank you!
The known time complexity for this problem is a constant O(MxN) where M is the number of rows and N is the number of columns in a MxN matrix. This is an awesome algorithm but it looks like it might be slower.
Looking at it more closely, with every iteration of the loop you are undergoing the following operations:
pop() # O(1)
extend() # O(k) where k is the number of elements added that operation
*matrix # O(1) -> python optimizes unpacking for native lists
list(zip()) # O(j) -> python 2.7 constructs the list automatically and python 3 requires the list construction to run
[::-1] # O(j/2) -> reverse sort divided by two because zip halved the items
Regardless of how many loop iterations, by the time this completes you will have at least called result.extend on every element (MxN elements) in the matrix. So best case is O(MxN).
Where I am less sure is how much time the repeated zips and list reversals are adding. The loop is only getting called roughly M+N-1 times but the zip/reverse is done on (M-1) * N elements and then on (M-1) * (N-1) elements, and so on. My best guess is that this type of function is at least logarithmic so I would guess overall time complexity is somewhere around O(MxN log(MxN)).
https://wiki.python.org/moin/TimeComplexity
No matter how you traverse a 2D matrix, the time complexity will always be quadratic in terms of the dimensions.
A m×n matrix therefore takes O(mn) time to traverse, regardless if it is spiral or row-major.
Here is a simplified version of a function that I have:
def create_edge(a,b,network=G):
weight = calculate_weight(matrix[a],matrix[b])
network.addedge(array[a],array[b], weight = weight)
Basically it takes two matrix row-indices, calculates the weight between the two rows and then adds it as the weight for the edge between two nodes.
My goal is to preform this function on every pair combinations in an array. What I mean by this is that if I have an array as such:
array = np.array(['A','B','C','D'])
to preform these functions:
create_edge('A','B')
create_edge('A','C')
create_edge('A','D')
create_edge('B','C')
create_edge('B','D')
create_edge('C','D')
The catch is my array is large! It contains roughly 15000 elements. This means it is very slow. I'm wondering if there is a quick way to do this?
What I have tried so far:
To prevent a XYproblem. I probably should note that I don't necessarily need it to be pair combinations as B->A and A->B are the same, I just gathered it would be faster after doing this:
def create_network(network):
for i in range(len(array)):
for j in range(len(array)):
create_edge(i,j,network)
I also tried this:
comb = list(itertools.combinations(array,2))
def create_network(network):
for i in range(len(comb)):
create_edge(comb[i][0],comb[i][1], network)
Either case was too slow. I understand that's likely due to the size of my array but I'm sure there is a faster/more effective/better method to do this.
I have an array of x,y,z coordinates of several (~10^10) points (only 5 shown here)
a= [[ 34.45 14.13 2.17]
[ 32.38 24.43 23.12]
[ 33.19 3.28 39.02]
[ 36.34 27.17 31.61]
[ 37.81 29.17 29.94]]
I want to make a new array with only those points which are at least some distance d away from all other points in the list. I wrote a code using while loop,
import numpy as np
from scipy.spatial import distance
d=0.1 #or some distance
i=0
selected_points=[]
while i < len(a):
interdist=[]
j=i+1
while j<len(a):
interdist.append(distance.euclidean(a[i],a[j]))
j+=1
if all(dis >= d for dis in interdist):
np.array(selected_points.append(a[i]))
i+=1
This works, but it is taking really long to perform this calculation. I read somewhere that while loops are very slow.
I was wondering if anyone has any suggestions on how to speed up this calculation.
EDIT: While my objective of finding the particles which are at least some distance away from all the others stays the same, I just realized that there is a serious flaw in my code, let's say I have 3 particles, my code does the following, for the first iteration of i, it calculates the distances 1->2, 1->3, let's say 1->2 is less than the threshold distance d, so the code throws away particle 1. For the next iteration of i, it only does 2->3, and let's say it finds that it is greater than d, so it keeps particle 2, but this is wrong! since 2 should also be discarded with particle 1. The solution by #svohara is the correct one!
For big data sets and low-dimensional points (such as your 3-dimensional data), sometimes there is a big benefit to using a spatial indexing method. One popular choice for low-dimensional data is the k-d tree.
The strategy is to index the data set. Then query the index using the same data set, to return the 2-nearest neighbors for each point. The first nearest neighbor is always the point itself (with dist=0), so we really want to know how far away the next closest point is (2nd nearest neighbor). For those points where the 2-NN is > threshold, you have the result.
from scipy.spatial import cKDTree as KDTree
import numpy as np
#a is the big data as numpy array N rows by 3 cols
a = np.random.randn(10**8, 3).astype('float32')
# This will create the index, prepare to wait...
# NOTE: took 7 minutes on my mac laptop with 10^8 rand 3-d numbers
# there are some parameters that could be tweaked for faster indexing,
# and there are implementations (not in scipy) that can construct
# the kd-tree using parallel computing strategies (GPUs, e.g.)
k = KDTree(a)
#ask for the 2-nearest neighbors by querying the index with the
# same points
(dists, idxs) = k.query(a, 2)
# (dists, idxs) = k.query(a, 2, n_jobs=4) # to use more CPUs on query...
#Note: 9 minutes for query on my laptop, 2 minutes with n_jobs=6
# So less than 10 minutes total for 10^8 points.
# If the second NN is > thresh distance, then there is no other point
# in the data set closer.
thresh_d = 0.1 #some threshold, equiv to 'd' in O.P.'s code
d_slice = dists[:, 1] #distances to second NN for each point
res = np.flatnonzero( d_slice >= thresh_d )
Here's a vectorized approach using distance.pdist -
# Store number of pts (number of rows in a)
m = a.shape[0]
# Get the first of pairwise indices formed with the pairs of rows from a
# Simpler version, but a bit slow : idx1,_ = np.triu_indices(m,1)
shifts_arr = np.zeros(m*(m-1)/2,dtype=int)
shifts_arr[np.arange(m-1,1,-1).cumsum()] = 1
idx1 = shifts_arr.cumsum()
# Get the IDs of pairs of rows that are more than "d" apart and thus select
# the rest of the rows using a boolean mask created with np.in1d for the
# entire range of number of rows in a. Index into a to get the selected points.
selected_pts = a[~np.in1d(np.arange(m),idx1[distance.pdist(a) < d])]
For a huge dataset like 10e10, we might have to perform the operations in chunks based on the system memory available.
your algorithm is quadratic (10^20 operations), Here is a linear approach if distribution is nearly random.
Splits your space in boxes of size d/sqrt(3)^3. Put each points in its box.
Then for each box,
if there is just one point, you just have to calculate distance with points in a little neighborhood.
else there is nothing to do.
Drop the append, it must be really slow. You can have a static vector of distances and use [] to put the number in the right position.
Use min instead of all. You only need to check if the minimum distance is bigger than x.
Actually, you can break on your append in the moment that you find a distance smaller than your limit, and then you can drop out both points. In this way you even do not have to save any distance (unless you need them later).
Since d(a,b)=d(b,a) you can do the internal loop only for the following points, forget about the distances you already calculated. If you need them you can pick the faster from the array.
From your comment, I believe this would do, if you have no repeated points.
selected_points = []
for p1 in a:
save_point = True
for p2 in a:
if p1!=p2 and distance.euclidean(p1,p2)<d:
save_point = False
break
if save_point:
selected_points.append(p1)
return selected_points
In the end I check a,b and b,a because you should not modify a list while processing it, but you can be smarter using some aditional variables.