Speed up pairwise distance matrix calculation in Python - python

I got about 9000 trajectoris, for a project I have to calculate the distance between each two of them. A trajectory consists of 11 points, and each points contains x and y coordinates. Simply generate sample dataset by:
import numpy as np
trajs = np.random.rand(9000,11,2)
I took the Frechet distance function from https://pypi.org/project/similaritymeasures/, which takes two trajectoris as input, and the output is a float value.
At the begining I wrote a nested for-loop:
from similaritymeasures import frechet_dist
distance_matrix = []
for i in trajs:
for j in trajs:
distance_matrix.append(frechet_dist(i,j))
It takes too long to get results.
Since the distance calculation is symmetrical (i.e. frechet_dist(t1,t2) = frechet_dist(t2,t1) ), I decrease the calculation times to half of above in this way:
from scipy.spatial.distance import squareform
n = len(trajs)
distance_matrix = []
flag = 0
for i in range(n):
for j in range(flag,n):
if i != j:
distance_matrix.append(frechet_dist(trajs[i],trajs[j]))
flag += 1
dist_mat = squareform(np.asarray(distance_matrix))
Now for 9000 trajectories, it takes 19 hours. I got the result but it still too long. Are there any methods to speed up the calculation?

Related

Square Matrix in Python with values -1 or 0 or 1

I'm trying to generate a square matrix that only contains the values -1 or 0 or 1 but no other values. The matrix is used as a relationship matrix for a genetic algorithm project that I am working on. The diagonal has to be all zeros.
So far I have tried this:
import numpy as np
n = 5
M = []
for i in range(n):
disc = random.random()
if disc <= 0.33:
M.append(-1)
elif disc > 0.33 and disc <= 0.66:
M.append(0)
else:
M.append(1)
RelMat = np.array(M).reshape(int(sqrt(n)),-1)
np.fill_diagonal(RelMat, 0)
This will yield me a matrix with all three values but it won't allow me to make it symmetrical. I have tried to multiply it with its transpose but then the values are not correct anymore.
I have also tried to get it to work with:
import numpy as np
N = 5
b = np.random.random_integers(-1,2,size=(N,N))
b_symm = (b + b.T)/2
but this will give me 0.5 as values in the matrix which pose a problem.
My main issues is the symmetry of the matrix and the condition that the matrix has to contain all three numbers. Any help is appreciated.
numpy.triu returns the upper triangular portion of the matrix (it sets elements below the k-th diagonal to 0). You could also zero the main diagonal too in that same call (to avoid calling fill_diagonal).
After that b + b.T should give you a symmetric matrix with the desired values.
Here's a much more compact way to build the matrix, in this case 5x5:
b = np.triu(np.random.randint(-1, 2, size=[5,5]), k=1)
b + b.T

Discrete Fourier Transform implementation using Python - Infinite loop

I need to program the Discrete Fourier Transform using python. (I cannot use the numpy function for fft).
You can find the numpy fft function in my code, but it is just for verification.
Not sure why, but it seems that my code is getting in a infinite loop (I have to Keyboard Interrupt).
Any ideas?
import matplotlib.pyplot as plt
import numpy as np
import cmath
Fs = 40000; # sampling
Ts = 1.0/Fs; # sampling period
t = np.arange(0,1,Ts) # time vector
f = 100; # frequencia do sinal
x1_n = np.sin(2*np.pi*f*t + 0)
f = 1000;
x2_n = np.sin(2*np.pi*f*t + 180)
x_n = x1_n + x2_n
n = len(x_n) # signal length
k = np.arange(n) #vetor em k
T = n/Fs
frq = k/T # both sides of the frequency vetor
frq = frq[range(int(n/2))] # one side of the frequency vector
X = np.fft.fft(x_n)/n # fft using numpy and normalization
print("A")
print(X) # printing the results for numpy fft
m = len(x_n)
output = []
for k in range(m): # For each output element
s = complex(0)
for t in range(m): # For each input element
angle = 2j * cmath.pi * t * k / m
s += x_n[t] * cmath.exp(-angle)
output.append(s)
print("B") #printing the results for the fft implementation using for loops
print(output)
It's not an infinite loop, but since m = 40000, your inner loop is going to run 1.6 billion times. That's going to take a LOT of time, which is why FFTs are not implemented in Python. On my machine, it does about 5 outer loops per second, so it's going to take 3 hours to run.
You've written a perfectly good implementation of a Fourier transform. You have not written a fast Fourier Transform, which specifically involves a series of techniques to bring the runtime town from O(n^2) to (n log n).
This is what made the FFT so revolutionary when it was discovered. Hard problems that could only be used using a Fourier Transform suddenly became a lot faster.

How to generate a Rank 5 matrix with entries Uniform?

I want to generate a rank 5 100x600 matrix in numpy with all the entries sampled from np.random.uniform(0, 20), so that all the entries will be uniformly distributed between [0, 20). What will be the best way to do so in python?
I see there is an SVD-inspired way to do so here (https://math.stackexchange.com/questions/3567510/how-to-generate-a-rank-r-matrix-with-entries-uniform), but I am not sure how to code it up. I am looking for a working example of this SVD-inspired way to get uniformly distributed entries.
I have actually managed to code up a rank 5 100x100 matrix by vertically stacking five 20x100 rank 1 matrices, then shuffling the vertical indices. However, the resulting 100x100 matrix does not have uniformly distributed entries [0, 20).
Here is my code (my best attempt):
import numpy as np
def randomMatrix(m, n, p, q):
# creates an m x n matrix with lower bound p and upper bound q, randomly.
count = np.random.uniform(p, q, size=(m, n))
return count
Qs = []
my_rank = 5
for i in range(my_rank):
L = randomMatrix(20, 1, 0, np.sqrt(20))
# L is tall
R = randomMatrix(1, 100, 0, np.sqrt(20))
# R is long
Q = np.outer(L, R)
Qs.append(Q)
Q = np.vstack(Qs)
#shuffle (preserves rank 5 [confirmed])
np.random.shuffle(Q)
Not a perfect solution, I must admit. But it's simple and comes pretty close.
I create 5 vectors that are gonna span the space of the matrix and create random linear combinations to fill the rest of the matrix.
My initial thought was that a trivial solution will be to copy those vectors 20 times.
To improve that, I created linear combinations of them with weights drawn from a uniform distribution, but then the distribution of the entries in the matrix becomes normal because the weighted mean basically causes the central limit theorm to take effect.
A middle point between the trivial approach and the second approach that doesn't work is to use sets of weights that favor one of the vectors over the others. And you can generate these sorts of weight vectors by passing any vector through the softmax function with an appropriately high temperature parameter.
The distribution is almost uniform, but the vectors are still very close to the base vectors. You can play with the temperature parameter to find a sweet spot that suits your purpose.
from scipy.stats import ortho_group
from scipy.special import softmax
import numpy as np
from matplotlib import pyplot as plt
N = 100
R = 5
low = 0
high = 20
sm_temperature = 100
p = np.random.uniform(low, high, (1, R, N))
weights = np.random.uniform(0, 1, (N-R, R, 1))
weights = softmax(weights*sm_temperature, axis = 1)
p_lc = (weights*p).sum(1)
rand_mat = np.concatenate([p[0], p_lc])
plt.hist(rand_mat.flatten())
I just couldn't take the fact the my previous solution (the "selection" method) did not really produce strictly uniformly distributed entries, but only close enough to fool a statistical test sometimes. The asymptotical case however, will almost surely not be distributed uniformly. But I did dream up another crazy idea that's just as bad, but in another manner - it's not really random.
In this solution, I do smth similar to OP's method of forming R matrices with rank 1 and then concatenating them but a little differently. I create each matrix by stacking a base vector on top of itself multiplied by 0.5 and then I stack those on the same base vector shifted by half the dynamic range of the uniform distribution. This process continues with multiplication by a third, two thirds and 1 and then shifting and so on until i have the number of required vectors in that part of the matrix.
I know it sounds incomprehensible. But, unfortunately, I couldn't find a way to explain it better. Hopefully, reading the code would shed some more light.
I hope this "staircase" method will be more reliable and useful.
import numpy as np
from matplotlib import pyplot as plt
'''
params:
N - base dimention
M - matrix length
R - matrix rank
high - max value of matrix
low - min value of the matrix
'''
N = 100
M = 600
R = 5
high = 20
low = 0
# base vectors of the matrix
base = low+np.random.rand(R-1, N)*(high-low)
def build_staircase(base, num_stairs, low, high):
'''
create a uniformly distributed matrix with rank 2 'num_stairs' different
vectors whose elements are all uniformly distributed like the values of
'base'.
'''
l = levels(num_stairs)
vectors = []
for l_i in l:
for i in range(l_i):
vector_dynamic = (base-low)/l_i
vector_bias = low+np.ones_like(base)*i*((high-low)/l_i)
vectors.append(vector_dynamic+vector_bias)
return np.array(vectors)
def levels(total):
'''
create a sequence of stritcly increasing numbers summing up to the total.
'''
l = []
sum_l = 0
i = 1
while sum_l < total:
l.append(i)
i +=1
sum_l = sum(l)
i = 0
while sum_l > total:
l[i] -= 1
if l[i] == 0:
l.pop(i)
else:
i += 1
if i == len(l):
i = 0
sum_l = sum(l)
return l
n_rm = R-1 # number of matrix subsections
m_rm = M//n_rm
len_rms = [ M//n_rm for i in range(n_rm)]
len_rms[-1] += M%n_rm
rm_list = []
for len_rm in len_rms:
# create a matrix with uniform entries with rank 2
# out of the vector 'base[i]' and a ones vector.
rm_list.append(build_staircase(
base = base[i],
num_stairs = len_rms[i],
low = low,
high = high,
))
rm = np.concatenate(rm_list)
plt.hist(rm.flatten(), bins = 100)
A few examples:
and now with N = 1000, M = 6000 to empirically demonstrate the nearly asymptotic behavior:

Is there a better and faster way to convert from Scipy condensed distance matrix to a Scipy sparse distance matrix under a threshold

I'm trying to calculate the euclidean distance between n-dimensional points, and then get a sparse distance matrix of all points where the distance is under a set threshold.
I've already got a method working, but it is too slow. For 12000 points in 3D, it takes about 8 seconds. The rest of the program runs in under a second, so this is the main bottleneck. This will be ran hundreds of times, so improving this step will increase performance by a large amount.
This is my current implementation.
def make_sparse_dm(points: np.array, thresh):
n = points.shape[0]
distance_matrix =
spatial.distance.squareform(spatial.distance.pdist(points))
# pairwise_distances(points)
[i, j] = np.meshgrid(np.arange(n), np.arange(n))
points_under_thresh = distance_matrix <= thresh
i = i[points_under_thresh]
j = j[points_under_thresh]
v = distance_matrix[points_under_thresh]
return sparse.coo_matrix((v, (i, j)), shape=(n, n)).tocsr()
The output is then fed into a library which is much faster when the input is in scipy sparse distance matrix form.

Finding the sum of minimum distance from the points in one list to points in other list?

I have two lists containing x and y number of n-dimensional points respectively. I had to calculate the sum of minimum distances of each point in list one (containing x points) from each point in second list (containing y points). The distance I am calculating is Euclidean distance. The optimized solution is needed.
I have already implemented its naive solution in Python. But its time complexity is too much to be used anywhere. There will be optimization possible. Can this problems time complexity be reduced than what I have implemented?
I was reading thispaper which I was trying to implement. In this they were having the similar problem to which they stated that it's special condition of Earth Mover Distance. As there was no code given, hence unable to know how it got implemented. Thus my naive implementation, the above code was too slow to work on data set of 11k documents. I used Google Colab for executing my code.
# Calculating Euclidean distance between two points
def euclidean_dist(x,y):
dd = 0.0
#len(x) is number of dimensions. Basically x and y is a
#list which contains coordinates of a point
for i in range(len(x)):
dd = dd+(x[i]-y[i])**2
return dd**(1/2)
# Calculating the desired solution to our problem
def dist(l1,l2):
min_dd = 0.0
dd = euclidean_dist(l1[0],l2[0])
for j in range(len(l1)):
for k in range(len(l2)):
temp = euclidean_dist(l1[j],l2[k])
if dd > temp:
dd = temp
min_dd = min_dd+dd
dd = euclidean_dist(l1[j],l2[0])
return min_dd
To reduce runtime, I would suggest finding manhattan distances (delta x + delta y), sorting the resulting array for each point and then creating a buffer of +20% of lowest manhattan distance, if values in the sorted list are in that range of +20%, you can compute euclidean distances and find the correct/minimum euclidean answer.
This will reduce some time, but the 20% figure might not reduce time if the points are all close together as most of them will fit in the buffer region, try fine-tuning the 20% parameter to see what works best for your dataset. Keep in mind reducing it too much might lead to inaccurate answers due to the nature of euclidean vs. manhattan distances.
It is similar to a k-nearest-neighbor problem so finding each closest point to a given point cost O(N) and for your problem should be O(N^2).
Sometimes Using kd-tree MAY improve some performance if your data is low-dimensional.
To calculate the distance between two points, you can use the distance formula:
which you can implement like that in python:
import math
def dist(x1, y1, x2, y2):
return math.sqrt(pow(x1 - x2, 2) + pow(y1 - y2, 2))
Then all you need to do is to loop over X or Y list, check the distance of two points and store it if it's under the current stored minimal distance. You should end up with a O(n²) complexity algorithm which is what you seems to want. Here a working example:
min_dd = None
for i in range(len(l1)):
for j in range(i + 1, len(l1)):
dd = dist(l1[i], l2[i], l1[j], l2[j])
if min_dd is None or dd < min_dd:
min_dd = dd
With this you can get pretty good performances even with large list of points.
Small arrays
For two numpy arrays x and y of shape (n,) and (m,) respectively, you can vectorize the distance calculations and then get the minimum distance:
import numpy as np
n = 10
m = 20
x = np.random.random(n)
y = np.random.random(m)
# Using squared distance matrix and taking the
# square root at the minimum value
distance_matrix = (x[:,None]-y[None,:])**2
minimum_distance_sum = np.sum(np.sqrt(np.min(distance_matrix, axis=1)))
For arrays of shape (n,l) and (m,l), you just need to calculate the distance_matrix as:
distance_matrix = np.sum((x[:,None]-y[None,:])**2, axis=2)
Alternatively, you could use np.linalg.norm, scipy.spatial.distance.cdist, np.einsum etc., but in many cases they are not faster.
Large arrays
If l, n and m above are too large for you to keep the distance_matrix in memory, you can use the mathematical lower and upper bound of the euclidean distance to increase the speed (see this paper. Since this relies on for loops, it will be very slow, but one can wrap the functions with numba to counter this:
import numpy as np
import numba
#numba.jit(nopython=True, fastmath=True)
def get_squared_distance(a,b):
return np.sum((a-b)**2)
def get_minimum_distance_sum(x,y):
n = x.shape[0]
m = y.shape[0]
l = x.shape[1]
# Calculate mean and standard deviation of both arrays
mx = np.mean(x, axis=1)
my = np.mean(y, axis=1)
sx = np.std(x, axis=1)
sy = np.std(y, axis=1)
return _get_minimum_distance_sum(x,y,n,m,l,mx,my,sx,sy)
#numba.jit(nopython=True, fastmath=True)
def _get_minimum_distance_sum(x,y,n,m,l,mx,my,sx,sy):
min_distance_sum = 0
for i in range(n):
min_distance = get_squared_distance(x[i], y[0])
for j in range(1,m):
if i == 0 and j == 0:
continue
lower_bound = l * ((mx[i] - my[j])**2 + (sx[i] - sy[j])**2)
if lower_bound >= min_distance:
continue
distance = get_squared_distance(x[i], y[j])
if distance < min_distance:
min_distance = distance
min_distance_sum += np.sqrt(min_distance)
return min_distance_sum
def test_minimum_distance_sum():
# Will likely be much larger for this to be faster than the other method
n = 10
m = 20
l = 100
x = np.random.random((n,l))
y = np.random.random((m,l))
return get_minimum_distance_sum(x,y)
This approach should be faster than the former approach with increased array size. The algorithm can be improved slightly as described in the paper, but any speedup would depend heavily on the shape of the arrays.
Timings
On my laptop, on two arrays of shape (1000,100), your approach takes ~1 min, the "small arrays" approach takes 690 ms and the "large arrays" approach takes 288 ms. For two arrays of shape (100, 3), your approach takes 28 ms, the "small arrays" approach takes 429 μs and the "large arrays" approach takes 578 μs.

Categories

Resources