Finding squared distances beteen n points to m points in numpy - python

I have 2 numpy arrays (say X and Y) which each row represents a point vector.
I would like to find the squared euclidean distances (will call this 'dist') between each point in X to each point in Y.
I would like the output to be a matrix D where D(i,j) is dist(X(i) , Y(j)).
I have the following python code based on : http://nonconditional.com/2014/04/on-the-trick-for-computing-the-squared-euclidian-distances-between-two-sets-of-vectors/
def get_sq_distances(X, Y):
a = np.sum(np.square(X),axis=1,keepdims=1)
b = np.ones((1,Y.shape[0]))
c = a.dot(b)
a = np.ones((X.shape[0],1))
b = np.sum(np.square(Y),axis=1,keepdims=1).T
c += a.dot(b)
c -= 2*X.dot(Y.T)
return c
I'm trying to avoid loops (should I?) and to use matrix multiplication in order to do a fast computation.
But I have the problem with "Memory Error" on large arrays. Maybe there is a better way to do this?

Scipy has the cdist function that does exactly what you want:
from scipy.spatial import distance
distance.cdist(X, Y, 'sqeuclidean')
The docs linked above have some good examples.

subtract lists, then square the list, then do sum.
import numpy as np
def get_sq_distances(a,b):
return np.sum(np.square(np.subtract(a,b)))
print(get_sq_distances([5,7,9],[4,5,6]))

Related

Lanczos algorithm for finding top eigenvalues of a matrix sum

I am trying to find the top k leading eigenvalues of a numpy matrix (using python dot product notation)
L#L + a*Y#Y.T, where L and Y are a symmetric nxn and an nxd matrix, respectively.
According to the below text from this paper, I should be able to calculate these leading eigenvalues with L#(L#v) + a*X#(X.T#v), where I guess v is an arbitrary vector. The Lanczos paper they cite is here.
I'm not quite sure where to start. I know that scipy has scipy.sparse.linalg.eigsh here, and from the notes it looks like it uses the Lanczos algorithm - but I am at a loss as to whether it's possible to use sparse.linalg.eigsh for my specific use case. I googled around and didn't find a Python implementation for this very quickly -- does anybody know if I can use sparse.linalg.eigsh to calculate this somehow? I definitely don't want to write this algorithm myself.
I also wasn't sure whether to post this in math.stackexchange or here, since it's a question about the Python implementation of a very mathy thing.
You could check scipy.sparse.linalg.eigsh.
import numpy as np;
from scipy.sparse.linalg import eigsh;
from numpy.linalg import eigh
a = 1.4
n = 20;
d = 7;
# random symmetric n x n matrix
L = np.random.randn(n, n)
L = L + L.T
# random n x d matrix
Y = np.random.randn(n, d)
A = L # L.T + a * Y # Y.T # your equation
A must be positive-definite to use eigsh, this is guaranteed to be true if a>0.
You could check the four eigenvalues as follows
eigsh(La, 4)[0]
For reference you can compare based on numpy.linalg.eigh that compute all the eigenvalues. Sort them, and take the last four elements of the sorted array, the results should be close.
np.sort(eigh(La)[0])[-4:]

Is there any inverse np.dot function?

If I have two matrices a and b, is there any function I can find the matrix x, that when dot multiplied by a makes b? Looking for python solutions, for matrices in the form of numpy arrays.
This problem of finding X such as A*X=B is equivalent to search the "inverse of A", i.e. a matrix such as X = Ainverse * B.
For information, in math Ainverse is noted A^(-1) ("A to the power -1", but you can say "A inverse" instead).
In numpy, this is a builtin function to find the inverse of a matrix a:
import numpy as np
ainv = np.linalg.inv(a)
see for instance this tutorial for explanations.
You need to be aware that some matrices are not "invertible", most obvious examples (roughly) are:
matrix that are not square
matrix that represent a projection
numpy can still approximate some value in certain cases.
if A is a full rank, square matrix
import numpy as np
from numpy.linalg import inv
X = inv(A) # B
if not, then such a matrix does not exist, but we can approximate it
import numpy as np
from numpy.linalg import inv
X = inv(A.T # A) # A.T # B

Finding the sum of minimum distance from the points in one list to points in other list?

I have two lists containing x and y number of n-dimensional points respectively. I had to calculate the sum of minimum distances of each point in list one (containing x points) from each point in second list (containing y points). The distance I am calculating is Euclidean distance. The optimized solution is needed.
I have already implemented its naive solution in Python. But its time complexity is too much to be used anywhere. There will be optimization possible. Can this problems time complexity be reduced than what I have implemented?
I was reading thispaper which I was trying to implement. In this they were having the similar problem to which they stated that it's special condition of Earth Mover Distance. As there was no code given, hence unable to know how it got implemented. Thus my naive implementation, the above code was too slow to work on data set of 11k documents. I used Google Colab for executing my code.
# Calculating Euclidean distance between two points
def euclidean_dist(x,y):
dd = 0.0
#len(x) is number of dimensions. Basically x and y is a
#list which contains coordinates of a point
for i in range(len(x)):
dd = dd+(x[i]-y[i])**2
return dd**(1/2)
# Calculating the desired solution to our problem
def dist(l1,l2):
min_dd = 0.0
dd = euclidean_dist(l1[0],l2[0])
for j in range(len(l1)):
for k in range(len(l2)):
temp = euclidean_dist(l1[j],l2[k])
if dd > temp:
dd = temp
min_dd = min_dd+dd
dd = euclidean_dist(l1[j],l2[0])
return min_dd
To reduce runtime, I would suggest finding manhattan distances (delta x + delta y), sorting the resulting array for each point and then creating a buffer of +20% of lowest manhattan distance, if values in the sorted list are in that range of +20%, you can compute euclidean distances and find the correct/minimum euclidean answer.
This will reduce some time, but the 20% figure might not reduce time if the points are all close together as most of them will fit in the buffer region, try fine-tuning the 20% parameter to see what works best for your dataset. Keep in mind reducing it too much might lead to inaccurate answers due to the nature of euclidean vs. manhattan distances.
It is similar to a k-nearest-neighbor problem so finding each closest point to a given point cost O(N) and for your problem should be O(N^2).
Sometimes Using kd-tree MAY improve some performance if your data is low-dimensional.
To calculate the distance between two points, you can use the distance formula:
which you can implement like that in python:
import math
def dist(x1, y1, x2, y2):
return math.sqrt(pow(x1 - x2, 2) + pow(y1 - y2, 2))
Then all you need to do is to loop over X or Y list, check the distance of two points and store it if it's under the current stored minimal distance. You should end up with a O(n²) complexity algorithm which is what you seems to want. Here a working example:
min_dd = None
for i in range(len(l1)):
for j in range(i + 1, len(l1)):
dd = dist(l1[i], l2[i], l1[j], l2[j])
if min_dd is None or dd < min_dd:
min_dd = dd
With this you can get pretty good performances even with large list of points.
Small arrays
For two numpy arrays x and y of shape (n,) and (m,) respectively, you can vectorize the distance calculations and then get the minimum distance:
import numpy as np
n = 10
m = 20
x = np.random.random(n)
y = np.random.random(m)
# Using squared distance matrix and taking the
# square root at the minimum value
distance_matrix = (x[:,None]-y[None,:])**2
minimum_distance_sum = np.sum(np.sqrt(np.min(distance_matrix, axis=1)))
For arrays of shape (n,l) and (m,l), you just need to calculate the distance_matrix as:
distance_matrix = np.sum((x[:,None]-y[None,:])**2, axis=2)
Alternatively, you could use np.linalg.norm, scipy.spatial.distance.cdist, np.einsum etc., but in many cases they are not faster.
Large arrays
If l, n and m above are too large for you to keep the distance_matrix in memory, you can use the mathematical lower and upper bound of the euclidean distance to increase the speed (see this paper. Since this relies on for loops, it will be very slow, but one can wrap the functions with numba to counter this:
import numpy as np
import numba
#numba.jit(nopython=True, fastmath=True)
def get_squared_distance(a,b):
return np.sum((a-b)**2)
def get_minimum_distance_sum(x,y):
n = x.shape[0]
m = y.shape[0]
l = x.shape[1]
# Calculate mean and standard deviation of both arrays
mx = np.mean(x, axis=1)
my = np.mean(y, axis=1)
sx = np.std(x, axis=1)
sy = np.std(y, axis=1)
return _get_minimum_distance_sum(x,y,n,m,l,mx,my,sx,sy)
#numba.jit(nopython=True, fastmath=True)
def _get_minimum_distance_sum(x,y,n,m,l,mx,my,sx,sy):
min_distance_sum = 0
for i in range(n):
min_distance = get_squared_distance(x[i], y[0])
for j in range(1,m):
if i == 0 and j == 0:
continue
lower_bound = l * ((mx[i] - my[j])**2 + (sx[i] - sy[j])**2)
if lower_bound >= min_distance:
continue
distance = get_squared_distance(x[i], y[j])
if distance < min_distance:
min_distance = distance
min_distance_sum += np.sqrt(min_distance)
return min_distance_sum
def test_minimum_distance_sum():
# Will likely be much larger for this to be faster than the other method
n = 10
m = 20
l = 100
x = np.random.random((n,l))
y = np.random.random((m,l))
return get_minimum_distance_sum(x,y)
This approach should be faster than the former approach with increased array size. The algorithm can be improved slightly as described in the paper, but any speedup would depend heavily on the shape of the arrays.
Timings
On my laptop, on two arrays of shape (1000,100), your approach takes ~1 min, the "small arrays" approach takes 690 ms and the "large arrays" approach takes 288 ms. For two arrays of shape (100, 3), your approach takes 28 ms, the "small arrays" approach takes 429 μs and the "large arrays" approach takes 578 μs.

Solve overdetermined system with QR decomposition in Python

I'm trying to solve an overdetermined system with QR decomposition and linalg.solve but the error I get is
LinAlgError: Last 2 dimensions of the array must be square.
This happens when the R array is not square, right? The code looks like this
import numpy as np
import math as ma
A = np.random.rand(2,3)
b = np.random.rand(2,1)
Q, R = np.linalg.qr(A)
Qb = np.matmul(Q.T,b)
x_qr = np.linalg.solve(R,Qb)
Is there a way to write this in a more efficient way for arbitrary A dimensions? If not, how do I make this code snippet work?
The reason is indeed that the matrix R is not square, probably because the system is overdetermined. You can try np.linalg.lstsq instead, finding the solution which minimizes the squared error (which should yield the exact solution if it exists).
import numpy as np
A = np.random.rand(2, 3)
b = np.random.rand(2, 1)
x_qr = np.linalg.lstsq(A, b)[0]
You need to call QR with the flag mode='reduced'. The default Q R matrices are returned as M x M and M x N, so if M is greater than N then your matrix R will be nonsquare. If you choose reduced (economic) mode your matrices will be M x N and N x N, in which case the solve routine will work fine.
However, you also have equations/unknowns backwards for an overdetermined system. Your code snippet should be
import numpy as np
A = np.random.rand(3,2)
b = np.random.rand(3,1)
Q, R = np.linalg.qr(A, mode='reduced')
#print(Q.shape, R.shape)
Qb = np.matmul(Q.T,b)
x_qr = np.linalg.solve(R,Qb)
As noted by other contributors, you could also call lstsq directly, but sometimes it is more convenient to have Q and R directly (e.g. if you are also planning on computing projection matrix).
As shown in the documentation of numpy.linalg.solve:
Computes the “exact” solution, x, of the well-determined, i.e., full rank, linear matrix equation ax = b.
Your system of equations is underdetermined not overdetermined. Notice that you have 3 variables in it and 2 equations, thus fewer equations than unknowns.
Also notice how it also mentions that in numpy.linalg.solve(a,b), a must be an MxM matrix. The reason behind this is that solving the system of equations Ax=b involves computing the inverse of A, and only square matrices are invertible.
In these cases a common approach is to take the Moore-Penrose pseudoinverse, which will compute a best fit (least squares) solution of the system. So instead of trying to solve for the exact solution use numpy.linalg.lstsq:
x_qr = np.linalg.lstsq(R,Qb)

Pearson's correlation coefficient between all pairs of rows from two 2D arrays using scipy.stats.pearsonr vs. numpy.corrcoeff in python 3.5

I tried to calculate the Pearson's correlation coefficients between every pairs of rows from two 2D arrays. Then, sort the rows/columns of the correlation matrix based on its diagonal elements. First, the correlation coefficient matrix (i.e., 'ccmtx') was calculated from one random matrix (i.e., 'randmtx') in the following code:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
def correlation_map(x, y):
n_row_x = x.shape[0]
n_row_y = x.shape[0]
ccmtx_xy = np.empty((n_row_x, n_row_y))
for n in range(n_row_x):
for m in range(n_row_y):
ccmtx_xy[n, m] = pearsonr(x[n, :], y[m, :])[0]
return ccmtx_xy
randmtx = np.random.randn(100, 1000) # generating random matrix
#ccmtx = np.corrcoef(randmtx, randmtx) # cc matrix based on numpy.corrcoef
ccmtx = correlation_map(randmtx, randmtx) # cc matrix based on scipy pearsonr
#
ccmtx_diag = np.diagonal(ccmtx)
#
ids, vals = np.argsort(ccmtx_diag, kind = 'mergesort'), np.sort(ccmtx_diag, kind = 'mergesort')
#ids, vals = np.argsort(ccmtx_diag, kind = 'quicksort'), np.sort(ccmtx_diag, kind = 'quicksort')
plt.plot(ids)
plt.show()
plt.plot(ccmtx_diag[ids])
plt.show()
vals[0]
The issue here is when the 'pearsonr' was used, the diagonal elements of 'ccmtx' are exactly 1.0 which makes sense. However, the 'corrcoef' was used, the diagonal elements of 'ccmtrix' are not exactly one (and slightly less than 1 for some diagonals) seemingly due to a precision error of floating point numbers.
I found to be annoying that the auto-correlation matrix of a single matrix have diagnoal elements not being 1.0 since this resulted in the shuffling of rows/columes of the correlation matrix when the matrix is sorted based on the diagonal elements.
My questions are:
[1] is there any good way to accelerate the computation time when I stick to use the 'pearsonr' function? (e.g., vectorized pearsonr?)
[2] Is there any good way/practice to prevent this precision error when using the 'corrcoef' in numpy? (e.g. 'decimals' option in np.around?)
I have searched the correlation coefficient calculations between all pairs of rows or columns from two matrices. However, as the algorithms containe some sort of "cov / variance" operation, this kind of precision issue seems always existing.
Minor point: the 'mergesort' option seems to provide reliable results than the 'quicksort' as the quicksort shuffled 1d array with exactly 1 to random order.
Any thoughts/comments would be greatly appreciated!
For question 1 vectorized pearsonr see the comments to the question.
I will answer only question 2: how to improve the precision of np.corrcoef.
The correlation matrix R is computed from the covariance matrix C according to
.
The implementation is optimized for performance and memory usage. It computes the covariance matrix, and then performs two divisions by sqrt(C_ii) and by sqrt(Cjj). This separate square-rooting is where the imprecision comes from. For example:
np.sqrt(3 * 3) - 3 == 0.0
np.sqrt(3) * np.sqrt(3) - 3 == -4.4408920985006262e-16
We can fix this by implementing our own simple corrcoef routine:
def corrcoef(a, b):
c = np.cov(a, b)
d = np.diag(c)
return c / np.sqrt(d[:, None] * d[None, :])
Note that this implementation requires more memory than the numpy implementation because it needs to store a temporary matrix with size n * n and it is slightly slower because it needs to do n^2 square roots instead of only 2 n.

Categories

Resources