Create distance matrix from 2D-equidistant matrix - python

I have an equidistant two-dimensional grid with N points, so n x n size. I want to create a NxN distance matrix D where element D_ij is the distance between point i and point j.
E.g matrix [[a b],[c d]] should have distance matrix
[[0 1 1 1.414],[1 0 1.1.414 1], [1 1.414 0 1],[1.414 1 1 0]].
I have done this in a very brute force way, shown below
N=2**2
n=int(np.sqrt(N))
row = np.arange(0,n)
grid = np.tile(row,n)
#East matrix. For each point, the matrix yields the distance in the x direction(east) to the other points. The first row covers each point in the first row in the grid, and so on.
D_E = np.zeros((N,N))
for i in range(N):
for j in range(N):
D_E[i,j] = grid[j]-grid[i]
#Same logic as above, but this time in the y_direction
D_S = np.zeros((N,N))
row2 = np.arange(0,n)
grid2 = np.repeat(row2,n)
for i in range(N):
for j in range(N):
D_S[i,j] = grid2[j]-grid2[i]
D = np.sqrt(D_E**2 + D_S**2)
Is there any better way to do this, perhaps more pythonic that I have not thought about?
PS.
I can add that the main goal in the end is to create a covariance matrix for the grid, using the distance matrix D to create the N x N covariance matrix, since each element in the covariance matrix is decided by the covariance function which inputs the euclidian distance between points in the grid.

Related

Vectorized KL divergence calculation between all pairs of rows of a matrix

I would like to find out the KL divergence between all pairs of rows of a matrix. To explain, let's assume there is a matrix V of shape N x K. Now I want to create a matrix L of dimension N x N, where each element L[i,j] = KL(V[i,:],V[j,:]). So far I have used the following scipy.stats.entropy to compute
upper_triangle = [entropy(V[i,:],V[j,:]) for (i,j) in itertools.combinations(range(N,2)]
lower_triangle = [entropy(V[j,:],V[i,:]) for (i,j) in itertools.combinations(range(N,2)]
L = np.zeroes((N,N))
L[np.triu_indices(N,k = 1)] = upper_triangle
L[np.tril_indices(N,k = -1)] = lower_triangle
Is there a cleverer way?
Ok, after massaging the equation for KL divergence a bit the following equation should work too and of course, it's magnitudes faster,
kl = np.dot(V, np.log(Vc).T)
right = kl + kl.T
left = np.tile(np.diag(kl),(kl.shape[0],1))
left = left + left.T
L = left - right

Algorithm to pixelate positions and fluxes

What would you do if you had n particles on a plane (with positions (x_n,y_n)), with a certain flux flux_n, and you have to pixelate these particles, so you have to go from (x,y) to (pixel_i, pixel_j) space and you have to sum up the flux of the m particles which fall in to every single pixel? Any suggestions? Thank you!
The are several ways with which you can solve your problem.
Assumptions: your positions have been stored into two numpy array of shape (N, ), i.e. the position x_n (or y_n) for n in [0, N), let's call them x and y. The flux is stored into a numpy array with the same shape, fluxes.
1 - INTENSIVE CASE
Create something that looks like a grid:
#get minimums and maximums position
mins = int(x.min()), int(y.min())
maxs = int(x.max()), int(y.max())
#actually you can also add and subtract 1 or more unit
#in order to have a grid larger than the x, y extremes
#something like mins-=epsilon and maxs += epsilon
#create the grid
xx = np.arange(mins[0], maxs[0])
yy = np.arange(mins[1], maxs[1])
Now you can perform a double for loop, tacking, each time, two consecutive elements of xx and yy, to do this, you can simple take:
x1 = xx[:-1] #excluding the last element
x2 = xx[1:] #excluding the first element
#the same for y:
y1 = yy[:-1] #excluding the last element
y2 = yy[1:] #excluding the first element
fluxes_grid = np.zeros((xx.shape[0], yy.shape[0]))
for i, (x1_i, x2_i) in enumerate(zip(x1, x2)):
for j, (y1_j, y2_j) in enumerate(zip(y1, y2)):
idx = np.where((x>=x1_i) & (x<x2_i) & (y>=y1_j) & (y<y2_j))[0]
fluxes_grid[i,j] = np.sum(fluxes[idx])
At the end of this loop you have a grid whose elements are pixels representing the sum of fluxes.
2 - USING A QUANTIZATION ALGORITHM LIKE K-NN
What happen if you have a lot o points, so many that the loop takes hours?
A faster solution is to use a quantization method, like K Nearest Neighbor, KNN on a rigid grid. There are many way to run a KNN (included already implemented version, e.g. sklearn KNN). But this is vary efficient if you can take advantage of a GPU. For example this my tensorflow (vs 2.1) implementation. After you have defined a squared grid:
_min, maxs = min(mins), max(maxs)
xx = np.arange(_min, _max)
yy = np.arange(_min, _max)
You can build the matrix, grid, and your position matrix, X:
grid = np.column_stack([xx, yy])
X = np.column_stack([x, y])
then you have to define a matrix euclidean pairwise-distance function:
#tf.function
def pairwise_dist(A, B):
# squared norms of each row in A and B
na = tf.reduce_sum(tf.square(A), 1)
nb = tf.reduce_sum(tf.square(B), 1)
# na as a row and nb as a co"lumn vectors
na = tf.reshape(na, [-1, 1])
nb = tf.reshape(nb, [1, -1])
# return pairwise euclidead difference matrix
D = tf.sqrt(tf.maximum(na - 2*tf.matmul(A, B, False, True) + nb, 0.0))
return D
Thus:
#compute the pairwise distances:
D = pairwise_dist(grid, X)
D = D.numpy() #get a numpy matrix from a tf tensor
#D has shape M, N, where M is the number of points in the grid and N the number of positions.
#now take a rank and from this the best K (e.g. 10)
ranks = np.argsort(D, axis=1)[:, :10]
#for each point in the grid you have the nearest ten.
Now you have to take the fluxes corresponding to this 10 positions and sum on them.
I had avoid to further specify this second method, I don't know the dimension of your catalogue, if you have or not a GPU or if you want to use such kind of optimization.
If you want I can improve this explanation, only if you are interested.

calculate the distance matrix from the center given a matrix of size n

let's say I have a matrix of size n (an odd number that's not 1) and I want to calculate the distance each entry is from the center. For example, if n = 2, then the matrix is 5 by 5 and to find the center of the matrix you would do..
import numpy as np
import math
center = math.floor(5/2)
Matrix[math.floor(5/2)][math.floor(5/2)] = 0
The center is zero because the distance to itself is 0. my approach is make the center like the origin of a coordinate plane and treat each of the 25 "squares" (5 by 5 matrix) as a dot in the center of each square and then calculate the euclidean distance that dot is from the center. visually:
my idea so far..
Matrix = [[0 for x in range(n)] for y in range(n)] #initialize the n by n matrix
for i in range(0, n):
for j in range(0, n):
Matrix[i][j] = ...
or is there a better way to find the distance matrix?
the output should be symmetric and for a n = 5 matrix it would be
Matrix
[[2.82843, 2.23607, 2, 2.23607, 2.82843],
[2.23607, 1.41421, 1, 1.41421, 2.23607],
[2, 1, 0, 1, 2],
[2.23607, 1.41421, 1, 1.41421, 2.23607],
[2.82843, 2.23607, 2, 2.23607, 2.82843]]
TIA
The answer is Pythagoras famous theorem: https://www.mathsisfun.com/pythagoras.html
For a cell at (i,j) you'll need the (x,y) offset to the center cell - then apply Pythagoras theorem to compute distance to that cell...
def pythag(a, b):
return math.sqrt(a*a + b*b)
n = 5
import math
center = math.floor(n/2)
for i in range(0, n):
for j in range(0, n):
dist = pythag(i-center, j-center)
print(dist)
Here's a repl with the code: https://repl.it/#powderflask/DizzyValuableQuark
Try to avoid loops when using numpy:
x_size, y_size = 5, 5
x_arr, y_arr = np.mgrid[0:x_size, 0:y_size]
cell = (2, 2)
dists = np.sqrt((x_arr - cell[0])**2 + (y_arr - cell[1])**2)

Adding Euclidean distance to a matrix

I explain what I have to develop.
Let's say I have to perform a function that is responsible for receiving two matrices, which have the same number of columns but can differ in the number of rows.
In summary, we will have two matrices of vectors with the same dimension but different number N of elements.
I have to calculate the Euclidean distance between each of the vectors that make up my two matrices, and then store it in another matrix that will contain the Euclidean distance between all my vectors.
This is the code I have developed:
def compute_distances(x, y):
# Dimension:
N, d = x.shape
M, d_ = y.shape
# The dimension should be the same
if d != d_:
print "Dimensiones de x e y no coinciden, no puedo calcular las distancias..."
return None
# Calculate distance with loops:
D = np.zeros((N, M))
i = 0
j = 0
for v1 in x:
for v2 in y:
if(j != M):
D[i,j] = math.sqrt(sum([(xi-yi)**2 for xi,yi in zip(v1,v2)]))
#print "[",i,",",j,"]"
j = j + 1
else:
j = 0
i = i + 1;
print D
In this method I am receiving the two matrices to later create a matrix that will have the Euclidean distances between the vectors of my matrices x and y.
The problem is the following, I do not know how, to each one of the calculated Euclidean distance values ​​I have to assign the correct position of the new matrix D that I have generated.
My main function has the following structure:
n = 1000
m = 700
d = 10
x = np.random.randn(n, d)
y = np.random.randn(m, d)
print "x shape =", x.shape
print "y shape =", y.shape
D_bucle = da.compute_distances(x, y)
D_cdist = cdist(x, y)
print np.max(np.abs(D_cdist - D_bucle))
B_cdist calculates the Euclidean distance using efficient methods.
It has to have the same result as D_bucle that calculates the same as the other but with non efficient code, but I'm not getting what the result should be.
I think it's when I create my Euclidean matrix D that is not doing it correctly, then the calculations are incorrect.
Updated!!!
I just updated my solution, my problem is that firstly I didnt know how to asign to the D Matrix my correct euclidean vector result for each pair of vectors,
Now I khow how to asign it but now my problem is that only the first line from D Matrix is having a correct result in comparison with cdist function
not fully understanding what you're asking, but I do see one problem which may explain your results:
for v1 in x:
for v2 in y:
D = math.sqrt(sum([(xi-yi)**2 for xi,yi in zip(v1,v2)]))
You are overwriting the value of D each of the NxM times you go through this loop. When you're done D only contains the distance of the last compare. You might need something like D[i,j] = math.sqrt(...

Distance calculation on matrix using numpy

I am trying to implement a K-means algorithm in Python (I know there is libraries for that, but I want to learn how to implement it myself.) Here is the function I am havin problem with:
def AssignPoints(points, centroids):
"""
Takes two arguments:
points is a numpy array such that points.shape = m , n where m is number of examples,
and n is number of dimensions.
centroids is numpy array such that centroids.shape = k , n where k is number of centroids.
k < m should hold.
Returns:
numpy array A such that A.shape = (m,) and A[i] is index of the centroid which points[i] is assigned to.
"""
m ,n = points.shape
temp = []
for i in xrange(n):
temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
distances = np.hypot(*temp)
return distances.argmin(axis=1)
Purpose of this function, given m points in n dimensional space, and k centroids in n dimensional space, produce a numpy array of (x1 x2 x3 x4 ... xm) where x1 is the index of centroid which is closest to first point. This was working fine, until I tried it with 4 dimensional examples. When I try to put 4 dimensional examples, I get this error:
File "/path/to/the/kmeans.py", line 28, in AssignPoints
distances = np.hypot(*temp)
ValueError: invalid number of arguments
How can I fix this, or if I can't, how do you suggest I calculate what I am trying to calculate here?
My Answer
def AssignPoints(points, centroids):
m ,n = points.shape
temp = []
for i in xrange(n):
temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
for i in xrange(len(temp)):
temp[i] = temp[i] ** 2
distances = np.add.reduce(temp) ** 0.5
return distances.argmin(axis=1)
Try this:
np.sqrt(((points[np.newaxis] - centroids[:,np.newaxis]) ** 2).sum(axis=2)).argmin(axis=0)
Or:
diff = points[np.newaxis] - centroids[:,np.newaxis]
norm = np.sqrt((diff*diff).sum(axis=2))
closest = norm.argmin(axis=0)
And don't ask what's it doing :D
Edit: nah, just kidding. The broadcasting in the middle (points[np.newaxis] - centroids[:,np.newaxis]) is "making" two 3D arrays from the original ones. The result is such that each "plane" contains the difference between all the points and one of the centroids. Let's call it diffs.
Then we do the usual operation to calculate the euclidean distance (square root of the squares of differences): np.sqrt((diffs ** 2).sum(axis=2)). We end up with a (k, m) matrix where row 0 contain the distances to centroids[0], etc. So, the .argmin(axis=0) gives you the result you wanted.
You need to define a distance function where you are using hypot. Usually in K-means it is
Distance=sum((point-centroid)^2)
Here is some matlab code that does it ... I can port it if you can't, but give it a go. Like you said, only way to learn.
function idx = findClosestCentroids(X, centroids)
%FINDCLOSESTCENTROIDS computes the centroid memberships for every example
% idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids
% in idx for a dataset X where each row is a single example. idx = m x 1
% vector of centroid assignments (i.e. each entry in range [1..K])
%
% Set K
K = size(centroids, 1);
[numberOfExamples numberOfDimensions] = size(X);
% You need to return the following variables correctly.
idx = zeros(size(X,1), 1);
% Go over every example, find its closest centroid, and store
% the index inside idx at the appropriate location.
% Concretely, idx(i) should contain the index of the centroid
% closest to example i. Hence, it should be a value in the
% range 1..K
%
for loop=1:numberOfExamples
Distance = sum(bsxfun(#minus,X(loop,:),centroids).^2,2);
[value index] = min(Distance);
idx(loop) = index;
end;
end
UPDATE
This should return the distance, notice that the above matlab code just returns the distance(and index) of the closest centroid...your function returns all distances, as does the one below.
def FindDistance(X,centroids):
K=shape(centroids)[0]
examples, dimensions = shape(X)
distance = zeros((examples,K))
for ex in xrange(examples):
distance[ex,:] = np.sum((X[ex,:]-centroids)**2,1)
return distance

Categories

Resources