I explain what I have to develop.
Let's say I have to perform a function that is responsible for receiving two matrices, which have the same number of columns but can differ in the number of rows.
In summary, we will have two matrices of vectors with the same dimension but different number N of elements.
I have to calculate the Euclidean distance between each of the vectors that make up my two matrices, and then store it in another matrix that will contain the Euclidean distance between all my vectors.
This is the code I have developed:
def compute_distances(x, y):
# Dimension:
N, d = x.shape
M, d_ = y.shape
# The dimension should be the same
if d != d_:
print "Dimensiones de x e y no coinciden, no puedo calcular las distancias..."
return None
# Calculate distance with loops:
D = np.zeros((N, M))
i = 0
j = 0
for v1 in x:
for v2 in y:
if(j != M):
D[i,j] = math.sqrt(sum([(xi-yi)**2 for xi,yi in zip(v1,v2)]))
#print "[",i,",",j,"]"
j = j + 1
else:
j = 0
i = i + 1;
print D
In this method I am receiving the two matrices to later create a matrix that will have the Euclidean distances between the vectors of my matrices x and y.
The problem is the following, I do not know how, to each one of the calculated Euclidean distance values I have to assign the correct position of the new matrix D that I have generated.
My main function has the following structure:
n = 1000
m = 700
d = 10
x = np.random.randn(n, d)
y = np.random.randn(m, d)
print "x shape =", x.shape
print "y shape =", y.shape
D_bucle = da.compute_distances(x, y)
D_cdist = cdist(x, y)
print np.max(np.abs(D_cdist - D_bucle))
B_cdist calculates the Euclidean distance using efficient methods.
It has to have the same result as D_bucle that calculates the same as the other but with non efficient code, but I'm not getting what the result should be.
I think it's when I create my Euclidean matrix D that is not doing it correctly, then the calculations are incorrect.
Updated!!!
I just updated my solution, my problem is that firstly I didnt know how to asign to the D Matrix my correct euclidean vector result for each pair of vectors,
Now I khow how to asign it but now my problem is that only the first line from D Matrix is having a correct result in comparison with cdist function
not fully understanding what you're asking, but I do see one problem which may explain your results:
for v1 in x:
for v2 in y:
D = math.sqrt(sum([(xi-yi)**2 for xi,yi in zip(v1,v2)]))
You are overwriting the value of D each of the NxM times you go through this loop. When you're done D only contains the distance of the last compare. You might need something like D[i,j] = math.sqrt(...
Related
I have an equidistant two-dimensional grid with N points, so n x n size. I want to create a NxN distance matrix D where element D_ij is the distance between point i and point j.
E.g matrix [[a b],[c d]] should have distance matrix
[[0 1 1 1.414],[1 0 1.1.414 1], [1 1.414 0 1],[1.414 1 1 0]].
I have done this in a very brute force way, shown below
N=2**2
n=int(np.sqrt(N))
row = np.arange(0,n)
grid = np.tile(row,n)
#East matrix. For each point, the matrix yields the distance in the x direction(east) to the other points. The first row covers each point in the first row in the grid, and so on.
D_E = np.zeros((N,N))
for i in range(N):
for j in range(N):
D_E[i,j] = grid[j]-grid[i]
#Same logic as above, but this time in the y_direction
D_S = np.zeros((N,N))
row2 = np.arange(0,n)
grid2 = np.repeat(row2,n)
for i in range(N):
for j in range(N):
D_S[i,j] = grid2[j]-grid2[i]
D = np.sqrt(D_E**2 + D_S**2)
Is there any better way to do this, perhaps more pythonic that I have not thought about?
PS.
I can add that the main goal in the end is to create a covariance matrix for the grid, using the distance matrix D to create the N x N covariance matrix, since each element in the covariance matrix is decided by the covariance function which inputs the euclidian distance between points in the grid.
I would like to find out the KL divergence between all pairs of rows of a matrix. To explain, let's assume there is a matrix V of shape N x K. Now I want to create a matrix L of dimension N x N, where each element L[i,j] = KL(V[i,:],V[j,:]). So far I have used the following scipy.stats.entropy to compute
upper_triangle = [entropy(V[i,:],V[j,:]) for (i,j) in itertools.combinations(range(N,2)]
lower_triangle = [entropy(V[j,:],V[i,:]) for (i,j) in itertools.combinations(range(N,2)]
L = np.zeroes((N,N))
L[np.triu_indices(N,k = 1)] = upper_triangle
L[np.tril_indices(N,k = -1)] = lower_triangle
Is there a cleverer way?
Ok, after massaging the equation for KL divergence a bit the following equation should work too and of course, it's magnitudes faster,
kl = np.dot(V, np.log(Vc).T)
right = kl + kl.T
left = np.tile(np.diag(kl),(kl.shape[0],1))
left = left + left.T
L = left - right
What would you do if you had n particles on a plane (with positions (x_n,y_n)), with a certain flux flux_n, and you have to pixelate these particles, so you have to go from (x,y) to (pixel_i, pixel_j) space and you have to sum up the flux of the m particles which fall in to every single pixel? Any suggestions? Thank you!
The are several ways with which you can solve your problem.
Assumptions: your positions have been stored into two numpy array of shape (N, ), i.e. the position x_n (or y_n) for n in [0, N), let's call them x and y. The flux is stored into a numpy array with the same shape, fluxes.
1 - INTENSIVE CASE
Create something that looks like a grid:
#get minimums and maximums position
mins = int(x.min()), int(y.min())
maxs = int(x.max()), int(y.max())
#actually you can also add and subtract 1 or more unit
#in order to have a grid larger than the x, y extremes
#something like mins-=epsilon and maxs += epsilon
#create the grid
xx = np.arange(mins[0], maxs[0])
yy = np.arange(mins[1], maxs[1])
Now you can perform a double for loop, tacking, each time, two consecutive elements of xx and yy, to do this, you can simple take:
x1 = xx[:-1] #excluding the last element
x2 = xx[1:] #excluding the first element
#the same for y:
y1 = yy[:-1] #excluding the last element
y2 = yy[1:] #excluding the first element
fluxes_grid = np.zeros((xx.shape[0], yy.shape[0]))
for i, (x1_i, x2_i) in enumerate(zip(x1, x2)):
for j, (y1_j, y2_j) in enumerate(zip(y1, y2)):
idx = np.where((x>=x1_i) & (x<x2_i) & (y>=y1_j) & (y<y2_j))[0]
fluxes_grid[i,j] = np.sum(fluxes[idx])
At the end of this loop you have a grid whose elements are pixels representing the sum of fluxes.
2 - USING A QUANTIZATION ALGORITHM LIKE K-NN
What happen if you have a lot o points, so many that the loop takes hours?
A faster solution is to use a quantization method, like K Nearest Neighbor, KNN on a rigid grid. There are many way to run a KNN (included already implemented version, e.g. sklearn KNN). But this is vary efficient if you can take advantage of a GPU. For example this my tensorflow (vs 2.1) implementation. After you have defined a squared grid:
_min, maxs = min(mins), max(maxs)
xx = np.arange(_min, _max)
yy = np.arange(_min, _max)
You can build the matrix, grid, and your position matrix, X:
grid = np.column_stack([xx, yy])
X = np.column_stack([x, y])
then you have to define a matrix euclidean pairwise-distance function:
#tf.function
def pairwise_dist(A, B):
# squared norms of each row in A and B
na = tf.reduce_sum(tf.square(A), 1)
nb = tf.reduce_sum(tf.square(B), 1)
# na as a row and nb as a co"lumn vectors
na = tf.reshape(na, [-1, 1])
nb = tf.reshape(nb, [1, -1])
# return pairwise euclidead difference matrix
D = tf.sqrt(tf.maximum(na - 2*tf.matmul(A, B, False, True) + nb, 0.0))
return D
Thus:
#compute the pairwise distances:
D = pairwise_dist(grid, X)
D = D.numpy() #get a numpy matrix from a tf tensor
#D has shape M, N, where M is the number of points in the grid and N the number of positions.
#now take a rank and from this the best K (e.g. 10)
ranks = np.argsort(D, axis=1)[:, :10]
#for each point in the grid you have the nearest ten.
Now you have to take the fluxes corresponding to this 10 positions and sum on them.
I had avoid to further specify this second method, I don't know the dimension of your catalogue, if you have or not a GPU or if you want to use such kind of optimization.
If you want I can improve this explanation, only if you are interested.
I need to efficiently calculate the euclidean weighted distances for every x,y point in a given array to every other x,y point in another array. This is the code I have which works as expected:
import numpy as np
import random
def rand_data(integ):
'''
Function that generates 'integ' random values between [0.,1.)
'''
rand_dat = [random.random() for _ in range(integ)]
return rand_dat
def weighted_dist(indx, x_coo, y_coo):
'''
Function that calculates *weighted* euclidean distances.
'''
dist_point_list = []
# Iterate through every point in array_2.
for indx2, x_coo2 in enumerate(array_2[0]):
y_coo2 = array_2[1][indx2]
# Weighted distance in x.
x_dist_weight = (x_coo-x_coo2)/w_data[0][indx]
# Weighted distance in y.
y_dist_weight = (y_coo-y_coo2)/w_data[1][indx]
# Weighted distance between point from array_1 passed and this point
# from array_2.
dist = np.sqrt(x_dist_weight**2 + y_dist_weight**2)
# Append weighted distance value to list.
dist_point_list.append(round(dist, 8))
return dist_point_list
# Generate random x,y data points.
array_1 = np.array([rand_data(10), rand_data(10)], dtype=float)
# Generate weights for each x,y coord for points in array_1.
w_data = np.array([rand_data(10), rand_data(10)], dtype=float)
# Generate second larger array.
array_2 = np.array([rand_data(100), rand_data(100)], dtype=float)
# Obtain *weighted* distances for every point in array_1 to every point in array_2.
dist = []
# Iterate through every point in array_1.
for indx, x_coo in enumerate(array_1[0]):
y_coo = array_1[1][indx]
# Call function to get weighted distances for this point to every point in
# array_2.
dist.append(weighted_dist(indx, x_coo, y_coo))
The final list dist holds as many sub-lists as points are in the first array with as many elements in each as points are in the second one (the weighted distances).
I'd like to know if there's a way to make this code more efficient, perhaps using the cdist function, because this process becomes quite expensive when the arrays have lots of elements (which in my case they have) and when I have to check the distances for lots of arrays (which I also have)
#Evan and #Martinis Group are on the right track - to expand on Evan's answer, here's a function that uses broadcasting to quickly calculate the n-dimensional weighted euclidean distance without Python loops:
import numpy as np
def fast_wdist(A, B, W):
"""
Compute the weighted euclidean distance between two arrays of points:
D{i,j} =
sqrt( ((A{0,i}-B{0,j})/W{0,i})^2 + ... + ((A{k,i}-B{k,j})/W{k,i})^2 )
inputs:
A is an (k, m) array of coordinates
B is an (k, n) array of coordinates
W is an (k, m) array of weights
returns:
D is an (m, n) array of weighted euclidean distances
"""
# compute the differences and apply the weights in one go using
# broadcasting jujitsu. the result is (n, k, m)
wdiff = (A[np.newaxis,...] - B[np.newaxis,...].T) / W[np.newaxis,...]
# square and sum over the second axis, take the sqrt and transpose. the
# result is an (m, n) array of weighted euclidean distances
D = np.sqrt((wdiff*wdiff).sum(1)).T
return D
To check that this works OK, we'll compare it to a slower version that uses nested Python loops:
def slow_wdist(A, B, W):
k,m = A.shape
_,n = B.shape
D = np.zeros((m, n))
for ii in xrange(m):
for jj in xrange(n):
wdiff = (A[:,ii] - B[:,jj]) / W[:,ii]
D[ii,jj] = np.sqrt((wdiff**2).sum())
return D
First, let's make sure that the two functions give the same answer:
# make some random points and weights
def setup(k=2, m=100, n=300):
return np.random.randn(k,m), np.random.randn(k,n),np.random.randn(k,m)
a, b, w = setup()
d0 = slow_wdist(a, b, w)
d1 = fast_wdist(a, b, w)
print np.allclose(d0, d1)
# True
Needless to say, the version that uses broadcasting rather than Python loops is several orders of magnitude faster:
%%timeit a, b, w = setup()
slow_wdist(a, b, w)
# 1 loops, best of 3: 647 ms per loop
%%timeit a, b, w = setup()
fast_wdist(a, b, w)
# 1000 loops, best of 3: 620 us per loop
You could use cdist if you don't need weighted distances. If you need weighted distances and performance, create an array of the appropriate output size, and use either an automated accelerator like Numba or Parakeet, or hand-tune the code with Cython.
You can avoid looping by using code that looks like the following:
def compute_distances(A, B, W):
Ax = A[:,0].reshape(1, A.shape[0])
Bx = B[:,0].reshape(A.shape[0], 1)
dx = Bx-Ax
# Same for dy
dist = np.sqrt(dx**2 + dy**2) * W
return dist
That will run a lot faster in python that anything that loops as long as you have enough memory for the arrays.
You could try removing the square root, since if a>b, it follows that a squared > b squared... and computers are REALLY slow at square roots normally.
I am trying to implement a K-means algorithm in Python (I know there is libraries for that, but I want to learn how to implement it myself.) Here is the function I am havin problem with:
def AssignPoints(points, centroids):
"""
Takes two arguments:
points is a numpy array such that points.shape = m , n where m is number of examples,
and n is number of dimensions.
centroids is numpy array such that centroids.shape = k , n where k is number of centroids.
k < m should hold.
Returns:
numpy array A such that A.shape = (m,) and A[i] is index of the centroid which points[i] is assigned to.
"""
m ,n = points.shape
temp = []
for i in xrange(n):
temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
distances = np.hypot(*temp)
return distances.argmin(axis=1)
Purpose of this function, given m points in n dimensional space, and k centroids in n dimensional space, produce a numpy array of (x1 x2 x3 x4 ... xm) where x1 is the index of centroid which is closest to first point. This was working fine, until I tried it with 4 dimensional examples. When I try to put 4 dimensional examples, I get this error:
File "/path/to/the/kmeans.py", line 28, in AssignPoints
distances = np.hypot(*temp)
ValueError: invalid number of arguments
How can I fix this, or if I can't, how do you suggest I calculate what I am trying to calculate here?
My Answer
def AssignPoints(points, centroids):
m ,n = points.shape
temp = []
for i in xrange(n):
temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
for i in xrange(len(temp)):
temp[i] = temp[i] ** 2
distances = np.add.reduce(temp) ** 0.5
return distances.argmin(axis=1)
Try this:
np.sqrt(((points[np.newaxis] - centroids[:,np.newaxis]) ** 2).sum(axis=2)).argmin(axis=0)
Or:
diff = points[np.newaxis] - centroids[:,np.newaxis]
norm = np.sqrt((diff*diff).sum(axis=2))
closest = norm.argmin(axis=0)
And don't ask what's it doing :D
Edit: nah, just kidding. The broadcasting in the middle (points[np.newaxis] - centroids[:,np.newaxis]) is "making" two 3D arrays from the original ones. The result is such that each "plane" contains the difference between all the points and one of the centroids. Let's call it diffs.
Then we do the usual operation to calculate the euclidean distance (square root of the squares of differences): np.sqrt((diffs ** 2).sum(axis=2)). We end up with a (k, m) matrix where row 0 contain the distances to centroids[0], etc. So, the .argmin(axis=0) gives you the result you wanted.
You need to define a distance function where you are using hypot. Usually in K-means it is
Distance=sum((point-centroid)^2)
Here is some matlab code that does it ... I can port it if you can't, but give it a go. Like you said, only way to learn.
function idx = findClosestCentroids(X, centroids)
%FINDCLOSESTCENTROIDS computes the centroid memberships for every example
% idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids
% in idx for a dataset X where each row is a single example. idx = m x 1
% vector of centroid assignments (i.e. each entry in range [1..K])
%
% Set K
K = size(centroids, 1);
[numberOfExamples numberOfDimensions] = size(X);
% You need to return the following variables correctly.
idx = zeros(size(X,1), 1);
% Go over every example, find its closest centroid, and store
% the index inside idx at the appropriate location.
% Concretely, idx(i) should contain the index of the centroid
% closest to example i. Hence, it should be a value in the
% range 1..K
%
for loop=1:numberOfExamples
Distance = sum(bsxfun(#minus,X(loop,:),centroids).^2,2);
[value index] = min(Distance);
idx(loop) = index;
end;
end
UPDATE
This should return the distance, notice that the above matlab code just returns the distance(and index) of the closest centroid...your function returns all distances, as does the one below.
def FindDistance(X,centroids):
K=shape(centroids)[0]
examples, dimensions = shape(X)
distance = zeros((examples,K))
for ex in xrange(examples):
distance[ex,:] = np.sum((X[ex,:]-centroids)**2,1)
return distance