how to calculate the distances between all datapoints among each other - python

I want to check which data points within X are close to each other and which are far. by calculating the distances between each other without getting to zero, is it possible?
X = np.random.rand(20, 10)
dist = (X - X) ** 2
print(X)

Using just numpy you can either do,
np.linalg.norm((X - X[:,None]),axis=-1)
or,
np.sqrt(np.square(X - X[:,None]).sum(-1))

Another possible solution:
from scipy.spatial.distance import cdist
X = np.random.rand(20, 10)
cdist(X, X)

You can go though each point in sequence
X = np.random.rand(20, 10)
no_points = X.shape[0]
distances = np.zeros((no_points, no_points))
for i in range(no_points):
for j in range(no_points):
distances[i, j] = np.linalg.norm(X[i, :] - X[j, :])
print(distances,np.max(distances))

I would assume you want a way to actually get some way of keeping track of the distances, correct? If so, you can easily build a dictionary that will contain the distances as the keys and a list of tuples that correspond to the points as the value. Then you would just need to iterate through the keys in asc order to get the distances from least to greatest and the points that correspond to that distance. One way to do so would be to just brute force each possible connection between points.
dist = dict()
X = np.random.rand(20, 10)
for indexOfNumber1 in range(len(X) - 1):
for indexOfNumber2 in range(1, len(X)):
distance = sqrt( (X[indexOfNumber1] - X[indexOfNumber2])**2 )
if distance not in dist.keys():
dist[distance] = [tuple(X[indexOfNumber1], X[indexOfNumber2])]
else:
dist[distance] = dist[distance].append(tuple(X[indexOfNumber1], X[indexOfNumber2]))
The code above will then have a dictionary dist that contains all of the possible distances from the points you are looking at and the corresponding points that achieve that distance.

Related

Calculate euclidean distance from a set in Python

I have an array S(i) which contains all of the index values of the x coordinates j which are close to i, i.e. if S(1) = {2,3} this means that x2 and x3 are close to x1. In total I have S(1), ..., S(N) sets.
So this part of the code works fine:
arr = np.array([[1,3], [2,8],[3,1],[6,18], [9,8]])
arr = [item[0] for item in arr] #Extract x-coordinates
def Si(x): #This is the set i want to use
return [[j for j in range(len(x)) if np.abs(x[j] - x[i]) < 2] for i in range(len(x))]
Now i have a subscript of j's, I want to calculate the euclidean distance between (x_i,y_i) to each (x_j,y_j) in S(i), e.g for i=1, if S(1) = {7}, find distance between (x_1, y_1) and (x_7, y_7), for i=2, if S(2) = {3,9}, find distance between (x_2,y_2) and (x_3,y_3) and (x_2,y_2) and (x_9,y_9) and repeat for each i.
I don't know how to implement this, i'm really confused! Here is a Euclidean distance code which finds it for ALL values in the array, but not in the set which i want.
def euc_dist(arr):
arr_x = (arr[:,0,np.newaxis].T - arr[:,0,np.newaxis])**2 ##x-coordinates
arr_y = (arr[:,1,np.newaxis].T - arr[:,1,np.newaxis])**2 ##y-coordinates
arr = np.sqrt(arr_x + arr_y)
return arr
This should work:
S = Si(arr) # get the array
def my_fn(i): # take the value of i
euc_dists = []
for j in S[i]: # iterate over j's in S[i]
if i!= j:
dist = np.linalg.norm(arr[i]-arr[j]) # euclidean distance
euc_dists.append(dist)
return euc_dists

Algorithm to pixelate positions and fluxes

What would you do if you had n particles on a plane (with positions (x_n,y_n)), with a certain flux flux_n, and you have to pixelate these particles, so you have to go from (x,y) to (pixel_i, pixel_j) space and you have to sum up the flux of the m particles which fall in to every single pixel? Any suggestions? Thank you!
The are several ways with which you can solve your problem.
Assumptions: your positions have been stored into two numpy array of shape (N, ), i.e. the position x_n (or y_n) for n in [0, N), let's call them x and y. The flux is stored into a numpy array with the same shape, fluxes.
1 - INTENSIVE CASE
Create something that looks like a grid:
#get minimums and maximums position
mins = int(x.min()), int(y.min())
maxs = int(x.max()), int(y.max())
#actually you can also add and subtract 1 or more unit
#in order to have a grid larger than the x, y extremes
#something like mins-=epsilon and maxs += epsilon
#create the grid
xx = np.arange(mins[0], maxs[0])
yy = np.arange(mins[1], maxs[1])
Now you can perform a double for loop, tacking, each time, two consecutive elements of xx and yy, to do this, you can simple take:
x1 = xx[:-1] #excluding the last element
x2 = xx[1:] #excluding the first element
#the same for y:
y1 = yy[:-1] #excluding the last element
y2 = yy[1:] #excluding the first element
fluxes_grid = np.zeros((xx.shape[0], yy.shape[0]))
for i, (x1_i, x2_i) in enumerate(zip(x1, x2)):
for j, (y1_j, y2_j) in enumerate(zip(y1, y2)):
idx = np.where((x>=x1_i) & (x<x2_i) & (y>=y1_j) & (y<y2_j))[0]
fluxes_grid[i,j] = np.sum(fluxes[idx])
At the end of this loop you have a grid whose elements are pixels representing the sum of fluxes.
2 - USING A QUANTIZATION ALGORITHM LIKE K-NN
What happen if you have a lot o points, so many that the loop takes hours?
A faster solution is to use a quantization method, like K Nearest Neighbor, KNN on a rigid grid. There are many way to run a KNN (included already implemented version, e.g. sklearn KNN). But this is vary efficient if you can take advantage of a GPU. For example this my tensorflow (vs 2.1) implementation. After you have defined a squared grid:
_min, maxs = min(mins), max(maxs)
xx = np.arange(_min, _max)
yy = np.arange(_min, _max)
You can build the matrix, grid, and your position matrix, X:
grid = np.column_stack([xx, yy])
X = np.column_stack([x, y])
then you have to define a matrix euclidean pairwise-distance function:
#tf.function
def pairwise_dist(A, B):
# squared norms of each row in A and B
na = tf.reduce_sum(tf.square(A), 1)
nb = tf.reduce_sum(tf.square(B), 1)
# na as a row and nb as a co"lumn vectors
na = tf.reshape(na, [-1, 1])
nb = tf.reshape(nb, [1, -1])
# return pairwise euclidead difference matrix
D = tf.sqrt(tf.maximum(na - 2*tf.matmul(A, B, False, True) + nb, 0.0))
return D
Thus:
#compute the pairwise distances:
D = pairwise_dist(grid, X)
D = D.numpy() #get a numpy matrix from a tf tensor
#D has shape M, N, where M is the number of points in the grid and N the number of positions.
#now take a rank and from this the best K (e.g. 10)
ranks = np.argsort(D, axis=1)[:, :10]
#for each point in the grid you have the nearest ten.
Now you have to take the fluxes corresponding to this 10 positions and sum on them.
I had avoid to further specify this second method, I don't know the dimension of your catalogue, if you have or not a GPU or if you want to use such kind of optimization.
If you want I can improve this explanation, only if you are interested.

Using combinations or another trick to iterate though 3 different arrays?

consider my code
a,b,c = np.loadtxt ('test.dat', dtype='double', unpack=True)
a,b, and c are the same array length.
for i in range(len(a)):
q[i] = 3*10**5*c[i]/100
x[i] = q[i]*math.sin(a)*math.cos(b)
y[i] = q[i]*math.sin(a)*math.sin(b)
z[i] = q[i]*math.cos(a)
I am trying to find all the combinations for the difference between 2 points in x,y,z to iterate this equation (xi-xj)+(yi-yj)+(zi-zj) = r
I use this combination code
for combinations in it.combinations(x,2):
xdist = (combinations[0] - combinations[1])
for combinations in it.combinations(y,2):
ydist = (combinations[0] - combinations[1])
for combinations in it.combinations(z,2):
zdist = (combinations[0] - combinations[1])
r = (xdist + ydist +zdist)
This takes a long time for python for a large file I have and I am wondering if there is a faster way to get my array for r preferably using a nested loop?
Such as
if i in range(?):
if j in range(?):
Since you're apparently using numpy, let's actually use numpy; it'll be much faster. It's almost always faster and usually easier to read if you avoid python loops entirely when working with numpy, and use its vectorized array operations instead.
a, b, c = np.loadtxt('test.dat', dtype='double', unpack=True)
q = 3e5 * c / 100 # why not just 3e3 * c?
x = q * np.sin(a) * np.cos(b)
y = q * np.sin(a) * np.sin(b)
z = q * np.cos(a)
Now, your example code after this doesn't do what you probably want it to do - notice how you just say xdist = ... each time? You're overwriting that variable and not doing anything with it. I'm going to assume you want the squared euclidean distance between each pair of points, though, and make a matrix dists with dists[i, j] equal to the distance between the ith and jth points.
The easy way, if you have scipy available:
# stack the points into a num_pts x 3 matrix
pts = np.hstack([thing.reshape((-1, 1)) for thing in (x, y, z)])
# get squared euclidean distances in a matrix
dists = scipy.spatial.squareform(scipy.spatial.pdist(pts, 'sqeuclidean'))
If your list is enormous, it's more memory-efficient to not use squareform, but then it's in a condensed format that's a little harder to find specific pairs of distances with.
Slightly harder, if you can't / don't want to use scipy:
pts = np.hstack([thing.reshape((-1, 1)) for thing in (x, y, z)])
sqnorms = np.sum(pts ** 2, axis=1)
dists = sqnorms.reshape((-1, 1)) - 2 * np.dot(pts, pts.T) + sqnorms
which basically implements the formula (a - b)^2 = a^2 - 2 a b + b^2, but all vector-like.
Apologies for not posting a full solution, but you should avoid nesting calls to range(), as it will create a new tuple every time it gets called. You are better off either calling range() once and storing the result, or using a loop counter instead.
For example, instead of:
max = 50
for number in range (0, 50):
doSomething(number)
...you would do:
max = 50
current = 0
while current < max:
doSomething(number)
current += 1
Well, the complexity of your calculation is pretty high. Also, you need to have huge amounts of memory if you want to store all r values in a single list. Often, you don't need a list and a generator might be enough for what you want to do with the values.
Consider this code:
def calculate(x, y, z):
for xi, xj in combinations(x, 2):
for yi, yj in combinations(y, 2):
for zi, zj in combinations(z, 2):
yield (xi - xj) + (yi - yj) + (zi - zj)
This returns a generator that computes only one value each time you call the generator's next() method.
gen = calculate(xrange(10), xrange(10, 20), xrange(20, 30))
gen.next() # returns -3
gen.next() # returns -4 and so on

Distance calculation on matrix using numpy

I am trying to implement a K-means algorithm in Python (I know there is libraries for that, but I want to learn how to implement it myself.) Here is the function I am havin problem with:
def AssignPoints(points, centroids):
"""
Takes two arguments:
points is a numpy array such that points.shape = m , n where m is number of examples,
and n is number of dimensions.
centroids is numpy array such that centroids.shape = k , n where k is number of centroids.
k < m should hold.
Returns:
numpy array A such that A.shape = (m,) and A[i] is index of the centroid which points[i] is assigned to.
"""
m ,n = points.shape
temp = []
for i in xrange(n):
temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
distances = np.hypot(*temp)
return distances.argmin(axis=1)
Purpose of this function, given m points in n dimensional space, and k centroids in n dimensional space, produce a numpy array of (x1 x2 x3 x4 ... xm) where x1 is the index of centroid which is closest to first point. This was working fine, until I tried it with 4 dimensional examples. When I try to put 4 dimensional examples, I get this error:
File "/path/to/the/kmeans.py", line 28, in AssignPoints
distances = np.hypot(*temp)
ValueError: invalid number of arguments
How can I fix this, or if I can't, how do you suggest I calculate what I am trying to calculate here?
My Answer
def AssignPoints(points, centroids):
m ,n = points.shape
temp = []
for i in xrange(n):
temp.append(np.subtract.outer(points[:,i],centroids[:,i]))
for i in xrange(len(temp)):
temp[i] = temp[i] ** 2
distances = np.add.reduce(temp) ** 0.5
return distances.argmin(axis=1)
Try this:
np.sqrt(((points[np.newaxis] - centroids[:,np.newaxis]) ** 2).sum(axis=2)).argmin(axis=0)
Or:
diff = points[np.newaxis] - centroids[:,np.newaxis]
norm = np.sqrt((diff*diff).sum(axis=2))
closest = norm.argmin(axis=0)
And don't ask what's it doing :D
Edit: nah, just kidding. The broadcasting in the middle (points[np.newaxis] - centroids[:,np.newaxis]) is "making" two 3D arrays from the original ones. The result is such that each "plane" contains the difference between all the points and one of the centroids. Let's call it diffs.
Then we do the usual operation to calculate the euclidean distance (square root of the squares of differences): np.sqrt((diffs ** 2).sum(axis=2)). We end up with a (k, m) matrix where row 0 contain the distances to centroids[0], etc. So, the .argmin(axis=0) gives you the result you wanted.
You need to define a distance function where you are using hypot. Usually in K-means it is
Distance=sum((point-centroid)^2)
Here is some matlab code that does it ... I can port it if you can't, but give it a go. Like you said, only way to learn.
function idx = findClosestCentroids(X, centroids)
%FINDCLOSESTCENTROIDS computes the centroid memberships for every example
% idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids
% in idx for a dataset X where each row is a single example. idx = m x 1
% vector of centroid assignments (i.e. each entry in range [1..K])
%
% Set K
K = size(centroids, 1);
[numberOfExamples numberOfDimensions] = size(X);
% You need to return the following variables correctly.
idx = zeros(size(X,1), 1);
% Go over every example, find its closest centroid, and store
% the index inside idx at the appropriate location.
% Concretely, idx(i) should contain the index of the centroid
% closest to example i. Hence, it should be a value in the
% range 1..K
%
for loop=1:numberOfExamples
Distance = sum(bsxfun(#minus,X(loop,:),centroids).^2,2);
[value index] = min(Distance);
idx(loop) = index;
end;
end
UPDATE
This should return the distance, notice that the above matlab code just returns the distance(and index) of the closest centroid...your function returns all distances, as does the one below.
def FindDistance(X,centroids):
K=shape(centroids)[0]
examples, dimensions = shape(X)
distance = zeros((examples,K))
for ex in xrange(examples):
distance[ex,:] = np.sum((X[ex,:]-centroids)**2,1)
return distance

Distance formula between two points in a list

I need to take a list I have created and find the closest two points and print them out. How can I go about comparing each point in the list?
There isn't any need to plot or anything, just compare the points and find the closest two in the list.
import math # 'math' needed for 'sqrt'
# Distance function
def distance(xi,xii,yi,yii):
sq1 = (xi-xii)*(xi-xii)
sq2 = (yi-yii)*(yi-yii)
return math.sqrt(sq1 + sq2)
# Run through input and reorder in [(x, y), (x,y) ...] format
oInput = ["9.5 7.5", "10.2 19.1", "9.7 10.2"] # Original input list (entered by spacing the two points).
mInput = [] # Manipulated list
fList = [] # Final list
for o in oInput:
mInput = o.split()
x,y = float(mInput[0]), float(mInput[1])
fList += [(x, y)] # outputs [(9.5, 7.5), (10.2, 19.1), (9.7, 10.2)]
It is more convenient to rewrite your distance() function to take two (x, y) tuples as parameters:
def distance(p0, p1):
return math.sqrt((p0[0] - p1[0])**2 + (p0[1] - p1[1])**2)
Now you want to iterate over all pairs of points from your list fList. The function iterools.combinations() is handy for this purpose:
min_distance = distance(fList[0], fList[1])
for p0, p1 in itertools.combinations(fList, 2):
min_distance = min(min_distance, distance(p0, p1))
An alternative is to define distance() to accept the pair of points in a single parameter
def distance(points):
p0, p1 = points
return math.sqrt((p0[0] - p1[0])**2 + (p0[1] - p1[1])**2)
and use the key parameter to the built-in min() function:
min_pair = min(itertools.combinations(fList, 2), key=distance)
min_distance = distance(min_pair)
I realize that there are library constraints on this question, but for completeness if you have N points in an Nx2 numpy ndarray (2D system):
from scipy.spatial.distance import pdist
x = numpy.array([[9.5,7.5],[10.2,19.1],[9.7,10.2]])
mindist = numpy.min(pdist(x))
I always try to encourage people to use numpy/scipy if they are dealing with data that is best stored in a numerical array and it's good to know that the tools are out there for future reference.
Note that the math.sqrt function is both slow and, in this case, unnecessary. Try comparing the distance squared to speed it up (sorting distances vs. distance squared will always produce the same ordering):
def distSquared(p0, p1):
return (p0[0] - p1[0])**2 + (p0[1] - p1[1])**2
This might work:
oInput = ["9.5 7.5", "10.2 19.1", "9.7 10.2"]
# parse inputs
inp = [(float(j[0]), float(j[1])) for j in [i.split() for i in oInput]]
# initialize results with a really large value
min_distance = float('infinity')
min_pair = None
# loop over inputs
length = len(inp)
for i in xrange(length):
for j in xrange(i+1, length):
point1 = inp[i]
point2 = inp[j]
if math.hypot(point1[0] - point2[0], point1[1] - point2[0]) < min_distance:
min_pair = [point1, point2]
once the loops are done, min_pair should be the pair with the smallest distance.
Using float() to parse the text leaves room for improvement.
math.hypot is about a third faster than calculating the distance in a handwritten python-function
Your fixed code. No efficient algorithm, just the brute force one.
import math # math needed for sqrt
# distance function
def dist(p1, p2):
return math.sqrt((p2[0] - p1[0]) ** 2 + (p2[1] - p1[1]) ** 2)
# run through input and reorder in [(x, y), (x,y) ...] format
input = ["9.5 7.5", "10.2 19.1", "9.7 10.2"] # original input list (entered by spacing the two points)
points = [map(float, point.split()) for point in input] # final list
# http://en.wikipedia.org/wiki/Closest_pair_of_points
mindist = float("inf")
for p1, p2 in itertools.combinations(points, 2):
if dist(p1, p2) < mindist:
mindist = dist(p1, p2)
closestpair = (p1, p2)
print(closestpair)
First, some notes:
a**2 # squares a
(xi - xii)**2 # squares the expression in parentheses.
mInput doesn't need to be declared in advance.
fList.append((x, y)) is more pythonic than using +=.
Now you have fList. Your distance function can be rewritten to take 2 2-tuple (point) arguments, which I won't bother with here.
Then you can just write:
shortest = float('inf')
for pair in itertools.combinations(fList, 2):
shortest = min(shortest, distance(*pair))
Many of the above questions suggest finding square root using math.sqrt which is slow as well as not a good approach to find square root. In spite of using such approach just recall the basic concepts from school: think of taking the square root of any positive number, x. The square root is then written as a power of one-half: x½. Thus, a fractional exponent indicates that some root is to be taken.
so rather than using math.sqrt((p0[0] - p1[0])**2 + (p0[1] - p1[1])**2)
Use
def distance(a,b):
euclidean_distance = ((b[0]-a[0])**2 + (a[1]-a[1])**2)**0.5
return(euclidean_distance)
Hope it helps

Categories

Resources