Finding 3d distances using an inbuilt function in python - python

I have 6 lists storing x,y,z coordinates of two sets of positions (3 lists each). I want to calculate the distance between each point in both sets. I have written my own distance function but it is slow. One of my lists has about 1 million entries.
I have tried cdist, but it produces a distance matrix and I do not understand what it means. Is there another inbuilt function that can do this?

If possible, use the numpy module to handle this kind of things. It is a lot more efficient than using regular python lists.
I am interpreting your problem like this
You have two sets of points
Both sets have the same number of points (N)
Point k in set 1 is related to point k in set 2. If each point is the coordinate of some object, I am interpreting it as set 1 containing the initial point and set 2 the point at some other time t.
You want to find the distance d(k) = dist(p1(k), p2(k)) where p1(k) is point number k in set 1 and p2(k) is point number k in set 2.
Assuming that your 6 lists are x1_coords, y1_coords, z1_coords and x2_coords, y2_coords, z2_coords respectively, then you can calculate the distances like this
import numpy as np
p1 = np.array([x1_coords, y1_coords, z1_coords])
p2 = np.array([x2_coords, y2_coords, z2_coords])
squared_dist = np.sum((p1-p2)**2, axis=0)
dist = np.sqrt(squared_dist)
The distance between p1(k) and p2(k) is now stored in the numpy array as dist[k].
As for speed: On my laptop with a "Intel(R) Core(TM) i7-3517U CPU # 1.90GHz" the time to calculate the distance between two sets of points with N=1E6 is 45 ms.

Although this solution uses numpy, np.linalg.norm could be another solution.
Say you have one point p0 = np.array([1,2,3]) and a second point p1 = np.array([4,5,6]). Then the quickest way to find the distance between the two would be:
import numpy as np
dist = np.linalg.norm(p0 - p1)

# Use the distance function in Cartesian 3D space:
# Example
import math
def distance(x1, y1, z1, x2, y2, z2):
d = 0.0
d = math.sqrt((x2 - x1)**2 + (y2 - y1)**2 + (z2 - z1)**2)
return d

You can use math.dist(A, B) with A and B being an array of coordinates

Related

Fit a line segment to a set of points

I'm trying to fit a line segment to a set of points but I have trouble finding an algorithm for it. I have a 2D line segment L and a set of 2D points C. L can be represented in any suitable way (I don't care), like support and definition vector, two points, a linear equation with left and right bound, ... The only important thing is that the line has a beginning and an end, so it's not infinite.
I want to fit L in C, so that the sum of all distances of c to L (where c is a point in C) is minimized. This is a least squares problem but I (think) cannot use polynmoial fitting, because L is only a segment. My mathematical knowledge in that area is a bit lacking so any hints on further reading would be appreciated aswell.
Here is an illustration of my problem:
The orange line should be fitted to the blue points so that the sum of squares of distances of each point to the line is minimal. I don't mind if the solution is in a different language or not code at all, as long as I can extract an algorithm from it.
Since this is more of a mathematical question I'm not sure if it's ok for SO or should be moved to cross validated or math exchange.
This solution is relatively similar to one already posted here, but I think is slightly more efficient, elegant and understandable, which is why I posted it despite the similarity.
As was already written, the min(max(...)) formulation makes it hard to solve this problem analytically, which is why scipy.optimize fits well.
The solution is based on the mathematical formulation for distance between a point and a finite line segment outlined in https://math.stackexchange.com/questions/330269/the-distance-from-a-point-to-a-line-segment
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize, NonlinearConstraint
def calc_distance_from_point_set(v_):
#v_ is accepted as 1d array to make easier with scipy.optimize
#Reshape into two points
v = (v_[:2].reshape(2, 1), v_[2:].reshape(2, 1))
#Calculate t* for s(t*) = v_0 + t*(v_1-v_0), for the line segment w.r.t each point
t_star_matrix = np.minimum(np.maximum(np.matmul(P-v[0].T, v[1]-v[0]) / np.linalg.norm(v[1]-v[0])**2, 0), 1)
#Calculate s(t*)
s_t_star_matrix = v[0]+((t_star_matrix.ravel())*(v[1]-v[0]))
#Take distance between all points and respective point on segment
distance_from_every_point = np.linalg.norm(P.T -s_t_star_matrix, axis=0)
return np.sum(distance_from_every_point)
if __name__ == '__main__':
#Random points from bounding box
box_1 = np.random.uniform(-5, 5, 20)
box_2 = np.random.uniform(-5, 5, 20)
P = np.stack([box_1, box_2], axis=1)
segment_length = 3
segment_length_constraint = NonlinearConstraint(fun=lambda x: np.linalg.norm(np.array([x[0], x[1]]) - np.array([x[2] ,x[3]])), lb=[segment_length], ub=[segment_length])
point = minimize(calc_distance_from_point_set, (0.0,-.0,1.0,1.0), options={'maxiter': 100, 'disp': True},constraints=segment_length_constraint).x
plt.scatter(box_1, box_2)
plt.plot([point[0], point[2]], [point[1], point[3]])
Example result:
Here is a proposition in python. The distance between the points and the line is computed based on the approach proposed here: Fit a line segment to a set of points
The fact that the segment has a finite length, which impose the usage of min and max function, or if tests to see whether we have to use perpendicular distance or distance to one of the end points, makes really difficult (impossible?) to get an analytic solution.
The proposed solution will thus use optimization algorithm to approach the best solution. It uses scipy.optimize.minimize, see: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html
Since the segment length is fixed, we have only three degrees of freedom. In the proposed solution I use x and y coordinate of the starting segment point and segment slope as free parameters. I use getCoordinates function to get starting and ending point of the segment from these 3 parameters and the length.
import numpy as np
from scipy.optimize import minimize
import matplotlib.pyplot as plt
import math as m
from scipy.spatial import distance
# Plot the points and the segment
def plotFunction(points,x1,x2):
'Plotting function for plane and iterations'
plt.plot(points[:,0],points[:,1],'ro')
plt.plot([x1[0],x2[0]],[x1[1],x2[1]])
plt.xlim(0, 1)
plt.ylim(0, 1)
plt.show()
# Get the sum of the distance between all the points and the segment
# The segment is defined by guess and length were:
# guess[0]=x coordinate of the starting point
# guess[1]=y coordinate of the starting point
# guess[2]=slope
# Since distance is always >0 no need to use root mean square values
def getDist(guess,points,length):
start_pt=np.array([guess[0],guess[1]])
slope=guess[2]
[x1,x2]=getCoordinates(start_pt,slope,length)
total_dist=0
# Loop over each points to get the distance between the point and the segment
for pt in points:
total_dist+=minimum_distance(x1,x2,pt,length)
return(total_dist)
# Return minimum distance between line segment x1-x2 and point pt
# Adapted from https://stackoverflow.com/questions/849211/shortest-distance-between-a-point-and-a-line-segment
def minimum_distance(x1, x2, pt,length):
length2 = length**2 # i.e. |x1-x2|^2 - avoid a sqrt, we use length that we already know to avoid re-computation
if length2 == 0.0:
return distance.euclidean(p, v);
# Consider the line extending the segment, parameterized as x1 + t (x2 - x1).
# We find projection of point p onto the line.
# It falls where t = [(pt-x1) . (x2-x1)] / |x2-x1|^2
# We clamp t from [0,1] to handle points outside the segment vw.
t = max(0, min(1, np.dot(pt - x1, x2 - x1) / length2));
projection = x1 + t * (x2 - x1); # Projection falls on the segment
return distance.euclidean(pt, projection);
# Get coordinates of start and end point of the segment from start_pt,
# slope and length, obtained by solving slope=dy/dx, dx^2+dy^2=length
def getCoordinates(start_pt,slope,length):
x1=start_pt
dx=length/m.sqrt(slope**2+1)
dy=slope*dx
x2=start_pt+np.array([dx,dy])
return [x1,x2]
if __name__ == '__main__':
# Generate random points
num_points=20
points=np.random.rand(num_points,2)
# Starting position
length=0.5
start_pt=np.array([0.25,0.5])
slope=0
#Use scipy.optimize, minimize to find the best start_pt and slope combination
res = minimize(getDist, x0=[start_pt[0],start_pt[1],slope], args=(points,length), method="Nelder-Mead")
# Retreive best parameters
start_pt=np.array([res.x[0],res.x[1]])
slope=res.x[2]
[x1,x2]=getCoordinates(start_pt,slope,length)
print("\n** The best segment found is defined by:")
print("\t** start_pt:\t",x1)
print("\t** end_pt:\t",x2)
print("\t** slope:\t",slope)
print("** The total distance is:",getDist([x1[0],x2[1],slope],points,length),"\n")
# Plot results
plotFunction(points,x1,x2)

FInd all the points that lie with in a spherical region

For example, find the image below, which explains the problem for a simple 2D case. The label (N) and coordinates (x,y) for each point is known. I need to find all the point labels that lie within the red circle
My actual problem is in 3D and the points are not uniformly distributed
Sample input file which contain coordinates of 7.25 M points is attached here point file.
I tried the following piece of code
import numpy as np
C = [50,50,50]
R = 20
centroid = np.loadtxt('centroid') #chk the file attached
def dist(x,y): return sum([(xi-yi)**2 for xi, yi in zip(x,y)])
elabels=[i+1 for i in range(len(centroid)) if dist(C,centroid[i])<=R**2]
For an single search it takes ~ 10 min. Any suggestions to make it faster ?
Thanks,
Prithivi
When using numpy, avoid using list comprehensions on arrays.
Your computation can be done using vectorized expressions like this
centre = np.array((50., 50., 50.))
points = np.loadtxt('data')
distances2= np.sum((points-centre)**2, axis=1)
points is a N x 2 array, points-centre is also a N x 2 array,
(points-centre)**2 computes the squares of each element of the difference and eventually np.sum(..., axis=1) sums the elements of the squared differences along axis no. 1, that is, across columns.
To filter the array of positions, you can use boolean indexing
close = points[distances2<max_dist**2]
You are heavily calling the dist function. You could try to low level optimize it, and control with the timeit Python module which is more efficient. On my machine, I tried this one:
def dist(x,y):
d0 = y[0] -x[0]
d1 = y[1] -x[1]
d2 = y[2] -x[2]
return d0 * d0 + d1*d1 + d2*d2
and timeit said it was more than 3 times quicker.
This one was just in the middle:
def dist(x,y):
s = 0
for i in range(len(x)):
d = y[i] - x[i]
s += d * d
return s

Calculating distances on grid

I have a 10 x 10 grid of cells (as a numpy array). I also have a list of 3 points on that grid. For each cell on the grid, I need to find the closest of the three points. I can do this in series of nested loops in python (2.7) which works but is slow (especially if I upscale to larger grids) but I suspect there is a faster way. Does anyone have any suggestions?
The simplest way I know of to calculate the distance between two points on a plane is using the Pythagorean theorem.
That is, picture a right angle triangle where the hypotenuse goes between the two points and the base of the triangle is parallel to the x axis and the height is parallel to the y axis. We then know that the distance (represented by the length of the hypotenuse) h adheres to the following: h^2 = a^2 + b^2, where a and b are the lengths of the two remaining sides of the triangle.
It's hard to give any other help without seeing your code. Have you tried something similar yet? You need to specify your question more if you want more specific answers.
If we assume that you know the point coord, then you can calculate the distance between a cell and the point using the distance formula: https://en.wikipedia.org/wiki/Distance
So for example, let's say that your cell correspond to 'x' and your 3 points correspond to y1, y2 and y3. You can simply get the distance between x - y1, x - y2 and x - y3 and then compare the three distances.
If we assume that you do not know the point coord, then you first have to find the point coord. You can find the point coord by scanning your grid and cheecking if a cell correspond to a point coord. When you found all your point, you can find the closest distance using the formula distance.
There is function in scipi called euclidean that will calculate the distances between points, if you want to loop through them.
from scipy.spatial.distance import euclidean
import numpy as np
a = np.array([1, 1, 1])
b = np.array([2, 2, 2])
dist = euclidean(a, b)
But I think for large data sets you would be better of using scipi's k-d tree to preform the search.

How to efficiently determine if a set of points contains two that are close

I need to determine if a set of points (each given by a tuple of floats, each of which is in [0, 1]) contains two that are within some threshold, say 0.01, of each other. I should also mention that in the version of the problem that I am interested in, these "points" are given by tuples of length ~30, that is they are points in [0, 1]^30.
I can test if any two are within this threshold using something like:
def is_near(p1, p2):
return sqrt(sum((x1 - x2)**2 for x1, x2 in zip(p1, p2))) < 0.01 # Threshold.
Using this I can just check every pair using something like:
def contains_near(points):
from itertools import combinations
return any(is_near(p1, p2) for p1, p2 in combinations(points, r=2))
However this is quadratic in the length of the list, which is too slow for the long list of points that I have.
Is there an n log n way of solving this?
I tried doing things like snapping these points to a grid so I could use a dictionary / hash map to store them:
def contains_near_hash(points):
seen = dict()
for point in points:
# The rescaling constant should be 1 / threshold.
grid_point = tuple([round(x * 100, 0) for x in point])
if grid_point in seen:
for other in seen[grid_point]:
if is_near(point, other):
return True
seen[grid_point].append(point)
else:
seen[grid_point] = [point]
return False
However this doesn't work when
points = [(0.1149999,), (0.1150001,)]
As these round to two different grid points. I also tried a version in which the point was appended to all neighbouring grid points however as the examples that I want to do have ~30 coordinates, each grid point has 2^30 neighbours which makes this completely impractical.
A pair of points can only be 'near' each other if their distance in at least one dimension is less than the threshold. This can be exploited to reduce the number of candidate pairs by filtering one dimension after the other.
I suggest:
- sort the points in one dimension (say: x)
- find all points which are close enough to the next point in the sorted list and put their index into a set of candidates
- do not use sqrt() but the quadratic distance (x1 - x2)**2 or even abs(x1 - x2) for efficiency
- do that for the second dimension as well
- determine the intersect of both sets, these are points near each other
This way, you avoid costly is_near() calls, operate on way smaller sets, only work with unique points, and set lookups are very efficient.
This scheme can easily be expanded to include more than 2 dimensions.

Fast, elegant way to calculate empirical/sample covariogram

Does anyone know a good method to calculate the empirical/sample covariogram, if possible in Python?
This is a screenshot of a book which contains a good definition of covariagram:
If I understood it correctly, for a given lag/width h, I'm supposed to get all the pair of points that are separated by h (or less than h), multiply its values and for each of these points, calculate its mean, which in this case, are defined as m(x_i). However, according to the definition of m(x_{i}), if I want to compute m(x1), I need to obtain the average of the values located within distance h from x1. This looks like a very intensive computation.
First of all, am I understanding this correctly? If so, what is a good way to compute this assuming a two dimensional space? I tried to code this in Python (using numpy and pandas), but it takes a couple of seconds and I'm not even sure it is correct, that is why I will refrain from posting the code here. Here is another attempt of a very naive implementation:
from scipy.spatial.distance import pdist, squareform
distances = squareform(pdist(np.array(coordinates))) # coordinates is a nx2 array
z = np.array(z) # z are the values
cutoff = np.max(distances)/3.0 # somewhat arbitrary cutoff
width = cutoff/15.0
widths = np.arange(0, cutoff + width, width)
Z = []
Cov = []
for w in np.arange(len(widths)-1): # for each width
# for each pairwise distance
for i in np.arange(distances.shape[0]):
for j in np.arange(distances.shape[1]):
if distances[i, j] <= widths[w+1] and distances[i, j] > widths[w]:
m1 = []
m2 = []
# when a distance is within a given width, calculate the means of
# the points involved
for x in np.arange(distances.shape[1]):
if distances[i,x] <= widths[w+1] and distances[i, x] > widths[w]:
m1.append(z[x])
for y in np.arange(distances.shape[1]):
if distances[j,y] <= widths[w+1] and distances[j, y] > widths[w]:
m2.append(z[y])
mean_m1 = np.array(m1).mean()
mean_m2 = np.array(m2).mean()
Z.append(z[i]*z[j] - mean_m1*mean_m2)
Z_mean = np.array(Z).mean() # calculate covariogram for width w
Cov.append(Z_mean) # collect covariances for all widths
However, now I have confirmed that there is an error in my code. I know that because I used the variogram to calculate the covariogram (covariogram(h) = covariogram(0) - variogram(h)) and I get a different plot:
And it is supposed to look like this:
Finally, if you know a Python/R/MATLAB library to calculate empirical covariograms, let me know. At least, that way I can verify what I did.
One could use scipy.cov, but if one does the calculation directly (which is very easy), there are more ways to speed this up.
First, make some fake data that has some spacial correlations. I'll do this by first making the spatial correlations, and then using random data points that are generated using this, where the data is positioned according to the underlying map, and also takes on the values of the underlying map.
Edit 1:
I changed the data point generator so positions are purely random, but z-values are proportional to the spatial map. And, I changed the map so that left and right side were shifted relative to eachother to create negative correlation at large h.
from numpy import *
import random
import matplotlib.pyplot as plt
S = 1000
N = 900
# first, make some fake data, with correlations on two spatial scales
# density map
x = linspace(0, 2*pi, S)
sx = sin(3*x)*sin(10*x)
density = .8* abs(outer(sx, sx))
density[:,:S//2] += .2
# make a point cloud motivated by this density
random.seed(10) # so this can be repeated
points = []
while len(points)<N:
v, ix, iy = random.random(), random.randint(0,S-1), random.randint(0,S-1)
if True: #v<density[ix,iy]:
points.append([ix, iy, density[ix,iy]])
locations = array(points).transpose()
print locations.shape
plt.imshow(density, alpha=.3, origin='lower')
plt.plot(locations[1,:], locations[0,:], '.k')
plt.xlim((0,S))
plt.ylim((0,S))
plt.show()
# build these into the main data: all pairs into distances and z0 z1 values
L = locations
m = array([[math.sqrt((L[0,i]-L[0,j])**2+(L[1,i]-L[1,j])**2), L[2,i], L[2,j]]
for i in range(N) for j in range(N) if i>j])
Which gives:
The above is just the simulated data, and I made no attempt to optimize it's production, etc. I assume this is where the OP starts, with the task below, since the data already exists in a real situation.
Now calculate the "covariogram" (which is much easier than generating the fake data, btw). The idea here is to sort all the pairs and associated values by h, and then index into these using ihvals. That is, summing up to index ihval is the sum over N(h) in the equation, since this includes all pairs with hs below the desired values.
Edit 2:
As suggested in the comments below, N(h) is now only the pairs that are between h-dh and h, rather than all pairs between 0 and h (where dh is the spacing of h-values in ihvals -- ie, S/1000 was used below).
# now do the real calculations for the covariogram
# sort by h and give clear names
i = argsort(m[:,0]) # h sorting
h = m[i,0]
zh = m[i,1]
zsh = m[i,2]
zz = zh*zsh
hvals = linspace(0,S,1000) # the values of h to use (S should be in the units of distance, here I just used ints)
ihvals = searchsorted(h, hvals)
result = []
for i, ihval in enumerate(ihvals[1:]):
start, stop = ihvals[i-1], ihval
N = stop-start
if N>0:
mnh = sum(zh[start:stop])/N
mph = sum(zsh[start:stop])/N
szz = sum(zz[start:stop])/N
C = szz-mnh*mph
result.append([h[ihval], C])
result = array(result)
plt.plot(result[:,0], result[:,1])
plt.grid()
plt.show()
which looks reasonable to me as one can see bumps or troughs at the expected for the h values, but I haven't done a careful check.
The main speedup here over scipy.cov, is that one can precalculate all of the products, zz. Otherwise, one would feed zh and zsh into cov for every new h, and all the products would be recalculated. This calculate could be sped up even more by doing partial sums, ie, from ihvals[n-1] to ihvals[n] at each timestep n, but I doubt that will be necessary.

Categories

Resources