I am trying to implement a Nearest neighbour search for Lat and Lon data. Here is the Data.txt
61.3000183105 -21.2500038147 0
62.299987793 -23.750005722 1
66.3000488281 -28.7500038147 2
40.8000183105 -18.250005722 3
71.8000183105 -35.7500038147 3
39.3000183105 -19.7500019073 4
39.8000183105 -20.7500038147 5
41.3000183105 -20.7500038147 6
The problem is, when I want to do the nearest neighbour for each of the Lat and Lon on the data set, it is searching it self. e.g Nearest Neighbour of (-21.2500038147,61.3000183105) will be (-21.2500038147,61.3000183105) and the resulting distance will be 0.0. I am trying to avoid this but with no luck. I tried doing if not (array_equal) but still...
Below is my python code
import numpy as np
from numpy import *
import decimal
from scipy import spatial
from scipy.spatial import KDTree
from math import radians,cos,sin,sqrt,exp
Lat =[]
Lon =[]
Day =[]
nja = []
Data = np.loadtxt('Data.txt',delimiter=" ")
for i in range(0,len(Data)):
Lon.append(Data[i][:][0])
Lat.append(Data[i][:][1])
Day.append(Data[i][:][2])
tree =spatial.KDTree(zip(Lon,Lat) )
print "Lon :",len(Lon)
print "Tree :",len(tree.data)
for i in range(0,len(tree.data)):
pts = np.array([tree.data[i][0],tree.data[i][1]])
nja.append(pts)
for i in range(0, len(nja)):
if not (np.array_equal(nja,tree.data)):
nearest = tree.query(pts,k=1,distance_upper_bound =9)
print nearest
For each point P[i] in your data set, you're asking "Which is the point nearest to P[i] in my data set?" and you get the answer "It is P[i]".
If you ask a different question, "Which are the TWO points nearest to P[i]?", i.e., tree.query(pts,k=2) (the difference with your code being s/k=1/k=2/)
you will get P[i] and also a P[j], the second nearest point, that is the result you want.
Side note:
I'd recommend that you project your data before building the tree, cause in your range of latitudes there is a large fluctuation in what is meant by a 1 degree distance in longitude.
How'bout a low-tech solution? If you have a large number of points (say 10000 or more), this is no more reasonable, but for a smaller number this brute force solution might be useful:
import numpy as np
dist = (Lat[:,None]-Lat[None,:])**2 + (Lon[:,None]-Lon[None,:])**2
Now you have an NxN array (N is the number of points) with distances (or squares of distances, to be more precise) between all point pairs. Finding the shortest distance for each point is then a matter of finding the smallest value on each row. To exclude the point itself you may set the diagonal to NaN and use nanargmax:
np.fill_diagonal(dist, np.nan)
closest = np.nanargmin(dist, axis=1)
This approach is very simple and guaranteed to find the closest points, but has two significant downsides:
It is O(n^2), and at 10000 points it takes around one second
Ot consumes a lot of memory (800 MB for the above mentioned case)
The latter problem can of course be avoided by doing this piecewise, but the first problem excludes large point sets.
This can be carried out also by using scipy.spatial.distance.pdist:
dist=scipy.spatial.distance.pdist(np.column_stack((Lon, Lat)))
This is a bit faster (by half at least), but the output matrix is in the condensed form, see the documentation for scipy.spatial.distance.squareform.
If you need to calculate the real distances, then this is a good alternative, as pdist can handle distances on a sphere.
Then, again, you may use your KDtree approach by just extending your query to two closest point:
nearest = tree.query(pts, k=2, distance_upper_bound=9)
Then nearest[1][0] has the point itself ("me, myself, and I"), nearest[1][1] the real nearest neighbour (or inf if there is nothing near enough).
The best solution depends on the number of points you have. Also, you might want to use something else than cartesian 2D distances if your map points are not close to each other on the globe.
Just a note about using latitudes and longitudes in finding distances: If you just try to pretend they are 2D Cartesian points, you get it wrong. At 60°N one degree of latitude is 1111 km, whereas one degree of longitude is 555 km. So, at least you will have to divide the longitudes by cos(latitude). And even with that trick you will end up in trouble when the longitudes change from east to west.
Probably the easiest way out of this trouble is to calculate the coordinate points into Cartesian 3D points:
x = cos(lat) * cos(lon)
y = cos(lat) * sin(lon)
z = sin(lat)
If you then calculate the shortest distances between these points, you will get the correct results. (Just note that the distances are not the same as real shortest distances on the surface of the globe.)
Related
Goal
I have a list of about 500K points in 3D space. I want to find the two coordinates with the maximum first nearest neighbor distance.
Approach
I am using scipy to calculate a sparse distance matrix:
from scipy.spatial import cKDTree
tree = cKDTree(points, 40)
spd = tree.sparse_distance_matrix(tree, 0.01)
spo = spd.tocsr()
spo.eliminate_zeros()
I eliminate explicit zeros to account for the diagonal elements where the distance between each point and itself is calculated.
I wanted to now find the coordinates of the minimum distance in each row/column, which should correspond to the first nearest neighbor of each point, with something like:
spo.argmin(axis=0)
By finding the maximum distance for the elements in this array I should be able to find the two elements with the maximum first nearest neighbor distance.
The problem
The issue is that the min and argmin functions of scipy.sparse.csr_matrix also take the implicit zeros into account, which for this application I do not want. How do I solve this issue? With this huge matrix, performance and memory are both issues. Or is there an entirely different approach to what I want to do?
I didn't find a solution with the distance matrix but it appears I overlooked the most obvious solution using the query method of the tree.
So to find the maximum distance between first nearest neighbors I did (with vectors a numpy array of shape (N, 3)):
tree = cKDTree(vectors, leaf_size)
# get the indexes of the first nearest neighbor of each vertex
# we use k=2 because k=1 are the points themselves with distance 0
nn1 = tree.query(vectors, k=2)[1][:,1]
# get the vectors corresponding to those indexes. Basically this is "vectors" sorted by
# first nearest neighbor of each point in "vectors".
nn1_vec = vectors[nn1]
# the distance between each point and its first nearest neighbor
nn_dist = np.sqrt(np.sum((vectors - nn1_vec)**2, axis=1))
# maximum distance
return np.max(nn_dist)
I have a set of approximately 10,000 vectors max (random directions) in 3d space and I'm looking for a new direction v_dev (vector) which deviates from all other directions in the set by e.g. a minimum of 5 degrees. My naive initial try is the following, which has of course bad runtime complexity but succeeds for some cases.
#!/usr/bin/env python
import numpy as np
numVecs = 10000
vecs = np.random.rand(numVecs, 3)
randVec = np.random.rand(1, 3)
notFound=True
foundVec=randVec
below=False
iter = 1
for vec in vecs:
angle = np.rad2deg(np.arccos(np.vdot(vec, foundVec)/(np.linalg.norm(vec) * np.linalg.norm(foundVec))))
print("angle: %f\n" % angle)
while notFound:
for vec in vecs:
angle = np.rad2deg(np.arccos(np.vdot(vec, randVec)/(np.linalg.norm(vec) * np.linalg.norm(randVec))))
if angle < 5:
below=True
if below:
randVec = np.random.rand(1, 3)
else:
notFound=False
print("iteration no. %i" % iter)
iter = iter + 1
Any hints how to approach this problem (language agnostic) would be appreciate.
Consider the vectors in a spherical coordinate system (u,w,r), where r is always 1 because vector length doesn't matter here. Any vector can be expressed as (u,w) and the "deadzone" around each vector x, in which the target vector t cannot fall, can be expressed as dist((u_x, w_x, 1), (u_x-u_t, w_x-w_t, 1)) < 5°. However calculating this distance can be a bit tricky, so converting back into cartesian coordinates might be easier. These deadzones are circular on the spherical shell around the origin and you're looking for a t that doesn't hit any on them.
For any fixed u_t you can iterate over all x and using the distance function can find the start and end point of a range of w_t, that are blocked because they fall into the deadzone of the vector x. The union of all 10000 ranges build the possible values of w_t for that given u_t. The same can be done for any fixed w_t, looking for a u_t.
Now comes the part that I'm not entirely sure of: Given that you have two unknows u_t and w_t and 20000 knowns, the system is just a tad overdetermined and if there's a solution, it should be possible to find it.
My suggestion: Set u_t fixed to a random value and check which w_t are possible. If you find a non-empty range, great, you're done. If all w_t are blocked, select a different u_t and try again. Now, selecting u_t at random will work eventually, yet a smarter iteration should be possible. Maybe u_t(n) = u_t(n-1)*phi % 360°, where phi is the golden ratio. That way the u_t never repeat and will cover the whole space with finer and finer granularity instead of starting from one end and going slowly to the other.
Edit: You might also have more luck on the mathematics stackexchange since this isn't so much a code question as it is a mathematics question. For example I'm not sure what I wrote is all that rigorous, so I don't even know it works.
One way would be two build a 2d manifold (area on the sphere) of forbidden areas. You start by adding a point, then, the forbidden area is a circle on the sphere surface.
While true, pick a point on the boundary of the area. If this is not close (within 5 degrees) to any other vector, then, you're done, return it. If not, you just found a new circle of forbidden area. Add it to your manifold of forbidden area. You'll need to chop the circle in line or arc segments and build the boundary as a list.
If the set of vector has no solution, you boundary will collapse to an empty point. Then you return failure.
It's not the easiest approach, and you'll have to deal with the boundaries of a complex shape over a sphere. But it's guaranteed to work and should have reasonable complexity.
What is the most efficient way compute (euclidean) distance of the nearest neighbor for each point in an array?
I have a list of 100k (X,Y,Z) points and I would like to compute a list of nearest neighbor distances. The index of the distance would correspond to the index of the point.
I've looked into PYOD and sklearn neighbors, but those seem to require "teaching". I think my problem is simpler than that. For each point: find nearest neighbor, compute distance.
Example data:
points = [
(0 0 1322.1695
0.006711111 0 1322.1696
0.026844444 0 1322.1697
0.0604 0 1322.1649
0.107377778 0 1322.1651
0.167777778 0 1322.1634
0.2416 0 1322.1629
0.328844444 0 1322.1631
0.429511111 0 1322.1627...)]
compute k = 1 nearest neighbor distances
result format:
results = [nearest neighbor distance]
example results:
results = [
0.005939372
0.005939372
0.017815632
0.030118587
0.041569616
0.053475883
0.065324964
0.077200014
0.089077602)
]
UPDATE:
I've implemented two of the approaches suggested.
Use the scipy.spatial.cdist to compute the full distances matrices
Use a nearest X neighbors in radius R to find subset of neighbor distances for every point and return the smallest.
Results are that Method 2 is faster than Method 1 but took a lot more effort to implement (makes sense).
It seems the limiting factor for Method 1 is the memory needed to run the full computation, especially when my data set is approaching 10^5 (x, y, z) points. For my data set of 23k points, it takes ~ 100 seconds to capture the minimum distances.
For method 2, the speed scales as n_radius^2. That is, "neighbor radius squared", which really means that the algorithm scales ~ linearly with number of included neighbors. Using a Radius of ~ 5 (more than enough given application) it took 5 seconds, for the set of 23k points, to provide a list of mins in the same order as the point_list themselves. The difference matrix between the "exact solution" and Method 2 is basically zero.
Thanks for everyones' help!
Similar to Caleb's answer, but you could stop the iterative loop if you get a distance greater than some previous minimum distance (sorry - no code).
I used to program video games. It would take too much CPU to calculate the actual distance between two points. What we did was divide the "screen" into larger Cartesian squares and avoid the actual distance calculation if the Delta-X or Delta-Y was "too far away" - That's just subtraction, so maybe something like that to qualify where the actual Eucledian distance metric calculation is needed (extend to n-dimensions as needed)?
EDIT - expanding "too far away" candidate pair selection comments.
For brevity, I'll assume a 2-D landscape.
Take the point of interest (X0,Y0) and "draw" an nxn square around that point, with (X0,Y0) at the origin.
Go through the initial list of points and form a list of candidate points that are within that square. While doing that, if the DeltaX [ABS(Xi-X0)] is outside of the square, there is no need to calculate the DeltaY.
If there are no candidate points, make the square larger and iterate.
If there is exactly one candidate point and it is within the radius of the circle incribed by the square, that is your minimum.
If there are "too many" candidates, make the square smaller, but you only need to reexamine the candidate list from this iteration, not all the points.
If there are not "too many" candidates, then calculate the distance for that list. When doing so, first calculate DeltaX^2 + DeltaY^2 for the first candidate. If for subsequent candidates the DetlaX^2 is greater than the minumin so far, no need to calculate the DeltaY^2.
The minimum from that calculation is the minimum if it is within the radius of the circle inscribed by the square.
If not, you need to go back to a previous candidate list that includes points within the circle that has the radius of that minimum. For example, if you ended with one candidate in a 2x2 square that happened to be on the vertex X=1, Y=1, distance/radius would be SQRT(2). So go back to a previous candidate list that has a square greated or equal to 2xSQRT(2).
If warranted, generate a new candidate list that only includes points withing the +/- SQRT(2) square.
Calculate distance for those candidate points as described above - omitting any that exceed the minimum calcluated so far.
No need to do the square root of the sum of the Delta^2 until you have only one candidate.
How to size the initial square, or if it should be a rectangle, and how to increase or decrease the size of the square/rectangle could be influenced by application knowledge of the data distribution.
I would consider recursive algorithms for some of this if the language you are using supports that.
How about this?
from scipy.spatial import distance
A = (0.003467119 ,0.01422762 ,0.0101960126)
B = (0.007279433 ,0.01651597 ,0.0045558849)
C = (0.005392258 ,0.02149997 ,0.0177409387)
D = (0.017898802 ,0.02790659 ,0.0006487222)
E = (0.013564214 ,0.01835688 ,0.0008102952)
F = (0.013375397 ,0.02210725 ,0.0286032185)
points = [A, B, C, D, E, F]
results = []
for point in points:
distances = [{'point':point, 'neighbor':p, 'd':distance.euclidean(point, p)} for p in points if p != point]
results.append(min(distances, key=lambda k:k['d']))
results will be a list of objects, like this:
results = [
{'point':(x1, y1, z1), 'neighbor':(x2, y2, z2), 'd':"distance from point to neighbor"},
...]
Where point is the reference point and neighbor is point's closest neighbor.
The fastest option available to you may be scipy.spatial.distance.cdist, which finds the pairwise distances between all of the points in its input. While finding all of those distances may not be the fastest algorithm to find the nearest neighbors, cdist is implemented in C, so it is likely run faster than anything you try in Python.
import scipy as sp
import scipy.spatial
from scipy.spatial.distance import cdist
points = sp.array(...)
distances = sp.spatial.distance.cdist(points)
# An element is not its own nearest neighbor
sp.fill_diagonal(distances, sp.inf)
# Find the index of each element's nearest neighbor
mins = distances.argmin(0)
# Extract the nearest neighbors from the data by row indexing
nearest_neighbors = points[mins, :]
# Put the arrays in the specified shape
results = np.stack((points, nearest_neighbors), 1)
You could theoretically make this run faster (mostly by combining all of the steps into one algorithm), but unless you're writing in C, you won't be able to compete with SciPy/NumPy.
(cdist runs in Θ(n2) time (if the size of each point is fixed), and every other part of the algorithm in O(n) time, so even if you did try to optimize the code in Python, you wouldn't notice the change for small amounts of data, and the improvements would be overshadowed by cdist for more data.)
This question was edited after answers for show final solution I used
I have unstructured 2D datasets coming from different sources, like by example:
Theses datasets are 3 numpy.ndarray (X, Y coordinates and Z value).
My final aim is to interpolate theses datas on a grid for conversion to image/matrix.
So, I need to find the "best grid" for interpolate theses datas. And, for this I need to find the best X and Y step between pixels of that grid.
Determinate step based on Euclidean distance between points:
Use the mean of Euclidean distances between each point and its nearest neighbour.
Use KDTree/cKDTree from scipy.spacial for build tree of the X,Y datas.
Use the query method with k=2 for get the distances (If k=1, distances are only zero because query for each point found itself).
# Generate KD Tree
xy = np.c_[x, y] # X,Y data converted for use with KDTree
tree = scipy.spacial.cKDTree(xy) # Create KDtree for X,Y coordinates.
# Calculate step
distances, points = tree.query(xy, k=2) # Query distances for X,Y points
distances = distances[:, 1:] # Remove k=1 zero distances
step = numpy.mean(distances) # Result
Performance tweaking:
Use of scipy.spatial.cKDTree and not scipy.spatial.KDTree because it is really faster.
Use balanced_tree=False with scipy.spatial.cKDTree: Big speed up in my case, but may not be true for all data.
Use n_jobs=-1 with cKDTree.query for use multithreading.
Use p=1 with cKDTree.query for use Manhattan distance in place of Euclidian distance (p=2): Faster but may be less accurate.
Query the distance for only a random subsample of points: Big speed up with large datasets, but may be less accurate and less repeatable.
Interpolate points on grid:
Interpolate dataset points on grid using the calculated step.
# Generate grid
def interval(axe):
'''Return numpy.linspace Interval for specified axe'''
cent = axe.min() + axe.ptp() / 2 # Interval center
nbs = np.ceil(axe.ptp() / step) # Number of step in interval
hwid = nbs * step / 2 # Half interval width
return np.linspace(cent - hwid, cent + hwid, nbs) # linspace
xg, yg = np.meshgrid(interval(x), interval(y)) # Generate grid
# Interpolate X,Y,Z datas on grid
zg = scipy.interpolate.griddata((x, y), z, (xg, yg))
Set NaN if pixel too far from initials points:
Set NaN to pixels from grid that are too far (Distance > step) from points from initial X,Y,Z data. The previous generated KDTree is used.
# Calculate pixel to X,Y,Z data distances
dist, _ = tree.query(np.c_[xg.ravel(), yg.ravel()])
dist = dist.reshape(xg.shape)
# Set NaN value for too far pixels
zg[dist > step] = np.nan
The problem you want to solve is called the "all-nearest-neighbors problem". See this article for example: http://link.springer.com/article/10.1007/BF02187718
I believe solutions to this are O(N log N), so on the same order as KDTree.query, but in practice much, much faster than a bunch of separate queries. I'm sorry, I don't know of a python implementation of this.
I suggest you to go with KDTree.query.
You are searching of a carachteristic distance to scale your binning: I suggest you to take only a random subset of your points, and to use the Manhattan distance, becasue KDTree.query is very slow (and yet it is a n*log(n) complexity).
Here is my code:
# CreateTree
tree=scipy.spatial.KDTree(numpy.array(points)) # better give it a copy?
# Create random subsample of points
n_repr=1000
shuffled_points=numpy.array(points)
numpy.random.shuffle(shuffled_points)
shuffled_points=shuffled_points[:n_repr]
# Query the tree
(dists,points)=tree.query(shuffled_points,k=2,p=1)
# Get _extimate_ of average distance:
avg_dists=numpy.average(dists)
print('average distance Manhattan with nearest neighbour is:',avg_dists)
I suggest you to use the Manhattan distance ( https://en.wikipedia.org/wiki/Taxicab_geometry ) because it is was faster to compute than the euclidean distance. And since you need only an estimator of the average distance it should be sufficient.
I have a 10 x 10 grid of cells (as a numpy array). I also have a list of 3 points on that grid. For each cell on the grid, I need to find the closest of the three points. I can do this in series of nested loops in python (2.7) which works but is slow (especially if I upscale to larger grids) but I suspect there is a faster way. Does anyone have any suggestions?
The simplest way I know of to calculate the distance between two points on a plane is using the Pythagorean theorem.
That is, picture a right angle triangle where the hypotenuse goes between the two points and the base of the triangle is parallel to the x axis and the height is parallel to the y axis. We then know that the distance (represented by the length of the hypotenuse) h adheres to the following: h^2 = a^2 + b^2, where a and b are the lengths of the two remaining sides of the triangle.
It's hard to give any other help without seeing your code. Have you tried something similar yet? You need to specify your question more if you want more specific answers.
If we assume that you know the point coord, then you can calculate the distance between a cell and the point using the distance formula: https://en.wikipedia.org/wiki/Distance
So for example, let's say that your cell correspond to 'x' and your 3 points correspond to y1, y2 and y3. You can simply get the distance between x - y1, x - y2 and x - y3 and then compare the three distances.
If we assume that you do not know the point coord, then you first have to find the point coord. You can find the point coord by scanning your grid and cheecking if a cell correspond to a point coord. When you found all your point, you can find the closest distance using the formula distance.
There is function in scipi called euclidean that will calculate the distances between points, if you want to loop through them.
from scipy.spatial.distance import euclidean
import numpy as np
a = np.array([1, 1, 1])
b = np.array([2, 2, 2])
dist = euclidean(a, b)
But I think for large data sets you would be better of using scipi's k-d tree to preform the search.